Citizen Data-Science using the Alteryx Intelligence Suite
Short Introduction Data Science / Machine Learning
In recent years, a real hype has arisen around the topic of "Data Science and Machine Learning". No wonder, because the global amount of data alone will increase to around 175 zettabytes by 2025, which is roughly equivalent to 2 trillion movies. In addition to this, on November 15th, 2018, the German government adopted the "Artificial Intelligence Strategy," an AI funding package totaling around 5 billion euros by 2025.
While the demand for individual or use-case tailored solutions is increasing, so is the demand for easier access to these same statistical analyses.
In this article, I show you how to use the Alteryx Intelligence Suite to perform a fully comprehensive classification analysis without in-depth programming knowledge.
Data Science in Alteryx
With the "Alteryx Designer", Alteryx offers a very powerful and user-friendly platform out of the box, which makes data processing and modelling of any kind for the specific use case much easier and more intuitive. In addition, the data modelling and preparation (data engineering) phase of a data science project is often underestimated, but in many projects consumes about 70-80% of the effort.
In principle, it is possible to carry out the entire project, including data preparation, modeling, and validation in Alteryx. If the standard tools are not sufficient, all functionalities of external libraries can also be used and embedded in Alteryx through the integration of Python or R.
The following table shows a few examples of how Alteryx can be used in a Data Science project:
Citizen Data Scientist | Data Scientist (intermediate) | Data Scientist (expert) | |
Data Preparation |
The otherwise time-consuming preparation of raw data can be implemented intuitively and quickly using an extensive range of standard tools. |
||
ML |
Tools such as Auto-ML or Assisted Modeling make it easy to integrate various ML algorithms into the workflow. The Assisted Modeling tool also provides basic information about the individual processing steps. |
In addition to the Auto-ML and Assisted Modeling tools, the Forecasting tools can also be used to perform further statistical analyses that are not currently covered by the two previously mentioned tools. |
If both the Auto-ML and the Assisted Modeling as well as Prognosis tools are not sufficient, then an appropriate code can be included via a Python or R node, which contains the corresponding model. With the help of these, Deep Learning models can also be used in Alteryx. |
Output |
The output of the models as well as the model metrics are returned in the respective tools. Especially in the assisted modeling tool, the different algorithms can be compared with each other in a manageable way. The resulting data can then be written to a database or exported. |
The output of the different models is also done by the respective tools. Through the additional forecasting tools, one has further-reaching possibilities to compare different metrics with each other, for example to evaluate them. The output data can then be written to a database or exported. |
The output is done in Alteryx. If you need to perform more sophisticated analyses you have to stick to the tools that Alteryx is providing. However, there is always the option to export the data to a database or file for further analyses outside of Alteryx. |
Benefits |
The Alteryx Intelligence Suite offers Citizen Data Scientists an excellent platform to try out various machine learning tools or statistical models. The entry barriers are also very low, since little or no prior knowledge is required for tools such as Auto-ML or, above all, Assisted Modeling. |
The advantages for an advanced Data Scientist are mainly the fast prototyping of various algorithms and statistical models. Existing workflows can quickly and easily be "recycled" and applied to a new use case. This way, you can get a first feel for the data in advance and quickly determine whether more metrics should be used or whether a more extensive analysis is worthwhile at all. In addition, depending on the use case, data in Alteryx can be processed more easily in the workflow without having to process the individual data points in the IDE manually. |
For a data scientist expert, Alteryx offers various advantages, especially in preprocessing and prototyping. For example, existing workflows can be applied to a new use case relatively quickly and easily, without the need for extensive adaptation of the code. It is also possible to integrate a Python or R node at various points if the tools provided by Alteryx are not sufficient. |
Practical Example: Classification with the Data Science Tools
In this practical example, the focus is on predicting customer churn in the telecommunication sector. The aim is to determine whether the respective customer will cancel the contract soon or not on the basis of various parameters such as the connection type, the payment method, the monthly fee and the number of years the customer has already been receiving services from the telecommunication provider. A trained model can then be used to calculate a probability distribution for churn/non-churn for current customers and thus counteract customer churn with targeted measures.
In the following examples, it is assumed that the raw data is already in a relatively good condition and that no particularly complex processing is required in advance. However, this does not mean that this is not possible in Alteryx. The focus here is to highlight the use of the Alteryx Intelligence Suite or the AutoML tool as well as the Assisted Modeling Tool respectively.
Classification with the AutoML Tool (suitable for users with intermediate ML/DS knowledge)
The following diagram (Figure 1) shows the modeling using the AutoML tool in Alteryx. Through the "Feature Types" (1) tool, the data types of the respective features are first analyzed and automatically adjusted.
Here you can make changes accordingly if some features are not assigned correctly. Before analyzing the data with the AutoML tool, the prediction quality of the features should first be checked with the "Data Health" tool (see Figure 1: (2)). This tool analyses each feature for 6 different metrics (Column Score, Missing Values, Unique Values, Sparsity Id and Unary) and gives a rating, a score and a recommendation. In this case, the tool has detected that the column or feature "CustomerID" has a very low Column Score. In fact, IDs are generally not used in machine learning models and should therefore be removed from the dataset for further analysis (see Figure 1: (3)).
The data is divided into a training and test data set (see Figure 1: (4)). If the input data are complete, they can now be modelled using the AutoML tool (see Figure 1: (5)). Here you have the option of defining the target variable, choosing between regression and classification, selecting an objective function or loss function and one or more of the algorithms mentioned. The model quality of the various algorithms is then automatically compared with each other and the best model, together together with various metrics, is output for further processing.
Subsequently, the model is used to make predictions on the test set (see Figure 1: (6)). Now, different metrics for evaluation can be calculated, such as the hit rate, a confusion matrix, the AUC or other metrics (see Figure 1: (7)).
Classification with the Assisted Modeling Tool (preferred for Citizen Data Scientists)
In the case of assisted modeling, the tool takes over a large part of the steps that were necessary in the previous example in an automated manner (provided you decide to use modelling "with assistance"). First, the "Assisted Modelling" tool is integrated into the workflow (See Figure 4: (1)) - input anchor is the raw data.
Before running the workflow for the first time, you can choose between "Assistant" and "Expert" in the tool. In the assistance mode, you are guided through the entire modeling process. In the expert mode, you must integrate the respective tools into the workflow independently and refine the data beforehand, similar to the previous case, and select a suitable model.
If you now start the assisted modeling, a pop-up window opens in which you are guided step-by-step through the entire modeling process.
The first step is to define the target variable and the machine learning method (classification or regression). Then you can choose between two automation levels – step-by-step or automatic. If you select "Automatic" here, steps 3 to 6 are omitted. Assisted Modeling independently creates the machine learning pipeline: It defines data types, cleans up missing values, selects properties and defines an algorithm.
In step 3, the data types are now defined or changed. Whereas in the previous example the data health tool had to be used to manually check whether the respective features are more or less suitable for the model and whether some features may have to be removed, the tool takes over this step automatically here. The tool also correctly recognizes that the "CustomerID" feature should be removed.
The configuration can be adjusted manually if you are dissatisfied with the given selection. In the next step, missing values are cleaned and replaced by a selected method. In this case the field "TotalCharges" contains 11 records with zero values. For these, automatic replacement with the median value is recommended..
In step 5, the individual features are examined for their predictive quality. The assisted modeling uses the GKT (Goodman-Kruskal Tau) and Gini score to determine whether the respective feature is a good predictor or not.
In this case, the property "PhoneService" is removed from the training data set, since it has a very weak association with the target variable, with a Gini score of 0.57. The "PhoneService" property is removed from the training data set. In the final step, one or more algorithms are selected for the analysis of the data.
The individual models are applied to the cleaned training data set and compared with each other. The result is a dashboard with all important information at one glance.
Using the tabs "Comparison", "Overview", "Interpretation" as well as "Configuration" you can find more statistics about the selected algorithms or models. Now you have the possibility to add a specific model to the workflow. To do this, simply select the corresponding model and click on the "Add models and continue with workflow" field. The workflow will then be built automatically as follows:
Four transformation tools as well as a classification, fitting and prediction tool are automatically added (see Figure 14: (2)). The settings within the tools can also be changed independently after adding them to the workflow. As a final step, the test data from the sampling is now connected to the prediction tool and, as in the previous example (AutoML), various metrics for model quality can be calculated (see Figure 14: (3)).
Conclusion
The practical examples show how Alteryx and in particular the Alteryx Intelligence Suite can be used in a data science project and which additional possibilities are being offered by these tools . This results in different solutions for different user groups. If you just want to simplify or maybe even automate pre-processing, use fast prototyping of different algorithms/methods or even outsource sub-processes, Alteryx offers a solution for almost every step. Furthermore, if the in-house tools are not sufficient, you always have the option to integrate a Python or R node directly into the workflow and make individual adjustments. Thus, the Intelligence Suite offers an optimal basis for statistical analysis and addresses both Citizen Data Scientists and advanced Data Scientists.
We are happy to help you implement your data science use case with Alteryx. So please feel free to contact us!
Sources
[1] https://www.iwd.de/artikel/datenmenge-explodiert-431851/
[2] https://www.bmwi.de/Redaktion/DE/Artikel/Technologie/kuenstliche-intelligenz.html