Francisco Riaño

Step by Step: How to Use Machine Learning in Marketing

Facebook
Twitter
LinkedIn

Step by step: How to Use Machine Learning in Marketing

In order to develop data-driven marketing strategies, some machine learning methods could be used. This is important because this new technology enables us to classify but also to predict diverse kinds of variables based on statistics and with the aim of making the best business decisions. In this tutorial 2 machine learning techniques are going to be used. These are:

  • Cluster analysis (K-means): Analysis made to group customers based on several attributes and their similarity
  • Decision trees and random forest:  One of the main machine learning tools used to classify and predict variables.

So, in order to carry out these purposes a dataset has been downloaded from Kaggle platform, which contains information about several customers. This information is focused on some aspects such as:

  • People: The first piece of this dataset contains information regarding the persons. Attributes such as ID, year of birth, and education, among others are included.
  • Products: In the products session there are data linked with the customers’ purchase behaviors in some specific kinds of products such as wine, meat, fruit, etc.
  • Promotion: In the promotion session, there is data regarding the consumers’ behaviors with the promotions offered by the store. It contains information, for instance, about if the customer used some specific promotion or about how many products were purchased with any kind of discount
  • Place: Last but not least there is the place attribute which contains information about the channels used by the customers and the number of purchases made by the customers through some listed channels.

Given the downloaded dataset, which could be found here, and the defined purpose, a specific path is going to be followed in order to get the desired result. First, we are going to classify all the customers into specific segments, based on some attributes, to organize better our dataset. After that, we are going to perform a MANOVA analysis in order to confirm whether it exists a significant difference among each group but separate by each variable. Later a decision tree is going to be made in order to predict which customers respond, and a random forest is going to complement it in order to show the probability of responding for each customer. Last but not least, a PCA analysis is going to be done in order to determine if it is possible to simplify the dataset by reducing its dimensionality. 

Before grouping the dataset into clusters, it is important to define which variables are going to be included in the analysis. First, we drop all the non-continuous variables; then, in order to avoid multicollinearity (high correlation between the variables in the model) a correlation matrix is calculated to see if there are levels above 0.9, if this is the case one of the two highly correlated variables has to be eliminated from the model. To assess this, the cor and corrplot functions were used, both from the corrplot library. 

Figure 0

Figure 0 contains a depiction of the correlation matrix produced in R. Although some variables are highly correlated, there are no coefficients above 0.9 therefore, the whole dataset could be used in order to set the clusters. 

As it was initially said, the first step in our project is to split the customers into different clusters. The fundamental question, in order to carry out this first task, is about the number of groups where the observations are going to be divided. To achieve this assignment the following code could be used:

1
Image 1
figure 1
Figure 1

 The first lines of code, in Image 1, just allow us to define a new data frame that contains the relevant variables that are going to be used to classify the observations. Additionally, it helps us to scale and center the data, this is necessary to make sure that all the variables have the same “importance” when the clusters are set. The last line of code is a function that enables us to determine the most adequate number of clusters based on the selected variables. The function will produce the chart depicted in Figure 1 which basically tells us that the most appropriate number of clusters is the one in the “elbow” of the curve, in this case in the number 2. It is important to know the number of observations per cluster to understand if there is a good balance among the confirmed groups. In order to get this outcome, it is important to assign each variable to one of the established 2 clusters, this is going to be made through a new column which is going to be added to the main data frame using the code in Image 2.

Image 2

With the new dataset (the one with the added column which specifies to what cluster the observation was assigned), we can run the following code in order to get the number of observations per each cluster and therefore know if there is a balance among the set groups.

Image 3

As it has been possible to determine, there is a balance, regarding the number of observations depicted in Image 4, among the clusters. With some extra lines of code, shown in Image 5, is possible to show the clusters in a graph. Figure 2 has a depiction of both clusters based on two variables, the Income and the amount of money spent on meat products that is called MntMeatProducts. Both variables were selected because they have a significant p-value when one-way ANOVA tests were carried out independently for each variable. Figure 3 contains the graph created with the fviz_cluster function which deploys the created segments using the first two principal components, defined through PCA methodology, to define each axis. 

Image 5
Figure 2
Figure 3

Additionally, it is possible to make a graph similar to the one shown in Figure 2 but now with a third variable which is the number of purchases made through the web page. Again this third variable was included because also got a significant p-value when the one-way ANOVA was run. 

Figure 4. Click on the chart to have full interaction with the 3D graph

Now that we have defined the clusters, it is time to analyze with which variables both clusters significantly differ. Given that there are a lot of potential dependent variables, a MANOVA test was run in order to get the desired output just with a few lines of code. Image 6 contains the code used to run the MANOVA. First, it was necessary to create a matrix just with the dependent variables; then, the test was run in order to get the level of significance of each variable of the dataset as dependent variables, related to the clusters as the unique independent variable. 

Image 6
Table 1

Table 1  contains the result of the MANOVA for a sample of dependent variables of the whole dataset. It is possible to highlight that variables like income or the number of purchases done through the catalog (# Catalog Purchases) or the web (# Web Purchases) have a significant difference between both clusters; while, for other variables such as recency (number of days since the last purchase), martial status or if the customer complained or not, the clusters did not significantly differ. 

After cluster and ANOVA assessments have been done, now it is time to perform linear regressions in order to determine possible and significant relationships between some variables within the data set. Following this purpose, it has been aimed to figure out which category of products is more significant related with the yearly income of the customers. In our dataset we have the following product categories and the amount of money spend by customer within the last 2 years:

  • Wines
  • Gold products
  • Fruits
  • Meat Products
  • Fish Products
  • Sweet Products

To carry out this task, 6 independent regression analysis have been done where on each, the dependent variable was the amount of money spend on each category, and the independent variable was the customers’ income. After all this computation where done, was possible to determine that the performance of the category that is the best “explained” by the income was the meat products.

 

Both pictures from above, shows the output of the individual regression analysis of the meat products category. At a first glance it is possible to highlight the p-value which is small enough and at the same time the R squared which is quite close to 0.48.  The p-value indicates that in fact there is a significant relationship between both variables so therefore the probability to have a slope equal to 0 is very small. A R square of 0.48 is equal to a R of 0.69, level which is also quite accepted by the literature in order to state that the changes in the dependent variable are explained quite enough by the model.

mem_km_kgg <-model_km4_kgg$cluster
Customers_2 <- mutate(Customers_1,Cluster=mem_km_kgg)
View(Customers_2)

Customers_2 %>%
count(Cluster)

So far, a complete marketing analysis has been done. All the computations done could be very useful to organize better a data set of customers based on the information generated by them and recorded by the business. Also, regression techniques all quite useful as well in order to measure the relationship and its significance between two or more variables. 

Leave a Comment