AY1516 T2 Group 18 Project Findings

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT FINDINGS

 

SCHEDULE

 

DOCUMENTATION


Logistic Regression

In hopes of identifying the significant factors that affect consumers’ decision on buying Personal Care products, we employed the use of a logistic regression model. A logistic regression model was used due to the categorical dichotomous nature of the response variable. In addition, the relationship between the independent explanatory variables and the dependent response variable is also not linear, which violates the assumption of a linear regression model.

Dealing with Multicollinearity

Before performing logistic regression, we had to check for multicollinearity amongst all of the variables. This was important to administer because logistic regression is sensitive to extremely high correlation among independent variables, which would give rise to a large standard error for parameter estimates. This was done via a measure of association between all (categorical) variables. We looked at the Lambda and Uncertainty measures as indicated in Figure 14 to determine if the categorical variables are highly associated with each other. We dealt with variables with high correlation with each other by eliminating one or more redundant independent variables from the model. We made sure that the correlation among the explanatory variables do not exceed a maximum coefficient of correlation of 0.5 before proceeding with our logistic regression analysis.

Figure14.png
Figure 14: Measures of Association table for example variables Market and Region

Dealing with “Missing Values”

Some of the questions in the survey are asked or skipped based on a respondent’s answer for the previous question. For example, if the answer to the question “In regard to device and media usage, was yesterday a typical day for you?” is “yes”, then the next few questions in the survey would be asked. Otherwise, the questions would be skipped. Figure x is shown below for a clearer understanding of the scenario:

Figure15.png
Figure 15: Example of a question showing routing rules

As a result of such questions, there were areas in the dataset indicated as missing values, simply because the questions were skipped for certain respondents. With the missing values in place, logistic regression will only evaluate the rows without any missing values. Hence, our results will suffer should there be a large number of missing values present. In order to deal with the missing values, we could either exclude the rows with missing values entirely, or do imputation. However, due to the large number of missing values (over 300 responses at an instance), we could potentially miss a lot of information from these respondents should we exclude it or replace it with another value. Hence, we replaced the missing values in two ways:

  • For continuous variables, we converted the responses into categorical variables instead, with 4 measures - “not a typical day”, “did not use this service”, value less than median, and value more than median, where the median was determined after excluding the missing values
  • For categorical variables, we replaced the missing values with “not a typical day”


Figure16.png
Figures 16 and 17: resulting categorical variables after dealing with missing values

Modelling Strategy

We employed the use of the Stepwise function in the Fit Model platform of SAS JMP Pro 12 to build the explanatory model for Personal Care consumers. Since we have a large number of variables at hand, we chose to use Stepwise since it is often used to reduce the number of variables. We changed the direction to “Mixed”, at the 0.05 level of entry and exit. The Mixed option combines the Forward and Backward direction option, so it tends to lessen the magnitude and likelihood of possible interpretation and prediction errors, since variables are evaluated twice in the process, where variables can be dropped after they have been added. An advantage of the Stepwise function is that it is an automated algorithm, hence, it is fast and easy for us to build the model.

However, a limitation of JMP prevented us to evaluate all possible explanatory variables at once. As a result, we needed to split the variables and run the Stepwise function separately based on their sections to drop initial variables that are insignificant. Finally, we combine all variables left in each section in our final model.

Evaluation of Logistic Regression Model


Performance of the Logistic Regression Model
We assess the overall performance of our logistic regression model using several different measures - the whole model test, misclassification rate, confusion matrix, and the Receiver Operating Characteristic (ROC) curve.

Figure25.png
Figure 25: Whole Model Test output for the logistic regression model


    The hypothesis for the Whole Model Test is:

Ho: The logistic model is NOT useful
H1: The logistic model is useful

Since our results indicate that p-value < 0.0001 shown in Figure, there is sufficient statistical evidence to reject the null hypothesis. Therefore, we can conclude that the logistic model is useful to explain the odds of the Personal Care consumers. In other words, the overall model is significant at the 0.0001 level according to the Model Chi-Square statistic.

Figure26.png
Figure 26: Lack of Fit output for the logistic regression model


In addition, we use the Lack of Fit test (i.e. Goodness of Fit) to evaluate if our model is adequate to explain the odds of Personal Care consumers. We form the hypothesis as follows:

Ho: The logistic model is adequate
H1: The logistic model is NOT adequate (i.e. there is a lack of fit)

Since our results show that the p-value = 0.5281 ( >0.05 ), there is insufficient evidence to reject the null hypothesis at the 95% confidence level. Therefore, we can conclude that our logistic regression model is adequate and that there is little to be gained by introducing additional variables to the model.

Figure27.png
Figure 27: ROC curve for the logistic regression model

The ROC curve, which is a plot of sensitivity by (1-specificity) for each value of x, with a value of 0.80345, indicates that there is very good distinguishing ability in identifying Personal Care consumers from the model.

Figure24.png
Figure 24: Confusion Matrix table for the logistic regression model

Lastly, the misclassification rate for our logistic regression model is 0.2445. This rate can also be derived from the confusion matrix table above, there are 339 false positives and 168 false negatives. Therefore, the overall misclassification rate works out to be 24.45% (i.e. (339+168)/2074). We can conclude that according to the misclassification rate measure, our model is able to predict 75.55% of the Personal Care consumers correctly.

Overall, based on the misclassification rate, confusion matrix, whole model test, lack of fit test, and ROC curve, we can draw the conclusion that the performance of our model is good for explaining the odds of Personal Care consumers.

Performance of the Individual Parameters
After we have established that our model is sufficient in explaining the odds of Personal Care consumers, we need to assess the individual parameters to sieve out significant explanatory variables that identify Personal Care consumers. We use the Likelihood-Ratio Tests (LRT) to help us with that. While the logistic regression output in SAS JMP Pro 12 provides the results from the Parameter Estimates based on Wald’s tests, we use the LRT instead because LRT tests are computed iteratively, and hence, would produce more reliable results.

The table below is a summarization of the results based on the different categories:

Category Explanatory Variable Significance Level
Demographics Gender <0.0001
Household Income
Employment Status
Section B: Device Ownership Owns: Pay-TV subscription <0.01
Intends to buy: Nothing
Mobile brand
Mobile service provider
Section C: Digital Activities Freq. of online activities: reading news, sport or weather <0.01
Freq. of online activities: mobile payment
Freq. of online activities: upload photos, videos, or music <0.05
Section D: Daypart Usage Online shopping <0.01
Read newspapers/magazines in the early evening
Don’t know which part of the day listen to the radio
Social networking during dinner
Don’t know which part of the day Shopping & Researching online <0.05
Watch videos or TV online in early morning
Section E: Category Engagement Engages online by - Reading emails from brands <0.01
Willing to engage with online - Government services
Willing to engage with online - Cosmetics or skincare
Prefer to use - YouTube to find out info about a brand
Engages online by - Comment or share brand posts on Facebook <0.05
Prefer to use - WhatsApp to access entertainment content from or about a brand
Section F: Digital & Social Segmentation Seek advice from social networks / official information <0.05


Decision Tree

Another multivariate analysis that can help us to identify the significant factors that affect consumers’ decision on buying Personal Care products is using the decision tree method in recursive partitioning, a classification technique. Decision tree is a hierarchical model composed of decision rules that recursively splits independent variables into homogeneous zones (Lee and Park, 2013), where the most important factor will be located at the root of the tree.

The decision tree method is different from the logistic regression method in that the decision tree, being a nonparametric approach, does not require any statistical assumptions concerning the data. In addition, it can also handle incomplete and qualitative data. In terms of these two aspects, decision tree proves to be at a more advantageous position as compared to logistic regression because firstly, such assumptions are not always met in real world situations, and secondly, it is able to deal with the missing values identified in the previous section.

Modelling Strategy

Unlike logistic regression, we did an additional step to sieve out statistically significant variables at the 0.05 level through the Likelihood Ratio and Pearson’s chi-squared tests. After which, we ran our analysis by launching the Partition method in the Modelling platform. We set the method to Decision Tree and input our response variable and explanatory variables. We start to perform splits to the data by evaluating the Logworth of the explanatory variables, where the split will take place at the variable with the highest Logworth. We continue to split the data until we observe the following: the plot of the R-Square against the number of splits plateaus, because this indicates that additional splits will no longer provide any additional value to the model; the Logworth of each additional variable following each split is no longer increasing as much; and the misclassification rate is no longer improving as much with each additional split. To prevent the tree from growing too big, we will prune the tree as long as we felt that the additional splits are not adding as much value to the decision tree anymore.

Evaluation of the Decision Tree Model


Performance of the Decision Tree Model
We used three measures to evaluate the performance of our decision tree model - misclassification rate, confusion matrix, and the ROC curve.

Figure29.png
Figure 29: Confusion Matrix table for the decision tree model

The misclassification rate for the decision tree model is 0.2430. This rate can also be derived from the confusion matrix table above, where there are 355 false positives and 149 false negatives. Therefore, the overall misclassification rate works out to be 24.3% (i.e. (355+149)/2074). We can conclude that according to the misclassification rate measure, our model is able to predict 75.7% of the Personal Care consumers correctly.

Figure30.png
Figure 30: ROC curve for the decision tree model

The ROC curve for the decision tree model indicates a value of 0.7889. While this indicates that there is very good distinguishing ability in identifying Personal Care consumers from the model, the distinguishing power is lower than that of the logistic regression model.

Overall, based on the misclassification rate, confusion matrix, and ROC curve, we can draw the conclusion that the performance of our decision tree model is good for explaining the odds of Personal Care consumers.

Assessing Individual Parameters

Category Explanatory Variable
Demographics Gender
Household Income
Race
Section B: Device Ownership Mobile Service Provider NPS
Owns - Pay-TV subscription
Owns - Smartwatch
Owns - Tablet
Section C: Digital Activities Freq. of online activities - Reading news, sport or weather
Freq. of online activities - Listen to internet radio / music online
Freq. of online activities - Visiting blogs or forums
Freq. of online activities - Upload photos, videos or music
Social / IM Usage - Facebook Messenger
Section D: Daypart Usage Online shopping
Watched TV
Listened to radio
Watched TV early morning
Listened to radio in bed when woke up
Section E: Category Engagement Willing to engage with cosmetics / skincare online
Willing to engage with any brands online
Willing to engage with mobile network operators
Willing to engage with government services
Prefer to use - access entertainment content from or about a brand Google+

Comparison of Logistic Regression and Decision Tree

Comparing confusion matrix:

Confusion matrix (logistic).jpg Confusion matrix (decision tree).jpg
Figure 31 (left): Confusion Matrix for Logistic Regression model
Figure 32 (right): Confusion Matrix for Decision Tree model

  • Comparing misclassification rate - Logistic Regression: 0.2445, Decision Tree: 0.2430
  • Comparing ROC curve - Logistic Regression: 0.80345, Decision Tree: 0.7889
  • Comparing variables obtained (similar):
    • Demographics: Gender, Household Income
    • Device ownership: Own pay-TV subscription
    • Digital activities: Frequency of online activities - upload photos, videos, or music; reading news, sport or weather
    • Daypart usage: Online shopping yesterday
    • Category Engagement: Willing to engage with online - Government services, cosmetics or skincare

Recommendations

Recommendation for Marketers

Based on the results of our analysis, we found certain demographic characteristics of Personal Care consumers that might be helpful for marketers of Personal Care products. Our results show that the odds of a Personal Care consumer being Female is 2.67 times greater than Male. This is most likely due to the nature of the product. This means that marketing strategies should be targeted at Females to receive better reciprocity.

However, regardless of gender, the odds of a person with the household income bracket of below S$1-2K, S$3-6K, S$12-13K, S$14-15K, or S$17.5k-20K is 1.68 times greater to be a consumer of Personal Care. This shows that different marketing campaigns can be suited for both income brackets below 6K and above 12K.

In addition, we found various online behavioral patterns of Personal Care consumers. In terms of seeking information online, the odds of a person reading news, sport or weather online is 1.66 times greater to be Personal Care consumers than a person who do not, with increasing frequency of doing so being more likely to be Personal Care consumers. In terms of online engagement and interaction, the odds of a person who uploads photos, videos or music is 1.35 times greater to be a Personal Care consumer than a person who do not. In addition, the odds of being a Personal Care consumer is 1.60 times greater if he/she shops online. In terms of willingness to engage with different categories online, our results highlighted that the odds of the people who are willing to engage in cosmetics or skin care online are 1.68 times greater to be a Personal Care consumer than someone who does not; and the odds of people who are willing to engage in government services online is 1.51 times greater to be a Personal Care consumer than those who are not. Marketers can make use of these findings to find ways to engage and market their products to their target consumers online. Through the understanding of the online behavior of Personal Care consumers, we can observe that they are consumers who are heavily engaged in the online platform, and are open to more interactions made online.

In terms of offline digital platform, our results show that the odds of a person having a Pay-TV subscription is 1.54 greater to be a consumer of Personal Care than those who are not. Marketers can make use of this information to perhaps consider advertising Personal Care products through usage of the products in popular pay-TV shows or during TV commercials.

Our findings point out that consumers of Personal Care are engaged in both online and offline activities through multiple platforms. All these information identified are useful for marketers to formulate marketing strategies as these behaviors highlight the diverse online activities that consumers engaged in.

Recommendation for Market Research Firm

The TNS Connected Life online survey questionnaire consists of 13 sections which covers respondents’ socio and economic demographics, devices and digital media platforms usage as well as online behavior. Each section consists of multiple parts, for example, under the demographic section, there are 6 parts such as Gender, Age and Education.

In this study, the objective is to reduce the number of survey questions to increase the accuracy of survey results and reduce cost. As our study focuses only on the FMCG - Personal Care products, we do not consider certain sections of the survey such as the Airlines, Banking and Mobile Operator as they are only relevant for the respective markets. As such, the remaining 10 sections are used for our analysis.

The results of our analysis shows that we can further reduce the number of sections in the survey as well as the questions in each section. Below is the list of 5 sections and its respective questions:

  1. Demographic
    1. Gender
    2. Household Income
  2. Device Ownership
    1. Device/service own or use at home
  3. Frequency of Online Digital Activities
    1. Upload photos, videos or music
    2. Read news, sport and weather
  4. Media Usage
    1. Online shopping
  5. Category Engagement
    1. Willingness to engage in Cosmetic or Skincare online
    2. Willingness to engage in Government services online

We identified a total of 8 main questions in 5 sections that are significant in determining the characteristics of Personal Care consumers. By asking these few questions, market researchers will be able to find out the characteristics of the consumers of Personal Care products. In addition, the lower number of survey questions reduces the time required to complete the survey and also reduces strain on respondents. This can increase the accuracy of the survey results. Furthermore, by reducing the incentives needed to incentivize respondents to complete a survey, the incentives can be redirected to gather more respondents.

Limitations

The data provided to us consists of 2074 responses after sieving out those that are relevant to the Singapore and Malaysia markets. Initially, we wanted to build a model to predict the socio demographics, economic demographics, device and digital media usage as well as online behavior of consumers of Personal Care products to allow marketers and maximize their return on investments. However, we encountered many obstacles and one of the main reasons was that the small sample size is not ideal for any predictive modelling because the results churned out with various models was unstable and could not be used. This is especially so for machine learning methods such as recursive partitioning bootstrap forest.

As a result, we decided to approach the problem using another method that can still benefit our stakeholders (Personal Care marketers and market research firms). With the data that we have, we decided to build an explanatory model to find out the characteristics of Personal Care consumers so as to recommend critical survey questions to retrieve valuable data from respondents. There are two benefits to this approach. First, this allows marketers from the Personal Care industry to learn more about their target consumers. Secondly, recognizing that the dataset consists of 6754 variables mainly due to the large amount of questions in the survey, we wanted to help reduce the cost of conducting survey as well as improve the accuracy of survey results by reducing the number of variables altogether.

However, we were still faced with the limitation that our results only serve to explain the current dataset. This means that we cannot draw conclusions should we extrapolate the data to other industries other than Personal Care as well as to other markets from other countries. Lastly, within the Personal Care industry, this explanatory model is based on the assumption that the behavioral pattern for Personal Care consumers does not change.

Conclusion

We started out having the objective to build an effective explanatory model to identify the significant variables that explains the purchase pattern outcome of Personal Care consumers in the hopes of reducing the number of variables needed for a market research study. We explored, evaluated and compared the results of the logistic regression and decision tree models with consideration of the limitations that comes with our analysis.

In conclusion, market research companies can make use of our analysis as an experiment dataset to map out questions that are redundant and focus on crafting a survey questionnaire that meets their study aim without drowning respondents with a large number of survey questions. To increase the value of the model, market research companies can aim to work towards building a predictive model by increasing the sample size in future studies.