Difference between revisions of "AY1516 T2 Group 18 Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 31: Line 31:
  
 
<div align="left">
 
<div align="left">
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Data</font></div>==
+
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Data Provided</font></div>==
 +
The 2015 Connected Life study was conducted over 50 markets around the world. For this report, we managed to attain the dataset from the Singapore and Malaysia markets, which was obtained using stratified random sampling based on region. The survey was administered online via TNS’s panel partners, and was carried out for the duration of three months from May to June 2015. The sampling frame of the survey encompasses regular Internet users (defined as people who use Internet at least once a week) in Singapore and Malaysia aged between 16 and 65 years old who are members of TNS’s panel partners. In addition, due to the nature of the survey questions, most of the variables are in categorical form. We were given a total sample size of 2,239 responses to work with - 1,145 from Malaysia, and 1,094 from Singapore.<br><br>
  
<b>Data Provided</b><br><br>
+
The online questionnaire was programmed in such a way that logic checks and routing checks would be done for every response, and hence the data obtained was complete. The dataset was presented to us in SPSS format, but was later converted to a format compatible to run on SAS JMP. Besides the SPSS data file, we were also given the survey questionnaire which contains the questions, options, and the codes corresponding to each option. An example is shown below:<br><br>
The Singapore and Malaysia data provided to us by TNS was collected via online panels and weighed afterwards in proportion of population representative for both countries. The data was presented to us in an SPSS format and is cleaned. This is because the online questionnaires in the online panels were programmed in a way where logic checks and routing were done during the data collection process. In addition, the data collected was also automatically coded in the background based on the specifications in the survey questionnaire, which was also given to us.<br><br>
 
  
<b>Data Preparation</b><br><br>
+
[[Image: Figure2.png|400px|link=]]<br>
Even though the data presented to us has been preliminarily cleaned by the online panels, there is still a need to go through a more stages of cleaning to ensure that the data is relevant for the purpose of this project. As we are only focusing on one particular sector, we also need to filter out all the questions catered specifically for other irrelevant market sectors. This allows us to focus on the key objectives better and provide more meaningful analysis.
+
<i>Figure 2: Example of a column variable in the dataset</i>
 +
 
 +
[[Image: Figure3.png|400px|link=]]<br>
 +
<i>Figure 3: Snippet of survey questionnaire showing question, options (e.g. homemaker, unemployed, etc.), and corresponding codes (e.g. 01, 02, etc.)</i><br>
 +
 
 +
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Target Sample Group Used</font></div>==
 +
As we are interested in the consumers who purchase Personal Care products, we need to first start by selecting consumers of fast-moving consumer goods (FMCG), since Personal Care products are under FMCG. We filtered out respondents who did not purchase FMCG products in the past four weeks when the survey was administered, based on the assumption that they are not customers of FMCG to begin with. We ended up with a total of 2074 responses.<br>
 +
 
 +
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Selection of the Variables</font></div>==
 +
<b>Response Variable</b><br>
 +
As the aim of our analysis is focused on identifying whether a consumer purchases Personal Care products or not, we reviewed the survey questionnaire and located a question that would help us directly with that - “Have you bought any of the following products in the last 4 weeks? - Personal Care”. The resulting variable comes with two levels of measures (yes/no), as shown in Figure 3 below. The frequencies of the response variable include 1,407 respondents (67.8%) “yes”, and 667 respondents (32.2%) indicating “no”, with “yes” being treated as consumers of Personal Care products.<br><br>
 +
 
 +
[[Image: Figure4.png|150px|link=]]<br>
 +
<i>Figure 4: Response variable column</i><br><br>
 +
 
 +
<b>Explanatory Variables</b><br>
 +
As the dataset given consists of variables from multiple industries in addition to the Personal Care industry (e.g. questions asking specifically about the banking industry and its performance, etc.), we removed those variables in addition to variables we believe do not have an influence on determining the purchase decision of Personal Care products (e.g. respondent’s serial number, what day was yesterday, etc.). We started out with 6753 variables, and after data preparation, we managed to shrink the number of explanatory variables needed to 208 for the decision tree model, and 246 variables for the logistic regression model.<br>
 +
 
 +
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Data Preparation</font></div>==
 +
Although the data has been preliminarily cleaned by the TNS’s panel partners, additional data preparation work is required to allow us to focus on the key objectives better and provide more meaningful analysis. We conducted univariate analysis to check for outliers, observing the distributions of each variable, and used it to determine the number of dummy variables to create for a predictor variable. Besides that, we also ran through the distributions for all variables again to ensure that the data has been properly cleaned. In addition, it also helps us to get a feel of the distribution of responses for each question and whether it makes sense, and helps us to form initial expectations of the results.<br><br>
 +
 
 +
We also did bivariate data analysis to test the association between all (categorical) variables against each other to ensure that the variables selected for model building in the later phase are all independent variables. This is done through a Test of Association. This helps to ensure the accuracy of our end results by removing the effects of multicollinearity. Following is the list of data preparation steps taken.
 
<br><br>
 
<br><br>
 +
<b>1. Variables and Dimension Reduction</b><br><br>
 +
<i>Removing Irrelevant Variables</i><br>
 +
The dataset includes variables that are irrelevant for the Singapore and Malaysia market. One such example is the “Renren” social network/instant messaging platform as shown in the figure below, where it does not contain any user from both Singapore and Malaysia, as this is a platform used in the China market. These variables are excluded from our analysis.<br>
 +
 +
In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:<br><br>
 +
[[Image: Figure5.png|250px|link=]]<br>
 +
<i>Figure 5: Variables with no users from Singapore and Malaysia</i><br>
  
In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:
+
<i>Excluding Variables with More Than 50% Missing Value or Less Than 50% Usage</i><br><rb>
<br><br>
+
We excluded variables that contain more than 50% of respondents not using that service. For example, the social media or instant messaging platform “Pinterest”, “MySpace” and “Friendster” has respectively 71%, 80% and 88% of respondents who have never used that service. These variables are excluded because the high proportion of respondents who never used this service would not be helpful to determine whether or not a respondent will purchase or not purchase personal care products.<br><br>
 +
[[Image: Figure6.png|500px|link=]]<br>
 +
<i>Figure 6: Variables with less than 50% usage by all respondents</i><br><br>
 +
 
 +
<i>Combining Multiple Variables into One</i><br><br>
 +
One of the reasons why the dataset consists of large amount of variables is because at least one variable is created for each option in a multiple choice question. For example, the question regarding household structure has four separate categorical variables created with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children”. We reduced the number of variables by combining multiple categorical variables into a single variable. In this case, the four variables are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children.”<br><br>
 +
[[Image: Figure7.png|400px|link=]]<br>
 +
<i>Figure 7: Survey question for household structure</i><br><br>
 +
 
 +
[[Image: Figure8.png|400px|link=]]&nbsp;&nbsp;&nbsp;&nbsp;[[Image: Figure9.png|130px|link=]]<br>
 +
<i>Figure 8 (left): Four variables created from the above survey question</i><br>
 +
<i>Figure 9 (right): Single variable after combining the results of the four variables</i><br><br>
 +
 
 +
<b>2. Outliers</b><br><br>
 +
We identified the outliers and recoded them to the mean of the distribution. In the dataset, we had 13 respondents who used multiple devices and engaged in multiple social media activities for 24 hours on a typical day. We recognized that it is possible for devices to operate 24 hours a day while engaging in social media activities. However, we assumed that it is highly unlikely for a person to use multiple devices and engage in social media activities 24 hours on a typical day. Nevertheless, we recognized that these respondents use those devices as well as engage in social media activities. Therefore, the time spent was recode to the average time spent of all respondents that did spend time on the device and activity, i.e. the average time spent calculated after excluding the outliers and the respondents that did not spend time on that respective device and activity.<br>
 +
 
 +
[[Image: Figure10.png|link=]]<br>
 +
<i>Figure 10: Respondent who uses PC/Laptop, watch video or TV online and engage in social networking activities for 24 hours a day</i><br>
 +
 
 +
[[Image: Figure11.png|link=]]<br>
 +
<i>Figure 11: Time spent of the above respondent was recode to mean of 2.49. The mean of 2.49 is calculated after excluding the outlier as well as the respondents which did not spent time on social networking</i><br><br>
 +
 
 +
<b>3. Data Binning</b><br><br>
 +
We looked at the distributions of the data and found that for certain variables, there are many categories with very few responses. For example, the range of the distribution of mobile brand usage is wide and the distribution is skewed to the left i.e. there are many mobile brands that are used by very few users. To deal with that, we checked the proportion of users of these mobile brands who have purchased Personal Care products. These mobile brands are then grouped together based on the proportion of users that purchased or did not purchase personal care products, whichever is higher. For example, MobileBrandXYZ has 5 users, 3 of them purchased personal care products while 2 did not. We grouped mobile brands such as MobileBrandXYZ into “Others - G1 Yes” while mobile brands having higher proportion of users who did not purchase personal care products into “Others - G1 No.”<br>
 +
 
 +
[[Image: Figure12.png|link=]]<br>
 +
<i>Figure 12: On the left is the distribution of the Mobile brand before grouping them by proportion. The diagram on the right is the result of grouping the mobile brands by proportion</i><br><br>
 +
 
 +
<b>4. Interpreting Free Text Responses</b><br><br>
 +
There are survey questions that allow respondents to select from a list of options and input free text under the option “Others, please specify: _____” if their desired options are not available. We infer the free text responses and as much as possible by matching them to one of the options available. If there are no suitable options, new options are added to the existing list to capture the free text responses of the respondents. This may be due to the large number of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words. This takes into consideration the free text responses instead of representing them as “Others” which may exclude important responses from the users.<br>
  
*Exclude variables that are not statistically significant in predicting internet users that purchased personal care products online
+
[[Image: Figure13.png|link=]]<br>
**In addition to single-response variables, there are a few survey questions that takes in multi-responses ie. a user can select the period of day he/she is doing certain activities. The results of these type of questions are then mapped to several dichotomous variables (0 or 1) based on the number of options available. If a user did activity A during period 1 and 2 of the day, the result will be a value of 1 in both variable “Activity A - Period 1” and “Activity A - Period 2”. We have to understand the results of the questionnaire and how they are mapped to the variables before applying the appropriate methods or checking the significance of the variables. As these type of variables are from a multi-responses questions and are related, checking the significance of these variables individually (using the “Fit Y by X” function in JMP Pro 12) will have a very different result as compared to checking these variables as a group of multi-responses variables (using the “Categorical Response Analysis” function and indicating that the role of these variables are “Multiple - Indicator Group”)
+
<i>Figure 13: Respondents’ free text responses of mobile brand they use</i><br><br>
*Dropped variables that are irrelevant to the project such as data for other countries
 
**This includes variables such as social networking platforms that are not used by internet users from Singapore and Malaysia
 
*Recode outliers by looking at data distribution
 
**Recoding the outlier to the mean of the distribution ie. the mean of the distribution without the value of the outlier. Most of our outliers belong to users doing multiple activities such as watching tv, using mobile phone, social networking and using PC/Laptop 24 hours a day. It is highly unlikely for this scenario to take place so we assumed that this user does the mentioned activities regularly, therefore we recoded the outlier to the average time spent of each of the activities
 
*Grouping of various categories with low frequency count, e.g. mobile brand, mobile service provider, etc.
 
**The range of the distribution of mobile brand usage is wide and the distribution is skewed to the left ie. there are many mobile brands that are used by very few users. We then check the proportion of users of these brands that purchased personal care products online. These brands are then grouped based on the proportion of users that purchased or did not purchase personal care products online, whichever is higher. Grouping of these mobile brands did not affect the distribution of this variable. For example, if mobile brand X has 5 users and 3 of them purchased personal care products online, these brands are categorised under brand that have higher proportion of users purchasing personal care products online. After finding out the proportion for all brands with few users, they are then grouped under “Others - Purchased personal care products” or “Others - Did not purchased personal care products”
 
*Reduce dimension by combining similar variables into one
 
**For example, four separate categorical variables with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children” are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children”. This reduces the amount of variables used for our analysis while the information of the four variables “No Children”, “Children”, “Dependent Children” and “Independent Children” are not lost
 
*Recode and match free text responses
 
**Making sense of the free text responses and code them into one of the available options. New options are added to the existing list for free text responses that does not match any options already available. This may due to the large number of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words. This take into consideration the free text responses instead of representing them as “Others” which may exclude important feedback from the users
 
  
 
<div align="left"> <!-- END CHUNK-->
 
<div align="left"> <!-- END CHUNK-->

Revision as of 14:57, 11 April 2016

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Data Provided

The 2015 Connected Life study was conducted over 50 markets around the world. For this report, we managed to attain the dataset from the Singapore and Malaysia markets, which was obtained using stratified random sampling based on region. The survey was administered online via TNS’s panel partners, and was carried out for the duration of three months from May to June 2015. The sampling frame of the survey encompasses regular Internet users (defined as people who use Internet at least once a week) in Singapore and Malaysia aged between 16 and 65 years old who are members of TNS’s panel partners. In addition, due to the nature of the survey questions, most of the variables are in categorical form. We were given a total sample size of 2,239 responses to work with - 1,145 from Malaysia, and 1,094 from Singapore.

The online questionnaire was programmed in such a way that logic checks and routing checks would be done for every response, and hence the data obtained was complete. The dataset was presented to us in SPSS format, but was later converted to a format compatible to run on SAS JMP. Besides the SPSS data file, we were also given the survey questionnaire which contains the questions, options, and the codes corresponding to each option. An example is shown below:

Figure2.png
Figure 2: Example of a column variable in the dataset

Figure3.png
Figure 3: Snippet of survey questionnaire showing question, options (e.g. homemaker, unemployed, etc.), and corresponding codes (e.g. 01, 02, etc.)

Target Sample Group Used

As we are interested in the consumers who purchase Personal Care products, we need to first start by selecting consumers of fast-moving consumer goods (FMCG), since Personal Care products are under FMCG. We filtered out respondents who did not purchase FMCG products in the past four weeks when the survey was administered, based on the assumption that they are not customers of FMCG to begin with. We ended up with a total of 2074 responses.

Selection of the Variables

Response Variable
As the aim of our analysis is focused on identifying whether a consumer purchases Personal Care products or not, we reviewed the survey questionnaire and located a question that would help us directly with that - “Have you bought any of the following products in the last 4 weeks? - Personal Care”. The resulting variable comes with two levels of measures (yes/no), as shown in Figure 3 below. The frequencies of the response variable include 1,407 respondents (67.8%) “yes”, and 667 respondents (32.2%) indicating “no”, with “yes” being treated as consumers of Personal Care products.

Figure4.png
Figure 4: Response variable column

Explanatory Variables
As the dataset given consists of variables from multiple industries in addition to the Personal Care industry (e.g. questions asking specifically about the banking industry and its performance, etc.), we removed those variables in addition to variables we believe do not have an influence on determining the purchase decision of Personal Care products (e.g. respondent’s serial number, what day was yesterday, etc.). We started out with 6753 variables, and after data preparation, we managed to shrink the number of explanatory variables needed to 208 for the decision tree model, and 246 variables for the logistic regression model.

Data Preparation

Although the data has been preliminarily cleaned by the TNS’s panel partners, additional data preparation work is required to allow us to focus on the key objectives better and provide more meaningful analysis. We conducted univariate analysis to check for outliers, observing the distributions of each variable, and used it to determine the number of dummy variables to create for a predictor variable. Besides that, we also ran through the distributions for all variables again to ensure that the data has been properly cleaned. In addition, it also helps us to get a feel of the distribution of responses for each question and whether it makes sense, and helps us to form initial expectations of the results.

We also did bivariate data analysis to test the association between all (categorical) variables against each other to ensure that the variables selected for model building in the later phase are all independent variables. This is done through a Test of Association. This helps to ensure the accuracy of our end results by removing the effects of multicollinearity. Following is the list of data preparation steps taken.

1. Variables and Dimension Reduction

Removing Irrelevant Variables
The dataset includes variables that are irrelevant for the Singapore and Malaysia market. One such example is the “Renren” social network/instant messaging platform as shown in the figure below, where it does not contain any user from both Singapore and Malaysia, as this is a platform used in the China market. These variables are excluded from our analysis.

In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:

Figure5.png
Figure 5: Variables with no users from Singapore and Malaysia

Excluding Variables with More Than 50% Missing Value or Less Than 50% Usage
We excluded variables that contain more than 50% of respondents not using that service. For example, the social media or instant messaging platform “Pinterest”, “MySpace” and “Friendster” has respectively 71%, 80% and 88% of respondents who have never used that service. These variables are excluded because the high proportion of respondents who never used this service would not be helpful to determine whether or not a respondent will purchase or not purchase personal care products.

Figure6.png
Figure 6: Variables with less than 50% usage by all respondents

Combining Multiple Variables into One

One of the reasons why the dataset consists of large amount of variables is because at least one variable is created for each option in a multiple choice question. For example, the question regarding household structure has four separate categorical variables created with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children”. We reduced the number of variables by combining multiple categorical variables into a single variable. In this case, the four variables are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children.”

Figure7.png
Figure 7: Survey question for household structure

Figure8.png    Figure9.png
Figure 8 (left): Four variables created from the above survey question
Figure 9 (right): Single variable after combining the results of the four variables

2. Outliers

We identified the outliers and recoded them to the mean of the distribution. In the dataset, we had 13 respondents who used multiple devices and engaged in multiple social media activities for 24 hours on a typical day. We recognized that it is possible for devices to operate 24 hours a day while engaging in social media activities. However, we assumed that it is highly unlikely for a person to use multiple devices and engage in social media activities 24 hours on a typical day. Nevertheless, we recognized that these respondents use those devices as well as engage in social media activities. Therefore, the time spent was recode to the average time spent of all respondents that did spend time on the device and activity, i.e. the average time spent calculated after excluding the outliers and the respondents that did not spend time on that respective device and activity.

Figure10.png
Figure 10: Respondent who uses PC/Laptop, watch video or TV online and engage in social networking activities for 24 hours a day

Figure11.png
Figure 11: Time spent of the above respondent was recode to mean of 2.49. The mean of 2.49 is calculated after excluding the outlier as well as the respondents which did not spent time on social networking

3. Data Binning

We looked at the distributions of the data and found that for certain variables, there are many categories with very few responses. For example, the range of the distribution of mobile brand usage is wide and the distribution is skewed to the left i.e. there are many mobile brands that are used by very few users. To deal with that, we checked the proportion of users of these mobile brands who have purchased Personal Care products. These mobile brands are then grouped together based on the proportion of users that purchased or did not purchase personal care products, whichever is higher. For example, MobileBrandXYZ has 5 users, 3 of them purchased personal care products while 2 did not. We grouped mobile brands such as MobileBrandXYZ into “Others - G1 Yes” while mobile brands having higher proportion of users who did not purchase personal care products into “Others - G1 No.”

Figure12.png
Figure 12: On the left is the distribution of the Mobile brand before grouping them by proportion. The diagram on the right is the result of grouping the mobile brands by proportion

4. Interpreting Free Text Responses

There are survey questions that allow respondents to select from a list of options and input free text under the option “Others, please specify: _____” if their desired options are not available. We infer the free text responses and as much as possible by matching them to one of the options available. If there are no suitable options, new options are added to the existing list to capture the free text responses of the respondents. This may be due to the large number of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words. This takes into consideration the free text responses instead of representing them as “Others” which may exclude important responses from the users.

Figure13.png
Figure 13: Respondents’ free text responses of mobile brand they use