AY1516 T2 Group 18 Data

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT FINDINGS

 

SCHEDULE

 

DOCUMENTATION


Data Provided

The 2015 Connected Life study was conducted over 50 markets around the world. For this report, we managed to attain the dataset from the Singapore and Malaysia markets, which was obtained using stratified random sampling based on region. The survey was administered online via TNS’s panel partners, and was carried out for the duration of three months from May to June 2015. The sampling frame of the survey encompasses regular Internet users (defined as people who use Internet at least once a week) in Singapore and Malaysia aged between 16 and 65 years old who are members of TNS’s panel partners. In addition, due to the nature of the survey questions, most of the variables are in categorical form. We were given a total sample size of 2,239 responses to work with - 1,145 from Malaysia, and 1,094 from Singapore.

The online questionnaire was programmed in such a way that logic checks and routing checks would be done for every response, and hence the data obtained was complete. The dataset was presented to us in SPSS format, but was later converted to a format compatible to run on SAS JMP. Besides the SPSS data file, we were also given the survey questionnaire which contains the questions, options, and the codes corresponding to each option. An example is shown below:

Figure2.png
Figure 2: Example of a column variable in the dataset

Figure3.png
Figure 3: Snippet of survey questionnaire showing question, options (e.g. homemaker, unemployed, etc.), and corresponding codes (e.g. 01, 02, etc.)

Target Sample Group Used

As we are interested in the consumers who purchase Personal Care products, we need to first start by selecting consumers of fast-moving consumer goods (FMCG), since Personal Care products are under FMCG. We filtered out respondents who did not purchase FMCG products in the past four weeks when the survey was administered, based on the assumption that they are not customers of FMCG to begin with. We ended up with a total of 2074 responses.

Selection of the Variables

Response Variable
As the aim of our analysis is focused on identifying whether a consumer purchases Personal Care products or not, we reviewed the survey questionnaire and located a question that would help us directly with that - “Have you bought any of the following products in the last 4 weeks? - Personal Care”. The resulting variable comes with two levels of measures (yes/no), as shown in Figure 3 below. The frequencies of the response variable include 1,407 respondents (67.8%) “yes”, and 667 respondents (32.2%) indicating “no”, with “yes” being treated as consumers of Personal Care products.

Figure4.png
Figure 4: Response variable column

Explanatory Variables
As the dataset given consists of variables from multiple industries in addition to the Personal Care industry (e.g. questions asking specifically about the banking industry and its performance, etc.), we removed those variables in addition to variables we believe do not have an influence on determining the purchase decision of Personal Care products (e.g. respondent’s serial number, what day was yesterday, etc.). We started out with 6753 variables, and after data preparation, we managed to shrink the number of explanatory variables needed to 208 for the decision tree model, and 246 variables for the logistic regression model.

Data Preparation

Although the data has been preliminarily cleaned by the TNS’s panel partners, additional data preparation work is required to allow us to focus on the key objectives better and provide more meaningful analysis. We conducted univariate analysis to check for outliers, observing the distributions of each variable, and used it to determine the number of dummy variables to create for a predictor variable. Besides that, we also ran through the distributions for all variables again to ensure that the data has been properly cleaned. In addition, it also helps us to get a feel of the distribution of responses for each question and whether it makes sense, and helps us to form initial expectations of the results.

We also did bivariate data analysis to test the association between all (categorical) variables against each other to ensure that the variables selected for model building in the later phase are all independent variables. This is done through a Test of Association. This helps to ensure the accuracy of our end results by removing the effects of multicollinearity. Following is the list of data preparation steps taken.

1. Variables and Dimension Reduction

Removing Irrelevant Variables
The dataset includes variables that are irrelevant for the Singapore and Malaysia market. One such example is the “Renren” social network/instant messaging platform as shown in the figure below, where it does not contain any user from both Singapore and Malaysia, as this is a platform used in the China market. These variables are excluded from our analysis.

In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:

Figure5.png
Figure 5: Variables with no users from Singapore and Malaysia

Excluding Variables with More Than 50% Missing Value or Less Than 50% Usage
We excluded variables that contain more than 50% of respondents not using that service. For example, the social media or instant messaging platform “Pinterest”, “MySpace” and “Friendster” has respectively 71%, 80% and 88% of respondents who have never used that service. These variables are excluded because the high proportion of respondents who never used this service would not be helpful to determine whether or not a respondent will purchase or not purchase personal care products.

Figure6.png
Figure 6: Variables with less than 50% usage by all respondents

Combining Multiple Variables into One

One of the reasons why the dataset consists of large amount of variables is because at least one variable is created for each option in a multiple choice question. For example, the question regarding household structure has four separate categorical variables created with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children”. We reduced the number of variables by combining multiple categorical variables into a single variable. In this case, the four variables are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children.”

Figure7.png
Figure 7: Survey question for household structure

Figure8.png    Figure9.png
Figure 8 (left): Four variables created from the above survey question
Figure 9 (right): Single variable after combining the results of the four variables

2. Outliers

We identified the outliers and recoded them to the mean of the distribution. In the dataset, we had 13 respondents who used multiple devices and engaged in multiple social media activities for 24 hours on a typical day. We recognized that it is possible for devices to operate 24 hours a day while engaging in social media activities. However, we assumed that it is highly unlikely for a person to use multiple devices and engage in social media activities 24 hours on a typical day. Nevertheless, we recognized that these respondents use those devices as well as engage in social media activities. Therefore, the time spent was recode to the average time spent of all respondents that did spend time on the device and activity, i.e. the average time spent calculated after excluding the outliers and the respondents that did not spend time on that respective device and activity.

Figure10.png
Figure 10: Respondent who uses PC/Laptop, watch video or TV online and engage in social networking activities for 24 hours a day

Figure11.png
Figure 11: Time spent of the above respondent was recode to mean of 2.49. The mean of 2.49 is calculated after excluding the outlier as well as the respondents which did not spent time on social networking

3. Data Binning

We looked at the distributions of the data and found that for certain variables, there are many categories with very few responses. For example, the range of the distribution of mobile brand usage is wide and the distribution is skewed to the left i.e. there are many mobile brands that are used by very few users. To deal with that, we checked the proportion of users of these mobile brands who have purchased Personal Care products. These mobile brands are then grouped together based on the proportion of users that purchased or did not purchase personal care products, whichever is higher. For example, MobileBrandXYZ has 5 users, 3 of them purchased personal care products while 2 did not. We grouped mobile brands such as MobileBrandXYZ into “Others - G1 Yes” while mobile brands having higher proportion of users who did not purchase personal care products into “Others - G1 No.”

Figure12.png
Figure 12: On the left is the distribution of the Mobile brand before grouping them by proportion. The diagram on the right is the result of grouping the mobile brands by proportion

4. Interpreting Free Text Responses

There are survey questions that allow respondents to select from a list of options and input free text under the option “Others, please specify: _____” if their desired options are not available. We infer the free text responses and as much as possible by matching them to one of the options available. If there are no suitable options, new options are added to the existing list to capture the free text responses of the respondents. This may be due to the large number of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words. This takes into consideration the free text responses instead of representing them as “Others” which may exclude important responses from the users.

Figure13.png
Figure 13: Respondents’ free text responses of mobile brand they use