AY1516 T2 Group 18 Data

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Data

Data Provided

The Singapore and Malaysia data provided to us by TNS was collected via online panels and weighed afterwards in proportion of population representative for both countries. The data was presented to us in an SPSS format and is cleaned. This is because the online questionnaires in the online panels were programmed in a way where logic checks and routing were done during the data collection process. In addition, the data collected was also automatically coded in the background based on the specifications in the survey questionnaire, which was also given to us.

Data Preparation

Even though the data presented to us has been preliminarily cleaned by the online panels, there is still a need to go through a more stages of cleaning to ensure that the data is relevant for the purpose of this project. As we are only focusing on one particular sector, we also need to filter out all the questions catered specifically for other irrelevant market sectors. This allows us to focus on the key objectives better and provide more meaningful analysis.

In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:

  • Exclude variables that are not statistically significant in predicting internet users that purchased personal care products online
    • In addition to single-response variables, there are a few survey questions that takes in multi-responses ie. a user can select the period of day he/she is doing certain activities. The results of these type of questions are then mapped to several dichotomous variables (0 or 1) based on the number of options available. If a user did activity A during period 1 and 2 of the day, the result will be a value of 1 in both variable “Activity A - Period 1” and “Activity A - Period 2”. We have to understand the results of the questionnaire and how they are mapped to the variables before applying the appropriate methods or checking the significance of the variables. As these type of variables are from a multi-responses questions and are related, checking the significance of these variables individually (using the “Fit Y by X” function in JMP Pro 12) will have a very different result as compared to checking these variables as a group of multi-responses variables (using the “Categorical Response Analysis” function and indicating that the role of these variables are “Multiple - Indicator Group”)
  • Dropped variables that are irrelevant to the project such as data for other countries
    • This includes variables such as social networking platforms that are not used by internet users from Singapore and Malaysia
  • Recode outliers by looking at data distribution
    • Recoding the outlier to the mean of the distribution ie. the mean of the distribution without the value of the outlier. Most of our outliers belong to users doing multiple activities such as watching tv, using mobile phone, social networking and using PC/Laptop 24 hours a day. It is highly unlikely for this scenario to take place so we assumed that this user does the mentioned activities regularly, therefore we recoded the outlier to the average time spent of each of the activities
  • Grouping of various categories with low frequency count, e.g. mobile brand, mobile service provider, etc.
    • The range of the distribution of mobile brand usage is wide and the distribution is skewed to the left ie. there are many mobile brands that are used by very few users. We then check the proportion of users of these brands that purchased personal care products online. These brands are then grouped based on the proportion of users that purchased or did not purchase personal care products online, whichever is higher. Grouping of these mobile brands did not affect the distribution of this variable. For example, if mobile brand X has 5 users and 3 of them purchased personal care products online, these brands are categorised under brand that have higher proportion of users purchasing personal care products online. After finding out the proportion for all brands with few users, they are then grouped under “Others - Purchased personal care products” or “Others - Did not purchased personal care products”
  • Reduce dimension by combining similar variables into one
    • For example, four separate categorical variables with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children” are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children”. This reduces the amount of variables used for our analysis while the information of the four variables “No Children”, “Children”, “Dependent Children” and “Independent Children” are not lost
  • Recode and match free text responses
    • Making sense of the free text responses and code them into one of the available options. New options are added to the existing list for free text responses that does not match any options already available. This may due to the large number of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words. This take into consideration the free text responses instead of representing them as “Others” which may exclude important feedback from the users