Difference between revisions of "AY1516 T2 Group 18 Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 33: Line 33:
 
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Data</font></div>==
 
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Data</font></div>==
  
<b>Data Provided</b><br>
+
<b>Data Provided</b><br><br>
 
The Singapore and Malaysia data provided to us by TNS was collected via online panels and weighed afterwards in proportion of population representative for both countries. The data was presented to us in an SPSS format and is cleaned. This is because the online questionnaires in the online panels were programmed in a way where logic checks and routing were done during the data collection process. In addition, the data collected was also automatically coded in the background based on the specifications in the survey questionnaire, which was also given to us.<br><br>
 
The Singapore and Malaysia data provided to us by TNS was collected via online panels and weighed afterwards in proportion of population representative for both countries. The data was presented to us in an SPSS format and is cleaned. This is because the online questionnaires in the online panels were programmed in a way where logic checks and routing were done during the data collection process. In addition, the data collected was also automatically coded in the background based on the specifications in the survey questionnaire, which was also given to us.<br><br>
  
<b>Data Preparation</b><br>
+
<b>Data Preparation</b><br><br>
Even though the data presented to us has been preliminarily cleaned by the online panels, there is still a need to go through a second stage of cleaning to ensure that the data is relevant for the purpose of this project. There is also a need to filter out outliers and anomalies. As we are only focusing on one particular sector, we also need to filter out all the questions catered specifically for other irrelevant market sectors. This allows us to focus on the key objectives better and provide more meaningful visualization.<br><br>
+
Even though the data presented to us has been preliminarily cleaned by the online panels, there is still a need to go through a more stages of cleaning to ensure that the data is relevant for the purpose of this project. As we are only focusing on one particular sector, we also need to filter out all the questions catered specifically for other irrelevant market sectors. This allows us to focus on the key objectives better and provide more meaningful analysis.
 +
<br><br>
  
In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. After which, we will also need to manipulate certain variables to create multi-dimensional views of the data. E.g. creation of a derivative variable.<br><br>
+
In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:
 +
<br><br>
  
<i>Note:</i><br>
+
*Exclude variables that are not statistically significant in predicting internet users that purchased personal care products online
We have identified variables from the dataset to be useful for the purpose of our analysis. However. due to NDA with the company, more information will be only be made available in the project proposal.<br>
+
**There are a few survey questions that takes in multi-responses ie. a user can select the period of day he/she is doing certain activities. The results of these types of questions are then mapped to several dichotomous variables (0 or 1) based on the number of options available. If a user did activity A during period 1 and 2 of the day, the result will be a value of 1 in both variable “Activity A - Period 1” and “Activity A - Period 2”. We have to understand the results of the questionnaire and how they are mapped to the variables before applying the appropriate methods or checking the significance of the variables. As these type of variables are from a multi-responses questions and are related, checking the significance of these variables individually (using the “Fit Y by X” function in JMP Pro 12) will have a very different results as compared to checking these variables as a group of multi-responses variables (using the “Categorical Response Analysis” function and indicating that the role of these variables are “Multiple - Indicator Group”)
Sample dataset will be in the project proposal as well.
+
*Dropped variables that are irrelevant to the project such as data for other countries
 +
**This includes variables such as social networking platforms that are not used by internet users from Singapore and Malaysia
 +
*Recode outliers by looking at data distribution
 +
**Recoding the outlier to the mean of the distribution i.e. the mean of the distribution without the value of the outlier. One example is a user does multiple activity such as watching tv, using mobile phone, social networking and using PC/Laptop 24 hours a day. It is highly unlikely for this scenario to take place but we assumed that this user does the mentioned activities regularly, therefore we recoded the outlier to the average time spent of each of the activities
 +
*Grouping of various categories with low frequency count, e.g. mobile brand, mobile service provider, etc.
 +
**The range of the distribution of mobile brand usage is wide and the distribution is skewed to the left ie. there are many mobile brands that are used by very few users. We then check the proportion of users of these brand that purchased personal care products online. These brands are then grouped based on the proportion of users that purchased or did not purchase personal care products online, whichever is higher. For example if mobile brand X has 5 users and 3 of them purchased personal care products online, these brand are categorised under brands that has higher proportion of users purchasing personal care products online. After finding out the proportion for all brands with few users, they are then grouped under “Others - Purchased personal care products online” or “Others - Did not purchased personal care products online”. As the mobile brands with higher users has higher proportion of users that purchased personal care products online, we then use this grouping to compare mobile brands with higher proportion of users purchasing and not purchasing personal care products online
 +
*Reduce dimension by combining similar variables into one
 +
**For example, four separate categorical variables with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children” are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children”
 +
*Recode and match free text responses
 +
**Making sense of the free text responses and code them into one of the available options. This may due to the large amount of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words
  
 
<div align="left"> <!-- END CHUNK-->
 
<div align="left"> <!-- END CHUNK-->

Revision as of 22:04, 28 February 2016

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Data

Data Provided

The Singapore and Malaysia data provided to us by TNS was collected via online panels and weighed afterwards in proportion of population representative for both countries. The data was presented to us in an SPSS format and is cleaned. This is because the online questionnaires in the online panels were programmed in a way where logic checks and routing were done during the data collection process. In addition, the data collected was also automatically coded in the background based on the specifications in the survey questionnaire, which was also given to us.

Data Preparation

Even though the data presented to us has been preliminarily cleaned by the online panels, there is still a need to go through a more stages of cleaning to ensure that the data is relevant for the purpose of this project. As we are only focusing on one particular sector, we also need to filter out all the questions catered specifically for other irrelevant market sectors. This allows us to focus on the key objectives better and provide more meaningful analysis.

In order to provide more in-depth analysis and insights, we will need to explore the data to understand various aspects of the data further. We will also need to manipulate certain variables to improve the data quality as follows:

  • Exclude variables that are not statistically significant in predicting internet users that purchased personal care products online
    • There are a few survey questions that takes in multi-responses ie. a user can select the period of day he/she is doing certain activities. The results of these types of questions are then mapped to several dichotomous variables (0 or 1) based on the number of options available. If a user did activity A during period 1 and 2 of the day, the result will be a value of 1 in both variable “Activity A - Period 1” and “Activity A - Period 2”. We have to understand the results of the questionnaire and how they are mapped to the variables before applying the appropriate methods or checking the significance of the variables. As these type of variables are from a multi-responses questions and are related, checking the significance of these variables individually (using the “Fit Y by X” function in JMP Pro 12) will have a very different results as compared to checking these variables as a group of multi-responses variables (using the “Categorical Response Analysis” function and indicating that the role of these variables are “Multiple - Indicator Group”)
  • Dropped variables that are irrelevant to the project such as data for other countries
    • This includes variables such as social networking platforms that are not used by internet users from Singapore and Malaysia
  • Recode outliers by looking at data distribution
    • Recoding the outlier to the mean of the distribution i.e. the mean of the distribution without the value of the outlier. One example is a user does multiple activity such as watching tv, using mobile phone, social networking and using PC/Laptop 24 hours a day. It is highly unlikely for this scenario to take place but we assumed that this user does the mentioned activities regularly, therefore we recoded the outlier to the average time spent of each of the activities
  • Grouping of various categories with low frequency count, e.g. mobile brand, mobile service provider, etc.
    • The range of the distribution of mobile brand usage is wide and the distribution is skewed to the left ie. there are many mobile brands that are used by very few users. We then check the proportion of users of these brand that purchased personal care products online. These brands are then grouped based on the proportion of users that purchased or did not purchase personal care products online, whichever is higher. For example if mobile brand X has 5 users and 3 of them purchased personal care products online, these brand are categorised under brands that has higher proportion of users purchasing personal care products online. After finding out the proportion for all brands with few users, they are then grouped under “Others - Purchased personal care products online” or “Others - Did not purchased personal care products online”. As the mobile brands with higher users has higher proportion of users that purchased personal care products online, we then use this grouping to compare mobile brands with higher proportion of users purchasing and not purchasing personal care products online
  • Reduce dimension by combining similar variables into one
    • For example, four separate categorical variables with Yes/No value for “No Children”, “Children”, “Dependent Children” and “Independent Children” are combined into a single variable with three possible options, “No children”, “Dependent Children” and “Independent Children”
  • Recode and match free text responses
    • Making sense of the free text responses and code them into one of the available options. This may due to the large amount of options available or misunderstanding of the meaning of the options thus respondent choose to input in their own words