Difference between revisions of "AY1516 T2 Group 18 Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
Line 36: Line 36:
  
 
<div align="left">
 
<div align="left">
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Introduction and Project Background</font></div>==
+
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Introduction</font></div>==
  
Understanding your target audience remains at the heart of successful marketing. The Connected Life is TNS's global syndicated study to understand connected consumer better. It is the largest and most comprehensive study of digital behavior of global consumers across the world. <br><br>
+
Syndicated market research studies that aim to target clients from multiple industries prove to be a good source of revenue for market research companies (insert source). However, research studies of this nature typically contain long survey questionnaires since it consists of questions catered for multiple industries. As such, it typically takes a respondent an average of 30 minutes to complete. Under such circumstances, the following will happen: firstly, obtained responses tend to be suboptimal because long questionnaires often put a strain on respondents and tire them out. This leads to a decrease in response rates and quality of responses; and secondly, because of the large number of survey questions (and hence, many resulting variables), an increase in monetary incentive is needed to incentivise respondents to complete the entire survey. Should the survey be shorter, the added incentive can instead be used to gather more respondents to improve their results. Hence, there is a need for market research companies to look for ways to shorten their surveys in order to uphold the accuracy of their results.<br><br>
The need for the study includes the following: <br>
 
a) There was a gap in the market as no one was offering such comprehensive information about digital consumers <br>
 
b) It was cost prohibitive for one client to undertake such a global venture and hence, clients were only doing these studies selectively and where budgets allowed <br>
 
c) Other studies which also claim to have such a global footprint were either by publishers themselves or by media agencies, thus clients are apprehensive that the analysis offered by them is biased and hence an independent study like Connected Life has great appeal.
 
  
<div align="left">
+
In this report, we aim to build an effective explanatory model that will help to reduce the number of variables needed for a market research study. By identifying pertinent variables and omitting variables that do not add value to the study results, we will be able to effectively reduce the number of survey questions in a study and reduce strain on survey respondents, provided that the behavior and demographics captured of the consumers in the industry remain the same in future studies.<br><br>
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Project Motivation</font></div>==
 
  
With the advent of the internet and digital devices over the past decade, it has become increasingly complex to understand and influence the choices of consumers. The media landscape has been shifting and traditional marketing approaches no longer work as well today. Many companies start to rely on digital marketing to reach out to consumers, where digital media growth have been estimated at 4.5 trillion online advertisements served annually with digital media spend to be at 48% in 2010. Whilst that is true, the power of traditional marketing approaches cannot be considered obsolete nor neglected in any way as there are cases when dealing with certain products and consumers where it still proves to be more effective. As such, businesses now face a difficulty in deciding and allocating the right amount of marketing resources to reap the best results from their target consumers.
+
As our obtained dataset consists of questions catered for numerous different industries, we will be focusing our efforts on the Personal Care industry. Personal Care products include facial care products, cosmetics, perfume or cologne, skin care products, and hair care products. Our objective would be to identify the significant factors (comprises of social demographic and economic profile, devices, digital media platforms, and online behavior in terms of time spent, frequency, and part of day for devices and activities engagement of Personal Care consumers) that would allow us to quantify consumers’ behavior with their purchase pattern outcome of buying Personal Care products.<br><br>
<br><br>
 
  
As the Connected Life study conducted by TNS covers across over 50 countries and 58 product categories, we have been given the datasets from two markets - Singapore and Malaysia for our project. Delving down, we have chosen to focus our efforts on the Personal Care sector, under the branch of Fast-Moving Consumer Goods (FMCG) industry for the purpose of our analysis.
+
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Study Context</font></div>==
  
<div align="left">
+
This report employs the dataset from a 2015 syndicated research study called Connected Life, conducted by Taylor Nelson Sofres (TNS) Singapore, a market research company under the WPP group. This study aims to identify the target consumer profiles, devices, and digital media platforms that today’s connected consumers engage in, so as to allow businesses from different industries to formulate more targeted marketing strategies to help them maximize the return on investment on their business decisions. Thus, the survey questionnaire is crafted in a way that would cover questions catered for a net of multiple different industries, including Personal Care, Airline, Mobile, etc. As a result, questions were crafted such that they were mostly general questions that cover the industry view. However, based on the results of the study, specific parts of the results could be taken out for further analysis for interested companies. <br>
 
 
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Project Objectives</font></div>==
 
 
 
The aim of this project is to help marketers from the FMCG - Personal Care industry to identify target consumer profiles, digital media platforms, and as well as devices to allow for more targeted marketing strategies, thus maximizing return on investment (ROI) on their business decisions.<br><br>
 
 
 
In order to do so, we have identified 5 main objectives that we would like to answer by the end of this project. These objectives follow through a step-by-step process of identifying, connecting, engaging, and lastly, having the power to influence at the end of it:<br>
 
*Who are our target consumers?
 
*What are the digital media platforms and devices that allow marketers to get to my target consumers and connect with them?
 
*How do marketers improve their touchpoint planning?
 
*What are the digital media platforms and content that needs to be prioritized in order to drive engagement and advocacy amongst the target consumers?
 
*After engagement is done, how do marketers influence the mindsets and decisions of the connected consumer?
 
 
 
<br>
 
 
 
<div align="left">
 
  
 
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Project Methodology</font></div>==
 
==<div style="background:#ff4fa7; padding: 10px; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #D3D3D3 solid 25px;"><font color="white">Project Methodology</font></div>==
Line 76: Line 55:
 
[[Image: Modelling process.png|300px|link=]]<br><br>
 
[[Image: Modelling process.png|300px|link=]]<br><br>
  
After data preparation, we first sieved out a set of relevant variables from the dataset that would help to answer each of our 5 objectives (i.e. identify target segments and their digital behavior to allow marketers to connect to them, identify how to engage target consumers and prioritize marketing efforts, and lastly, identify the ways to influence their decisions). We employed the use of different sections of the questionnaire variables to help us answer the different phases and needs of the objectives.<br><br>
+
The figure above illustrates the explanatory modelling process used for our analysis. The full list of data preparation procedures have been listed in the following section. After data preparation, we proceeded with the exploratory data analysis (EDA) to help us understand more about the data. During this process, we often find ourselves iterating back to the data preparation stage upon observing the distributions of some of the variables. <br><br>
 
 
Before we construct our model, we carried out exploratory data analysis by looking at the data distributions across all variables. Besides ensuring that the data has been properly cleaned, it also helps us to get a feel of the distribution of responses for each question and whether it makes sense, and helps us to form initial expectations of the results. For example, we split the distributions according to the response variable (G1. Products purchased P4W - NET - Personal Care) into customers and non-customers of Personal Care to see if there are any obvious discrepancies or patterns in the distributions of the potential predictor variables. We then use that observation to form initial expectations of our model. An example would be Gender, where we observed a greater proportion of female as compared to male who are customers of Personal Care, while the reverse is observed for people who do not buy Personal Care products.<br><br>
 
 
 
The Likelihood Ratio and Pearson’s chi-squared tests were then carried out to determine and sieve out statistically significant variables at the 0.05 level. After which, a test of association between all (categorical) variables against each other was then carried out to ensure that the variables selected for model building in the later phase are all independent variables. This helps to ensure the accuracy of our end results.<br><br>
 
  
Since our dataset consist of mostly categorical variables, we needed to research and explore the types of predictive modelling suitable for our dataset. While logistic regression proved to be a viable option as it is able to take into consideration categorical variables through the creation of dummy variables, we realized that with the sheer number of categorical variables in the dataset (as opposed to continuous variables), logistic regression would end up yielding unsatisfactory results since it is not quite suited for our data. Instead, we opted for recursive partition to build our model. Not only is the recursive partition approach able to deal with categorical predictor variables without treating them as if they were measured on an interval or ratio scale, recursive partition, being a non-parametric approach, is also able to deal with the so-called small n large p case, unlike in parametric models where interaction effects of high order cannot be included (see Strobl, Carolin ; Malley, James ; Tutz, Gerhard Maxwell, Scott E., 2009). We evaluate our model by assessing the misclassification rate and confusion matrix at each iteration, then subsequently improve our model through revisions and future iterations. The misclassification rate represents the global accuracy of our prediction model, indicating the proportion of incorrect classifications over the total sampled. On the other hand, the confusion matrix represents a two-way classification of actual and predicted responses. Our aim is to be able to effectively minimize the misclassification rate while at the same time, achieving a high true positive rate. At the end of our project, we hope to be able to reach a satisfactory number of explanatory variables that can best predict and classify the customers and behaviors of Personal Care.
+
Similarly during the model fitting stage, we find ourselves iterating through the model fitting and evaluation stage as we calibrate the model for optimal results. We evaluate and assess the performance of the models with several statistics such as Whole Model Test, Assessing Individual Parameters, Receiver Operating Characteristic (ROC) Curve, Fit Statistics, Misclassification Rate and Confusion Matrix.<br><br>
  
 +
Furthermore, after fitting and evaluating the model, we discover ways that we could improve our analysis. This brings us back to data preparation stage as we reorganize the data, followed by another round of EDA, model fitting and evaluation. Finally, we assess the models created and recommend list of actionable improvements to the marketers and market research firms. <br>
 
<br>
 
<br>
 
<div align="left">
 
<div align="left">

Revision as of 18:27, 11 April 2016

HOME

 

PROJECT OVERVIEW

 

DATA

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Taylor Nelson Sofres (TNS) is one of the largest research agencies worldwide. They provide actionable insights to help companies make impactful decisions that drive growth. TNS is part of Kantar, one of the world's largest insight, information and consultancy group.

Introduction

Syndicated market research studies that aim to target clients from multiple industries prove to be a good source of revenue for market research companies (insert source). However, research studies of this nature typically contain long survey questionnaires since it consists of questions catered for multiple industries. As such, it typically takes a respondent an average of 30 minutes to complete. Under such circumstances, the following will happen: firstly, obtained responses tend to be suboptimal because long questionnaires often put a strain on respondents and tire them out. This leads to a decrease in response rates and quality of responses; and secondly, because of the large number of survey questions (and hence, many resulting variables), an increase in monetary incentive is needed to incentivise respondents to complete the entire survey. Should the survey be shorter, the added incentive can instead be used to gather more respondents to improve their results. Hence, there is a need for market research companies to look for ways to shorten their surveys in order to uphold the accuracy of their results.

In this report, we aim to build an effective explanatory model that will help to reduce the number of variables needed for a market research study. By identifying pertinent variables and omitting variables that do not add value to the study results, we will be able to effectively reduce the number of survey questions in a study and reduce strain on survey respondents, provided that the behavior and demographics captured of the consumers in the industry remain the same in future studies.

As our obtained dataset consists of questions catered for numerous different industries, we will be focusing our efforts on the Personal Care industry. Personal Care products include facial care products, cosmetics, perfume or cologne, skin care products, and hair care products. Our objective would be to identify the significant factors (comprises of social demographic and economic profile, devices, digital media platforms, and online behavior in terms of time spent, frequency, and part of day for devices and activities engagement of Personal Care consumers) that would allow us to quantify consumers’ behavior with their purchase pattern outcome of buying Personal Care products.

Study Context

This report employs the dataset from a 2015 syndicated research study called Connected Life, conducted by Taylor Nelson Sofres (TNS) Singapore, a market research company under the WPP group. This study aims to identify the target consumer profiles, devices, and digital media platforms that today’s connected consumers engage in, so as to allow businesses from different industries to formulate more targeted marketing strategies to help them maximize the return on investment on their business decisions. Thus, the survey questionnaire is crafted in a way that would cover questions catered for a net of multiple different industries, including Personal Care, Airline, Mobile, etc. As a result, questions were crafted such that they were mostly general questions that cover the industry view. However, based on the results of the study, specific parts of the results could be taken out for further analysis for interested companies.

Project Methodology

See here for more information about our data

    Modelling Process:
Modelling process.png

The figure above illustrates the explanatory modelling process used for our analysis. The full list of data preparation procedures have been listed in the following section. After data preparation, we proceeded with the exploratory data analysis (EDA) to help us understand more about the data. During this process, we often find ourselves iterating back to the data preparation stage upon observing the distributions of some of the variables.

Similarly during the model fitting stage, we find ourselves iterating through the model fitting and evaluation stage as we calibrate the model for optimal results. We evaluate and assess the performance of the models with several statistics such as Whole Model Test, Assessing Individual Parameters, Receiver Operating Characteristic (ROC) Curve, Fit Statistics, Misclassification Rate and Confusion Matrix.

Furthermore, after fitting and evaluating the model, we discover ways that we could improve our analysis. This brings us back to data preparation stage as we reorganize the data, followed by another round of EDA, model fitting and evaluation. Finally, we assess the models created and recommend list of actionable improvements to the marketers and market research firms.

Project Limitations & Asssumptions

  • Datasets given by the sponsor are only based on 2 markets - Singapore and Malaysia, and thus the analysis cannot be a representative for other markets.
  • Small dataset in general paired with a large number of variables
  • Dataset consists of predominantly categorical variables, hence, we cannot employ the use of analytical methods that deal with predominantly continuous variables

Analytical Tools

JMP
SAS EM
Capable to write scripts using JMP Scripting Language to customize analysis and generating reports Capable to write scripts using SAS Language to customize analysis and generating reports
JMP holds data in RAM. It cannot handle data sets as large as can be handled by SAS. However, with less data it works faster due to memory processing Can process data on secondary storage instead of RAM thus able to process huge amount of data or more data than the RAM can hold
Cheaper (about $5k for the first year license) Expensive (over $100k for the first year license)
Reporting tool built-in with JMP that provides general-use reporting capabilities Powerful reporting tool with its Business Intelligence and Analytics software that allows very detailed customization of reports
JMP does not provide a workflow or history of analysis to keep track of progress. Organizes analysis into projects and diagrams with process flow diagrams thus able to track analysis procedure
JMP provides a very interactive GUI that allows users to do exploratory data analysis and try out various analytical methods easily and quickly Provides a server version for ease of collaboration on data cleansing, integration, security and access


The following are the consideration for choosing JMP as our tool of choice:
1. JMP is easier to learn as we had some experience in JMP. It is also easier to explore and manipulate the data with its GUI. This reduces the amount of time and effort for us to learn a new tool while allowing us to enhance our knowledge of JMP
2. Both tools have the statistical methods we expect to need for the project although SAS provides more options as compared to JMP. JMP has the decision tree, bootstrap forest, boosted forest and K nearest neighbour, which we expect to be sufficient for our project
3. Since we do not have huge amount of data that exceed the capacity that our RAM can hold, we do not require the capability of accessing secondary storage to process our data. Instead we do benefit from the relatively small data set that can be process by the RAM of our laptops which give a faster processing speed
4. Although both the JMP and SAS Enterprise Miner are accessible to us and both provide the capabilities for our project, we decided to use JMP due to the reasons mentioned above.