Group14 Project Findings

From Analytics Practicum
Jump to navigation Jump to search
Group Logo


HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 
Data Methodology

Data

Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:

  • Outlet Data
Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
  • Customer Data
Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
  • Transactional Data
Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.

Summary of Data Cleaning

1. Customer Dataset
Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.

Invalid Customer Data with “Bad Value”


Invalid Customer Data with “Missing Value”


Invalid Customer Data from Invalid Subzone “CKSZ07”


Invalid Customer Data with Birthyear 1900


Handling of Invalid Customer Data in R

After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.

Handling of Invalid Customer Data from Zero-Population Subzone



2. Transaction Dataset
In the transaction dataset, there are also some invalid data which correspond to the invalid customers as identified above. Therefore, to make sure that those invalid data are not passed on to our further analysis steps, we used inner join to exclude them and at the same time, linked subzone information with each transaction records.

Handling of Invalid Data in Transaction Dataset

Additional Data

Having solely government data as input is not enough for our analysis, so additional data generation and abstraction is also necessary prior to the modelling and visualization. As identified in the previous steps, the number of MRT, malls and tuition centres are not only required in the model as part of the attractiveness variables, they are also helpful in the data exploration process. Therefore, we are expected to generate the corresponding data for each outlet based on their location, which was not available before. Besides, as one of the most important components in our model, the distance between each outlet and each subzone is also lacking at this stage. In order to calculate the necessary number of surrounding facilities and distance between, we customized programming functions based on geographical information of existing data (outlets, subzones, MRTs, malls and tuition centres) in R to carry out the task.

Government Data Retrieval
In order to implement the MCI model, the attraction variables and transport cost must be obtained. In order to guarantee the accuracy of our outcome from the first stage, we must incorporate official up-to-date data into our analysis.

Singapore Population Data
In order to analyse the patronage trend among different population cohorts of ABC Retail customers, dataset containing accurate population data is required. In addition to this, population data is also necessary in the MCI model to represent outlet visiting power from each subzone (“purchasing power” in a store market context”). Therefore, our team retrieved the latest updated population data from Statistics Singapore for the years 2000 to 2016, as grouped by subzones, sex and age cohort. Our team selected year 2013 population data for further modelling and analysis in order to match it with the year 2013 transaction data provided by ABC Retail.

Number of Surrounding Facilities
Prior to calculating the number of surrounding facilities, we first identified that the effective buffer distance around any outlet is 1km. From this, we utilized the longitudes and latitudes of various subjects and geo-functions included in R to calculate the number of surrounding facilities that fall into the defined buffer.

Distance from Subzones to outlets
Since subzone itself is a polygon with area, we must identify one representative point in each subzone to measure the distance between it and every single outlet. Our team believes that the centroid of each subzone is the perfect point we are looking for due to its unbiasedness in terms of distance measurement. Therefore, we used a geographical data processing tool in order to determine the centroid of each subzone– QGIS. Firstly, we imported the shape file of polygon data – Subzone, and then used the geometry tool – Polygon Centroids to generate and export the centroid data, which was further imported into our data in R.
Since we have already obtained the subzone centroids data points and imported it in R, next, we used a customized R function to calculate the distances between all subzones and all outlets to get the “Subzone_Distance” data frame.

Outlet Data with Number of Items Purchased
The data that we received from ABC Retail was in the format where each row represented a scanned item which was purchased. Therefore, each row was a part of a transaction but did not represent a full transaction. Thus, our team decided to transform the data in such a manner that each row in the data would then consist of a given transaction carried out by a customer. The main aim of this transformation was to capture the total number of items purchased per transaction for a given customer at a time. Once we had gathered this information, we could then easily map out the number of items purchased at each outlet by customers.

Dummy Variable Creation – Branch Type
In the outlet data given by our sponsor, the column “Branch Type” has one of the three values – “Mall”, “Stand-Alone” and “Regional” to indicate the outlet type. However, such value type is not permitted in the MCI R package, so in order to make sure that the formats meet requirements, we transformed this column into two dummy variables – “type_if_mall” and “type_if”standalone”, whose values are (1, 0) when the branch type is “Mall”, (0, 1) when the branch type is “Stand-Alone” and (0, 0) when the branch type is “Regional”. These two dummy variables will be detected when we put the data into MCI model and thus processed without any further transformation.

Subzone Abstraction for Customer Data
In the given customer dataset, each customer UID corresponds with only one variable “Locale Planning ADZID” to indicate their residential area. Therefore, we need to generate subzone data for each of the customers to further match the data with the rest. In order to do so, keeping in mind that subzone would be the first 6 letters/numbers of Locale Planning ADZID, we truncated the first 6 chars of Locale Planning ADZID and assigned them to a new variable “Subzone”.