Difference between revisions of "Group14 Project Findings"

Latest revision as of 22:16, 23 April 2017

Data		Methodology

Data

Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:

Outlet Data

Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.

Customer Data

Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014

Transactional Data

Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.

Summary of Data Cleaning

1. Customer Dataset
Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.

Invalid Customer Data with “Bad Value”

Invalid Customer Data with “Missing Value”

Invalid Customer Data from Invalid Subzone “CKSZ07”

Invalid Customer Data with Birthyear 1900

Handling of Invalid Customer Data in R

After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.

Handling of Invalid Customer Data from Zero-Population Subzone

2. Transaction Dataset
In the transaction dataset, there are also some invalid data which correspond to the invalid customers as identified above. Therefore, to make sure that those invalid data are not passed on to our further analysis steps, we used inner join to exclude them and at the same time, linked subzone information with each transaction records.

Handling of Invalid Data in Transaction Dataset

Additional Data

Having solely government data as input is not enough for our analysis, so additional data generation and abstraction is also necessary prior to the modelling and visualization. As identified in the previous steps, the number of MRT, malls and tuition centres are not only required in the model as part of the attractiveness variables, they are also helpful in the data exploration process. Therefore, we are expected to generate the corresponding data for each outlet based on their location, which was not available before. Besides, as one of the most important components in our model, the distance between each outlet and each subzone is also lacking at this stage. In order to calculate the necessary number of surrounding facilities and distance between, we customized programming functions based on geographical information of existing data (outlets, subzones, MRTs, malls and tuition centres) in R to carry out the task.

Government Data Retrieval
In order to implement the MCI model, the attraction variables and transport cost must be obtained. In order to guarantee the accuracy of our outcome from the first stage, we must incorporate official up-to-date data into our analysis.

Singapore Population Data
In order to analyse the patronage trend among different population cohorts of ABC Retail customers, dataset containing accurate population data is required. In addition to this, population data is also necessary in the MCI model to represent outlet visiting power from each subzone (“purchasing power” in a store market context”). Therefore, our team retrieved the latest updated population data from Statistics Singapore for the years 2000 to 2016, as grouped by subzones, sex and age cohort. Our team selected year 2013 population data for further modelling and analysis in order to match it with the year 2013 transaction data provided by ABC Retail.

Number of Surrounding Facilities
Prior to calculating the number of surrounding facilities, we first identified that the effective buffer distance around any outlet is 1km. From this, we utilized the longitudes and latitudes of various subjects and geo-functions included in R to calculate the number of surrounding facilities that fall into the defined buffer.

Distance from Subzones to outlets
Since subzone itself is a polygon with area, we must identify one representative point in each subzone to measure the distance between it and every single outlet. Our team believes that the centroid of each subzone is the perfect point we are looking for due to its unbiasedness in terms of distance measurement. Therefore, we used a geographical data processing tool in order to determine the centroid of each subzone– QGIS. Firstly, we imported the shape file of polygon data – Subzone, and then used the geometry tool – Polygon Centroids to generate and export the centroid data, which was further imported into our data in R.
Since we have already obtained the subzone centroids data points and imported it in R, next, we used a customized R function to calculate the distances between all subzones and all outlets to get the “Subzone_Distance” data frame.

Outlet Data with Number of Items Purchased
The data that we received from ABC Retail was in the format where each row represented a scanned item which was purchased. Therefore, each row was a part of a transaction but did not represent a full transaction. Thus, our team decided to transform the data in such a manner that each row in the data would then consist of a given transaction carried out by a customer. The main aim of this transformation was to capture the total number of items purchased per transaction for a given customer at a time. Once we had gathered this information, we could then easily map out the number of items purchased at each outlet by customers.

Dummy Variable Creation – Branch Type
In the outlet data given by our sponsor, the column “Branch Type” has one of the three values – “Mall”, “Stand-Alone” and “Regional” to indicate the outlet type. However, such value type is not permitted in the MCI R package, so in order to make sure that the formats meet requirements, we transformed this column into two dummy variables – “type_if_mall” and “type_if”standalone”, whose values are (1, 0) when the branch type is “Mall”, (0, 1) when the branch type is “Stand-Alone” and (0, 0) when the branch type is “Regional”. These two dummy variables will be detected when we put the data into MCI model and thus processed without any further transformation.

Subzone Abstraction for Customer Data
In the given customer dataset, each customer UID corresponds with only one variable “Locale Planning ADZID” to indicate their residential area. Therefore, we need to generate subzone data for each of the customers to further match the data with the rest. In order to do so, keeping in mind that subzone would be the first 6 letters/numbers of Locale Planning ADZID, we truncated the first 6 chars of Locale Planning ADZID and assigned them to a new variable “Subzone”.

@@ Line 47: / Line 47: @@
 |}
 ------------------------------------------------------------------------>
+<!---------------START of sub menu —-------------------->
+{| style="background-color:#ffffff; font-family:DIN Alternate; margin: 3px auto 0 auto" width="70%"
+|-
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #2e2e2e" width="150px" | [[Group14_Project_Findings| <span style="color:#3d3d3d">Data</span>]]
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Methodology| <span style="color:#3d3d3d">Methodology</span>]]
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
+<!--
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Group03_Result | <span style="color:#3d3d3d">Result</span>]]
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
+-->
+|}
+<!---------------------------End of sub menu —--------------------->
 ==<div id="mw-content-text" lang="en-GB" dir="ltr" class="mw-content-ltr"><div style="background: #6FB1D0; padding: 10px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #6A8295 solid 15px; font-size: 20px; font-family:DIN Alternate"><font color="white">Data</font></div>==
-<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
-Currently we have received the raw data provided by our sponsor including the following files:
-* Collection_Dataset_FY13 and FY14.xlsx
+Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:
-* Patron_Headers.csv
+*Outlet Data
-* Patron_Dataset_FY13.csv & Patron_Dataset_FY14.csv
+::Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
-* TXN_Headers.csv
+*Customer Data
-* TXN_FY13.csv & TXN_FY14.csv
+::Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
+*Transactional Data
+::Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.
 </div>
 ==<div id="mw-content-text" lang="en-GB" dir="ltr" class="mw-content-ltr"><div style="background: #6FB1D0; padding: 10px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #6A8295 solid 15px; font-size: 20px; font-family:DIN Alternate"><font color="white">Summary of Data Cleaning</font></div>==
-<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
-.Both the TXN_Y13 & TXN_Y14 datasets displayed that the transaction’s timeline from each year’s April to the next year’s April. (12 months)
+<b>1. Customer Dataset</b>
-<br><br>
+<br>
-.In both the TXN_Y13 & TXN_Y14 datasets, there were rows where the Patron UID was set to “0”, whereas both the Patron_Dataset_FY13 & Patron_Dataset_FY14 datasets do not contain any data of a Patron with UID “0”. After exploring the data, 1335 counts & 65 counts of rows of Data contained Patron UID set to “0” in the TXN_Y13 & TXN_Y14 datasets respectively.
+Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.
-[[Image:Point2.png|600px|center|PatronUID_0]]
+<br>
-<br><br>
-.In the TXN_Y13 dataset, there were Patron Borrower Category Code types such as “DEAR”, “DEARC”, “DEARS” & “Deceased”. Based on the previous team’s data cleaning we discovered that “DEAR” should be removed from the dataset completely as it refers to institutional partnership programmes and NLB suggested to remove all records with Patron Borrower Category Code set to “DEAR” from further analysis. However, there was no mention of the other codes “DEARC”, “DEARS” & “Deceased”, thus for the time being our team has decided to remove those records as well. The frequency of these codes are displayed below for TXN_Y13 & TXN_Y14 datasets respectively.
-[[Image:Point3.png|400px|center|DEAR]]
-<br><br>
-.In the TXN_Y13 Transaction Dataset, 3 records have Branch Code set to “Bad Value” and 3,065 records set to “Missing Value”. In the TXN_Y14 Transaction Dataset, 9,467 records have Branch Code set to “Bad Value” and 1,600 records set to “Missing Value”.
-<br><br>
-.Based on the updated Collection dataset, it was confirmed that there are currently 26 active libraries for NLB and therefore transactions in both the TXN_Y13 & TXN_Y14 datasets which did not belong to these 26 branch codes were removed, such as
-::::a.	07LKCRL
+[[Image:Cleaning1.PNG|600px|thumb|center|Invalid Customer Data with “Bad Value”]]
-::::b.	08LKCRL
-::::c.	11LKCRL
-::::d.	ORN
-::::e.	RN
-::::f.	RU
 <br>
-.Removed any transactions in both the TXN_Y13 & TXN_Y14 datasets that stated that the birthyear of the Patrons is “1900”, “2015”,”2016” as that referred to values being set to the year that the institutional programme was set up.
+[[Image:Cleaning2.PNG|600px|thumb|center|Invalid Customer Data with “Missing Value”]]
-[[Image:Point6.png|600px|center|Invalid BirthYear]]
+<br>
+[[Image:Cleaning3.png|600px|thumb|center|Invalid Customer Data from Invalid Subzone “CKSZ07”]]
+<br>
+[[Image:Cleaning4.png|600px|thumb|center|Invalid Customer Data with Birthyear 1900]]
+<br>
+[[Image:Cleaning5.png|600px|thumb|center|Handling of Invalid Customer Data in R]]
+After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.
+[[Image:Cleaning6.png|600px|thumb|center|Handling of Invalid Customer Data from Zero-Population Subzone]]
+<br><br>
+<b>2. Transaction Dataset</b><br>
+In the transaction dataset, there are also some invalid data which correspond to the invalid customers as identified above. Therefore, to make sure that those invalid data are not passed on to our further analysis steps, we used inner join to exclude them and at the same time, linked subzone information with each transaction records.
+[[Image:Cleaning7.png|600px|thumb|center|Handling of Invalid Data in Transaction Dataset]]
 </div>
 ==<div id="mw-content-text" lang="en-GB" dir="ltr" class="mw-content-ltr"><div style="background: #6FB1D0; padding: 10px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #6A8295 solid 15px; font-size: 20px; font-family:DIN Alternate"><font color="white">Additional Data</font></div>==
-<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
-<b>1.Surrounding Facility Dataset:</b>
+Having solely government data as input is not enough for our analysis, so additional data generation and abstraction is also necessary prior to the modelling and visualization. As identified in the previous steps, the number of MRT, malls and tuition centres are not only required in the model as part of the attractiveness variables, they are also helpful in the data exploration process. Therefore, we are expected to generate the corresponding data for each outlet based on their location, which was not available before. Besides, as one of the most important components in our model, the distance between each outlet and each subzone is also lacking at this stage. In order to calculate the necessary number of surrounding facilities and distance between, we customized programming functions based on geographical information of existing data (outlets, subzones, MRTs, malls and tuition centres) in R to carry out the task.
-::* Geographical location of Shopping Malls/ Plazas
+<br><br>
-As indicated in the senior’s group’s report, there is “positive inter-store externalities generated by the shopping malls that operate near the library (Brueckner, 2011), as more consumers visit the shopping malls, the patronage level of the nearby library will likely follow a similar increase.” Therefore, our team will keep studying the significant effect on the patronage of the libraries from the distribution of various shopping malls.
+<b>Government Data Retrieval</b><br>
+In order to implement the MCI model, the attraction variables and transport cost must be obtained. In order to guarantee the accuracy of our outcome from the first stage, we must incorporate official up-to-date data into our analysis.
+<br><br>
+<b>Singapore Population Data</b> <br>
+In order to analyse the patronage trend among different population cohorts of ABC Retail customers, dataset containing accurate population data is required. In addition to this, population data is also necessary in the MCI model to represent outlet visiting power from each subzone (“purchasing power” in a store market context”). Therefore, our team retrieved the latest updated population data from Statistics Singapore for the years 2000 to 2016, as grouped by subzones, sex and age cohort. Our team selected year 2013 population data for further modelling and analysis in order to match it with the year 2013 transaction data provided by ABC Retail.
+<br><br>
+<b>Number of Surrounding Facilities</b><br>
+Prior to calculating the number of surrounding facilities, we first identified that the effective buffer distance around any outlet is 1km. From this, we utilized the longitudes and latitudes of various subjects and geo-functions included in R to calculate the number of surrounding facilities that fall into the defined buffer.
+<br><br>
+<b>Distance from Subzones to outlets</b><br>
+Since subzone itself is a polygon with area, we must identify one representative point in each subzone to measure the distance between it and every single outlet. Our team believes that the centroid of each subzone is the perfect point we are looking for due to its unbiasedness in terms of distance measurement. Therefore, we used a geographical data processing tool in order to determine the centroid of each subzone– QGIS. Firstly, we imported the shape file of polygon data – Subzone, and then used the geometry tool – Polygon Centroids to generate and export the centroid data, which was further imported into our data in R.
 <br>
-::*Geographical location of Primary Schools/ Secondary Schools/ Junior Colleges
+Since we have already obtained the subzone centroids data points and imported it in R, next, we used a customized R function to calculate the distances between all subzones and all outlets to get the “Subzone_Distance” data frame.
-As one of the largest groups visiting libraries, students are nonnegligible given that they are likely to spend time in the libraries after school hours and during examination period. Hence, our team will also have a deep look at the impact on the patronage of the libraries based on the location distribution of nearby educational institutions (primary schools, secondary school, junior colleges) using the data derived online.
+<br><br>
-<br>
+<b>Outlet Data with Number of Items Purchased</b><br>
-<b>2.Transportation Accessibility Dataset:</b>
+The data that we received from ABC Retail was in the format where each row represented a scanned item which was purchased. Therefore, each row was a part of a transaction but did not represent a full transaction. Thus, our team decided to transform the data in such a manner that each row in the data would then consist of a given transaction carried out by a customer. The main aim of this transformation was to capture the total number of items purchased per transaction for a given customer at a time. Once we had gathered this information, we could then easily map out the number of items purchased at each outlet by customers.
-::*Geographical location of MRT Stations (A greater weight will be assigned to MRT interchanges in the analyses)
+<br><br>
-::*Geographical location of Bus Stops & No. of Bus Services Provided
+<b>Dummy Variable Creation – Branch Type</b><br>
-In order to evaluate the likelihood for a patron to visit a library, the accessibility of transportation also plays an important role. With an easily accessed public transport network connected to a library, there will be less hindrance and thus a higher probability for a patron to visit the library. To analyze more deeply, the impact of public transportation may also vary between different neighborhoods where people are of different social and financial levels. Therefore, our team will embrace the available transportation dataset (MRT and bus stops) in our model with weight assigned to better measure and predict the attractiveness of the libraries.
+In the outlet data given by our sponsor, the column “Branch Type” has one of the three values – “Mall”, “Stand-Alone” and “Regional” to indicate the outlet type. However, such value type is not permitted in the MCI R package, so in order to make sure that the formats meet requirements, we transformed this column into two dummy variables – “type_if_mall” and “type_if”standalone”, whose values are (1, 0) when the branch type is “Mall”, (0, 1) when the branch type is “Stand-Alone” and (0, 0) when the branch type is “Regional”. These two dummy variables will be detected when we put the data into MCI model and thus processed without any further transformation.
+<br><br>
+<b>Subzone Abstraction for Customer Data</b><br>
+In the given customer dataset, each customer UID corresponds with only one variable “Locale Planning ADZID” to indicate their residential area. Therefore, we need to generate subzone data for each of the customers to further match the data with the rest. In order to do so, keeping in mind that subzone would be the first 6 letters/numbers of Locale Planning ADZID, we truncated the first 6 chars of Locale Planning ADZID and assigned them to a new variable “Subzone”.
 <br>
-<b>3.Geographical Dataset:</b>
-::*Building within costaloutline.shp
-As mentioned above, although the subzone clustering analysis conducted by the senior’s team returned a relatively executable model, our team aims to build up on the next level and present a more precise and accurate analysis. In terms of the geographical dataset, subzones no longer meet our demand due to the wide coverage of each subzone and the inequality analysis on patrons from different parts within the same subzone. Therefore, our team will utilize the geospatial data at HDB level (after transformation) and link it to the post-geocoding patron’s data so as to better analyze the patronage of each library.
 </div>
 <br>

Difference between revisions of "Group14 Project Findings"

Latest revision as of 22:16, 23 April 2017

Data

Summary of Data Cleaning

Additional Data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools