Difference between revisions of "Group14 Project Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 65: Line 65:
 
<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
  
Currently we have received the raw data provided by our sponsor including the following files:  
+
Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:
* Collection_Dataset_FY13 and FY14.xlsx
+
*Outlet Data
* Patron_Headers.csv
+
::Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
* Patron_Dataset_FY13.csv & Patron_Dataset_FY14.csv
+
*Customer Data
* TXN_Headers.csv
+
::Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
* TXN_FY13.csv & TXN_FY14.csv
+
*Transactional Data
 +
::Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.
 
</div>
 
</div>
  
 
==<div id="mw-content-text" lang="en-GB" dir="ltr" class="mw-content-ltr"><div style="background: #6FB1D0; padding: 10px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #6A8295 solid 15px; font-size: 20px; font-family:DIN Alternate"><font color="white">Summary of Data Cleaning</font></div>==
 
==<div id="mw-content-text" lang="en-GB" dir="ltr" class="mw-content-ltr"><div style="background: #6FB1D0; padding: 10px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #6A8295 solid 15px; font-size: 20px; font-family:DIN Alternate"><font color="white">Summary of Data Cleaning</font></div>==
 
<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #96C0CE solid 8px;font-family: Avenir; padding: 0px 30px 0px 18px; ">
 +
1. Customer Dataset
 +
Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.
 +
<br>
  
1.Both the TXN_Y13 & TXN_Y14 datasets displayed that the transaction’s timeline from each year’s April to the next year’s April. (12 months)
+
[[Image:Cleaning1.PNG|600px|thumb|center|Invalid Customer Data with “Bad Value”]]
<br><br>
+
<br>
2.In both the TXN_Y13 & TXN_Y14 datasets, there were rows where the Patron UID was set to “0”, whereas both the Patron_Dataset_FY13 & Patron_Dataset_FY14 datasets do not contain any data of a Patron with UID “0”. After exploring the data, 1335 counts & 65 counts of rows of Data contained Patron UID set to “0” in the TXN_Y13 & TXN_Y14 datasets respectively.
+
[[Image:Cleaning2.PNG|600px|thumb|center|Invalid Customer Data with “Missing Value”]]
[[Image:Point2.png|600px|center|PatronUID_0]]
+
<br>
<br><br>
+
[[Image:Cleaning3.PNG|600px|thumb|center|Invalid Customer Data from Invalid Subzone “CKSZ07”]]
3.In the TXN_Y13 dataset, there were Patron Borrower Category Code types such as “DEAR”, “DEARC”, “DEARS” & “Deceased”. Based on the previous team’s data cleaning we discovered that “DEAR” should be removed from the dataset completely as it refers to institutional partnership programmes and NLB suggested to remove all records with Patron Borrower Category Code set to “DEAR” from further analysis. However, there was no mention of the other codes “DEARC”, “DEARS” & “Deceased”, thus for the time being our team has decided to remove those records as well. The frequency of these codes are displayed below for TXN_Y13 & TXN_Y14 datasets respectively.
+
<br>
[[Image:Point3.png|400px|center|DEAR]]
+
[[Image:Cleaning4.PNG|600px|thumb|center|Invalid Customer Data with Birthyear 1900]]
<br><br>
 
4.In the TXN_Y13 Transaction Dataset, 3 records have Branch Code set to “Bad Value” and 3,065 records set to “Missing Value”. In the TXN_Y14 Transaction Dataset, 9,467 records have Branch Code set to “Bad Value” and 1,600 records set to “Missing Value”.
 
<br><br>
 
5.Based on the updated Collection dataset, it was confirmed that there are currently 26 active libraries for NLB and therefore transactions in both the TXN_Y13 & TXN_Y14 datasets which did not belong to these 26 branch codes were removed, such as
 
 
 
::::a. 07LKCRL
 
::::b. 08LKCRL
 
::::c. 11LKCRL
 
::::d. ORN
 
::::e. RN
 
::::f. RU
 
 
<br>
 
<br>
6.Removed any transactions in both the TXN_Y13 & TXN_Y14 datasets that stated that the birthyear of the Patrons is “1900”, “2015”,”2016” as that referred to values being set to the year that the institutional programme was set up.
+
[[Image:Cleaning5.PNG|600px|thumb|center|Handling of Invalid Customer Data in R]]
[[Image:Point6.png|600px|center|Invalid BirthYear]]
+
After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.
 +
[[Image:Cleaning6.PNG|600px|thumb|center|Handling of Invalid Customer Data from Zero-Population Subzone]]
  
 
</div>
 
</div>

Revision as of 21:52, 23 April 2017

Group Logo


HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 
Data Methodology

Data

Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:

  • Outlet Data
Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
  • Customer Data
Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
  • Transactional Data
Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.

Summary of Data Cleaning

1. Customer Dataset Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.

Invalid Customer Data with “Bad Value”


Invalid Customer Data with “Missing Value”


File:Cleaning3.PNG
Invalid Customer Data from Invalid Subzone “CKSZ07”


File:Cleaning4.PNG
Invalid Customer Data with Birthyear 1900


File:Cleaning5.PNG
Handling of Invalid Customer Data in R

After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.

File:Cleaning6.PNG
Handling of Invalid Customer Data from Zero-Population Subzone

Additional Data

1.Surrounding Facility Dataset:

  • Geographical location of Shopping Malls/ Plazas

As indicated in the senior’s group’s report, there is “positive inter-store externalities generated by the shopping malls that operate near the library (Brueckner, 2011), as more consumers visit the shopping malls, the patronage level of the nearby library will likely follow a similar increase.” Therefore, our team will keep studying the significant effect on the patronage of the libraries from the distribution of various shopping malls.

  • Geographical location of Primary Schools/ Secondary Schools/ Junior Colleges

As one of the largest groups visiting libraries, students are nonnegligible given that they are likely to spend time in the libraries after school hours and during examination period. Hence, our team will also have a deep look at the impact on the patronage of the libraries based on the location distribution of nearby educational institutions (primary schools, secondary school, junior colleges) using the data derived online.

2.Transportation Accessibility Dataset:

  • Geographical location of MRT Stations (A greater weight will be assigned to MRT interchanges in the analyses)
  • Geographical location of Bus Stops & No. of Bus Services Provided

In order to evaluate the likelihood for a patron to visit a library, the accessibility of transportation also plays an important role. With an easily accessed public transport network connected to a library, there will be less hindrance and thus a higher probability for a patron to visit the library. To analyze more deeply, the impact of public transportation may also vary between different neighborhoods where people are of different social and financial levels. Therefore, our team will embrace the available transportation dataset (MRT and bus stops) in our model with weight assigned to better measure and predict the attractiveness of the libraries.

3.Geographical Dataset:

  • Building within costaloutline.shp

As mentioned above, although the subzone clustering analysis conducted by the senior’s team returned a relatively executable model, our team aims to build up on the next level and present a more precise and accurate analysis. In terms of the geographical dataset, subzones no longer meet our demand due to the wide coverage of each subzone and the inequality analysis on patrons from different parts within the same subzone. Therefore, our team will utilize the geospatial data at HDB level (after transformation) and link it to the post-geocoding patron’s data so as to better analyze the patronage of each library.