Group14 Project Findings

From Analytics Practicum
Revision as of 21:52, 23 April 2017 by Gaurib.2013 (talk | contribs)
Jump to navigation Jump to search
Group Logo


HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 
Data Methodology

Data

Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:

  • Outlet Data
Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
  • Customer Data
Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
  • Transactional Data
Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.

Summary of Data Cleaning

1. Customer Dataset Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.

Invalid Customer Data with “Bad Value”


Invalid Customer Data with “Missing Value”


File:Cleaning3.PNG
Invalid Customer Data from Invalid Subzone “CKSZ07”


File:Cleaning4.PNG
Invalid Customer Data with Birthyear 1900


File:Cleaning5.PNG
Handling of Invalid Customer Data in R

After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.

File:Cleaning6.PNG
Handling of Invalid Customer Data from Zero-Population Subzone

Additional Data

1.Surrounding Facility Dataset:

  • Geographical location of Shopping Malls/ Plazas

As indicated in the senior’s group’s report, there is “positive inter-store externalities generated by the shopping malls that operate near the library (Brueckner, 2011), as more consumers visit the shopping malls, the patronage level of the nearby library will likely follow a similar increase.” Therefore, our team will keep studying the significant effect on the patronage of the libraries from the distribution of various shopping malls.

  • Geographical location of Primary Schools/ Secondary Schools/ Junior Colleges

As one of the largest groups visiting libraries, students are nonnegligible given that they are likely to spend time in the libraries after school hours and during examination period. Hence, our team will also have a deep look at the impact on the patronage of the libraries based on the location distribution of nearby educational institutions (primary schools, secondary school, junior colleges) using the data derived online.

2.Transportation Accessibility Dataset:

  • Geographical location of MRT Stations (A greater weight will be assigned to MRT interchanges in the analyses)
  • Geographical location of Bus Stops & No. of Bus Services Provided

In order to evaluate the likelihood for a patron to visit a library, the accessibility of transportation also plays an important role. With an easily accessed public transport network connected to a library, there will be less hindrance and thus a higher probability for a patron to visit the library. To analyze more deeply, the impact of public transportation may also vary between different neighborhoods where people are of different social and financial levels. Therefore, our team will embrace the available transportation dataset (MRT and bus stops) in our model with weight assigned to better measure and predict the attractiveness of the libraries.

3.Geographical Dataset:

  • Building within costaloutline.shp

As mentioned above, although the subzone clustering analysis conducted by the senior’s team returned a relatively executable model, our team aims to build up on the next level and present a more precise and accurate analysis. In terms of the geographical dataset, subzones no longer meet our demand due to the wide coverage of each subzone and the inequality analysis on patrons from different parts within the same subzone. Therefore, our team will utilize the geospatial data at HDB level (after transformation) and link it to the post-geocoding patron’s data so as to better analyze the patronage of each library.