Group14 Project Findings
Data | Methodology |
---|
Data
Data preparation is a crucial component of any data analysis, it is a time-consuming and tedious task however still extremely important. It allows for more accurate, easier and better interpretation of the data. We received these raw data files from ABC Retail as shown below:
- Outlet Data
- Contains outlet information such as outlet branch code, floor size, scope of service and branch type (if the outlet is within a mall, a standalone outlet or a Regional outlet) for each outlet for the years 2013 and 2014.
- Customer Data
- Contains all the characteristics for a unique customer that has visited the store, such as citizenship, birth year, race, gender, location ID for both the years 2013 and 2014
- Transactional Data
- Contains all the rows of transactions of items purchased by customers over the years 2013 and 2014.
Summary of Data Cleaning
1. Customer Dataset
Inside the customer dataset which we received, there are a few types of invalid data including invalid “Locale Planning ADZID” (data value as “Bad Value” and “Missing Value”), invalid patron birthyear (data value as “1900”, which has been confirmed to be invalid with ABC retail) and invalid subzone (generate subzone value as “CKSZ07” which does not exist in any other government subzone data). So, we discarded all these invalid data in R using subset function.
After cross checking with official population census data, we realised that there are also some customers from subzones where population equals zero, which means these customers are also invalid. Therefore, we also discarded these data.
Additional Data
1.Surrounding Facility Dataset:
- Geographical location of Shopping Malls/ Plazas
As indicated in the senior’s group’s report, there is “positive inter-store externalities generated by the shopping malls that operate near the library (Brueckner, 2011), as more consumers visit the shopping malls, the patronage level of the nearby library will likely follow a similar increase.” Therefore, our team will keep studying the significant effect on the patronage of the libraries from the distribution of various shopping malls.
- Geographical location of Primary Schools/ Secondary Schools/ Junior Colleges
As one of the largest groups visiting libraries, students are nonnegligible given that they are likely to spend time in the libraries after school hours and during examination period. Hence, our team will also have a deep look at the impact on the patronage of the libraries based on the location distribution of nearby educational institutions (primary schools, secondary school, junior colleges) using the data derived online.
2.Transportation Accessibility Dataset:
- Geographical location of MRT Stations (A greater weight will be assigned to MRT interchanges in the analyses)
- Geographical location of Bus Stops & No. of Bus Services Provided
In order to evaluate the likelihood for a patron to visit a library, the accessibility of transportation also plays an important role. With an easily accessed public transport network connected to a library, there will be less hindrance and thus a higher probability for a patron to visit the library. To analyze more deeply, the impact of public transportation may also vary between different neighborhoods where people are of different social and financial levels. Therefore, our team will embrace the available transportation dataset (MRT and bus stops) in our model with weight assigned to better measure and predict the attractiveness of the libraries.
3.Geographical Dataset:
- Building within costaloutline.shp
As mentioned above, although the subzone clustering analysis conducted by the senior’s team returned a relatively executable model, our team aims to build up on the next level and present a more precise and accurate analysis. In terms of the geographical dataset, subzones no longer meet our demand due to the wide coverage of each subzone and the inequality analysis on patrons from different parts within the same subzone. Therefore, our team will utilize the geospatial data at HDB level (after transformation) and link it to the post-geocoding patron’s data so as to better analyze the patronage of each library.