Difference between revisions of "Qui Vivra Verra - Data Exploration"

From Analytics Practicum
Jump to navigation Jump to search
Line 86: Line 86:
 
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>Data Pre-processing</strong></font></div></div>
 
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>Data Pre-processing</strong></font></div></div>
  
''Coming Soon!''
+
'''Planning Area Reconciliation'''
 +
 
 +
In the Patron Datasets, the residential areas of the patrons are given as Local Planning ADZID, which is equivalent to the subzones where they live in. However, the team find that it may be hard to identify the residential areas with subzones as there are simply too many subzones. Hence the team has decided to match the subzones to URA Planning Areas instead.
 +
 
 +
 
 +
The matching of Subzones to Planning Area was performed in QGIS and the procedure is as follows:
 +
 
 +
1. Extracted the first 6 characters (column ‘Subzone’) of Locale Planning ADZID from the Patron Dataset using JMP Pro.
 +
 
 +
 
 +
[[File:Planning Area Recon1.png|border|center|Planning Area Recon1.png]]
 +
 
 +
 
 +
2. The Patron Dataset is then joined with the MP14_SUBZONE shapefile which the team has obtained from data.gov.sg, matching the extracted characters from Patron Dataset with the variable ‘SUBZONE_C’ found in the MP14_SUBZONE shape file to get the Planning Area for each Subzone. The team has also obtained the region (column REGION_N) in which the Subzones fall in, which can be used to display the data in a higher level of detail.
 +
 
 +
 
 +
[[File:Planning Area Recon2.png|border|center|Planning Area Recon2.png]]
 +
 
 +
 
 +
3. The team also needed the centroid of each Planning Area to be used for initial visualisation in Tableau, thus the team has made use of the ‘Polygon Centroid’ function available in QGIS to obtain the centroid and coordinates (Latitude and Longitude) of each Planning Area.
 +
 
 +
 
 +
[[File:Planning Area Recon3.png|border|center|Planning Area Recon3.png]]
 +
 
 +
 
 +
'''Library Branch Code Reconciliation'''
 +
 
 +
From the Collection Dataset, the only identifier for the libraries is the attribute Branch Code. However, it is hard to identify a library based on the Branch Code. Hence, the Branch Code is then matched to the ‘DESCRIPTIO’ variable found in the attribute table of the LIBRARIES shapefile which the team has obtained from data.gov.sg to get the library names. Furthermore, as the team wanted to perform a geospatial visualisation, we needed to get the coordinates of the libraries which was also obtained from performing the join.
 +
 
 +
[[File:Library Branch Code Recon1.png|border|center|Library Branch Code Recon1.png]]
 +
 
 +
 
 +
[[File:Library Branch Code Recon2.png|border|center|Library Branch Code Recon2.png]]
 +
 
  
 
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>External Data Sources</strong></font></div></div>
 
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>External Data Sources</strong></font></div></div>
  
 
''Coming Soon!''
 
''Coming Soon!''

Revision as of 23:49, 10 October 2016



  HOME

  ABOUT US

  PROJECT OVERVIEW

  PROJECT FINDINGS

  PROJECT MANAGEMENT

  DOCUMENTATION



Summary of Anomalies & Errors

The team has performed exploratory data analysis and highlighted some anomalies as listed:

  • Anomaly 1

In the 2013 Patron Dataset, 2.534% of all records have Locale Planning ADZID set to “Bad Value” and “Missing Value”. In the 2013 Patron Dataset, 2.876% of all records have Locale Planning ADZID set to “Bad Value” and “Missing Value”. A snapshot of the anomaly is as shown below.

Anomaly 1.png
  • Anomaly 2

In the 2013 Transaction Dataset, 19,828 records have Patron_UID set to ‘0’. In the 2014 Transaction Dataset, 641 records have Patron_UID set to ‘0’. A snapshot of the anomaly is as shown below. A snapshot of the anomaly is as shown below.

Anomaly 2.png
  • Anomaly 3

In the 2013 Transaction Dataset, 3 records have Branch Code set to “Bad Value” and 3,065 records set to “Missing Value”. In the 2014 Transaction Dataset, 9,467 records have Branch Code set to “Bad Value” and 1,600 records set to “Missing Value”. A snapshot of the anomaly is as shown below.

Anomaly 3.png
  • Anomaly 4

In the 2013 Transaction Dataset, there are 893 patrons with Avg. No. of Books Borrowed (aggregated value) exceeding 32. In the 2014 Transaction Dataset, there are 779 patrons with Avg. No. of Books Borrowed exceeding 32. Furthermore, from the records with Avg. No. of Books Borrowed exceeding 32, we find that most of the records have attribute Patron Borrower Category Code set to “DEAR”. These records also have unrealistic values in attribute Patron Birthyear i.e. “1900”, “2015, and “2016”. These records have Patron Citizenship, Patron Race, Patron Gender set to “Others”. A snapshot of the anomaly is as shown below.

Anomaly 4.png
  • Anomaly 5

In the 2013 and 2014 Transaction Datasets, there are records with Branch Codes that do not exist in Collection_Dataset_FY13 and FY14, e.g. ‘07LKCRL’, 08LKCRL’. A snapshot of the anomaly is as shown below.

Anomaly 5.png

During Sponsor Meeting 01 with the NLB held on 06 October 2016, the team has consulted the NLB Analytics Team regarding the above-mentioned anomalies and noted the following:

  • Patron Borrower Category Code “DEAR”, is not a unique patron per se, but refers to institutional partnership programmes. NLB suggested to remove all records with Patron Borrower Category Code set to “DEAR” from further analysis.
  • Patrons with Birthyears ‘1900’, ‘2015’ and ‘2016’ are due to values being set to the year that the institutional programme was set up.
  • NLB suggested to exclude the Branch Codes that are not listed in the Collection Dataset.
  • NLB is agreeable with removing all records that contain the anomalies as described above.


Data Pre-processing

Planning Area Reconciliation

In the Patron Datasets, the residential areas of the patrons are given as Local Planning ADZID, which is equivalent to the subzones where they live in. However, the team find that it may be hard to identify the residential areas with subzones as there are simply too many subzones. Hence the team has decided to match the subzones to URA Planning Areas instead.


The matching of Subzones to Planning Area was performed in QGIS and the procedure is as follows:

1. Extracted the first 6 characters (column ‘Subzone’) of Locale Planning ADZID from the Patron Dataset using JMP Pro.


Planning Area Recon1.png


2. The Patron Dataset is then joined with the MP14_SUBZONE shapefile which the team has obtained from data.gov.sg, matching the extracted characters from Patron Dataset with the variable ‘SUBZONE_C’ found in the MP14_SUBZONE shape file to get the Planning Area for each Subzone. The team has also obtained the region (column REGION_N) in which the Subzones fall in, which can be used to display the data in a higher level of detail.


Planning Area Recon2.png


3. The team also needed the centroid of each Planning Area to be used for initial visualisation in Tableau, thus the team has made use of the ‘Polygon Centroid’ function available in QGIS to obtain the centroid and coordinates (Latitude and Longitude) of each Planning Area.


Planning Area Recon3.png


Library Branch Code Reconciliation

From the Collection Dataset, the only identifier for the libraries is the attribute Branch Code. However, it is hard to identify a library based on the Branch Code. Hence, the Branch Code is then matched to the ‘DESCRIPTIO’ variable found in the attribute table of the LIBRARIES shapefile which the team has obtained from data.gov.sg to get the library names. Furthermore, as the team wanted to perform a geospatial visualisation, we needed to get the coordinates of the libraries which was also obtained from performing the join.

Library Branch Code Recon1.png


Library Branch Code Recon2.png


External Data Sources

Coming Soon!