Difference between revisions of "Hiryuu Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 81: Line 81:
 
* Clarify unclear data/abnormalities with Sponsors
 
* Clarify unclear data/abnormalities with Sponsors
  
==<div style="background: #A4A4A4; line-height: 0.3em; font-family:Roboto;  border-left: #848484 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>General Steps Taken </strong></font></div></div>==
+
==<div style="background: #A4A4A4; line-height: 0.3em; font-family:Roboto;  border-left: #848484 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Complexities Involved</strong></font></div></div>==
# <h4> 1. Joining data tables for shipment details and status update </h4>
+
# <h4> 1. Complexities in Geocoding by Country </h4>
* Used left outer join with statuses as the main table so as to add on the corresponding shipment details to each status update
+
The shipment data provided to us includes geospatial data pertaining to the sender and receiver addresses. These properties made it appropriate for us to geocode the data in order to perform detailed analysis to improve processes. However, before such data could be utilised effectively, the intermediate steps of cleaning and comprehending has to be performed.
* Used Match Flag to check if there is missing data from either source
+
 
 +
# Hong Kong
 +
A major restriction in conducting geospatial analysis for Hong Kong’s data is the lack of a proper national postal code system in Hong Kong. Due to the inability to easily geocode a shipment, attempts to geocode a shipment using information such as address and district have also proved futile.
 +
This is due to the inconsistency in data recording across employees and customers.
 +
For example in figure 1, there are 3 different records of Mong Kong, a district in Hong Kong. Also, there are also differences in the way addresses are recorded. Addresses could be seen recorded in full with the unit and block number, whereas there are other instances where addresses only contained of the street name or building name.
 +
 
 +
[[File:Figure1.png|center]]
 +
<center>Fig 1: District Names in Hong Kong</center>
 +
 
 +
# Korea
 +
Problems arose in both the sponsor’s and publicly available data. Korea used to operate on a 6-digit postal code system, but as of August 1st, 2015, the country switched to a new 5-digit postal code system. However, the sponsor’s data had both the old and new system present in their data, and this unnecessarily complicated the geocoding process.
 +
A workaround this problem was to make use of the state and city names. The sponsor’s data had information on Receiver’s and Sender’s City, State, and Address. Fortunately, all State names were unique. (E.g. Busan, Incheon). Tying the unique State name together with the City residing within, we were able to generate our own unique national ID for each City in Korea. The naming of States and Cities were consistent across both the sponsor’s dataset as well as the publicly available data. This approach was adopted to bypass the complexities of two different postal code system residing in the sponsor’s dataset.
 +
 
 +
[[File:Figure2.JPG|center]]
 +
<center>Fig 2: Geocodes to create Korea's Kcodes</center>
 +
 
 +
Lastly there also a related issue pertaining to the geospatial data accuracy of the publicly available shape file. The latest version was the 2015 version and there were issues with the polygons. A single entity in the public dataset was linked to 2 polygons erroneously. More specifically, selecting Jung-gu in Busan also led to Jung-gu in Incheon being selected. (as demonstrated in the figure 3 below). This issue did not emerge at the first few steps of cleaning, it was only after the mapping over of data resulted in always the same city always being left out (Busan’s Jung-gu) were we made aware of it.
 +
 
 +
[[File:Figure3.png|center]]
 +
<center>Fig 3: Jung-gu in Incheon and Busan</center>
  
 
# <h4> 2. Joined the data table with the status codes list </h4>
 
# <h4> 2. Joined the data table with the status codes list </h4>

Revision as of 19:30, 22 April 2017

Current Project

Logo Hiryuu.png


Home

About Us

Project Overview

Findings

Project Management

Documentations

Data Preparation Analysis

Data Tools Used

Microsoft Excel

Used a VBA macro to merge all individual data files into a compilation Macro’s function was to open all csv files in a folder and add them into a master sheet by the rows

Pros:

  • Excel was the tool of choice by the sponsors, and it preserves the character encoding of non-western characters
  • Easy to use

Cons:

  • Limited in the number of rows it could handle

JMP Pro

Pros:

  • Useful in joining data tables with large number of data
  • Useful in tabulating data of different variables into a data table
  • Easy creation of visuals to look into the data

Cons:

  • Faced issues with non-western characters inconsistently that made it difficult to determine the cause
  • Difficulty in trying to use advanced functions that required scripting ability
  • No formula to find date difference for working days only

QGIS

Pros:

  • Familiar with this tool, used before in previous modules
  • Open source software
  • Can read greater viariet file types compared to ArcGIS

Cons:

  • Cannot open more than 1 project at a time
  • Unable to plot Vector map for both (1-grid) and (2-grid)
  • Unable to log contours

Objectives

  • Identifying the correct set of data
  • Identifying complete shipments
  • Identifying shipments as inbound or outbound
  • Ensure no missing data
  • Identifying the relevant geospatial data for analysis
  • Clarify unclear data/abnormalities with Sponsors

Complexities Involved

  1. 1. Complexities in Geocoding by Country

The shipment data provided to us includes geospatial data pertaining to the sender and receiver addresses. These properties made it appropriate for us to geocode the data in order to perform detailed analysis to improve processes. However, before such data could be utilised effectively, the intermediate steps of cleaning and comprehending has to be performed.

  1. Hong Kong

A major restriction in conducting geospatial analysis for Hong Kong’s data is the lack of a proper national postal code system in Hong Kong. Due to the inability to easily geocode a shipment, attempts to geocode a shipment using information such as address and district have also proved futile. This is due to the inconsistency in data recording across employees and customers. For example in figure 1, there are 3 different records of Mong Kong, a district in Hong Kong. Also, there are also differences in the way addresses are recorded. Addresses could be seen recorded in full with the unit and block number, whereas there are other instances where addresses only contained of the street name or building name.

Figure1.png
Fig 1: District Names in Hong Kong
  1. Korea

Problems arose in both the sponsor’s and publicly available data. Korea used to operate on a 6-digit postal code system, but as of August 1st, 2015, the country switched to a new 5-digit postal code system. However, the sponsor’s data had both the old and new system present in their data, and this unnecessarily complicated the geocoding process. A workaround this problem was to make use of the state and city names. The sponsor’s data had information on Receiver’s and Sender’s City, State, and Address. Fortunately, all State names were unique. (E.g. Busan, Incheon). Tying the unique State name together with the City residing within, we were able to generate our own unique national ID for each City in Korea. The naming of States and Cities were consistent across both the sponsor’s dataset as well as the publicly available data. This approach was adopted to bypass the complexities of two different postal code system residing in the sponsor’s dataset.

Figure2.JPG
Fig 2: Geocodes to create Korea's Kcodes

Lastly there also a related issue pertaining to the geospatial data accuracy of the publicly available shape file. The latest version was the 2015 version and there were issues with the polygons. A single entity in the public dataset was linked to 2 polygons erroneously. More specifically, selecting Jung-gu in Busan also led to Jung-gu in Incheon being selected. (as demonstrated in the figure 3 below). This issue did not emerge at the first few steps of cleaning, it was only after the mapping over of data resulted in always the same city always being left out (Busan’s Jung-gu) were we made aware of it.

Figure3.png
Fig 3: Jung-gu in Incheon and Busan
  1. 2. Joined the data table with the status codes list

  • Status data only contained the stage code without explanation
  • Needed to join with another data table with the stage codes and relevant descriptions
  1. 3. Removing Duplicate Rows

  • Duplicate rows with identical data were present in the data given to us
  • Needed to delete duplicates to ensure greater data accuracy for our analysis
  • Duplicated data rows serves no practical purpose hence ought to be removed
  • Used a JMP jsl script posted on the JMP community website to delete duplicate rows whilst preserving the first unique occurrence of it
  1. 4. Checking the data for any abnormalities based on standard parameters given by the sponsor

  • Basic guidelines put forth by the Sponsor includes:
  • Quantity = 1
  • All should be domestic shipments
    • With the exception for App 2's data
  • Sender and Receiver Information should be included

Detailed Data Preparation Steps

Exploratory Analysis

  1. 1. Identifying inbound(IB) / outbound(OB) shipments

Different methods are used for different data sets according to the specifications of our sponsor.
Methods used include:

  • Checking the receiver and sender name
  • Identifying specific customer account no for IB and OB shipments
  • Identifying IB and OB shipments using starting and ending stage codes
  1. 2. Calculating turnaround time (TAT)

TAT time is used to calculate the total time taken from the starting point to the ending point of the shipment. It is used to flag out and ensure that shipments stay within the stipulated time.
Steps conducted:

  1. Identify the starting and ending dates using the starting and ending stage codes and their relevant stage updated dates.
  2. Calculate the TAT time by the difference in the number of days between the ending and starting date.

For shipments that are missing either a starting or ending point, they are flagged out and labelled as "Incomplete" shipments.
Shipments that are complete, they will either by labelled as "Pass" or "Fail" depending on the calculated TAT time. If the TAT time <= N, it will be labelled as "Pass" else "Fail".

Geospatial Analysis

  1. Compatibility of Sponsor Data and Online SHP files

Before we can proceed with mapping count values into our SHP map, we had to find identify a joining column. This column will ensure that it is unique to each City/Province and must be identical for both Sponsor and Online data. Examples of joining columns include:

  • Unique country-code based on ProvinceID and CityID that includes Postal Code as well
  • If such country-code is not avaiable, we will create our own Country-Code by inserting our own ProvinceID and CityID and then creating the code as such: "ProvinceID-CityID"
  • Sponsor data for some countries did not include

Further Problems:
Sponsor data had to be cleaned as well as they sometimes input "District, City" as Receiver. If shp map and sponsor data have too many mismatches, but lat long is provided, then we will have to create our own polygons and shp file based on these coordinates.