Hiryuu Findings

From Analytics Practicum
Jump to navigation Jump to search

Current Project

Logo Hiryuu.png


Home

About Us

Project Overview

Findings

Project Management

Documentations

Data Preparation Analysis

Data Tools Used

Microsoft Excel

Used a VBA macro to merge all individual data files into a compilation Macro’s function was to open all csv files in a folder and add them into a master sheet by the rows

Pros:

  • Excel was the tool of choice by the sponsors, and it preserves the character encoding of non-western characters
  • Easy to use

Cons:

  • Limited in the number of rows it could handle

JMP Pro

Pros:

  • Useful in joining data tables with large number of data
  • Useful in tabulating data of different variables into a data table
  • Easy creation of visuals to look into the data

Cons:

  • Faced issues with non-western characters inconsistently that made it difficult to determine the cause
  • Difficulty in trying to use advanced functions that required scripting ability
  • No formula to find date difference for working days only

QGIS

Pros:

  • Familiar with this tool, used before in previous modules
  • Open source software
  • Can read greater viariet file types compared to ArcGIS

Cons:

  • Cannot open more than 1 project at a time
  • Unable to plot Vector map for both (1-grid) and (2-grid)
  • Unable to log contours

Objectives

  • Identifying the correct set of data
  • Identifying complete shipments
  • Identifying shipments as inbound or outbound
  • Ensure no missing data
  • Identifying the relevant geospatial data for analysis
  • Clarify unclear data/abnormalities with Sponsors

Complexities Involved

  1. 1. Complexities in Geocoding by Country

The shipment data provided to us includes geospatial data pertaining to the sender and receiver addresses. These properties made it appropriate for us to geocode the data in order to perform detailed analysis to improve processes. However, before such data could be utilised effectively, the intermediate steps of cleaning and comprehending has to be performed.

  • Hong Kong

A major restriction in conducting geospatial analysis for Hong Kong’s data is the lack of a proper national postal code system in Hong Kong. Due to the inability to easily geocode a shipment, attempts to geocode a shipment using information such as address and district have also proved futile. This is due to the inconsistency in data recording across employees and customers. For example in figure 1, there are 3 different records of Mong Kong, a district in Hong Kong. Also, there are also differences in the way addresses are recorded. Addresses could be seen recorded in full with the unit and block number, whereas there are other instances where addresses only contained of the street name or building name.

Figure1.png
Fig 1: District Names in Hong Kong
  • Korea

Problems arose in both the sponsor’s and publicly available data. Korea used to operate on a 6-digit postal code system, but as of August 1st, 2015, the country switched to a new 5-digit postal code system. However, the sponsor’s data had both the old and new system present in their data, and this unnecessarily complicated the geocoding process. A workaround this problem was to make use of the state and city names. The sponsor’s data had information on Receiver’s and Sender’s City, State, and Address. Fortunately, all State names were unique. (E.g. Busan, Incheon). Tying the unique State name together with the City residing within, we were able to generate our own unique national ID for each City in Korea. The naming of States and Cities were consistent across both the sponsor’s dataset as well as the publicly available data. This approach was adopted to bypass the complexities of two different postal code system residing in the sponsor’s dataset.

Figure2.JPG
Fig 2: Geocodes to create Korea's Kcodes

Lastly there also a related issue pertaining to the geospatial data accuracy of the publicly available shape file. The latest version was the 2015 version and there were issues with the polygons. A single entity in the public dataset was linked to 2 polygons erroneously. More specifically, selecting Jung-gu in Busan also led to Jung-gu in Incheon being selected. (as demonstrated in the figure 3 below). This issue did not emerge at the first few steps of cleaning, it was only after the mapping over of data resulted in always the same city always being left out (Busan’s Jung-gu) were we made aware of it.

Figure3.png
Fig 3: Jung-gu in Incheon and Busan
  • Japan

In contrast to the above countries, Japan has implemented a systematic standard in the format for addresses along with its postal code systems, which made it easy to geocode the addresses in the data given to us. The Japanese address follows a clear format of the following example:

〒123-4567
東京都
新宿区
西早稲田X丁目YY-ZZ

〒123-4567
X-YY-ZZ Nishiwaseda
Shinjuku-ku,
Tokyo-to

In the writing of the address, they make clear distinctions between the different levels of classification. This clear distinction is reflected in the forms for documentation used by their national offices which allow you to fill in the addresses in the systematic order from the 都道府県 (prefecture), to 市区町村 (city/ward/town/county) then to the street number and room number if available. The first line of the address in both languages refers to the postal code. The second line of the address in Japanese refers to the prefecture, which uses the suffixes of 都道府県 with the name of the prefecture. For example, Tokyo would be written as Tokyo-to, whereas Osaka would be written as Osaka-fu. The third line of the address in Japanese refers to the city/ward/town/county, which uses the suffixes of 市区町村 with the name. For example, the 23 special wards of Tokyo follow with the -ku suffix, such as Shinjuku-ku, Taito-ku. The fourth line of the address in Japanese refers to the street level, which is similar to how we write our addresses in Singapore as 81 Victoria Street or Ang Mo Kio Avenue 3. The X丁目 refers to Avenue X, whereas the YY-ZZ refers to the specific lot number to locate the exact location of the address, similar to the 81 of 81 Victoria Street to find the exact lot on the street. The systematic writing of the addresses in Japan with clear distinction over the different geographic classification layers made it easy for clear identification of the actual location that correspond to the address. Furthermore, the postal code is accurate to the avenue name, allowing for one to narrow down to a specific locality such as 西早稲田X丁目, which would narrow down to the local post offices in charge of the parcel to find out the exact address of the receiver should the postal code and some ambiguous writings be the only address written for a postal item. The postal codes themselves also follow a fixed format in the identification of the prefecture level and city level names, using the first 3 to 5 digits of the postal code. This postal code data is published by the Japan Post Office online, allowing for one to download the data to use for geocoding purposes. By using the postal code data, the prefecture and city names were effectively identified from the postal code provided in the sponsor’s data. The data published by the Japan Post Office is regularly updated and maintained, allowing all data to be kept up to date with changes made.


  1. 2. Complexities in Data Provided

Detailed Data Preparation Steps

Exploratory Analysis

  1. 1. Identifying inbound(IB) / outbound(OB) shipments

Different methods are used for different data sets according to the specifications of our sponsor.
Methods used include:

  • Checking the receiver and sender name
  • Identifying specific customer account no for IB and OB shipments
  • Identifying IB and OB shipments using starting and ending stage codes
  1. 2. Calculating turnaround time (TAT)

TAT time is used to calculate the total time taken from the starting point to the ending point of the shipment. It is used to flag out and ensure that shipments stay within the stipulated time.
Steps conducted:

  1. Identify the starting and ending dates using the starting and ending stage codes and their relevant stage updated dates.
  2. Calculate the TAT time by the difference in the number of days between the ending and starting date.

For shipments that are missing either a starting or ending point, they are flagged out and labelled as "Incomplete" shipments.
Shipments that are complete, they will either by labelled as "Pass" or "Fail" depending on the calculated TAT time. If the TAT time <= N, it will be labelled as "Pass" else "Fail".

Geospatial Analysis

  1. Compatibility of Sponsor Data and Online SHP files

Before we can proceed with mapping count values into our SHP map, we had to find identify a joining column. This column will ensure that it is unique to each City/Province and must be identical for both Sponsor and Online data. Examples of joining columns include:

  • Unique country-code based on ProvinceID and CityID that includes Postal Code as well
  • If such country-code is not avaiable, we will create our own Country-Code by inserting our own ProvinceID and CityID and then creating the code as such: "ProvinceID-CityID"
  • Sponsor data for some countries did not include

Further Problems:
Sponsor data had to be cleaned as well as they sometimes input "District, City" as Receiver. If shp map and sponsor data have too many mismatches, but lat long is provided, then we will have to create our own polygons and shp file based on these coordinates.