Hiryuu Findings
|
|
|
|
|
|
Data Preparation | Analysis |
---|
Contents
- 1 Data Tools Used
- 2 Objectives
- 3 Complexities Involved
- 3.1 1. Complexities in Geocoding by Country
- 3.2 2. Complexities in Data Provided
- 3.2.1 1) Different Data Systems Utilized
- 3.2.2 2) Differences Across Different Countries
- 3.2.3 3) Different Vendors/Service Providers
- 3.2.4 4) Different Definitions for the Completion of a Shiment
- 3.2.5 5) Different Data Sources Required to Assess Performance
- 3.2.6 6) Unique Geospatial Mapping for Australia and Japan
- 4 Detailed Data Preparation Steps
Data Tools Used
Microsoft Excel
Used a VBA macro to merge all individual data files into a compilation Macro’s function was to open all csv files in a folder and add them into a master sheet by the rows
Pros:
- Excel was the tool of choice by the sponsors, and it preserves the character encoding of non-western characters
- Easy to use
Cons:
Limited in the number of rows it could handle
JMP Pro
Pros:
- Useful in joining data tables with large number of data
- Useful in tabulating data of different variables into a data table
- Easy creation of visuals to look into the data
Cons:
- Faced issues with non-western characters inconsistently that made it difficult to determine the cause
- Difficulty in trying to use advanced functions that required scripting ability
- No formula to find date difference for working days only
QGIS
Pros:
- Familiar with this tool, used before in previous modules
- Open source software
- Can read greater viariet file types compared to ArcGIS
Cons:
- Cannot open more than 1 project at a time
- Unable to plot Vector map for both (1-grid) and (2-grid)
- Unable to log contours
Objectives
- Identifying the correct set of data
- Identifying complete shipments
- Identifying shipments as inbound or outbound
- Ensure no missing data
- Identifying the relevant geospatial data for analysis
- Clarify unclear data/abnormalities with Sponsors
Complexities Involved
1. Complexities in Geocoding by Country
The shipment data provided to us includes geospatial data pertaining to the sender and receiver addresses. These properties made it appropriate for us to geocode the data in order to perform detailed analysis to improve processes. However, before such data could be utilised effectively, the intermediate steps of cleaning and comprehending has to be performed.
- Hong Kong
A major restriction in conducting geospatial analysis for Hong Kong’s data is the lack of a proper national postal code system in Hong Kong. Due to the inability to easily geocode a shipment, attempts to geocode a shipment using information such as address and district have also proved futile.
This is due to the inconsistency in data recording across employees and customers.
For example in figure 1, there are 3 different records of Mong Kong, a district in Hong Kong. Also, there are also differences in the way addresses are recorded. Addresses could be seen recorded in full with the unit and block number, whereas there are other instances where addresses only contained of the street name or building name.
- Korea
Problems arose in both the sponsor’s and publicly available data. Korea used to operate on a 6-digit postal code system, but as of August 1st, 2015, the country switched to a new 5-digit postal code system. However, the sponsor’s data had both the old and new system present in their data, and this unnecessarily complicated the geocoding process.
A workaround this problem was to make use of the state and city names. The sponsor’s data had information on Receiver’s and Sender’s City, State, and Address. Fortunately, all State names were unique. (E.g. Busan, Incheon). Tying the unique State name together with the City residing within, we were able to generate our own unique national ID for each City in Korea. The naming of States and Cities were consistent across both the sponsor’s dataset as well as the publicly available data. This approach was adopted to bypass the complexities of two different postal code system residing in the sponsor’s dataset.
Lastly there also a related issue pertaining to the geospatial data accuracy of the publicly available shape file. The latest version was the 2015 version and there were issues with the polygons. A single entity in the public dataset was linked to 2 polygons erroneously. More specifically, selecting Jung-gu in Busan also led to Jung-gu in Incheon being selected. (as demonstrated in the figure 3 below). This issue did not emerge at the first few steps of cleaning, it was only after the mapping over of data resulted in always the same city always being left out (Busan’s Jung-gu) were we made aware of it.
- Japan
In contrast to the above countries, Japan has implemented a systematic standard in the format for addresses along with its postal code systems, which made it easy to geocode the addresses in the data given to us.
The Japanese address follows a clear format of the following example:
〒123-4567 |
〒123-4567 |
---|
In the writing of the address, they make clear distinctions between the different levels of classification. This clear distinction is reflected in the forms for documentation used by their national offices which allow you to fill in the addresses in the systematic order from the 都道府県 (prefecture), to 市区町村 (city/ward/town/county) then to the street number and room number if available.
The first line of the address in both languages refers to the postal code.
The second line of the address in Japanese refers to the prefecture, which uses the suffixes of 都道府県 with the name of the prefecture. For example, Tokyo would be written as Tokyo-to, whereas Osaka would be written as Osaka-fu.
The third line of the address in Japanese refers to the city/ward/town/county, which uses the suffixes of 市区町村 with the name. For example, the 23 special wards of Tokyo follow with the -ku suffix, such as Shinjuku-ku, Taito-ku.
The fourth line of the address in Japanese refers to the street level, which is similar to how we write our addresses in Singapore as 81 Victoria Street or Ang Mo Kio Avenue 3. The X丁目 refers to Avenue X, whereas the YY-ZZ refers to the specific lot number to locate the exact location of the address, similar to the 81 of 81 Victoria Street to find the exact lot on the street.
The systematic writing of the addresses in Japan with clear distinction over the different geographic classification layers made it easy for clear identification of the actual location that correspond to the address.
Furthermore, the postal code is accurate to the avenue name, allowing for one to narrow down to a specific locality such as 西早稲田X丁目, which would narrow down to the local post offices in charge of the parcel to find out the exact address of the receiver should the postal code and some ambiguous writings be the only address written for a postal item.
The postal codes themselves also follow a fixed format in the identification of the prefecture level and city level names, using the first 3 to 5 digits of the postal code. This postal code data is published by the Japan Post Office online, allowing for one to download the data to use for geocoding purposes. By using the postal code data, the prefecture and city names were effectively identified from the postal code provided in the sponsor’s data. The data published by the Japan Post Office is regularly updated and maintained, allowing all data to be kept up to date with changes made.
2. Complexities in Data Provided
1) Different Data Systems Utilized
Two different logistics tracking systems, App1 and App2 have been utilised by our sponsors in managing logistic shipments in different countries. App1 and App2 are the systems that record the data pertaining to the shipment details such as the sender and receiver information, and tracks the distribution process for each shipment from the picking up of the shipment to its successful delivery. Japan, Korea, Taiwan, Thailand and Hong Kong uses App1, whereas Australia and New Zealand uses App2. There exists a number of differences in the reporting of data across these two apps.
As the data structure across both apps are different, this makes the reconciliation of data across both apps difficult and impossible. This is due to the lack of consistency in naming different columns, the lack of specific columns of information in App2 which makes understanding the problems of a shipment more difficult compared to App1.
2) Differences Across Different Countries
As with differences in cultures in different countries, this thus resulted in differences in recording data across different countries. Below, we will explain in detail how the handling of these regional data adds new layers of complexity beyond what we would have had experienced in the classroom with the data given to us by our school.
Different Languages
In Singapore, we see that we have standardized all official processes to be done and documented in the English language for that it is the official language of our nation. Similarly, other countries would choose to document their work in their official language, which may not always be English. This adds another layer of complexity in our data, for that not all countries follow with inputting their data in English. Although system interfaces have been designed in English with some fields to be picked from a list in English, some information such as the addresses and names which were originally written in another language by the end consumer has to be inputted in exactly into the system.
For the list of countries that we have the data for, the data collected includes that of English, Japanese, Chinese, Thai and Korean languages. Although not all fields written in non-English languages may be key figures used in determining the key performance indicators, some of them involve data that would be used in further analysis. Main data fields that were written in different languages were those pertaining to geospatial data, for that addresses could be written in local languages instead of English. This affected us in using some of the data fields originally in the data. For example, in the data for Japan, the column that records the state of an address has that of both languages of Tokyo and 東京都, and some cases it was written in different formats such as Tokyo and Tokyo-to. The variations across languages and the corresponding variations when written in English saw a need to be reconciled in a singular standard format to be derived from the data we have.
For countries with a postal code system and their postal code data freely available online and is up-to-date, we were able to derive a standardized list of values in English using these sources and matched it to our data columns which consists of postal code values. Postal code values were written in numerical formats, free from the languages, hence allowing us to use as an accurate reference in deriving the required data columns of state and city level details in English from external sources.
The difference in languages not only affects our ability to assess the data directly, but also in accessing data sources available online that may be specific to their national tongue. Used to the westernized society of Singapore, we may be inclined to think that all official data on the internet such as our own data.gov.sg will necessarily be in English. However, that had not been the case when we had to source for data for our usage. For example, the Japan Post Office had posted their postal code data online but could only be accessed on their Japanese website. On the other hand, Korea’s geospatial data files were on their equivalent of our data.gov.sg, however was in Korean and required identifications as a Korean resident in order to gain access to it. This challenge faced is not just limited to us who are working on this project, but also applies to the sponsors also had to work with local entities or speakers of the local languages to gain access to data in their implementation of the different projects. Despite the challenges faced in this, it also allowed us to gain exposure to such data that we would not have the opportunity to unless we were working on regional data, which proved useful to us in preparing ourselves and also learning how to handle it despite our lack of knowledge in the languages.
Different Ways of Identifying Shipment Type as Inbound/Outbound
Through our experience in the project, we understood that there are different ways in identifying an inbound or outbound shipment for different countries and for different apps.
The major difference lies in countries using App1. Firstly, for countries using App1. In Japan, a specific string (e.g. ABC XXX) in the ‘Receiver Name’ is used to identify inbound shipments. If the name in ‘Receiver Name’ matches our intended string, it is labelled as an inbound shipment, else outbound. This method is similarly used in Korea and Taiwan, however the difference would be the specific string used to identify. On the other hand, Thailand identifies its outbound shipments using the “Sender Address”. As such, if the sender address matches the address of the company address in Thailand, it is thus an outbound shipment. Lastly, Hong Kong identifies its inbound shipments using the “Customer Account No”. If the customer account number matches the specific account number for our sponsor, it is thus labelled as an inbound shipment. Other account numbers are labelled as outbound shipments.
On the other hand, App2 which is used by Australia and New Zealand identifies their inbound and outbound shipments using the “Pickup Address” and “Receiver Address”. If the “Pickup Address” matches the warehouse’s address, it is an outbound shipment. Whereas if the “Receiver Address” matches the warehouse’s address, it is then an inbound shipment.
As such, the various differences in simply identifying an inbound or outbound shipment for further analysis suggests the complexities managing the data. This also proves the inflexibility in creating a unified dashboard for all different countries.
3) Different Vendors/Service Providers
Different vendors and service providers may be used for different shipments and also in the different countries. Some vendors specialize in handling bulk shipments, while others may differ in the areas they service, hence our sponsors engage a number of different vendors in their logistics services, and integrates the shipment data from the vendors into their own systems of App1 or App2. The engagement of different vendors and the assessment of their performances may differ, such that there might be different service level agreements (SLA) made between the different vendors despite the same areas handled by them. For example, Vendor A might be able to deliver to State XYZ within 1 day, however Vendor B might only be able to do so within 2 days, hence they cannot be assessed on one singular standard but has to be assessed based on what they have agreed as per the SLA with our Sponsor. Next, we will explain in detail the complexities resulting from the engagement of multiple vendors.
The integration process of shipment data between the vendors and our sponsor requires extensive status mapping to be done, to find the best match between the statuses provided by the vendor and those currently existing in App1 or App2. The statuses currently in the system may not always be a one-size-fits-all, hence may require the mapping and creation of additional statuses to fit the level of detail in the statuses provided. It is also important to note that the different levels of detail in the statuses of the different vendors may also correspond to the expected level of detail of each locale, which may not be like the Singaporean context which our sponsor’s App1 and App2 were based on. This level of detail may also be affected by the differences in the shipment distribution process by the vendors in response to the needs of the local community, hence needs to be taken into account in assessing the performance of each vendor. Firms handling logistics processes of different regions will have to understand the vendors it engages and the local context in order to better handle the expectations of their clients.
The different vendors engaged may follow different SLA agreed upon between the individual vendors and the sponsors, hence this results in the need to assess the different vendors by the different standards. The Sponsor’s App1 data contains a data column to record “Agent / Vendor name”, which is required in order to make the distinction from the list of shipments on which are handled by which vendor. By matching with the vendor and the corresponding geolocation specified in the SLA, we could compute the key performance indicators following the standards from our sponsor in measuring turnaround time taken to assess the vendors’ performance.
4) Different Definitions for the Completion of a Shiment
As end consumers and users of the services available on the market, we tend to have the perception that the end of the shipment cycle would be when the item is delivered successfully into the hands of the recipient. This may not be an inaccurate definition of the end point, but is only one of the many different possibilities. Below is a flowchart to show an example of possible endpoints that would mark the completion of the shipment.
As we can see from the flowchart, delivery successful may not always be the end point. The shipment could be lost in transporting, which would mark it as an exception case to be investigated. Multiple delivery unsuccessful attempts could mark the item to be returned to the sender, following the standard protocols set out by each logistics service provider in assessing how long will they hold a shipment following unsuccessful delivery attempts.
The representation of these different stages of the shipment’s cycle are recorded in the systems as status codes corresponding to each unique status code. The details of what is involved in causing the different statuses are also captured in the specific status codes, such as the number of times a shipment was out for a delivery attempt, exception cases involving shipment damage and the extent of damage. The status codes hence contains these specific details in them to be used in understanding the issue beyond the general level of details shown to the general consumer.
Following the different statuses to be taken into account as potential endpoints of the shipment cycle, we cannot analyse the data in isolation with a single fixed endpoint, but have to understand which statuses are to be accounted for in the data. This requires us to allow for variable change following ways to mark a completion and provide for the flexibility in adjusting for the definitions of the endpoints of the shipment cycle. Perhaps enhancements can be made for existing systems to allow for certain fixed end points, however the dynamics in the data coverage of the different contexts may be eroded should that be done instead.
5) Different Data Sources Required to Assess Performance
Adding to the complexity of managing data across different systems, our team have also identified a difficulty due to the different data sources required to assess the performance of a shipment. Due to the lack of a consolidated data system in the company, the procedures to prepare the data to assess a shipment’s performance is tedious and complicated.
The following is a list of sources required to be consolidated before further analysis could be done.
6) Unique Geospatial Mapping for Australia and Japan
Not all countries have progressed on the same developmental stages in mapping geospatial data for their country, some may have done extensive data collection to gather all details of the geographical landscape whereas others may only have basic data regarding the roads only. In addition, the geospatial data collected may not be geocoded in similar manners, dependent on the context and culture of the locality on what attributes have been assigned to such data records. For the purposes of the project, we have focused on the geospatial data of Australia and Japan, the two main territories handled by our sponsor. Due to the factors explained above, the format and attributes of the geospatial data available from these two countries differed greatly.
The geospatial attributes of the data for Japan captured in App1 were the postal codes of the receiver. However, the publicly available shape file for Japan that we sourced for did not use postal codes to identify the polygons denoting the different areas in Japan. Instead, the shape file contained data of the JIS code, to be explained in detail in the later section. By utilizing a separate data source to map the JIS codes to the postal codes, we were able to join the shape file to the shipment data from our sponsor’s data and hence map out the geospatial data based on our sponsor’s requirements.
While the geospatial mapping was rather straightforward for Japan for they had datasets that were easily convertible to map with each other, it was not the same for Australia’s geospatial data. Unlike Japan that had clearly defined areas and methods of classifying these territories, Australia did not have a fixed list of the classifications and definitions of the areas, whereby some definitions were often subjected to arbitrary definitions by the locals. Despite the clear definitions of what consisted as a state in Australia, the suburbs that were referenced in addresses from our sponsor’s data were not the nationally recognized standard, hence lacked a complete and up-to-date list for our geospatial mapping. The suburbs used in the sponsor’s data were a mix of either broadly recognized shires and countries or specific neighbourhoods.
In comparison to that, the publicly available shape file that we had sourced used the national standards of Suburbs to name each polygon, which we found to be more standardised and uniform compared to the arbitrary naming standards we saw in the sponsor’s data. Hence to be able to better visualize the sponsor’s data and allow it to conform to widely accepted standards, we have decided to use a separate source to map the latitude and longitude information to the postal codes captured in our sponsor’s data in order to get each individual coordinate mapped on the shape file and then using the shape file’s existing polygon to act as the classifier. By doing so, we were then able to map the shipments to their respective suburbs by the national standards in the map visualization utilized.
Detailed Data Preparation Steps
Exploratory Analysis
1. Identifying inbound(IB) / outbound(OB) shipments
Different methods are used for different data sets according to the specifications of our sponsor.
Methods used include:
- Checking the receiver and sender name
- Identifying specific customer account no for IB and OB shipments
- Identifying IB and OB shipments using starting and ending stage codes
2. Calculating turnaround time (TAT)
TAT time is used to calculate the total time taken from the starting point to the ending point of the shipment. It is used to flag out and ensure that shipments stay within the stipulated time.
Steps conducted:
- Identify the starting and ending dates using the starting and ending stage codes and their relevant stage updated dates.
- Calculate the TAT time by the difference in the number of days between the ending and starting date.
For shipments that are missing either a starting or ending point, they are flagged out and labelled as "Incomplete" shipments.
Shipments that are complete, they will either by labelled as "Pass" or "Fail" depending on the calculated TAT time. If the TAT time <= N, it will be labelled as "Pass" else "Fail".
Geospatial Analysis
Compatibility of Sponsor Data and Online SHP files
Before we can proceed with mapping count values into our SHP map, we had to find identify a joining column. This column will ensure that it is unique to each City/Province and must be identical for both Sponsor and Online data. Examples of joining columns include:
- Unique country-code based on ProvinceID and CityID that includes Postal Code as well
- If such country-code is not avaiable, we will create our own Country-Code by inserting our own ProvinceID and CityID and then creating the code as such: "ProvinceID-CityID"
- Sponsor data for some countries did not include
Further Problems:
Sponsor data had to be cleaned as well as they sometimes input "District, City" as Receiver.
If shp map and sponsor data have too many mismatches, but lat long is provided, then we will have to create our own polygons and shp file based on these coordinates.