Hiryuu Data Preparation

From Analytics Practicum
Jump to navigation Jump to search

Current Project

Logo Hiryuu.png


Home

About Us

Project Overview

Findings

Project Management

Documentations

Data Preparation Analysis

Data Tools Used

Microsoft Excel

Used a VBA macro to merge all individual data files into a compilation Macro’s function was to open all csv files in a folder and add them into a master sheet by the rows

Pros:

  • Excel was the tool of choice by the sponsors, and it preserves the character encoding of non-western characters
  • Easy to use

Cons:

  • Limited in the number of rows it could handle

JMP Pro

Pros:

  • Useful in joining data tables with large number of data
  • Useful in tabulating data of different variables into a data table
  • Easy creation of visuals to look into the data

Cons:

  • Faced issues with non-western characters inconsistently that made it difficult to determine the cause
  • Difficulty in trying to use advanced functions that required scripting ability
  • No formula to find date difference for working days only

Objectives

  • Identifying the correct set of data
  • Identifying complete shipments
  • Identifying shipments as inbound or outbound
  • Ensuring no missing data
  • Identifying the relevant geospatial data for analysis
  • Clarify unclear data/abnormalities with Sponsors
    • E.g. Quantity > 1

General Steps Taken

  1. 1. Joining data tables for shipment details and status update

  • Used left outer join with statuses as the main table so as to add on the corresponding shipment details to each status update
  • Used Match Flag to check if there is missing data from either source
  1. 2. Joined the data table with the status codes list

  • Status data only contained the stage code without explanation
  • Needed to join with another data table with the stage codes and relevant descriptions
  1. 3. Removing Duplicate Rows

  • Duplicate rows with identical data were present in the data given to us
  • Needed to delete duplicates to ensure greater data accuracy for our analysis
  • Duplicated data rows serves no practical purpose hence ought to be removed
  • Used a JMP jsl script posted on the JMP community website to delete duplicate rows whilst preserving the first unique occurrence of it
  1. 4. Checking the data for any abnormalities based on standard parameters given by the sponsor

  • Basic guidelines put forth by the Sponsor includes:
  • Quantity = 1
  • All should be domestic shipments
    • With the exception for App 2's data
  • Sender and Receiver Information should be included