IS428 AY2019-20T1 Assign Harvey Kristanto Lauw: Dataset
|
|
|
|
Datasets
Before we start with the data analysis, there is a need to understand and read the raw data. For this VAST Challenge 2019: Mini-Challenge 1, I am given 1 dataset which is:
- mc1-reports-data.csv [Data columns:Time | Sewer & Water | Power | Roads & Bridges | Medical | Buildings | Shake intensity | Location ]
For this Challenge, I will be leveraging on the custom polygon maps for St. Himark in order to create map visualizations with the 2 datasets listed below:
- StHimark_Features.csv [Data columns:Id | Nbrhood | StHimark_ID | Latitude | Longitude ]
- StHimark_Points.csv [Data columns:StHimark_ID | sub_polygon_id | point_order | Latitude | Longitude ]
Exploratory Data Analysis
Moving on to the data analysis, I will be performing Exploratory Data Analysis using Tableau Prep Builder. There are some issues with the original dataset listed below:
Columns with Null Values
- Sewer & Water: 171 counts of Null values
- Medical: 47441 counts of Null values
- Building: 170 counts of Null values
- Shake Intensity: 12144 counts of Null values
We cannot simply replace the Null values as columns like “Medical” and “Shake Intensity” contains a significant number of records with Null value to replace. If chosen to replace with mean or median, it will drastically change the variation of the distribution. Hence, the decision is to remove these values. Removed values will be in a form of records with the sole category and the rating itself, as pivoting will be done during the dataset transformation.
Dataset Transformation
Description | Reference |
---|---|
Overview |
|
Step 1: Creation of Calculated Field Using “location” as an ID to create a calculated field (Refer to Figure 0.2) of “Location Name” based on the word document reference for the background of St. Himark, VAST 2019 - St. Himark - About Our City.docx. |
|
Step 2: Pivot and Clean Pivot 6 columns: Sewer & Water, Power, Roads & Bridges, Medical, Buildings & Shake intensity. Cleaning of data source: Proper Renaming of header row (Refer to Figure 0.3) and Removal of Null Values (Refer to Figure 0.4).
Result in Number of rows: 83072 -> 438494 |
|
Step 3: Join and Clean Part 1 Joining with mc1-reports-data.csv & StHimark_Features.csv with these 2 identifiers (Refer to Figure 0.5) respectively: “location” & “id” to populate the data column, “StHimark_ID”. And Removal of unwanted columns: Latitude & Longitude (These points are not accurate and does not create the shape of the map, hence the decision to remove).
|
|
Step 4: Join and Clean Part 2 Joining with StHimark_Features.csv & StHimark_Points.csv with these 2 identifiers (Refer to Figure 0.7) respectively: “StHimark_ID” & “StHimark_ID”. And Removal of unwated columns: Latitude (StHimark_Features.csv) & Longitude (StHimark_Features.csv), Nbrhood, sub_polygon_id and StHimark_ID (StHimark_Features.csv).
|
|