Difference between revisions of "IS428 AY2019-20T1 Assign Wei Ming DataTransformationAnalysis"
(13 intermediate revisions by the same user not shown) | |||
Line 21: | Line 21: | ||
| style="padding:0.3em; font-size:100%; background-color:#3b8b68; text-align:center; color:#928456" width="10%" | | | style="padding:0.3em; font-size:100%; background-color:#3b8b68; text-align:center; color:#928456" width="10%" | | ||
[[IS428_AY2019-20T1_Assign_Wei_Ming_Questions | | [[IS428_AY2019-20T1_Assign_Wei_Ming_Questions | | ||
− | <font color="#ffffff" size=2><b> | + | <font color="#ffffff" size=2><b>Tasks & Questions</b></font>]] |
|} | |} | ||
− | + | = Preliminary Data Analysis = | |
− | + | == Geographic Information</big> == | |
− | [[File:LPXASGMAP.jpg|thumb|400px| | + | [[File:LPXASGMAP.jpg|thumb|center|400px|Figure1.1 St. Himark Labeled Map]] |
<p>St. Himark is subdivided into '''19''' neighborhoods. Infrastructure provided includes sewer & water, roads & bridges, gas, garbage, healthcare and power, of which 72% is provided by the Always Safe Nuclear Power Plant located at SAFE TOWN (Neighborhood 4).</p> | <p>St. Himark is subdivided into '''19''' neighborhoods. Infrastructure provided includes sewer & water, roads & bridges, gas, garbage, healthcare and power, of which 72% is provided by the Always Safe Nuclear Power Plant located at SAFE TOWN (Neighborhood 4).</p> | ||
<p>St. Himark has an exceptional network of '''8''' hospitals, located at PALACE HILLS (Neighborhood 1), OLD TOWN (Neighborhood 3), SOUTHWEST (Neighborhood 5), 2 of them at DOWNTOWN (Neighborhood 6), BROADVIEW (Neighborhood 9), TERRAPIN SPRINGS (Neighborhood 11), SOUTHTON (Neighborhood 16).</p> | <p>St. Himark has an exceptional network of '''8''' hospitals, located at PALACE HILLS (Neighborhood 1), OLD TOWN (Neighborhood 3), SOUTHWEST (Neighborhood 5), 2 of them at DOWNTOWN (Neighborhood 6), BROADVIEW (Neighborhood 9), TERRAPIN SPRINGS (Neighborhood 11), SOUTHTON (Neighborhood 16).</p> | ||
− | + | == Report Data</big> == | |
'''Data Description''' | '''Data Description''' | ||
<p>The data for MC1 includes one (CSV) file spanning the entire length of the event, containing (categorical) individual reports of shaking/damage by neighborhood over time. Reports are made by citizens at any time, however, they are only recorded in 5-minute batches/increments due to the server configuration. Furthermore, delays in the receipt of reports may occur during power outages.</p> | <p>The data for MC1 includes one (CSV) file spanning the entire length of the event, containing (categorical) individual reports of shaking/damage by neighborhood over time. Reports are made by citizens at any time, however, they are only recorded in 5-minute batches/increments due to the server configuration. Furthermore, delays in the receipt of reports may occur during power outages.</p> | ||
Line 42: | Line 42: | ||
<p>There are '''83070''' report records in this file, covering five days from April 6 to April 10. For every point-in-time at each location, there can be multiple records, one record or zero record.</p> | <p>There are '''83070''' report records in this file, covering five days from April 6 to April 10. For every point-in-time at each location, there can be multiple records, one record or zero record.</p> | ||
<p>1. Number of Records by Location</p> | <p>1. Number of Records by Location</p> | ||
− | [[File:Records by location.png|thumb|center| | + | [[File:Records by location.png|thumb|center|600px|Figure1.2 Number of Records by Location]] |
<p>'''Finding 1''': According to the stacked bar chart, Location 3 (OLD TOWN), 8 (SCENIC VISTA) and 9 (BROADVIEW) received most reports, while Location 7 (WILSON FOREST) received very few reports. Referring to the shake map, location 3 is one of the most affected areas. And location 7 is actually a developing area where not many people live.</p> | <p>'''Finding 1''': According to the stacked bar chart, Location 3 (OLD TOWN), 8 (SCENIC VISTA) and 9 (BROADVIEW) received most reports, while Location 7 (WILSON FOREST) received very few reports. Referring to the shake map, location 3 is one of the most affected areas. And location 7 is actually a developing area where not many people live.</p> | ||
<p>'''Finding 2''': Breaking down to each damage category, the proportion of all the damages for each location is pretty even except "medical damage", which is the red part in the chart. At location 2, 4, 7, 8, 10, 12, 13, 14, 15, 17, 18, 19, there are very few report about medical damage. This is because there are no hospitals in those areas.</p> | <p>'''Finding 2''': Breaking down to each damage category, the proportion of all the damages for each location is pretty even except "medical damage", which is the red part in the chart. At location 2, 4, 7, 8, 10, 12, 13, 14, 15, 17, 18, 19, there are very few report about medical damage. This is because there are no hospitals in those areas.</p> | ||
+ | |||
+ | <p>2. Number of Records by Time</p> | ||
+ | [[File:Records by time.png|thumb|center|600px|Figure1.3 Number of Records by Time]] | ||
+ | <p>'''Finding 1''': There are very few records from April 6 to around 7am in April 8, but a sharp increase after 7am. We can interpret that there may be a major shake in April 8.</p> | ||
+ | <p>'''Finding 2''': There are several peaks in April 9 and 10, indicating there might be several aftershocks in these two days.</p> | ||
+ | |||
+ | |||
+ | = Data Preparation = | ||
+ | == Report Data Transformation == | ||
+ | The original dataset is not tidy enough for visualization. Therefore, we need to tidy up it first so that every record is an observation and every field is a variable. <br> | ||
+ | '''Pivot data using Tableau Prep''' <br> | ||
+ | <p>In the original dataset, every type of damage is in one field. So we need to merge them into one field, named as “damage”, and map their values to another field “severity”.</p> | ||
+ | <p>''Issue'': Should I define “shake_intensity” as one type of damage here? </p> | ||
+ | ''Trade-off'': <br> | ||
+ | <p>Even if “shake_intensity” doesn’t mean a kind of infrastructure damage, if “shake_intensity” is separated as one field, there is going to be more redundant data since one “shake_intensity” needs to be mapped to multiple records in the tidy version of data. And also, if “shake_intensity” is missing for one report (in original data), then there will be more missing data in the tidy version as well.</p> | ||
+ | <p>Conclusion: To put “shake_intensity” under damage </p> | ||
+ | [[File:Tableau prep.png|thumb|center|800px|Figure2.1 Tableau Prep Pivot]] | ||
+ | [[File:Data before and after.png|thumb|center|800px|Figure2.2 Data Before and After]] | ||
+ | |||
+ | == Map Setup == | ||
+ | <p>Now that the report data is ready for use, we need to implement map to the visualization as a base. In order to arrange neighborhood shapes into the blank background map, we need to break down into the details of the shapes and arrange the position of the border points to the background. To achieve that, two generated files is needed:</p> | ||
+ | [[File:Map details.png|thumb|center|800px|Figure2.3 Shape Details]] | ||
+ | <p>In Tableau, we can import these two sets of data, join them by "StHimark_ID", set "Marks" as "Polygon", put "Point Order" into "Path", and "StHimark_ID" and "Sub Polygon_Id" into "Detail", finally put "longitude" and "latitude" in, then we can get a map like this: </p> | ||
+ | [[File:Main Map.png|thumb|center|800px|Figure2.4 Map Setup]] |
Latest revision as of 17:49, 13 October 2019
Contents
Preliminary Data Analysis
Geographic Information
St. Himark is subdivided into 19 neighborhoods. Infrastructure provided includes sewer & water, roads & bridges, gas, garbage, healthcare and power, of which 72% is provided by the Always Safe Nuclear Power Plant located at SAFE TOWN (Neighborhood 4).
St. Himark has an exceptional network of 8 hospitals, located at PALACE HILLS (Neighborhood 1), OLD TOWN (Neighborhood 3), SOUTHWEST (Neighborhood 5), 2 of them at DOWNTOWN (Neighborhood 6), BROADVIEW (Neighborhood 9), TERRAPIN SPRINGS (Neighborhood 11), SOUTHTON (Neighborhood 16).
Report Data
Data Description
The data for MC1 includes one (CSV) file spanning the entire length of the event, containing (categorical) individual reports of shaking/damage by neighborhood over time. Reports are made by citizens at any time, however, they are only recorded in 5-minute batches/increments due to the server configuration. Furthermore, delays in the receipt of reports may occur during power outages.
mc1-reports-data.csv fields:
- time: timestamp of incoming report/record, in the format YYYY-MM-DD hh:mm:ss
- location: id of neighborhood where person reporting is feeling the shaking and/or seeing the damage
- {shake_intensity, sewer_and_water, power, roads_and_bridges, medical, buildings}: reported categorical value of how violent the shaking was/how bad the damage was (0 - lowest, 10 - highest; missing data allowed)
Exploratory Data Analysis
There are 83070 report records in this file, covering five days from April 6 to April 10. For every point-in-time at each location, there can be multiple records, one record or zero record.
1. Number of Records by Location
Finding 1: According to the stacked bar chart, Location 3 (OLD TOWN), 8 (SCENIC VISTA) and 9 (BROADVIEW) received most reports, while Location 7 (WILSON FOREST) received very few reports. Referring to the shake map, location 3 is one of the most affected areas. And location 7 is actually a developing area where not many people live.
Finding 2: Breaking down to each damage category, the proportion of all the damages for each location is pretty even except "medical damage", which is the red part in the chart. At location 2, 4, 7, 8, 10, 12, 13, 14, 15, 17, 18, 19, there are very few report about medical damage. This is because there are no hospitals in those areas.
2. Number of Records by Time
Finding 1: There are very few records from April 6 to around 7am in April 8, but a sharp increase after 7am. We can interpret that there may be a major shake in April 8.
Finding 2: There are several peaks in April 9 and 10, indicating there might be several aftershocks in these two days.
Data Preparation
Report Data Transformation
The original dataset is not tidy enough for visualization. Therefore, we need to tidy up it first so that every record is an observation and every field is a variable.
Pivot data using Tableau Prep
In the original dataset, every type of damage is in one field. So we need to merge them into one field, named as “damage”, and map their values to another field “severity”.
Issue: Should I define “shake_intensity” as one type of damage here?
Trade-off:
Even if “shake_intensity” doesn’t mean a kind of infrastructure damage, if “shake_intensity” is separated as one field, there is going to be more redundant data since one “shake_intensity” needs to be mapped to multiple records in the tidy version of data. And also, if “shake_intensity” is missing for one report (in original data), then there will be more missing data in the tidy version as well.
Conclusion: To put “shake_intensity” under damage
Map Setup
Now that the report data is ready for use, we need to implement map to the visualization as a base. In order to arrange neighborhood shapes into the blank background map, we need to break down into the details of the shapes and arrange the position of the border points to the background. To achieve that, two generated files is needed:
In Tableau, we can import these two sets of data, join them by "StHimark_ID", set "Marks" as "Polygon", put "Point Order" into "Path", and "StHimark_ID" and "Sub Polygon_Id" into "Detail", finally put "longitude" and "latitude" in, then we can get a map like this: