IS428 AY2019-20T1 Assign Pham Anh Kim Chau

From Visual Analytics for Business Intelligence
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Capture-MC2-2019-Header.png   VAST Challenge MC2 - Citizen Science to the Rescue

Problem Statement

One of St. Himark’s largest employers is the Always Safe nuclear power plant. The pride of the city, it produces power for St. Himark’s needs and exports the excess to the mainland providing a steady revenue stream. However, the plant was not compliant with international standards when it was constructed and is now aging. As part of its outreach to the broader community, Always Safe agreed to provide funding for a set of carefully calibrated professional radiation monitors at fixed locations throughout the city. Additionally, a group of citizen scientists led by the members of the Himark Science Society started an education initiative to build and deploy lower cost homemade sensors, which people can attach to their cars. The sensors upload data to the web by connecting through the user’s cell phone. The goal of the project was to engage the community and demonstrate that the nuclear plant’s operations were not significantly changing the region’s natural background levels of radiation.

When an earthquake strikes St. Himark, the nuclear power plant suffers damage resulting in a leak of radioactive contamination. Further, a coolant leak sprayed employees’ cars and contaminated them at varying levels. Now, the city’s government and emergency management officials are trying to understand if there is a risk to the public while also responding to other emerging crises related to the earthquake as well as satisfying the public’s concern over radiation

Data Provided

  • St. Himark shapefile
  • StaticSensorReadings.csv
    • 4 columns: Timestamp, Sensor-id, Value, Units
    • Import to Python as "value" dataframe
  • StaticSensorLocations.csv
    • 3 columns: Sensor-id, Lat, Long
    • Import to Python as "location" dataframe
  • MobileSensorReadings.csv
    • 7 columns: Timestamp, Sensor-id, Long, Lat, Value, Units, User-id
    • Import to Python as "mobile" dataframe

Data Analysis and Preparation

Aggregate the data

  • Problem: The 2 datasets is recorded every 5 seconds which lead to the enormous amount of data points. It is impossible to merge the 2 datasets (mobile & static) and store the data back to Excel
  • Solution: Aggregate the data from every 5 seconds to every 5 minutes with an underlying assumption - The mobile sensors would not be able to move so far within 5 minutes. The radiation measure would be the average value of all data points within 5 minutes time to prevent the distortion of data (The radiation would be higher if there is no missing data in the period but it would not mean that there is a spike in radiation at the point in time)
  • Tool and Packages: Python (Numpy, Pandas, Datetime)
  • Steps applied to both Static and Mobile data
Step Code
Create a new list of data & time ranges within
the start time and end time of the data set

Real start date & time
d1 = '2020-04-06 00:00:00'
d1= pd.to_datetime(d1)
Dummy end date & time
d2 = '2020-04-11 00:00:00'
d2= pd.to_datetime(d2)
Real end date & time
d3 = '2020-04-10 23:59:45'
d3= pd.to_datetime(d3)
Create a date range between start date and end date with 5 minutes interval
dd = pd.date_range(d1, d2, freq='5min')

Create a list that contains the IDs of all static sensors

Create a list that contains all unique values in column "Sensor-id"
id_static = value['Sensor-id'].unique().tolist()
id_static.sort()

Create an list that contains the IDs of all mobile sensors

Create a list that contains all unique values in column "Sensor-id"
id_mobile = mobile['Sensor-id'].unique().tolist()
id_mobile.sort()

  • Steps to aggregate static data:

1/ Create a new list called "static_data"
2/ Loop through each index in the date list and each item in the static id list, check the requirements and carry out the steps depend on the scenario

Scenario Step Code
Scenario 1: When date-time is the start date-time (There will be 1 value for each sensor) Assign value to the variables time = str(dd[d])

ss_id = i
Search in the column "Value" to find all items that have the same timestamp and sensor id as indicated. Because there is only 1 value without any missing data, num will be a list with only 1 value. Take the first value in that list
num = value.loc[(value['Timestamp'] == dd[d]) & (value['Sensor-id'] == i), 'Value'].iloc[0]

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num]
Append "row" to the "static_data" list static_data.append(row)
Scenario 2: When date-time is the end date-time (There will be multiple values for each sensor) Assign value to the variables time = str(d3)

ss_id = i
Search in the column "Value" to find all items that have the timestamp latter than "2020-04-10 23:55:00" and the same sensor id as indicated.
value_list = value.loc[(value['Timestamp'] >= dd[d-2]) & (value['Sensor-id'] == i), 'Value']
Take the average of all value inside the list
num = value_list.sum()/len(value_list)

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num]
Append "row" to the "static_data" list static_data.append(row)
Scenario 3: All the other date-time in between (There will be multiple values for each sensor) Assign value to the variables time = str(dd[d])

ss_id = i
Search in the column "Value" to find all items that have the timestamp between the previous date-time index and this date-time index and the same sensor id as indicated.
value_list = value.loc[(value['Timestamp'] > dd[d-1]) & (value['Timestamp'] <= dd[d]) & (value['Sensor-id'] == i), 'Value']
Take the average of all value inside the list
num = value_list.sum()/len(value_list)

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num]
Append "row" to the "static_data" list static_data.append(row)

3/ Create a new dataframe with the data from static_data list. The new dataframe will have 3 columns: 'Timestamp','Sensor ID','Value' 4/ Vertically merge the new dataframe with the static sensors' location by matching the Sensor ID
5/ Delete the duplicated "Sensor-id" column, which was the result of the vertical merge above
6/ Add in two new columns, User-id and Type of sensor to match the mobile sensors' data

Static.png
  • Steps to aggregate mobile data

1/ Create a new dataframe,mobile_sensor, by copying the "mobile" dataframe and keep only the "Sensor-id" and "User-id" columns. Remove the duplicated values within the new dataframe so we can match the "User-id" column later
2/ Create a new list called "mobile_data"
3/ Loop through each index in the date list and each item in the mobile id list, check the requirements and carry out the steps depend on the scenario

Scenario Step Code
Scenario 1: When date-time is the start date-time (There will be 1 value for each sensor) Assign value to the variables time = str(dd[d])

ss_id = i
Find all items that have the same timestamp and sensor id as indicated to retrieve the "Value", "Long" and "Lat" Because there is only 1 value without any missing data, each of this function will be a list with only 1 value. Take the first value in those lists
long = mobile.loc[(mobile['Timestamp'] == dd[d]) & (mobile['Sensor-id'] == i), 'Long'].iloc[0]
lat = mobile.loc[(mobile['Timestamp'] == dd[d]) & (mobile['Sensor-id'] == i), 'Lat'].iloc[0]
num = mobile.loc[(mobile['Timestamp'] == dd[d]) & (mobile['Sensor-id'] == i), 'Value'].iloc[0]

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num,lat,long]
Append "row" to the "static_data" list mobile_data.append(row)
Scenario 2: When date-time is the end date-time (There will be multiple values for each sensor) Assign value to the variables time = str(d3)

ss_id = i
Find all items that have the timestamp latter than "2020-04-10 23:55:00" and the same sensor id as indicated to retrieve the "Value", "Long" and "Lat"
Because there are multiple values, take the average of all the values in the list
value_list = mobile.loc[(mobile['Timestamp'] > dd[d-2]) & (mobile['Sensor-id'] == i), 'Value']
num = value_list.sum()/len(value_list)
For longtitude and latitude value, because the result of the function is a list, there might be chance that it is an empty list. As the function recorded the search results from top to bottom in the list, the latest coordinate will be at the end of the list. Therefore, we will check whether the list is empty. If it is not, take the last item in the list as coordinate, else leave it blank
long = mobile.loc[(mobile['Timestamp'] > dd[d-2]) & (mobile['Sensor-id'] == i), 'Long']
long = long.iloc[-1] if len(long) >= 1 else np.nan
lat = mobile.loc[(mobile['Timestamp'] > dd[d-2]) & (mobile['Sensor-id'] == i), 'Lat']
lat = lat.iloc[-1] if len(lat) >= 1 else np.nan

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num,lat,long]
Append "row" to the "static_data" list mobile_data.append(row)
Scenario 3: All the other date-time in between (There will be multiple values for each sensor) Assign value to the variables time = str(dd[d])

ss_id = i
Search in the column "Value" to find all items that have the timestamp between the previous date-time index and this date-time index and the same sensor id as indicated.
Because there are multiple values, take the average of all the values in the list value_list = mobile.loc[(mobile['Timestamp'] > dd[d-1]) & (mobile['Timestamp'] <= dd[d]) & (mobile['Sensor-id'] == i), 'Value']
num = value_list.sum()/len(value_list)
For longtitude and latitude value, because the result of the function is a list, there might be chance that it is an empty list. As the function recorded the search results from top to bottom in the list, the latest coordinate will be at the end of the list. Therefore, we will check whether the list is empty. If it is not, take the last item in the list as coordinate, else leave it blank
long = mobile.loc[(mobile['Timestamp'] > dd[d-1]) & (mobile['Timestamp'] <= dd[d]) & (mobile['Sensor-id'] == i), 'Long']
long = long.iloc[-1] if len(long) >= 1 else np.nan
lat = mobile.loc[(mobile['Timestamp'] > dd[d-1]) & (mobile['Timestamp'] <= dd[d]) & (mobile['Sensor-id'] == i), 'Lat']
lat = lat.iloc[-1] if len(lat) >= 1 else np.nan

Create a new list called "row" that contains the newly created variables row = [time,ss_id,num,lat,long]
Append "row" to the "static_data" list mobile_data.append(row)

4/ Create a new dataframe,mobile_agg, with the data in mobile_data. The new dataframe will have 4 columns: 'Timestamp','Sensor ID','Value','Lat','Long'
5/ Vertical merge the new dataframe with the mobile_sensor dataframe based on Sensor ID to get the User ID
6/ Delete the duplicated "Sensor-id" column, which was the result of the vertical merge above
7/ Add in 1 new column,Type of sensor to differentiate with the static sensors

Mobile.png

Merge mobile data and static data

  • Problem: Currently, the data for the static sensors and mobile sensors are in 2 dataframes. This makes the analysis process difficult with the need to blend or join data in Tableau.
  • Solution: Use Python to concatenate the 2 dataframes into 1 and write the data back to csv
  • Tool and Packages: Python (Pandas)
  • Detailed Steps:
    • Change the name of "User-id" column to match that of static data (there is a space in front of the words)
    • Merge the static and mobile data into 1 dataframe
    • Create a new "Sensor-id" column
      • If the sensor is static sensor, the sensor id will be "Static - " and the ID (e.g. Static - 1)
      • If the sensor is mobile sensor, the sensor id will be "Mobile - " and the ID (e.g. Mobile - 1)
    • Remove the old "Sensor ID" column
    • Write the new data back to csv file
Merge.png

Add neighborhood information to the data set

  • Problem: There is only longtitude and latitude of the sensors in the data. There is no way to know which neighborhood a sensor is in unless manually compared against the St.Himark map
  • Solution: Based on the provided longtitude and latitude of the map, map the points against the map to retrieve the neighborhood information
  • Tool and Packages: R (tidyverse, sf)
  • Detail steps
    • Load the value data set
    • Load th St.Himark shapefile
    • Change the longtitude and latitude of the points to the correct format and property
    • Combine the 2 dataframes based on the longtitudes and latitudes
    • Write the new value file with the neighborhood information to csv
R.png

Dashboard Design

Dashboard Link: https://public.tableau.com/views/Individualassignment-MC2/Story1?:embed=y&:display_count=yes&publish=yes&:origin=viz_share_link In case there is any distortion to the display on Tableau Public, please choose to switch to "Desktop view" to get the best layout

  • There are 5 pages in the Tableau story:
  1. Homepage: Problem Statement
  2. Overall situation: The radiation situation in St.Himark within the 6-10 April 2020
  3. Neighborhood: Average radiation level in 1 hour by neighborhood, recorded by static and mobile sensors separately
  4. Sensor ID: Average radiation level in 1 hour recorded by each sensor
  5. Contaminated cars: The flow map of contaminated cars, which recorded the radiation level higher than 75 cpm, to see the routes and potential radiation spread caused by them
  • Navigation:
Navigation button Description
Navigate.png Show the current page you are at & move forward/backward
Table of content.png
Go back to the Homepage
Left-chevron.png
Go to the previous page
Right-chevron.png
Go to the next page

Interactive Visualization

Color Legends

Gammarscout-and-guide.jpg

According to the some sources, the normal radiation level is around 25 to 75 while any measurement that is 100 and above is dangerous. As a result, I set the property of the color legend like the picture below. This is also to prevent a few extreme outliers in the dataset

Colorlegends.png

Interactive Visualization

Throughout all of the dashboards, there are different visualization and filters used to provide additional information to viewers depend on their own interests. For details, please refer to the table below:

Interactive features Rationale Brief implementation steps
Type of sensor filter
Typeofsensor.png
To provide better insight regarding the value and location of sensors
  • Drag "Type.of.sensor" to Filters
  • Right click to select "Show Filter"
  • Change the Filter Type into "Single Value (dropdown)"
Neighborhood filter
Neighborhodo.png
To provide more insight regarding the area of interest
  • Drag "Neighborhood" to Filters
  • Right click to select "Show Filter"
  • Change the Filter Type into "Multiple Value (list)"
"Go through..." filter
Filter.png
To narrow the the cars that have gone through potential contaminated location
  • Create a calculated field "Safe Town" with the formula:
Safetown.png
  • Create a calculated field "Go Through Safe Town" with the formula:
Gothrsafetown.png
  • Drag the "Go Through Safe Town" to filter
  • Right click to select "Show Filter"
  • Change the Filter Type to "Single Value (list)"
Contaminated filter (hidden) To narrow down to potential contaminated vehicles *Create a calculated field "Higher than 75" with the formula:
Higherthan75.png
  • Create a calculated field "Value higher than 75" with the formula:
Contaminated filter.png
  • Drag the "Value higher than 75" to filter and choose "Yes"

Overall situation

Potential contaminated area on map

This map shows all radiation value recorded over time as well as the location of potential contamination. Each value would either a star or a circle depends it was recorded by which type of sensor. Moreover, based on the size and the color, we will know whether the value is in acceptable or dangerous range
The color opacity of the point is also reduced to 75% to sow how frequent there is a mobile sensor cross by an area. The darker the area, the more mobile sensors passed through and lead to higher chance of radiation contamination.

Map.png

Overall calendar heatmap

The calendar heatmap provide the average value provided by all sensors in 1 hour at each neighborhood. It would allow viewers to quickly identify area of problem and at what time. Moreover, this type of visualization would minimize the the effect of the extreme outliers

Calendar heatmap.png

Neighborhood

This calendar heatmap show the radiation value at each neighborhood but these value are separated based on which type of sensors captured the value. This would help viewers to identify the difference in value captured by different type of sensors at the same neighborhood at the same point of time.

Neighborhood.png

Sensor ID

This visualization help to show the records of each sensor within the period. In case there is any anomalies with the type of sensors, any individual sensors, this heatmap can help to identify the pattern

Id.png

Contaminated cars

This map helps to identify the contaminated cars and their routes within the time period. From here, we can drill down further to analyze why some areas might get contaminated without being closed to the nuclear power plant. In the future, it can expand to have the filter check for all of the neighborhoods (filter for the mobile sensor that go through the particular neighborhood) to cater for all scenario.

Flow map.png

Observation & Insight

Question 1

Visualize radiation measurements over time from both static and mobile sensors to identify areas where radiation over background is detected. Characterize changes over time.

From the calendar heatmap below, we can clearly see that there are 5 neighborhoods contaminated with radiation at unhealthy level: Old Town, Safe Town, Scenic Vista, Wilson Forest and unidentified neighborhood (NA). Among these neighborhoods, Wilson Forest is the worst of all with extremely high level of radiation by the end of 9th till the beginning of 10th April

Calendar heatmap by neighborhood.png

For the unidentified neighborhood, when we compare the measurement with the map of St. Himark, the "NA" location are at highways/ bridges out of St.Himark. That is why there is information about the neighborhood. On the other hand, when we check the location of Wilson Forest and Scenic Vista, the distance is actually quite close so there might be the chance that the radiation is spread from one area to the other

Radiation spot.png

Question 2

Use visual analytics to represent and analyze uncertainty in the measurement of radiation across the city

  • Compare uncertainty of the static sensors to the mobile sensors. What anomalies can you see? Are there sensors that are too uncertain to trust?

In the picture below, there are the calendar heatmaps based on the values captured by static sensors and mobile sensors separately. For Calendar heatmap for mobile sensors, the filter is set to show only the neighborhoods that have static sensors. Despite displaying the radiation level at the regions at the same time frame with the same color legends setting, there are big differences between these 2 visualizations. The mobile sensors could pick up a few concerning radiation signal while according to static sensors, the radiation level is safe all the time. On the other hand, each type of sensors have its own uncertainty due to its property:

    • Static sensor: can only measure the radiation within certain diameter. Hence, it is hard to cover the whole neighborhood
    • Mobile sensor: can only measure the radiation at the region it went through so there is a risk of spreading the contamination around. Even if there is a high value recorded by the sensor, we do not know if it is the region got contaminated or it is the vehicle itself. Moreover, there is no way to measure the radiation situation in the area over time.
Calendar heatmap- tatic & mobile.png

Comparing the locations of static sensors against the radiation hotspots detected by mobile sensors, there are 2 points where the static sensors are quite near to the radiation zones but they did not pick up anything the whole time. From this point, we can concluded that one type of sensors might provide incorrect value

Capture.png
  • Which regions of the city have greater uncertainty of radiation measurement? Use visual analytics to explain your rationale.

According to this visualization below, Old Town and Safe Town are the 2 neighborhood with the greater uncertainty of radiation measurement

Calendar heatmap by neighborhood (2).png
  • What effects do you see in the sensor readings after the earthquake and other major events? What effect do these events have on uncertainty?

There is a sudden surge in radiation measurement at Safe Town around 1pm on the 8th April so I assume that this is the time of earthquake/ major events that led to the leak of radiation. The radiation spiked for 3-4 hours then returned to normal at around 5pm. This measurement was picked up by the mobile sensor(s) so it is hard to know for sure whether the location was back to normal or the contaminated car(s) moved away.

Calendar heatmap by neighborhood.png

Question 3

Given the uncertainty you observed in question 2, are the radiation measurements reliable enough to locate areas of concern?

  • Highlight potential locations of contamination, including the locations of contaminated cars. Should St. Himark officials be worried about contaminated cars moving around the city?

From the visualization on the left, we can identify that there are 5 potential locations of contamination: Jade Bridge, Old Town, Safe Town, Scenic Vista and Wilson Forest. In the visualization on the right side, there are flow map and a list of 17 mobile sensors had recorded any value that higher than 75 cpm at any point in time. This means 34% of the mobile sensors had came in contact with the radiation and moved around town. Moreover, considering the distance of Scenic Vista and Wilson Forest to the main source of radiation, Safe Town, we can concluded that some of vehicles had spread the radiation.

Radiation location n car.PNG
  • Estimate how many cars may have been contaminated when coolant leaked from the Always Safe plant. Use visual analysis of radiation measurements to determine if any have left the area.

From the flow map, we can see that there are 11 out of 17 contaminated cars have gone through Safe Town. All of these vehicles then left Safe Town and moved across different neighborhood

Flow map of mobile sensors - Safe Town.png

Out of those 11 vehicles, 7 of them went through Old Town and may have spread the contamination there because there are no contamination cars went through Old Town only.

Flow map of mobile sensors - Safe Town & Old Town.png

3 out of the 11 contaminated cars from Safe Town traveled to Wilson Forest and contaminated the area

Flow map of mobile sensors - Safe Town & Wilson Forest.png

5 out of of 17 cars did not pass through Safe Town but they came in contact with the radiation at Wilson Forest and then spread the radiation to Scenic Vista, which lead to the contamination of mobile sensor 20

Flow map of mobile sensors - Wilson Forest.png
Flow map of mobile sensors - Scenic Vista.png
  • Indicated where you would deploy more sensors to improve radiation monitoring in the city. Would you recommend more static sensors or more mobile sensors or both? Use your visualization of radiation measurement uncertainty to justify your recommendation.

From all of the visualizations, I believe there is a need to place additional static and mobile sensors across the town. For static sensors, there should be more at the southern part of St.Himark as there are a lot of regions that are not recorded by the static sensors due to the distance. Moreover, we can reduce the risk of missing data or radiation spreading with static sensors. On the other hand, there is also a need to additional mobile sensors as they are more dynamic. Even within a neighborhood, 1-2 static sensors are not enough due to neighborhood size so mobile sensors would be a better choice in term of economical choice.

Calendar heatmap by neighborhood, type of sensors.png
Radiation spot-static.png

Question 4

Summarize the state of radiation measurements at the end of the available period. Use your novel visualizations and analysis approaches to suggest a course of action for the city. Use visual analytics to compare the static sensor network to the mobile sensor network. What are the strengths and weaknesses of each approach? How do they support each other?
At the end of radiation measurement, all sensors reported the radiation measurement within the safety level. In the future, there is a need to increase the number of sensors across the city, both static and mobile sensors for the reasons mentioned above in Question 3. The static sensors would be more stable with less missing data and more detailed information about radiation measurement in the area while mobile sensors would be more dynamic and can cover more ground.
There should be at least 2-3 static sensors in a neighborhood depend of the area size to measure the radiation situation of the area. Moreover, there should be 1 static sensors at every entrance to St. Himark island to check the radiation level. If the bridge/ highway have high radiation, the city council can block that way to avoid the spreading the radiation to other towns.
For mobile sensors, there should be an system to send text messages to the mobile sensors when there is a sudden spike in radiation recorded so they can avoid moving around while carrying the risk of radiation exposure. These car owners can also carry out protection measures to clean their cars and reduce the risk for themselves as well.

Calendar heatmap by neighborhood.png

Question 5

The data for this challenge can be analyzed either as a static collection or as a dynamic stream of data, as it would occur in a real emergency. Describe how you analyzed the data - as a static collection or a stream. How do you think this choice affected your analysis?
I analyzed the data as a static collection because the I used Python and R to aggregate the data and add in neighborhood information to provide a more accurate and detailed data for the analysis. This approach can reduce the processing time of the visualization as well as provide more insight but it would have lead time in case of real emergency. However, in reality, it is possible to develop a closed process to automatically retrieve and process the information before feeding to the dashboard. Although there would be lead time but much shorter.

References