ISSS608 2018-19 T1 Assign Stanley Alexander Dion Task 2

From Visual Analytics and Applications
Jump to navigation Jump to search

Pollution Title.jpg

  Help'em Breathe in Sofia
      A Visual Analytics Case Study


Setting up the Stage

Data Analysis Approach

Task 1

Task 2

Task 3

Conclusion

Sensor Distribution and Activities

To look at our visualisation dashboard running in Tableau Public, please visit the following link Sensor Functionality Analysis

Sensor Coverage

The density map of sensors in Sofia

The sensors that the citizen air quality data have are located across many cities in Bulgaria, however, we would focus on visualising sensors located in Sofia city. Utilising our gridded Topography data, we could filter the rest of the sensors located outside the maximum and minimum of the lng and lat given in the grid. Across the city, the sensors are not distributed well as we can see higher density is highlighted at the northern part of the city. There are more overlaps of pollution sensing in the north than in the south.

Sensor Report Frequency

The plot on the right tries to explain whether every sensor has a routine and consistent report for every hour. Interestingly, when we count the number of sensor readings throughout the observation period, we could notice a lot of sensors are reporting more than once for every hour. These peculiar sensors apparently present more in the eastern part of Sofia than on the western part. Taking a look at the number of sensing received every day, we could see that the sensors have a varying active frequency, meaning that some sensors are detected not sending report routinely. Again, there are lot more active sensors in the eastern region than in the west. Most of the peculiar sensors that have more than one reading an hour apparently is more active than the sensors with normal presence.

Reporting count of the year

Nonactive Sensors

Calendar View for Proportion of Missing Sensor

To view the proportion of sensors that missing sensor every day, we could visualise the count of missing value over the standard total number of reporting a day. We could see the count of missing is gradually decreasing over the observation period. Since we are basing on the total number of sensors ever recorded throughout our dataset, the decreasing non-active sensors can be due to increasing number of sensors planted by the citizen science community. We could easily spot there are several days such as at the end of March, beginning of April, and the beginning of July where there are suspicious sudden spikes of missing sensors

Missing and Reading

Density Graph on Missing Percentage and PM10 Reading

If we plot the reading concentration against the missing proportion of sensors, we could observe most of the readings happen when 50% of the sensors are missing and the readings are lower than 50 microgram, which is still below the safety level. The reading is strangely reaching the highest whenever there are 70% of missing sensors. The reading is back to the healthy range whenever the proportion missing is approaching 100%. A smaller density of observation is also noticed at 90% of missing value with normal air quality.


A Prelude to Sensor Readings

Correlations between the P1 & P2 of Citizen and Government Official Data

The plot is retrieved from the Citizen data, with each sensors tagged with the nearest government official stations. The aim is to find whether there is any difference between the concentration measurement in the official and citizen data as well as the difference between PM10 and PM2.5 measurement in the regions determined by the government stations.

The plot shows us that we have acceptable correlation between PM10 of Citizen science and Government official data. The most correlated region is Mladost and Nadezhda, with both of them reaches around 0.52 Rsquare value. On the other hand, the least correlated region is detected in Nadezhda where we only have 0.45 Rsquare. Seeing from the graph, the government official data measurement tends to exceed that of the citizen science sensor readings.

Nevertheless, we have consistent high correlations between PM10 and PM2.5 measurement within the data. This would lead us to the decision where we can use PM10 and PM2.5 measurement interchangeably.

Getting Clustered

Here is the link to the full storyboard on Tableau: Sensor Reading Story Board

Cluster Overview

As what has been introduced in the data preparation stage, we have employed learning algorithm to find similar patterns in our sensor readings and group them under the same cluster. Before we are going to analyse the readings, we would like to see each of the clusters coverage across the city. Looking at the three clusters formed, we could differentiate that they cover different part of the city:

  • Cluster 1 sensors are denser in the eastern side of the city
  • Cluster 2 covers the northern part of the city
  • Cluster 3 covers the western side of the city.
Clusters Coverage

Cluster Time-Series

Clustering Time-Series

Plotting all the sensors time-series coloured by the corresponding cluster, we could see that the three clusters formed have different characteristics. Cluster 1 captured all sensors with extremely high reading, a concentration that reaches 2000. As such readings are ridiculous (40 times above the allowed reading), we could consider these sensors experience breakdowns. The breakdown happened once in December 2017 but started occurring more often on May 2018 onwards. Cluster 2 and 3 have less fluctuating readings compared to the Cluster 1. When cluster 2 has a bit peak in June and July of 2018, Cluster 3 has relatively stable readings after the winter period ends (on March onwards).

Cluster Distribution

Spatio-Temporal Analysis Graph

In order to understand which part of the city has higher pollution concentration, we created a dashboard containing two linked graphs. A selection in the dot plot above will select the corresponding sensors located on the second graph.

To start with, we could select the link graph to show the sensors with extreme pollution observation. Notice that the sensors in the map having extreme PM 10 concentrations are jointly highlighted. We could see that the sensors selected are only coming from the cluster 1 located in the middle of the city. Furthermore, if we are to highlight the sensors having hazardous concentration, we can see that these pollution readings come from the sensors majorly from cluster 3, with few from Cluster 2. Those sensors are located toward the northern region of the city; half is seen from the north-east of the city, others are located in the north-west. From the timestamp of the selected dotplot, we could roughly see that most of these sensors in the north detect high pollution around June 2018 and December 2017.

Spiral Plot

Animated Spiral Plot

Taking a deeper view of the time element, we could characterize there are two different hourly patterns for different season of the year; the pollution happening in winter period always peak at the evening and the dawn time whereby in March we could see pollutions in the morning and the afternoon time.

Temporal Pollution Monitor

Grid of Colour-Coded Air Quality

To get the answer on whether the event of high pollution is also time-dependent, we could visualise the above semi-calendar plots. The plot shows the entire sensors in Sofia is based on the average PM 10 concentration on every hour of a day across the month.

When the above graph is grouped with the cluster formed, we could clearly see the distinct characteristics of the concentration peaking time for different sensor clusters that we have discussed translate to the different region of Sofia.

  • For eastern part of Sofia, dominated by cluster 1, we could see that the majority of the sensors have significant number of missing observations across non-winter months. March would be the only month where we could see most of the sensors from this cluster active.
  • In the northern part of Sofia, described by cluster 2 sensors, we could observe that the pollution peak starts earlier than what is detected by cluster 1 sensors. However, while some of them detecting the pollution, majority of them are not active during the winter periods. Unlike those sensors in the eastern region, the sensor in the northern region detects above normal pollution at the dawn and evening hour of the day in March. Most of these sensors are active only during the summer period and detect good weather condition.
  • Cluster 3 resembles most of the sensor in the western part of Sofia. This cluster comprises more active sensors than the rest of the clusters. The cluster detects dawn and evening pollutions starting from October and their reading spikes in January. Unlike those of Cluster 2 sensors, the sensors detect additional dawn and evening peaks starting from February. Most of the sensors are then becoming non-active in the summer period