IS428 AY2018-19T1 Chua Sing Rue
Contents
Problem and Motivation
Task 1: Spatio-temporal Analysis of Official Air Quality
Data Preparation: EEA Dataset
Name of Air Quality Station | Local ID | Data available |
---|---|---|
| ||
| ||
| ||
| ||
| ||
|
Issue 1: The official air quality measurements in the EEA Data folder contains PM10 measurements from air quality stations in Sofia city for the years 2013 to 2018. However, the dataset is not complete. Air quality station Mladost has only 2018 data available. Additionally, air quality station Orlov Most is missing data for the years 2016, 2017 and 2018.
Solution: For the purposes of comparison, air quality station Mladost and Orlov Most will be excluded.
Issue 2: PM10 measurements are reported as an averaged daily value for the years 2013 to 2016, but reported as an averaged hourly value for the years 2017 to 2018.
Solution: When comparing across the periods 2013 to 2016 and 2017 to 2018, the hourly PM10 average values will be recalculated as a daily average in order to get a common basis for comparison. At the same time, averaged hourly PM10 values will still be used to drill down into within-day trends for the more recent information.
Issue 3: For 2017 data, PM10 values for 1 Jan 2017 to 26 Nov 2017 are missing. Additionally, 2018 data is recorded up til September 14 only.
Solution: 2017 and 2018 data should not be used for yearly trends.
Overall trend in Daily Average PM10
1. Time series for Daily Average PM10, 2013 - 2016
2. Time series for Daily Average PM10, Nov 2017 - Sept 2018
Seasonal trend in Daily Average PM10
1. Calendar Heatmap
2. Cycle Plot
Intra-day trend in Hourly Average PM10
1. Calendar Heatmap: Within-day trend of Average Hourly PM10 by Month
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Data preparation: Air Tube Dataset
Issue 1: The citizen science air quality measurements in the Air Tube Data folder are mapped to unique geohash numbers. However, geohash is not supported by Tableau currently.
Solution: Reverse geocoding to get the longitude and latitude of these stations must be done. Use R package, geohash to decode geohashes into latitude/longitude pairs.
Issue 2: Latitude/longitude pairs through reverse geocoding is simply a reference point, and as such, is not representative as a point location.
Solution: Techniques such as hexagon binning should be used to better represent the stations.
Issue 3: Due to the large number of records, data for 2017 and 2018 are stored in separate files. Additionally, it is not possible to explore the data using Excel as not all records will load.
Solution: Use Python package, pandas to merge both 2017 and 2018 csv files into one.
Issue 4: After connecting the merged file to Tableau, it is clear that the data available covers beyond Sofia City. As such, the dataset must be filtered to include only information from Sofia City.
Solution: Use the lasso tool in Tableau to select all data points within Sofia City, then include them as members in a set while excluding all other data points.
Sensor Coverage
As observed, there are fewer sensors closer to the city boundary. On the other hand, there are more sensors closer to the city centre. As such, the sensors appear to be clustered near the city centre, and are not evenly distributed across the entire city. This distribution pattern is similar across 2017 and 2018.
As such, readings from sensors may not be representative of Sofia City as a whole. Additionally, the clustering of sensors near the city centre may lead to a upward bias in pollution measures, especially since traffic is typically heavier closer to the city centre.
Sensor Performance
1. Time trend of total number of recorded measurements
From the above line graph, it can be observed that there is a general upward trend in terms of the total number of recorded air quality measurements. However, several notable points stand out, recording significant dips that go against the prevailing upward trend. These points are labelled in red.
These anomalies suggest that not all sensors work at all times. In other words, there are instances when some sensors fail.
2. Time trend of the rate of change in total number of recorded measurements
The visualisation of percentage change in total number of recorded measurements over time similarly reveal anomalies in the data. Generally, rate of change in total number of recorded measurements from one day to the next is fairly constant and stable. However, several instances in the data show a drastic change from one interval to the next, with the most significant change being a 1087% increase in total number of recorded measurements in July 2018.
It is unlikely that dramatic changes were caused by new sensors being installed. A more probable explanation could be that the sensors were taken off the grid for maintenance, or there was an incident of power outage that affected a cluster of sensors.
Spatial Distribution of Citizen Science Air Quality Measures
Exploratory Data Analysis
Using a scatter plot matrix, pair relationships among the five measures can be easily observed, particularly if there is some form of linear relationship or correlation between either two of the measures.
From the scatter plot matrix, outlier records are easily observed, as labelled in the image above. These outlier points will be excluded from the analysis. Additionally, records with measurement value of zero for all measures except temperature will also be excluded, as these may be due to issues with sensor performance instead of representing a true measure.
In order to effectively compare the readings across different parts of Sofia City, all five measures were converted to percentile.