ISSS608 2018-19 T1 Assign Debbie Siah Mei Ping Data Prep
Sofia City - Air Quality Analysis
|
|
|
|
|
|
Contents
Data Preparation
As with the undertaking of any typical data discovery studies, the raw data provided will need to be cleaned and prepared prior to any analytical work. The data quality issues and preparation work are documented as follows.
Task 1: Spatio-temporal Analysis of Official Air Quality
Air Quality Data Distributed
The air quality data available from EEA is from 6 different air quality stations and distributed over 28 CSV files. The data files are checked and determined to have the same data columns and hence are concatenated directly.
Uneven Spread of Dataset over Time
It is noted that out of the 6 air quality stations, 4 of them provided data across the full date range from 2013 to 2018. However, one of the remaining 2 (BG0054A) only contains data from 2013 to 2015. This could be due to the closing down of this station, which resulted in no further outputs of data from 2016 onwards. The last station (BG0079A) only contains data from 1st January 2018 onwards. This could be a new station that only started operations in 2018.
Difference in Interval of Measurements
It is noted that the air quality measurements prior to end of 2016 are taken on a daily basis. From 2017 onwards, the measurements started to be taken on an hourly basis. This could be due to process improvements to provide data on a more granular basis.
Joining EEA table with Station Info
The air quality stations are identified with an unique string identifier, the EOL code. For easy readability, the table consisting of the monitoring stations data is joined with the EEA table, providing every reading with the corresponding station name, latitude and longitude.
Task 2
Geographical Data available only as Geohash
It is noted that the geographical data on Air Tube is only available as geohashes. As a result, the geohashes need to be converted to lat lon measurements first before they can be plotted on the map. This is done by geocoding using R to geocode the geohashes, before joining the tables together to form the final table used for analysis.