ISSS608 2018-19 T1 Assign Wong Yam Yip Data Preparation
|
|
|
|
|
`
EEA Dataset
The EEA dataset contains official PM10 air pollution readings from 6 air stations in Sofia, namely, Druzhba, Hipodruma, IAOS/Pavlovo, Mladost, Nadezhda and Orlov Most. The dataset is made up of multiple CSV files, each representing one-year worth of air pollution data, each from 1 of the 6 air stations. Overall, there were 39,715 observations stretching from 1 Jan 2013 to 14 Sep 2018. Additionally, there is metadata providing more information on each station, including altitude and type of air station, type and placement of equipment used and etc.
For this analysis, we merge the data from all stations for the entire period of 2013 to 2018. Subsequently, the corresponding air station information is updated to each row, from metadata.xlsx using JMP Pro. From data exploration, Averaging Time: has 3 different categories, vars, hour and day. By measuring the difference in time between Datetime Begin and Datetime End, it is found that vars is actually also an hourly time measure, thus vars is recoded to hour. For the purpose of this analysis, we will be using Datetime Begin as our basis of time.
Further to this, to compare past and present, the data is split to 2 time frames at the break of data in time between Dec 2016 – Nov 2017. Data collected between 2013-2016 will be grouped as past data 2013-2016 while data collected from 28 Nov 2017 to 14 Sep 2018 are grouped as present data, 2017-2018. Description of dataset is as follows:
Metadata.xlsx:
Air Tube Dataset
The Air Tube Dataset is made up of 2 csv files containing civilian sensors readings for 2017 and 2018. In total, this dataset has 3,610,146 readings from 1,265 civilian sensors distributed across Bulgaria from 06 Sep 2017 to 16 Aug 2018. The civilian sensors measure temperature, humidity, pressure and 2 pollution indicators P1 and P2. Based on Airtube, P1 and P2 each represents PM10 and PM2.5 concentration readings in µg/m³. The definition of PM10 and PM2.5 can be found here. Each observation is accompanied by the sensor location in geohash and the time of reading. Description of variables in dataset as follows:
For data preparation, the 2017 and 2018 data tables are merged by the rows into one data table. Subsequently, the geohash is mapped into 2 separate columns, Latitude and Longitude, in R Studio.
One of the sensor's geohash was mapped to somewhere in the Indian Ocean. In addition, there are 4 observations with missing geohash. Therefore, these 5 observations will be disregarded in this analysis
Meteorology Dataset
This dataset contains 2,449 meteorological readings on different dates stretching from 1 Jan 2012 to 17 Sep 2018. Description of dateset variables as follows:
To compare the relation between local meteorology and air pollution, the official PM10, and civilian, PM10 and PM2.5, readings are aggregated to a daily average. These 3 mean aggregates are subsequently updated to the meteorology dataset. As the dates available in official and civilian datasets do not fully match in time, we shall look at their correlation with local meteorology readings separately by filtering dates that do not have concentration readings. In addition, values of PRCPMAX and PRCPMIN are all missing (-9999), we will exclude these 2 variables from our analysis.
Topology Dataset
This dataset contains 196 topology measurements of Latitude, Longitude and Elevation of different locations in Sofia City. Detailed description of the variables are as follows:
Reference
The Tableau Workbook to the above image can be found here
Banner image credit to: MarcusObal