ISSS608 2018-19 T1 Assign Wong Yam Yip Data Preparation

From Visual Analytics and Applications
Jump to navigation Jump to search

Wyy Image1.jpeg   What's Suffocating Sofia?

Overview

Data Preparation

Task 1: Official Air Quality

Task 2: Citizen Science Air Quality

Task 3: Identifying Factors of Pollution

`

Dataset and Preparation

EEA Dataset

The EEA dataset contains official PM10 air pollution readings from 6 air stations in Sofia, namely, Druzhba, Hipodruma, IAOS/Pavlovo, Mladost, Nadezhda and Orlov Most. The dataset is made up of multiple CSV files, each representing one-year worth of air pollution data, each from 1 of the 6 air stations. Overall, there were 39,715 observations stretching from 1 Jan 2013 to 14 Sep 2018. Additionally, there is metadata providing more information on each station, including altitude and type of air station, type and placement of equipment used and etc.

For this analysis, we merge the data from all stations for the entire period of 2013 to 2018. Subsequently, the corresponding air station information is updated to each row, from metadata.xlsx using JMP Pro. From data exploration, Averaging Time: has 3 different categories, vars, hour and day. By measuring the difference in time between Datetime Begin and Datetime End, it is found that vars is actually also an hourly time measure, thus vars is recoded to hour. For the purpose of this analysis, we will be using Datetime Begin as our basis of time.

Further to this, to compare past and present, the data is split to 2 time frames at the break of data in time between Dec 2016 – Nov 2017. Data collected between 2013-2016 will be grouped as past data 2013-2016 while data collected from 28 Nov 2017 to 14 Sep 2018 are grouped as present data, 2017-2018. Description of dataset is as follows:

Wyy EEA.png

 

Metadata.xlsx:

Metadata.png

 

Air Tube Dataset

The Air Tube Dataset is made up of 2 csv files containing civilian sensors readings for 2017 and 2018. In total, this dataset has 3,610,146 readings from 1,265 civilian sensors distributed across Bulgaria from 06 Sep 2017 to 16 Aug 2018. The civilian sensors measure temperature, humidity, pressure and 2 pollution indicators P1 and P2. Based on Airtube, P1 and P2 each represents PM10 and PM2.5 concentration readings in µg/m³. The definition of PM10 and PM2.5 can be found here. Each observation is accompanied by the sensor location in geohash and the time of reading. Description of variables in dataset as follows:

Wyy Airtube.png

For data preparation, the 2017 and 2018 data tables are merged by the rows into one data table. Subsequently, the geohash is mapped into 2 separate columns, Latitude and Longitude, in R Studio.

Wyy Image2.png

 

One of the sensor's geohash was mapped to somewhere in the Indian Ocean. In addition, there are 4 observations with missing geohash. Therefore, these 5 observations will be disregarded in this analysis

Wyy Image3.png

 

Meteorology Dataset

This dataset contains 2,449 meteorological readings on different dates stretching from 1 Jan 2012 to 17 Sep 2018. Description of dateset variables as follows:

Wyy Meteor.png

To compare the relation between local meteorology and air pollution, the official PM10, and civilian, PM10 and PM2.5, readings are aggregated to a daily average. These 3 mean aggregates are subsequently updated to the meteorology dataset. As the dates available in official and civilian datasets do not fully match in time, we shall look at their correlation with local meteorology readings separately by filtering dates that do not have concentration readings. In addition, values of PRCPMAX and PRCPMIN are all missing (-9999), we will exclude these 2 variables from our analysis.

 

Topology Dataset

This dataset contains 196 topology measurements of Latitude, Longitude and Elevation of different locations in Sofia City. Detailed description of the variables are as follows:

Wyy Topo.png


Reference

The Tableau Workbook to the above image can be found here
Banner image credit to: MarcusObal