IS428 2018-19 Term1 Assign Aaron Poh Weixin
Contents
Background & Objectives
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3). According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
The objective of this project is to first understand..... (continue)
Task 1: Spatio-temporal Analysis of Official Air Quality
Data Preparation
Before diving into the data cleaning process, it is important to detail some findings I got from simply observing and scanning the data:
- Stations 60881 and 9484 has serious data quality issues
- Station 60881 only has data for 2018
- Station 9484 only has data up to the year 2015
- Station 9484 has missing data for the months of oct to dec for the year 2015
- Station 9484 has missing data for the months of aug to sep for the year 2013
- Averaging format inconsistent across time periods
- Year 2013-2014 averages data daily
- Year 2015 averages data daily, except for 31 Dec, which they classify as hourly
- Year 2016 mostly averages data daily, with some hourly averages
- Year 2017 averages hourly and only has data for the months of November and December. They also have a 'var' average which does not make sense
- Year 2018 averages mostly hourly, with some daily averages
Considering the task requires analysis of the past and most recent air quality situation in Sofia City, the large amount of missing data would be unacceptable, hence I decided to exclude them from the analysis. Furthermore, since Bulgaria is not a big country, 4 stations should be sufficient to explain the air quality situation
(Note: Year 2017 only has severe missing data in the original dataset, but I managed to download the latest version which captures most of the daily data)
With the above information in mind, I decided on the following cleaning procedure (in Python):
1 |
I started off with cleaning the years 2013-2015 and 2017(updated) because they had the cleanest data format. Cleaning was simply converting 'DatetimeBegin' into a suitable DateTime format ("dd/mm/yyyy")
|
|
2 |
Years 2016 and 2018 were slightly more complicated because they contained both hourly and daily averages.
|
|
3 |
This step is the simplest as it simply merges all the cleaned datasets into 1 dataset to be parsed into the relevant visualisation tools |
Functions used:
|
Data Visualisation
- Typical day
- Trend
- Anomalies
- how anomales affect