IS428 AY2018-19T1 Le Thanh An
Contents
► Problem & Motivation
Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).
According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
► Exploratory Data Analysis & Data Transformation
EEA Data
We are given 27 csv files, along with 1 metadata xlsx file. Below are the columns present in the 27 csv files and their descriptions.
Since there are 27 files, we will have to combine them. I used Tableau's in-built Union function as shown below
And below are the columns present in the metadata and their descriptions.
We will need to combine the data in the csv files with the metadata. We will do this by Inner Joining AirQualityStationEolCode in Tableau
Problem #1 | |
---|---|
Issue | There are missing data from Jan 2017 to Nov 2017. In addition, Orlov station doesn't have any data from year 2016-2018, and within 2012-2015 for Orlov there are missing data between July 2013 to September 2013 as well as Feb 12 2015 to Feb 24 2015. Mladost station only have data in 2018. |
Solution | Since the data all follow the same pattern, Average Concentration is not affected. I have tried using Average Concentration as well as excluding the data for the 2 cities, and the result is the same. |
Air Tube Data
We are given 2 csv files, with one being 2017 data and the other being 2018 data. The columns of the files are as follow
Problem #1 | |
---|---|
Issue | Tableau does not recognise geohash as geospatial data. Air Tube data uses geohashes to locate its sensors. We would need to transform the geohashes into latitude and longitude for further visualisation |
Solution | I use a python library geohash2 to perform geodecoding. I then add an additional column called lat and lng to the dataset. The code snippet for the geodecoding is as below |
Problem #2 | |
Issue | There were invalid geohashes during geodecode |
Solution | I decided to check the geohashes to see where "m-2105171" location is before deciding whether to remove the place. It turns out to have latitude of 44.99296188354492 and longitude of 57.792205810546875. However, Google Map returns an invalid place based on the coordinates, so I decided that the geohash is invalid and the row will be ignored in further analysis |
► Task 1: Spatio-temporal Analysis of Official Air Quality
A day in Sofia...
As shown on the heatmap, 2nd week of October, November, December, January and February are generally unhealthy (with some days go up to very unhealthy level) while a normal day in other months are at low level of concentration. There is seemingly no correlation between the day of the week and the PM10 concentration level.
In order to see the yearly trend in concentration level, I plot a line chart of concentration level against day for all 5 stations.
As shown on the line chart, there are no anomalous sequence in the different stations. Other than the missing 2017 data and stations that have impartial data, there are recurring outliers every year (hovering on the point would reveal the exact day of the outlier). One of the outliers fall on December 25th, which is Christmas day. This might be due to Bulgarian's tradition of keeping a fireplace lit in a hearth, burning long enough to last through Christmas and into the day. Other peaks of the graph includes New Year period (1st week of January) and end of January.
Over the year, we can see that there is a general decreasing trend of the peaks, indicating that air pollution in Sofia is getting better. This is in tandem with the trend observed in the Calendar Heatmap. In addition, there are less peak points at the end of the year (with 2014 includes points in February while subsequent years the same day/period do not have any peak).
In conclusion, it seems that in Sofia, the average day is of moderate level of PM10 concentration. Start and end of year seems to have worst air quality, and the worst days happen on holidays. However, over the years, the PM10 concentration has been steadily decreasing.
► Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Sensors, how good are you?
► Sensors, where are people placing you?
First, I plot the locations of the sensors on a map using the latitude and longitude generated earlier. The resulting density map (which is a new feature in the latest Tableau version) has sensors that are placed outside of Sofia, but for this exercise I decided to exclude those and only focus on the ones that are in Sofia city. The resulting density map is as follow:
As seen on the above maps, there are more sensors in 2018 than in 2017. However, for both years, the sensors are very concentrated in the middle of the city. There are little sensors nearer to the border of the city.
► Have the sensors been working....?
In order to find out whether the sensors are working properly, I plot the number of records against the Hours to see if there are any hour that there are an unusual number of records. The result is as follow:
There are multiple dips in the chart and 1 time where the number of records went abnormally high, indicating that there were times where the sensors were not working properly (by suddenly not taking as many records as it is supposed to be, or suddenly malfunction and take a lot more than usual). However, a closer look at the data shows that the dips only started happening in 2018. This could be due to the new sensors that are added in 2018.
► So the sensors measure stuff, but how accurate are they...?
In order to find out unexpected behaviour in the sensors, I plot a series of boxplots of maximum and minimum temperature, humidity and pressure recorded for the data collected to check for outliers. Since the hottest temperature ever recorded in Bulgaria was 45.6 Celcius and the coldest ever was -38.3, I set those as my limits. Any temperature outside that range would be considered as outliers. Humidity is recorded in percentage, thus I deemed any humidity recorded below 0 and above 100 to be outliers. For pressure data, outliers are deemed as data lying outside the quartiles. The result is as follow:
As shown in the diagram, there is a significant number of outliers in temperature, humidity and pressure being recorded. Thus, I suspect that the sensors might not be working as expected, or are put in places that have additional heat source, pressure and humidity sources, leading to different readings.
► How about the particles? Where in the city are we getting pollutions?
Since the number of sensors in 2017 and 2018 are different, it only makes sense to take a look at 2017 and 2018 separately and see any commonalities/differences. For this task, I have plot a density map with the total PM2.5 and PM10 concentration (P1 + P2). I will be looking at 3 different timing for each year, namely midnight-early morning (12am-8am), working hours (8am-6pm) and after hours (6pm-12am). The result is as follow
As you can see, the identified spots shown in the diagrams above have varying pollution concentration level across the time periods. During working hours, the identified spots have the same pollution levels as other spots. Let's see if the same spots appear in 2018, or are there any new spots in 2018?