IS428 AY2018-19T1 Le Thanh An

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
To Be a Visual Detective: Sofia Air Pollution

► Problem & Motivation

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

► Exploratory Data Analysis & Data Transformation

EEA Data

We are given 27 csv files, along with 1 metadata xlsx file. Below are the columns present in the 27 csv files and their descriptions.

Screenshot 2018-11-11 at 4.37.53 AM.png

Since there are 27 files, we will have to combine them. I used Tableau's in-built Union function as shown below

Union.png

And below are the columns present in the metadata and their descriptions.

Metadata columns.png

We will need to combine the data in the csv files with the metadata. We will do this by Inner Joining AirQualityStationEolCode in Tableau

Innerjoin.png
Problem #1
Issue There are missing data from Jan 2017 to Nov 2017. In addition, Orlov station doesn't have any data from year 2016-2018, and within 2012-2015 for Orlov there are missing data between July 2013 to September 2013 as well as Feb 12 2015 to Feb 24 2015. Mladost station only have data in 2018.
Solution Since the data all follow the same pattern, Average Concentration is not affected. I have tried using Average Concentration as well as excluding the data for the 2 cities, and the result is the same.

Air Tube Data

Problem #1
Issue Tableau does not recognise geohash as geospatial data. Air Tube data uses geohashes to locate its sensors. We would need to transform the geohashes into latitude and longitude for further visualisation
Solution I use a python library geohash2 to perform geodecoding. I then add an additional column called lat and lng to the dataset. The code snippet for the geodecoding is as below
Geodecode.jpg
Problem #2
Issue There were invalid geohashes during geodecode
Solution I decided to check the geohashes to see where "m-2105171" location is before deciding whether to remove the place. It turns out to have latitude of 44.99296188354492 and longitude of 57.792205810546875. However, Google Map returns an invalid place based on the coordinates, so I decided that the geohash is invalid and the row will be ignored in further analysis

Task 1: Spatio-temporal Analysis of Official Air Quality

A day in Sofia...

Sofia calendar heatmap.png
In order to see a typical day in Sofia, I have decided to plot a calendar heatmap of average concentration, divided by year, week day and over the months in a year. The values are organised into bins as per the EPA PM2.5 standards.
AQI Index.jpg

As shown on the heatmap, 2nd week of October, November, December, January and February are generally very unhealthy (with some days go up to hazardous level) while a normal day in other months are at best moderate, and at worst unhealthy. There is seemingly no correlation between the day of the week, and the PM2.5 concentration level.

In order to see the yearly trend in concentration level, I plot a line chart of concentration level against day for all 5 stations.

Linegraph sofia.png

As shown on the graph, other than the missing 2017 data and stations that have impartial data, there are recurring outliers every year (hovering on the point would reveal the exact day of the outlier). One of the outliers fall on December 25th, which is Christmas day. This might be due to Bulgarian's tradition of keeping a fireplace lit in a hearth, burning long enough to last through Christmas and into the day. Other peaks of the graph includes New Year period (1st week of January) and end of January.

Over the year, we can see that there is a general decreasing trend of the peaks, indicating that air pollution in Sofia is getting better. This is in tandem with the trend observed in the Calendar Heatmap. In addition, there are less peak points at the end of the year (with 2014 includes points in February while subsequent years the same day/period do not have any peak)

Another

► Task 2

► Task 3