IS428 AY2018-19T1 Huang Huanyuan

From Visual Analytics for Business Intelligence
Revision as of 23:41, 11 November 2018 by Hy.huang.2014 (talk | contribs)
Jump to navigation Jump to search

Va.jpg IS428 Visual Analytics for Business Intelligence

About

Assignment Dropbox

 


To be a Visual Detective

Overview

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

Data Pre-Processing

In this section, we will be looking into the data pre-processing process to clean the data, before we dive into the analysis proper. In general, I will be using Python and Jupyter Notebook for the pre-preparing of data. Subsequently, joins across different data sources will be done through both Python and the SQL-like build in functions within Tableau.

The Dataset given contains a .zip file from four different sources. These sources are: Official Air Quality Measurements (EEA Data), Citizen Science Air Quality Measurements (Air Tube Data), Meteorological Measurements (METEO Data), and Topography Data for Sofia City. For the EEA and Air Tube datasets, the transformation will combine the data into a single .csv file for convenience’s sake.

Official Air Quality Measurements (EEA Data)

The EEA datasets contain data from 2013-2018 stored in 28 .csv files (partitioned by year) from five stations in the city. The columns within the dataset is as follow:

EEA Data Frame.png

Issues:

  1. The data for each station has various levels of completeness for each station. Some stations contain all five years of data from 2013 to 2018, while others may only contain between one to three years of data.
  2. The granularity of data is different in 2017 and 2018 compared with the other years. The dataset of 2017 and 2018 contains hour-level data, while other years contain only day-level data. This added granularity gives us an additional layer of analysis, but we will not be able to compare the analysis across the years from 2013-2016.
  3. Although 2017 data provides added granularity, it is also incomplete. One would expect a full-year worth of data from 2017, but the dataset only contains hour-level data from November to December. This raises an issue as we are unable to conduct any form of analyses between January to October of 2017.
  4. The hour-level data for 2018 is only complete up till early September, and does not contain the full 24-hour data onwards.

Solutions:

As the given sets of data from the .zip folder is incomplete, I have opt to redownload the datasets from the official EEA website. I will elaborate on how I issues stated above are addressed below.

  1. I have chosen to only include the datasets from stations with the complete five-year data from 2013 to 2018. This ensures consistency of data used in our analyses.
  2. This proves to be an issue that cannot be addressed, since hour-level data is unavailable for the years 2013-2016.
  3. For 2017, I have chosen to use day-level data as it is a complete dataset.
  4. I only took data up till July 31st, 2018.

The resulting dataset contains 35713 rows, as compared to 39715 rows from above. Additionally, I have included a Datebegin column, derived from the DatetimeBegin column, to use as a key to join with the METEO dataset later on.

Citizen Science Air Quality Measurements (Air Tube Data)

The Air Tube datasets contain data from 2017-2018, stored in 2 .csv files (partitioned by year) from instruments placed around the country. The columns within the dataset is as follow:

Air Tube Data Frame.png

Issues:

  1. Datetime value in the ‘time’ column is stored in an unusable format (‘%Y-%m-%dT%H:%M:%SZ).
  2. Geographic data is stored in a geohash format.
  3. Dataset contains data from across Bulgaria, even though we are only interested in Sofia City.

Solutions:

  1. Parsed the time into a usable datetime format using Python’s library: datetime’s strptime() function.
  2. Used Python’s library: pygeohash to decode the geohash data.
  3. Used Python’s library: reverse_geocode to get tag each lat-long pair to a city within Bulgaria. This allowed me to filter easily for Sofia City during the analyses afterwards.
    1. Initially tried to use Google Map’s API to do the reverse geo-coding. However, the API constantly times out due to large number of calls.
    2. The reverse_geocode library has a set of known geocoded locations and uses a k-d tree to efficiently find the nearest neighbour, and hence allows geocoding to be done offline. This however leads to some inaccuracies in the parsed data, since the algorithm finds the nearest neighbour, which might not be the correct city of interest.
  4. (Additionally) I only took data up till July 31st, 2018.

The resulting dataset contains 35713 rows, as compared to 39715 rows from above. Additionally, I have included a Datebegin column, derived from the DatetimeBegin column, to use as a key to join with the METEO dataset later on.


Meteorological Measurements (METEO Data)

The METEO dataset contains information on temperature, humidity, wind speed, pressure, rainfall and visibility from 2012-2018. The columns within the dataset is as follow:

METEO Data.png

Issues:

  1. Date information is stored in 3 different columns – year, month and day.

Solutions:

  1. Create a new ‘Date’ column that combines the date information into ‘%Y-%m-%d’ format.
  2. (Additionally) I did not limit the date range to lie between 2013 and July 31st 2018, as we will be using this table for an inner join within Tableau.

Topography Data for Sofia City (TOPO Dataset)

The TOPO dataset contains information on elevation of several coordinates around Sofia City. The columns within the dataset is as follow:

TOPO Data Frame.png


Data Transformation

In this section, we will be looking into how we transform the data that we pre-processed previously into data sources that we will be using for our analyses subsequently. For the analyses, the Tableau workbook be powered primarily by two data sources:

  1. EEA,METEO: Official Air Quality Measurements (EEA Data) + Meteorological Data (METEO Data)
  2. AirTube,METEO,TOPO: Citizen Science Air Quality Measurements (Air Tube Data) + Meteorological Measurements (METEO Data) + Topography Data for Sofia City (TOPO Dataset)

EEA,METEO

The transformation process is done within Tableau itself, with an inner join between the pre-processed EEA dataset and the pre-processed METEO dataset. The joining keys are Datebegin column from EEA and Date column from METEO.

AirTube,METEO,TOPO

The transformation process is broken down into two parts: joining the AirTube dataset to the TOPO dataset (Python), and joining the resulting table to the METEO dataset (within Tableau).

As the TOPO dataset’s coordinates are not comprehensive, a join was based on the lowest resulting distances using the Harversine formula. The Haversine formula determines the great circle distance between two points on a sphere given their longitudes and latitudes. Minimizing this distance allow us to classify each lat-long coordinate pair in the pre-processed Air Tube dataset to its closest counterpart on the TOPO dataset, and in turn allow us to estimate the elevation for each lat-long coordinate pair. Since the TOPO dataset only contains lat-long coordinates within Sofia City, the resulting table will incorrectly estimate elevation for lat-long coordinate pairs outside Sofia City to its closest counterpart within Sofia City. This is not a problem if the other coordinates are on the border of Sofia City, but will pose a problem if they are far away. However, since our analyses will focus primarily on data within Sofia City, the incorrect data can be ignored.

The resulting table is then joined on the pre-processed METEO dataset within Tableau, with an inner join between resulting table and the METEO dataset. The joining keys are Date column from EEA and Date column from METEO.



The Task

Task 1: Spatio-temporal Analysis of Official Air Quality

S/N Observation Information Graph
1 We see here a plot of average concentration of air-pollutants by day, between 2013-2018 from the EEA dataset. We can see immediately that the concentration of air pollutants are highest during the end and start of the years (November, December, January, February). The trend starts to pick up at around October of each year, increases, and stabilizes at around March.

From the chart, there are definitely quite a few notable trends worth investigating. First is the spikes in concentrations as mentioned before. As this is a recurring trend for the six years, it is noteworthy to find out what is causing the spikes at the particular times. Furthermore, the trend seems to last around four months long – which is substantial since citizens of Sofia City will be living amidst high concentrations of air pollutants for four months of each year.

Secondly, it is worth diving into each of the spikes and understanding why the day-on-day changes can be so drastic. For example, if we look at end-December of 2013, we see that the average concentration of air pollutants dropped from 345.8 down to 56.7 within the span of 3 days.

Thirdly, we might be interested in finding out the discontinuities within the dataset itself. Throughout the years, we see instances where the line ‘breaks’. This indicates that data is not captured during that particular day.
Data Source: EEA,METEO

Title: EEA by Day
EEA by Day.png
2 With the added time granularity, we are able to do a plot of average concentration of air pollutants by hour in 2018. From this plot, we observe several interesting spikes, particularly in March 17. Data Source: EEA,METEO

Title: EEA by Hour 2018
EEA by Hour 2018.png
3 With this plot, we separate the readings taken by each Air Quality Station. As we can see, some of the readings are not consistent across the stations. Data Source: EEA,METEO

Title: EEA by Air Quality Station

EEA by Air Quality Station.png

Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

S/N Observation Information Graph
1 We see here the distribution of sensors within Bulgaria. In the above diagram, each colour represents a city within Bulgaria, characterized by the reverse-geocoding algorithm described under Data Pre-Processing. We can see immediately that although the clustering algorithm is comprehensive, it is not one hundred percent accurate, as we see overlaps between cities. Despite the lack in accuracy, it still provides us an easy way to filter for the data we need. Note, the data points in blue represents Sofia City.

We can also see that most of the sensors are concentrated around the city center of Sofia City. Although this will not give a fair picture to the overall air pollution level within the entire city, it gives us a clear understanding on the relevant air pollution levels that has a direct impact on people’s lives, since the city center comprises of the most densely populated areas of Sofia City.
Data Source: AirTube,METEO,TOPO

Title: Distribution of Sensors
Distribution of Sensors.png
2 We see here the Average Concentration of P1(Size of Squares) and Average Concentration of P2 (Intensity of Colours) within Sofia City. We can see immediately that the data is skewed by the data point represented by the Big Red Square on the plot. To minimize the impact of outliers, we will be removing the top three highest concentrations. Data Source: AirTube,METEO,TOPO

Title: P1 P2 Concentrations

P1 P2 Concentrations.png

3 In this graph, we plot the overall trend of Average Concentrations of P1 and Average Concentrations of P2 respectively. From the graph, we are able to identify regions where the lines break – this indicate that no data was collected for all sensors for that particular period of time. We might also suspect drastic changes in spikes of concentration to indicate that the sensors are not picking up the data properly. Data Source: AirTube,METEO,TOPO

Title: P1 P2 Hourly

P1 P2 Hourly.png

4 After removing the three outliers, the results become more pronounced. As we can see from above, the areas with higher relative air-pollutant concentration are mostly scattered around the borders of Sofia City, with some high concentration areas within the center itself. This is probably due to the sensors being close to industrial areas in Sofia City. Data Source: AirTube,METEO,TOPO

Title: P1 P2 Concentrations

P1 P2 Concentrations Clean.png

5 Perhaps interestingly, we see that the concentration of air pollutants differs by time - in this case, we compare between the concentrations at 2am and 11am. However, this is arguably due to the Data Source: AirTube,METEO,TOPO

Title: Concentration by Time

TIME.png

Task 3

S/N Observation Information Graph
1 At a glance, we are able to infer a correlation between Temperature, Humidity and Concentration of Air Pollutants. Data Source: EEA,METEO

Title: Meteorological Dashboard

Meteorological Dashboard.png

2 Diving deeper, we are able to observe the same result with the correlation plot. Data Source: EEA,METEO

Title: Correlation Plot

Correlation Plot.png

3 Looking at the plot for Elevation, we can conclude that there is little correlation between elevation and the concentration of air pollutants (represented by P1 and P2 plots here). Data Source: AirTube,METEO,TOPO

Title: Elevation

Elevation.png


Reference

Haversine Formula
Pygeohash Reverse Geocode