Difference between revisions of "IS428 AY2018-19T1 Gokarn Malika Nitin"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Problem and Motivation</font></div>=
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Problem and Motivation</font></div>=
<!--
 
 
<div style="font-family:Open Sans, Arial, sans-serif;font-size:12px">
 
  
 
Air Pollution is the single largest environmental health risk in Europe as well as the rest of the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. The fact that it is such a large health risk is backed by the high number of metrics pointing toward air pollution as the primary cause of distress in terms of disease (most deadly of which include cancer) and death. That is to say that the health effects can be both short term and long term. For example, it is estimated that 7 million people died prematurely across the world due to air population. In fact, in the European Union, 400,000 people suffered a premature death.
 
Air Pollution is the single largest environmental health risk in Europe as well as the rest of the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. The fact that it is such a large health risk is backed by the high number of metrics pointing toward air pollution as the primary cause of distress in terms of disease (most deadly of which include cancer) and death. That is to say that the health effects can be both short term and long term. For example, it is estimated that 7 million people died prematurely across the world due to air population. In fact, in the European Union, 400,000 people suffered a premature death.
Line 11: Line 8:
  
 
To solve the problem of air pollution, it’s necessary to understand the issues and look for ways to counter it. Therefore, the aim of this assignment is to reveal the spatiotemporal patterns of air quality and measurement techniques in Sofia City of Bulgaria, thereby identifying issues of concern.
 
To solve the problem of air pollution, it’s necessary to understand the issues and look for ways to counter it. Therefore, the aim of this assignment is to reveal the spatiotemporal patterns of air quality and measurement techniques in Sofia City of Bulgaria, thereby identifying issues of concern.
</div>
 
 
-->
 
  
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Dataset Analysis and Transformation Process</font></div>=
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Dataset Analysis and Transformation Process</font></div>=
 
<div style="font-family:Open Sans, Arial, sans-serif;font-size:15px">
 
<div style="font-family:Open Sans, Arial, sans-serif;font-size:15px">
<!--
+
 
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Dataset Download</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Dataset Download</div>==
  
Line 43: Line 37:
 
The format for data contained by these sensor data files looks as follows:
 
The format for data contained by these sensor data files looks as follows:
 
[[File:AirTube Data.jpg|550px|center]]
 
[[File:AirTube Data.jpg|550px|center]]
 +
 +
<div style="font-family:Open Sans, Arial, sans-serif;font-size:16px"><b>METEO Data</b></div>
 +
 +
The downloaded zip file has 1 file under METEO data which looks as follows:
 +
[[File:Metero-data.jpg|550px|center]]
  
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Data Cleaning and Transformation</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Data Cleaning and Transformation</div>==
Line 76: Line 75:
 
| Issue || The citizen science air quality measurement readings (AirTube data) do not include the longitude and latitude of the place of measurement. Instead, they are contained in the form of a geohash code. Unfortunately, Tableau is not built to handle geohash code.
 
| Issue || The citizen science air quality measurement readings (AirTube data) do not include the longitude and latitude of the place of measurement. Instead, they are contained in the form of a geohash code. Unfortunately, Tableau is not built to handle geohash code.
 
|-
 
|-
| Solution || Making use of the GitHub python geohash2 library [https://github.com/DBarthe/geohash] I am able to write a python script that can do the decoding for me, taking into consideration the error of transformation as well.
+
| Solution || Making use of the GitHub python geohash2 library [https://github.com/DBarthe/geohash] I am able to write a python script that can do the decoding for me, taking into consideration the error of transformation as well. The code snippet is as below:
 +
[[File:Codesnippet1.jpg|550px|center]]
 +
 
 
<br/>
 
<br/>
 
Upon importing the decoded dataset into Tableau, I found 4 points that have latitude and longitude values of 0.000000, as well as 1 point that has a latitude value of -4.025953, and a longitude value of 78.751781. As neither of these 5 points is anywhere near Bulgaria or Sofia City I have excluded them from the dataset as a whole.
 
Upon importing the decoded dataset into Tableau, I found 4 points that have latitude and longitude values of 0.000000, as well as 1 point that has a latitude value of -4.025953, and a longitude value of 78.751781. As neither of these 5 points is anywhere near Bulgaria or Sofia City I have excluded them from the dataset as a whole.
Line 90: Line 91:
 
| Issue || The citizen science air quality measurement readings (AirTube data) has multiple "wrong" readings with some being noise while some being representative of broken sensors. Through a simple internet search, one can find that the lowest temperature Bulgaria has ever faced is -38.3 degrees Celsius, while the highest is 45.2 degrees Celsius.
 
| Issue || The citizen science air quality measurement readings (AirTube data) has multiple "wrong" readings with some being noise while some being representative of broken sensors. Through a simple internet search, one can find that the lowest temperature Bulgaria has ever faced is -38.3 degrees Celsius, while the highest is 45.2 degrees Celsius.
 
|-
 
|-
| Solution || In order to remove the noise and outliers, the recorded temperatures above 50 degrees Celsius and below -40 degrees Celsius are removed.
+
| Solution || In order to remove the noise and outliers, the recorded temperatures above 50 degrees Celsius and below -40 degrees Celsius are removed. Further changes are made to the pressure and humidity variables as well, as is discussed in task 2.
 +
|}
 +
 
 +
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 +
|-
 +
! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem #4
 +
! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | Meteo Data Building Issues
 +
|-
 +
| Issue || The data currently has year, day and month as separate variables. Additionally, there is no information on the air quality measurements.
 +
|-
 +
| Solution || Firstly, I used a python script to build the full date so that it is comparable to a date format. I then extracted my unionised data from Air Tube data and extracted the average air quality measure per unique date. Lastly, I imported both the resultant csv files with an inner join on the date data. The code snippet used is below:
 +
[[File:Codesnippet2.jpg|550px|center]]
 
|}
 
|}
 
</div>
 
</div>
-->
 
  
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Task 1: Spatio-temporal Analysis of Official Air Quality</font></div>=
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Task 1: Spatio-temporal Analysis of Official Air Quality</font></div>=
 
+
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Analysis</div>==
<!--
 
 
I started by bringing in the EEA Data for the years 2013 to 2018. The aim is to visualize the concentration in terms of the average, across a Calendar Heatmap, to understand the outliers, and any potential anomalies. It can be understood that data across all stations is missing for the time period of 1 January 2017 to 28 November 2017.  
 
I started by bringing in the EEA Data for the years 2013 to 2018. The aim is to visualize the concentration in terms of the average, across a Calendar Heatmap, to understand the outliers, and any potential anomalies. It can be understood that data across all stations is missing for the time period of 1 January 2017 to 28 November 2017.  
  
Line 163: Line 173:
 
# Public Holidays in Bulgaria in 2018. (n.d.). Retrieved from https://www.officeholidays.com/countries/bulgaria/index.php
 
# Public Holidays in Bulgaria in 2018. (n.d.). Retrieved from https://www.officeholidays.com/countries/bulgaria/index.php
 
# Bulgaria: Kresna Gorge forest fires lead to more evacuations. (2017, August 30). Retrieved from https://sofiaglobe.com/2017/08/28/bulgaria-kresna-gorge-forestfires-lead-to-more-evacuations/
 
# Bulgaria: Kresna Gorge forest fires lead to more evacuations. (2017, August 30). Retrieved from https://sofiaglobe.com/2017/08/28/bulgaria-kresna-gorge-forestfires-lead-to-more-evacuations/
 
-->
 
  
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements</font></div>=
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements</font></div>=
  
<!--
 
-->
 
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensor's Coverage</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensor's Coverage</div>==
I brought in the sensors for the Sofia region only, and use the Density Marks feature of Tableau 10.3 to plot the coverage of the sensors. I find that the sensors are generally placed in central Sophia, with lack of coverage in South Eastern and North Eastern regions. The number of sensors increases from 2017 to 2018, however, this is mainly concentrated in the central region, with little focus on the areas that lack coverage. Below is the coverage for the year 2018.
+
I brought in the sensors for the Sofia region only, and use the Density Marks feature of Tableau 10.3 to plot the coverage of the sensors. I find that the sensors are generally placed in central Sophia, with lack of coverage in South Eastern and North Eastern regions. The number of sensors increases from 2017 to 2018, however, this is mainly concentrated in the central region, with little focus on the areas that lack coverage. <br/>
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
+
Below is the coverage for the year 2017.
|- [[File:Coverage 2017.jpg|500px|left]] [[File:Coverage.jpg|500px|right]]
+
[[File:Coverage 2017.jpg|550px|center]]  
|}
+
Below is the coverage for the year 2018.
 +
[[File:Coverage.jpg|550px|center]]
  
 
<br/>
 
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensors' Stable Operations</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensors' Stable Operations</div>==
 
To understand whether the sensors are operating properly at all times, I looked into the time series data of the number of records being captured by the sensors over the hours. This shows that indeed the number of records and thereby sensors has increased over time. However, they are not working properly all the time, as shown by the sudden steep drop in the number of records collected and a sudden spike. Perhaps this is due to sensors suddenly not taking as many or taking too many records as compared to their configuration due to malfunction. A closer look at the time axis shows that these dips started in 2018. This could be due to the new sensors that were added in 2018. Perhaps certain members of the new group of sensors are not working correctly.
 
To understand whether the sensors are operating properly at all times, I looked into the time series data of the number of records being captured by the sensors over the hours. This shows that indeed the number of records and thereby sensors has increased over time. However, they are not working properly all the time, as shown by the sudden steep drop in the number of records collected and a sudden spike. Perhaps this is due to sensors suddenly not taking as many or taking too many records as compared to their configuration due to malfunction. A closer look at the time axis shows that these dips started in 2018. This could be due to the new sensors that were added in 2018. Perhaps certain members of the new group of sensors are not working correctly.
Line 183: Line 188:
  
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensors' Performance Accuracy</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Sensors' Performance Accuracy</div>==
 +
Checking the performance accuracy of the sensors requires some basic knowledge of the measures that are being acquired by the sensors. The three measures I am looking into here are temperature, pressure and humidity.
 +
# <b>Temperature in Bulgaria: </b> The highest ever [https://www.kashkaval-tourist.com/weather-in-bulgaria-all-year-guide/ temperature] recorded is 45.2 degrees Celsius, while the lowest ever recorded is -38.3 degrees Celsius. When the sensors recorded temperatures nearing 400 degrees Celsius and -5K Celsius, they are obviously not behaving accurately.
 +
# <b> Pressure across the world: </b> Human beings have a limited range of atmospheric pressure that they can bear and survive in. [https://www.engineeringtoolbox.com/air-altitude-pressure-d_462.html Above 46.4kPa] we cannot survive as the atmospheric pressure is vital to our survival in terms of breathing and other functions. Similarly, we cannot survive in a vacuum. Interestingly there is a record from a sensor that measures atmospheric pressure nearing 0. This is definitely an anomaly as an absolute vacuum will actually provide lethal damage to the human lungs.
 +
# <b> Humidity across the world: </b>  [https://en.wikipedia.org/wiki/Humidity Relative humidity] is always recorded as a percentage value, therefore values of 898 and -999 are just not possible or anywhere near accurate. Additionally, we see some humidity values of around 110 as well.
 +
[[File:Performance Accuracy.jpg|550px|center]]
  
 +
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Air Pollution Measurements Analysis</div>==
 +
Looking into the air pollution measurements, I chose to plot the sensor data based on location and based on time, for a changeable time granularity. Using an average of P1 and P2, as well as the P1:P2 ratio, I built a dashboard which could be used to analyse at the time of interest, area of interest and value of interest. It is important to note that the number of sensors changes over 2017 to 2018, so they must be looked at separately. The [https://public.tableau.com/profile/mallika.gokarn7278#!/vizhome/Spatio-temporalAnalysisofCitizenScienceAirQualityMeasurements/Spatio-temporalAnalysisofCitizenScienceAirQualityMeasurements resultant dashboard] looks as shown below:
 +
[[File:SpatioTemporal Analysis.jpg|550px|center]]
  
* Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviours of the sensors by analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
+
It is interesting to see that the city shows relatively higher readings in the northern-central part of Sofia city. This can further be drilled down based on the specific hour of the day, to see that the areas of pollution change from morning to night based on denser population and increased human activity.  
* Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others?  Are these differences time-dependent? Limit your response to no more than 6 images and 800 words.
+
P1 and P2 follow a positive relationship wherein when one is high the other is also high. Additionally, the insights from Task 1 are further confirmed with regards to the activity in the Winter months and the festivities.
  
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">References</div>==
 
==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">References</div>==
# Air quality index. (2018, May 04). Retrieved from https://www.eea.europa.eu/themes/air/air-quality-index/index#tab-based-on-data
+
# Bozhinov, T. (2018, August 15). The weather in Bulgaria: An all-year guide. Retrieved from https://www.kashkaval-tourist.com/weather-in-bulgaria-all-year-guide/
# Air Quality Standards. (n.d.). Retrieved from http://ec.europa.eu/environment/air/quality/standards.htm
+
# (n.d.). Retrieved from https://www.engineeringtoolbox.com/air-altitude-pressure-d_462.html
# Bulgaria Celebrates with Christmas Eve Traditions. (n.d.). Retrieved from https://www.novinite.com/articles/135151/Bulgaria Celebrates with Christmas Eve Traditions
+
# Humidity. (2018, October 30). Retrieved from https://en.wikipedia.org/wiki/Humidity
#
+
 
# Climate - Bulgaria. (n.d.). Retrieved from https://www.climatestotravel.com/climate/Bulgaria
+
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Task 3: Air Quality Measurements with respect to environmental factors</font></div>=
# Public Holidays in Bulgaria in 2018. (n.d.). Retrieved from https://www.officeholidays.com/countries/bulgaria/index.php
+
 
# Bulgaria: Kresna Gorge forest fires lead to more evacuations. (2017, August 30). Retrieved from https://sofiaglobe.com/2017/08/28/bulgaria-kresna-gorge-forestfires-lead-to-more-evacuations/
+
I found the environmental factors to be most interesting in their effect on the air quality. It would be interesting to understand if the air quality once bad is further exacerbated or dissipated by changes in environmental factors. For this purpose I looked into the Meteo dataset and attempted to understand the correlation between P1 and P2 resectively with the
 +
* daily average temperature
 +
* daily average dew point
 +
* daily average relative humidity
 +
* daily average wind speed
 +
* daily average surface pressure
 +
* daily average precipitation amount
 +
 
 +
I chose to look into the P1 and P2 measurements separately in case there are any features that could be lost through aggregation into an average value between the two. The [https://public.tableau.com/profile/mallika.gokarn7278#!/vizhome/AirQualityMeasurementswithrespecttotheenvironmentalfactors/P2ConcentrationagainstEnvironmentalFactors resultant dashboard] for P1 values looks as shown below:
 +
[[File:P1 Dashboard.jpg|550px|center]]
 +
The resultant dashboard for P2 values looks as shown below:
 +
[[File:P2 Dashboard.jpg|550px|center]]
 +
 
 +
What can be seen from these visualizations is that the Surface pressure has little to no effect on the concentration of PM10 and PM2.5, on the other hand, Wind speed, temperature and dew point temperature are negatively correlated. While a sweeping statement cannot be made regarding the humidity and precipitation, we can infer that when temperature is low, air quality measures are high due to fuel burning and as a result of the heat, precipitation and humidity are low.
  
-->
+
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Conclusion</font></div>=
  
=Task 3=
+
In conclusion, based on the data visualisation, we can deduce that the concentration level of the air pollution in Sofia City is relatively poor and hazardous. In particular, there is a spike in the air pollution's concentration level during the festive months such as January, February, November and December. Apart from having a spike in the air pollution's concentration level, Sofia City is also heavily polluted due to the high amount of pollutions from coal-fired plants and thermal plants.
  
Urban air pollution is a complex issue. There are many factors affecting the air quality of a city. Some of the possible causes are:
+
The sensors from the Citizen Science group are mainly clustered in the central area of Sofia city similar to the population trend. While there are issues with the functioning and performance of these sensors, they help to substantiate the data collected by the official stations and show that indeed the insights from the station data is true here as well. The added benefit of having extra sensors is that we can analyse the population movement and human activity in the day through their heat signature and thereby air pollution.
  
* Local energy sources.  For example, according to [http://unmaskmycity.org/project/sofia/ Unmask My City], a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport.
+
Environmental factors such as the wind speed and rain precipitation is also another factor leading to the high air pollution. With faster wind speed, the air particles that pollutes the air are being carried out of Sofia City. This would in the long run help reduce air pollution. Vice versa, a slower wind speed will lead to a relatively higher air pollution concentration level due to the stillness of the air.  
* Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
 
* Local topography
 
* Complex interactions between local topography and meteorological characteristics.
 
* Transboundary pollution, for example, the haze that intruded into Singapore from our neighbours.
 
  
In this third task, you are required to reveal the relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2. Limit your response to no more than 5 images and 600 words.
+
In order to improve the air pollution issue in Sofia City, there will be a need for the governmental authorities to take steps to cut back on air pollution resulting factories and industries. Additionally, more rules and policies should be passed by the government to push Sofia City, Bulgaria towards the goal of being a greener city.
  
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Software</font></div>=
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Software</font></div>=
Line 216: Line 238:
 
* Tableau - for visualization of the various tasks
 
* Tableau - for visualization of the various tasks
 
* Python - for geocoding  
 
* Python - for geocoding  
</div>
 
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>References</font></div>=
 
<div style="font-family:Open Sans, Arial, sans-serif;font-size:12px">
 
*
 
 
</div>
 
</div>

Latest revision as of 02:50, 12 November 2018

Problem and Motivation

Air Pollution is the single largest environmental health risk in Europe as well as the rest of the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. The fact that it is such a large health risk is backed by the high number of metrics pointing toward air pollution as the primary cause of distress in terms of disease (most deadly of which include cancer) and death. That is to say that the health effects can be both short term and long term. For example, it is estimated that 7 million people died prematurely across the world due to air population. In fact, in the European Union, 400,000 people suffered a premature death.

The level of air pollution across the world is only increasing. Air pollution causes damage to crops, animals, forests, and bodies of water. It also contributes to the depletion of the ozone layer, which protects the Earth from the sun’s UV rays. Thereby contributing to killing the planet quite significantly. Within the European Union, one of the countries with the highest PM2.5 concentration in urban areas, over a three-year average is Bulgaria. At the same time, Bulgaria is also leading on the top polluted countries in the PM10 measure, with 77 μg/m3 on the daily mean concentration, which is much higher than WHO limit as well as the EU limit (50 μg/m3).

It is now a major concern in Bulgaria as to how clean the air you’re breathing right now is. Measurements show that citizens all over the country breathe air that is considered harmful to health. The Organization for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide.

To solve the problem of air pollution, it’s necessary to understand the issues and look for ways to counter it. Therefore, the aim of this assignment is to reveal the spatiotemporal patterns of air quality and measurement techniques in Sofia City of Bulgaria, thereby identifying issues of concern.

Dataset Analysis and Transformation Process

Dataset Download

Four major data sets in zipped file format are used and are available for download below:

  • Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description here
  • Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data).
  • Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
  • Topography data (TOPO-DATA)

They can be download by click on this link.

Data Description and Understanding

Official Air Quality Measurements

The downloaded zip file has 30 files under EEA data, including 1 metadata file, 1 readme.txt and 28 station data files in a .csv format. The format for data contained by the station files:

EEA stations data.jpg

The format for data contained by the metadata file is as below:

EEA metadata.jpg
Citizen Science Air Quality Measurements

The downloaded zip file has 3 files under AirTube data, including 1 sample file, 1 .csv.gz file for 2017 and 1 for 2018. The format for data contained by these sensor data files looks as follows:

AirTube Data.jpg
METEO Data

The downloaded zip file has 1 file under METEO data which looks as follows:

Metero-data.jpg

Data Cleaning and Transformation

Problem #1 EEA Data Building Issues
Issue The official air quality measurement readings (EEA data) do not include the longitude and latitude of the place of measurement. Instead, they are contained in a separate metadata file. Additionally, each stations' recordings for a specific year are stored in separate .csv files.
Solution Append all the files together, through a Tableau Union. Eliminate data for station 9484, referring to the station named "Orlov Most". This is due to the fact that data for the years 2016 onwards is missing. Additionally, I choose to exclude the data for station 60881 referring to the station "Mladost" solely because the data for Mladost is data covering 2018 onwards, and can be considered a new addition to the station list.


Lastly, an inner join of the union and the metadata file is conducted. This is done in order to assign the respective longitude and latitudes to all the rows, based on their respective Air Quality Stations. Therefore the join is done based on the EoI Code.

Problem #2 EEA TimeFrame for Analysis
Issue Missing data is an issue because a factor of seasonality would definitely be pivotal in understanding the trends in air pollution. Data is missing for much of the year of 2017 (further elaborated on in Task 1). Additionally, the stations start to record at the hourly level from 2016 onwards, this would affect "hourly analysis" if any.
Solution Using the "DateTimeEnd" variable, for the daily analysis, leave the data as is, generalization and insights will be affected based on new data if uncovered.


For the hourly analysis, use data only from November 2017 onwards. This tackles the sparsity of data in 2016's hourly analysis as well as the missing data in 2017.

Problem #3 AirTube Data Building Issues
Issue The citizen science air quality measurement readings (AirTube data) do not include the longitude and latitude of the place of measurement. Instead, they are contained in the form of a geohash code. Unfortunately, Tableau is not built to handle geohash code.
Solution Making use of the GitHub python geohash2 library [1] I am able to write a python script that can do the decoding for me, taking into consideration the error of transformation as well. The code snippet is as below:
Codesnippet1.jpg


Upon importing the decoded dataset into Tableau, I found 4 points that have latitude and longitude values of 0.000000, as well as 1 point that has a latitude value of -4.025953, and a longitude value of 78.751781. As neither of these 5 points is anywhere near Bulgaria or Sofia City I have excluded them from the dataset as a whole.
Tackling geocode issues
There are sensors plotted to be outside the boundaries of Sofia city. Most of the citizen science sensors as well as the stations from EEA data both appear to be in the central area of Sofia city. Therefore I choose to exclude the sensor locations outside Sofia City boundaries, using Tableau's lasso selection.

Problem #4 AirTube Data Outliers and Noise Removal
Issue The citizen science air quality measurement readings (AirTube data) has multiple "wrong" readings with some being noise while some being representative of broken sensors. Through a simple internet search, one can find that the lowest temperature Bulgaria has ever faced is -38.3 degrees Celsius, while the highest is 45.2 degrees Celsius.
Solution In order to remove the noise and outliers, the recorded temperatures above 50 degrees Celsius and below -40 degrees Celsius are removed. Further changes are made to the pressure and humidity variables as well, as is discussed in task 2.
Problem #4 Meteo Data Building Issues
Issue The data currently has year, day and month as separate variables. Additionally, there is no information on the air quality measurements.
Solution Firstly, I used a python script to build the full date so that it is comparable to a date format. I then extracted my unionised data from Air Tube data and extracted the average air quality measure per unique date. Lastly, I imported both the resultant csv files with an inner join on the date data. The code snippet used is below:
Codesnippet2.jpg

Task 1: Spatio-temporal Analysis of Official Air Quality

Analysis

I started by bringing in the EEA Data for the years 2013 to 2018. The aim is to visualize the concentration in terms of the average, across a Calendar Heatmap, to understand the outliers, and any potential anomalies. It can be understood that data across all stations is missing for the time period of 1 January 2017 to 28 November 2017.

OriginalHeatmap.jpg

This HeatMap visualized above shows the potential for a trend during the winter months from November onwards. However, the trend here is shown by the assigned palette which means that proper definition of boundary conditions is required to see a trend which we can make sense of.

Therefore, making use of the legend available with this map that visualizes the European Air Quality Index for the year 2017. This legend is defined by the European Environment Agency. Therefore, I built binning criteria as shown below:

Lower Bound (inclusive) Upper Bound (exclusive) Label
- 20 Good
20 35 Moderate
35 50 Unhealthy for Sensitive Groups
50 100 Unhealthy
100 - Hazardous

It is important to note that 50μg/m3 measured daily is the limit for Bulgaria with a 35 exceedances each year as defined by the EEA themselves. Thus it is important that the visualization is generated so as to clearly pinpoint the days where the concentration exceeds 50μg/m3. This will clearly differentiate the days that residents of Sofia City are breathing healthy air. Based on the above bins a colour scale can be developed, thereby allowing us to visualize a typical day in Sofia City. The resultant visualization is as below:

Calendar HeatMap Final.jpg

This second categorization allows us to understand that true to the reputation of Bulgaria, Sofia city too has a very high level of concentration of PM10. This is especially so across the year with a dip in the summer months of May, June, and July. More importantly, there are spikes in January and December. The global maxima of all the data is found on 25th December 2013, for which there are cultural reasons explaining the spike in air pollution, as can be found at the following link, wherein it is stated that “Strict tradition demanded that a fire be built in the hearth, with enough wood to burn all night and into Christmas Day, to help with the new birth of the sun.”

Control Plot.jpg

The Calendar Heatmap helps to highlight the overall daily trend of high pollution in terms of PM10. However, in order to better visualize the amount of spike between days, a control plot would be more intuitive in understanding the data. It is noticeable that between 18th and 24th January as well as on Christmas days each year there are spikes. Air pollution is high on Christmas days has already been explained by the cultural significance and traditions above. While there are no significant public holidays during the days of interest in January, I wondered whether there was a chronic trend of January 18th to 24th being the coldest days of the year in Bulgaria. It is interesting to note that while I have not found specific data that points to these dates being the coldest of the year, the average temperature recorded for the month of January is -5 degree to 2 degrees Celsius.

Taking this into account, residents of Bulgaria might be more inclined to lighting fires to get through the cold. Additionally, I found that forest fires are not rare in Bulgaria, and this could have some amount of significant contribution to the deteriorating air quality.

Lastly, the major contributing factors to air pollution across the day is representative of human activity during the day. This would perhaps involve burning motor vehicle fuel during the morning and evening hours involving travelling to and from work. Additionally, it can be speculated that with the night time and early morning hours being the coldest during the winter season, burning of solid fuel would also take place for heating purposes. These two reasons could explain the trend of a specific day, as shown in the below heat map. (Note that data is only for November 2017 until September 2018)

Average hourly heatmap.jpg

The final dashboard I have designed to represent the spatio-temporal analysis of the official air quality measurements would look like the following:

Final dashboard.jpg

This dashboard shows the difference between each day's measurements in terms of average, binned by the level of the pollution. Additionally, the difference between each day can be seen through the control plot which highlights all outliers relative to the average measurements of Sofia city. The most important outliers are the maximum values/peaks each year. Interactivity is promoted through the selection of station and the selection of the year.


References

  1. Air quality index. (2018, May 04). Retrieved from https://www.eea.europa.eu/themes/air/air-quality-index/index#tab-based-on-data
  2. Air Quality Standards. (n.d.). Retrieved from http://ec.europa.eu/environment/air/quality/standards.htm
  3. Bulgaria Celebrates with Christmas Eve Traditions. (n.d.). Retrieved from https://www.novinite.com/articles/135151/Bulgaria Celebrates with Christmas Eve Traditions
  4. Climate - Bulgaria. (n.d.). Retrieved from https://www.climatestotravel.com/climate/Bulgaria
  5. Public Holidays in Bulgaria in 2018. (n.d.). Retrieved from https://www.officeholidays.com/countries/bulgaria/index.php
  6. Bulgaria: Kresna Gorge forest fires lead to more evacuations. (2017, August 30). Retrieved from https://sofiaglobe.com/2017/08/28/bulgaria-kresna-gorge-forestfires-lead-to-more-evacuations/

Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Sensor's Coverage

I brought in the sensors for the Sofia region only, and use the Density Marks feature of Tableau 10.3 to plot the coverage of the sensors. I find that the sensors are generally placed in central Sophia, with lack of coverage in South Eastern and North Eastern regions. The number of sensors increases from 2017 to 2018, however, this is mainly concentrated in the central region, with little focus on the areas that lack coverage.
Below is the coverage for the year 2017.

Coverage 2017.jpg

Below is the coverage for the year 2018.

Coverage.jpg

Sensors' Stable Operations

To understand whether the sensors are operating properly at all times, I looked into the time series data of the number of records being captured by the sensors over the hours. This shows that indeed the number of records and thereby sensors has increased over time. However, they are not working properly all the time, as shown by the sudden steep drop in the number of records collected and a sudden spike. Perhaps this is due to sensors suddenly not taking as many or taking too many records as compared to their configuration due to malfunction. A closer look at the time axis shows that these dips started in 2018. This could be due to the new sensors that were added in 2018. Perhaps certain members of the new group of sensors are not working correctly.

Stability Operation.jpg

Sensors' Performance Accuracy

Checking the performance accuracy of the sensors requires some basic knowledge of the measures that are being acquired by the sensors. The three measures I am looking into here are temperature, pressure and humidity.

  1. Temperature in Bulgaria: The highest ever temperature recorded is 45.2 degrees Celsius, while the lowest ever recorded is -38.3 degrees Celsius. When the sensors recorded temperatures nearing 400 degrees Celsius and -5K Celsius, they are obviously not behaving accurately.
  2. Pressure across the world: Human beings have a limited range of atmospheric pressure that they can bear and survive in. Above 46.4kPa we cannot survive as the atmospheric pressure is vital to our survival in terms of breathing and other functions. Similarly, we cannot survive in a vacuum. Interestingly there is a record from a sensor that measures atmospheric pressure nearing 0. This is definitely an anomaly as an absolute vacuum will actually provide lethal damage to the human lungs.
  3. Humidity across the world: Relative humidity is always recorded as a percentage value, therefore values of 898 and -999 are just not possible or anywhere near accurate. Additionally, we see some humidity values of around 110 as well.
Performance Accuracy.jpg

Air Pollution Measurements Analysis

Looking into the air pollution measurements, I chose to plot the sensor data based on location and based on time, for a changeable time granularity. Using an average of P1 and P2, as well as the P1:P2 ratio, I built a dashboard which could be used to analyse at the time of interest, area of interest and value of interest. It is important to note that the number of sensors changes over 2017 to 2018, so they must be looked at separately. The resultant dashboard looks as shown below:

SpatioTemporal Analysis.jpg

It is interesting to see that the city shows relatively higher readings in the northern-central part of Sofia city. This can further be drilled down based on the specific hour of the day, to see that the areas of pollution change from morning to night based on denser population and increased human activity. P1 and P2 follow a positive relationship wherein when one is high the other is also high. Additionally, the insights from Task 1 are further confirmed with regards to the activity in the Winter months and the festivities.

References

  1. Bozhinov, T. (2018, August 15). The weather in Bulgaria: An all-year guide. Retrieved from https://www.kashkaval-tourist.com/weather-in-bulgaria-all-year-guide/
  2. (n.d.). Retrieved from https://www.engineeringtoolbox.com/air-altitude-pressure-d_462.html
  3. Humidity. (2018, October 30). Retrieved from https://en.wikipedia.org/wiki/Humidity

Task 3: Air Quality Measurements with respect to environmental factors

I found the environmental factors to be most interesting in their effect on the air quality. It would be interesting to understand if the air quality once bad is further exacerbated or dissipated by changes in environmental factors. For this purpose I looked into the Meteo dataset and attempted to understand the correlation between P1 and P2 resectively with the

  • daily average temperature
  • daily average dew point
  • daily average relative humidity
  • daily average wind speed
  • daily average surface pressure
  • daily average precipitation amount

I chose to look into the P1 and P2 measurements separately in case there are any features that could be lost through aggregation into an average value between the two. The resultant dashboard for P1 values looks as shown below:

P1 Dashboard.jpg

The resultant dashboard for P2 values looks as shown below:

P2 Dashboard.jpg

What can be seen from these visualizations is that the Surface pressure has little to no effect on the concentration of PM10 and PM2.5, on the other hand, Wind speed, temperature and dew point temperature are negatively correlated. While a sweeping statement cannot be made regarding the humidity and precipitation, we can infer that when temperature is low, air quality measures are high due to fuel burning and as a result of the heat, precipitation and humidity are low.

Conclusion

In conclusion, based on the data visualisation, we can deduce that the concentration level of the air pollution in Sofia City is relatively poor and hazardous. In particular, there is a spike in the air pollution's concentration level during the festive months such as January, February, November and December. Apart from having a spike in the air pollution's concentration level, Sofia City is also heavily polluted due to the high amount of pollutions from coal-fired plants and thermal plants.

The sensors from the Citizen Science group are mainly clustered in the central area of Sofia city similar to the population trend. While there are issues with the functioning and performance of these sensors, they help to substantiate the data collected by the official stations and show that indeed the insights from the station data is true here as well. The added benefit of having extra sensors is that we can analyse the population movement and human activity in the day through their heat signature and thereby air pollution.

Environmental factors such as the wind speed and rain precipitation is also another factor leading to the high air pollution. With faster wind speed, the air particles that pollutes the air are being carried out of Sofia City. This would in the long run help reduce air pollution. Vice versa, a slower wind speed will lead to a relatively higher air pollution concentration level due to the stillness of the air.

In order to improve the air pollution issue in Sofia City, there will be a need for the governmental authorities to take steps to cut back on air pollution resulting factories and industries. Additionally, more rules and policies should be passed by the government to push Sofia City, Bulgaria towards the goal of being a greener city.

Software

  • Tableau - for visualization of the various tasks
  • Python - for geocoding