IS428 AY2018-19T1 Lau Zi Quan

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search

To be a Visual Detective

Overview

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

Dataset Analysis & Transformation Process


Sofia Dataset

Four major data sets in zipped file format are provided. They can be download by click on this link.

  • Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description HERE…
  • Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data).
  • Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
  • Topography data (TOPO-DATA)


Dataset Data Attributes Rationale Of Usage
EEA Data.zip
EEA EDA.png
  1. Official air quality measurements (5 stations in the city)
  2. There is a Metadata.xlsx file which contains information of stations.
  3. There are a total of 6 Stations.
  4. Druzhba, Hipodruma, IAOS/Pavlovo and Nadezhda have data from 2013 to 2018
  5. Mladost only has data in 2018, it may be the replacement the station of OrlovMost which only have data from 2013 to 2015.
  6. All data from EEA are validated, valid and verified.
Air Tube.zip
AirTube EDA.png
  1. Citizen science air quality measurements
  2. Because of the information and coverage by official air measurements data are limited, Crowdsourcing from the citizens to gather data are important.
  3. These sensors are deployed all over Bulgaria across the year 2017 to 2018.
  4. Data from AirTube are expected to be noisy and requires cleaning and verification.
  5. P1 data is PM10 and P2 data is PM2.5
METEO-data.zip
Meteo EDA.png
  1. Data collected by Meteorological measurements
  2. There are a total of 2449 records in this dataset.
  3. This data is collected at Sofia Airport and is measured at a standard 2-meter height. Pressure data is adjusted to sea-level.
  4. Data have undergone QA and should be without error.


Transformation Process

Issue 1 : Merging of Data Set
For EEA Data, there is a total of 28 csv file from different stations location in sofia across different years. In addition there a xlsx metadata which consists imporatant information like CommonName (Station Name) and Latitude and Longtitude of the station. In order to proceed with analysis, we need to merge all the data set together.
Solution :
Method 1: Using Python Pandas:
I made use of python pandas read_csv function to load the data into dataframe in order to concatenate the data. After we concat the data, we can merge the data based on the StationEoICode to get more information on the stations.

MergeEEAPython.PNG


By using this method, I can export the csv into a single file and load it into different analytics tool for visualization.
Method 2: Using Tableau Union Function:
Similarly, Tableau can merge the data together using the union features. Subsequently, we can inner join the data based on the StationEoICode.

TabMergeEEA2.PNG

Issue 2 : Geohashing for Airtube Data
Solution  : I used python to convert the geohash to coordinates. I reference to python geohash2 library link.

GeoHashDecoder.PNG


After I decode the geohash, I discover that there are noises in the process. There is a particular geohash "m-2105171" which is unable to be decode. I used this converter to decode the geohash to obtain the latitude and longtitude. There are also a total of null values in the geohash data. In this case, I would probably be remove these 5 records as the geolocation is not located in Sofia or even near Bulgaria.

GeoHashError.png


Issue 3 : Outlier/Noise in Airtube Dataset
During the Exploratory Data Analysis Process, I discovered extreme values in the dataset for certain measures. However, for Task 2, exploring outlier for Airtube Data is required. Thus this clean dataset is used after and before Task 2 to compare the results.

Solution : I removed values for Temperatures that are > 50 degrees Celsius and < -50 degrees Celsius. That is about 25,068 records removed.


Task 1

Task 2

Task 3

Conclusion

Reference

Feedbacks