IS428 AY2018-19T1 Lau Zi Quan
To be a Visual Detective
Contents
Overview
Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).
According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
Dataset Analysis & Transformation Process
Four major data sets in zipped file format are provided. They can be download by click on this link.
- Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description HERE…
- Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data).
- Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
- Topography data (TOPO-DATA)
Dataset | Data Attributes | Rationale Of Usage |
---|---|---|
| ||
| ||
|
Transformation Process
Issue 1 : Merging of Data Set
For EEA Data, there is a total of 28 csv file from different stations location in sofia across different years. In addition there a xlsx metadata which consists imporatant information like CommonName (Station Name) and Latitude and Longtitude of the station. In order to proceed with analysis, we need to merge all the data set together.
Solution :
Method 1: Using Python Pandas:
I made use of python pandas read_csv function to load the data into dataframe in order to concatenate the data. After we concat the data, we can merge the data based on the StationEoICode to get more information on the stations.
By using this method, I can export the csv into a single file and load it into different analytics tool for visualization.
Method 2: Using Tableau Union Function:
Similarly, Tableau can merge the data together using the union features. Subsequently, we can inner join the data based on the StationEoICode.
Issue 2 : Geohashing for Airtube Data
Solution :
I used python to convert the geohash to coordinates. I reference to python geohash2 library link.
After I decode the geohash, I discover that there are noises in the process. There is a particular geohash "m-2105171" which is unable to be decode. I used this converter to decode the geohash to obtain the latitude and longtitude. There are also a total of null values in the geohash data. In this case, I would probably be remove these 5 records as the geolocation is not located in Sofia or even near Bulgaria.
Issue 3 : Outlier/Noise in Airtube Dataset
During the Exploratory Data Analysis Process, I discovered extreme values in the dataset for certain measures. However, for Task 2, exploring outlier for Airtube Data is required. Thus this clean dataset is used after and before Task 2 to compare the results.
- Temperature: The lowest temperature in Bulgaria is -38.3 degrees Celsius and the highest is 45.2 degrees Celsius. In the Airtube Data, there are values as low as -400 and high as 70 degrees. Thus, these records could have erroneous data and need to be treated.
- Pressure: Pressure at sea level is 100 kpa,(100,000 Pa) and the higher the elevation, then the lower the pressure. Therefore it would be interesting if the pressure falls below 0 or higher that 200kpa. Pressure reduce by 1.2kpa for every 100 meters elevation.
- Humidity: Humidity is often measured in a relative scale from 0 to 100. Thus values like 1000 and -1000 could be type or noises.
Solution :
- Temperature : I removed values for Temperatures that are > 50 degrees Celsius and < -50 degrees Celsius. That is about 25,068 records removed.
- Pressure: Remove > 50kpa and < 150 kpa
- Humidity: Remove the > 1000 and -1000
Task 1: Spatio-temporal Analysis of Official Air Quality
Firstly, we are looking at only EEA Data from 2013 to 2018. By looking at the data as a whole, we identified that all stations have missing values from the period of 1 Jan 2017 to 28 November 2017.
From this simple plot, we are able to identify that there is a pattern in the increase of the concentration of PM10. This means that there could be an interesting reason for the cause. Thus I decided to explore what is the current standards for PM10 to be considered unhealthy. Sofia City is located in Bulgaria, which is part of EU, thus I referenced to their standards of air quality from this link. From this link we can further categorize the PM Air quality into different categories by on EU Air Quality Standards. Firstly 50μg/m3 measured daily is the limit for Bulgaria with a 35 exceedences each year. Thus we need to generate a graph that can clearly pinpoint on which day the concentration exceeds and when are the days where people in Sofia city can enjoy breathing healthy air.
I used the above categorization as my Color Scaling to visualize how a typical day in Sofia City looks like.
By categorizing the concentration, we can identify that actually, Sofia City is facing a high level of concentration of PM10. Surprisingly, other than the spikes in January and December, Sofia City is also facing a high concentration of pollutant across the years except for June.
Although Heatmap can highlight the seriousness of pollution Sofia is facing, but using Control Plot, we can use it to identify the underlying pattern and interesting insight from this graph. You can notice that every year during 24th December and between 18th to 24th January, there is a significant rise in the concentration of PM10 in Sofia. Could this be a coincidence or a reason behind this. I look up the national holidays of Bulgaria and try to identify to see other Festive Seasons also have a significant rise other than Christmas Season, but in this case there isn't.
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Part 1 :
Sensor Coverage :
Input First Map Here
Sensor Performace :
Input Line Chart Here
Sensor Operations : Unusual Behaviour (Detecting Outlier)
Input Dashboard Here
Part 2 :
Using appropriate data visualisation, you are required will be asked to answer the following types of questions:
Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others? Are these differences time dependent? Limit your response to no more than 6 images and 800 words.
Task 3: Air Quality Measure Analysis
Urban air pollution is a complex issue. There are many factors affecting the air quality of a city. Some of the possible causes are:
- Local energy sources. For example, according to Unmask My City, a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport.
- Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
- Local topography
- Complex interactions between local topography and meteorological characteristics.
- Transboundary pollution for example the haze that intruded into Singapore from our neighbours.
In this third task, you are required to reveal the relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2. Limit your response to no more than 5 images and 600 words.
Conclusion
Reference
Feedbacks
Please feel free to provide your feedback. Thank you.