Difference between revisions of "IS428 AY2018-19T1 Liu Jinlongu"
(15 intermediate revisions by the same user not shown) | |||
Line 120: | Line 120: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 1] PM10 Concentration over the timeline |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
Line 126: | Line 126: | ||
This diagram shows the average concentration of the PM10 recorded from the five stations by hours across years.<br/> | This diagram shows the average concentration of the PM10 recorded from the five stations by hours across years.<br/> | ||
− | [[File: | + | [[File:Concentration of the PM10|800px|center]] |
<br> | <br> | ||
|- | |- | ||
| <b>Interactive Technique</b><br> | | <b>Interactive Technique</b><br> | ||
− | <ol> | + | <ol> |
<li>Select : Hover</li> | <li>Select : Hover</li> | ||
Tooltips are provided to show air quality station type, averaging tiem, common name, timestamp, average altitude, average concentration. | Tooltips are provided to show air quality station type, averaging tiem, common name, timestamp, average altitude, average concentration. | ||
Line 138: | Line 138: | ||
| <b>Analysis</b><br> | | <b>Analysis</b><br> | ||
− | + | Every year around Christmas Day, the peak figure shows that the air pollution level grows higher than the other days within one year. This might be mainly because of the fireworks. | |
Also, a deeper inspection of the data shows, regularly missing data hourly from 9-10 AM from Mladost station (BG0079A) for the critical 1st week of January. The readings in the hours following this missing data spike up significantly. What is the cause of these dropped data signals during these hours? Was there an instrument malfunction in the official weather stations. If the instruments are so costly relative to the citizen weather stations, then is it expected to be unreliable under some conditions. | Also, a deeper inspection of the data shows, regularly missing data hourly from 9-10 AM from Mladost station (BG0079A) for the critical 1st week of January. The readings in the hours following this missing data spike up significantly. What is the cause of these dropped data signals during these hours? Was there an instrument malfunction in the official weather stations. If the instruments are so costly relative to the citizen weather stations, then is it expected to be unreliable under some conditions. | ||
Line 149: | Line 149: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 1] PM10 Concentration over the timeline with shade |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
Line 155: | Line 155: | ||
This diagram shows the average concentration of the PM10 recorded from the five stations by month across years.<br/> | This diagram shows the average concentration of the PM10 recorded from the five stations by month across years.<br/> | ||
− | [[File: | + | [[File:PM10 Concentration over the timeline with shade.png|800px|center]] |
− | + | ||
− | + | <br/> | |
− | <br> | ||
|- | |- | ||
| <b>Interactive Technique</b><br> | | <b>Interactive Technique</b><br> | ||
Line 180: | Line 179: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 1] PM10 Concentration over Christmas |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
Line 206: | Line 205: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 2] Citizen geo-distribution |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
This diagram shows a geospatial distribution of all the sensors across the whole city. | This diagram shows a geospatial distribution of all the sensors across the whole city. | ||
<br/> | <br/> | ||
− | [[File: | + | [[File:Citizen geo-distribution.png|800px|center]] |
<br | <br | ||
|- | |- | ||
Line 228: | Line 227: | ||
From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible. | From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible. | ||
− | The | + | The color code is responsible for the highest concentration record reported from the sensor at that location. It can be observed that the points with the deepest colour appear at the center area indicating that the center area is the most polluted area. |
+ | |||
+ | The data can be clustered in month/year and sizes will represent the number of records in that location. | ||
Line 237: | Line 238: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 2] No. of records by hour accross citizen |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
Line 243: | Line 244: | ||
This diagram shows the number of records reported from the sensors during the past two years <br/> | This diagram shows the number of records reported from the sensors during the past two years <br/> | ||
− | [[File: | + | [[File:No. of records by hour accross citizen.png|800px|center]] |
<br | <br | ||
|- | |- | ||
Line 255: | Line 256: | ||
| <b>Analysis</b><br> | | <b>Analysis</b><br> | ||
− | This visualization aims to investigate the time-coverage of the dataset. Over the past two years, the number of records may not be evenly distributed | + | This visualization aims to investigate the time-coverage of the dataset. Over the past two years, the number of records may not be evenly distributed and shown on the graph. |
+ | The graph can be filtered by a year or by month to change the date frame and the data will also change accordingly, we aim to use to analyze the if the pollution is time-independent. | ||
− | + | We start from the upper graph, the graph shows the pollution data in the year 2017 and the big red dot shows the high pollution places, by comparing the lower graph which is the data in the year 2018, the big red dot positions changed and that shows the data set are different from each other. When we analyze it further, we can select each month to compare and the results turned to be the same which is the pollution reading is time-independent. | |
<br/> | <br/> | ||
|} | |} | ||
Line 264: | Line 266: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 2] Time dependency of sensor data |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
This diagram shows Time dependency of sensor data <br/> | This diagram shows Time dependency of sensor data <br/> | ||
− | + | [[File:Time2017.png|800px|center]] | |
− | [[File: | + | <br/> |
− | <br> | + | [[File:Time2018.png|800px|center]] |
− | [[File: | + | <br/> |
− | <br> | ||
|- | |- | ||
| <b>Interactive Technique</b><br> | | <b>Interactive Technique</b><br> | ||
<ol><li>Select : Hover</li> | <ol><li>Select : Hover</li> | ||
− | Tooltips are provided to show date and the | + | Tooltips are provided to show date and the number of records by time. |
</ol> | </ol> | ||
Line 294: | Line 295: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: # | + | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 3] Relationship between concentrtion and altitude |
|- | |- | ||
| <b>Purpose / Description</b><br> | | <b>Purpose / Description</b><br> | ||
Line 312: | Line 313: | ||
| <b>Analysis</b><br> | | <b>Analysis</b><br> | ||
This visualisation aims to investigate the relationship between the altitude and the concentration of pollutants. | This visualisation aims to investigate the relationship between the altitude and the concentration of pollutants. | ||
− | The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the | + | The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the pollution level and the altitude. |
+ | |||
+ | |||
+ | <br/> | ||
+ | |} | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! style="font-weight: bold;background: #000000;color:#fbfcfd;width: 20%;" | [Task 3] Relationship between concentrtion and temperature | ||
+ | |- | ||
+ | | <b>Purpose / Description</b><br> | ||
+ | |||
+ | This diagram shows the relationship between concentration and temperature<br/> | ||
+ | |||
+ | [[File:Relation.png|800px|center]] | ||
+ | |||
+ | |- | ||
+ | | <b>Interactive Technique</b><br> | ||
+ | <ol><li>Select : Hover</li> | ||
+ | Tooltips are provided to show date and the concentration. | ||
+ | |||
+ | </ol> | ||
+ | <br/> | ||
+ | |- | ||
+ | | <b>Analysis</b><br> | ||
+ | This visualization aims to investigate the relationship between the temperature and the concentration of pollutants. The relationship is such that the higher the temperature, the lower the pollutant concentration. This might be the cause of the Christmas spike. | ||
+ | |||
<br/> | <br/> | ||
|} | |} | ||
+ | |||
+ | =<div style="background: #000000; padding: 15px; line-height: 0.3em; text-indent: 10px; font-size:18px; font-family:Helvetica, Open Sans, Arial, sans-serif;"><font color= #ffffff> '''Final Work Publication'''</font></div>= | ||
+ | * https://public.tableau.com/profile/liu.jinlong#!/vizhome/SofiaMetroPollutionOfficialDataExploration_0/SofiaOfficialPollutionDataStory | ||
+ | |||
+ | <br/> | ||
+ | |||
+ | [[File:Final.png|800px|center]] | ||
+ | <br/> | ||
=Visualisation Software= | =Visualisation Software= | ||
Line 322: | Line 357: | ||
*Tableau | *Tableau | ||
*Excel | *Excel | ||
− | * | + | *Java |
Latest revision as of 23:40, 11 November 2018
To be a Visual Detective: Detecting spatio-temporal patterns
Contents
Overview
In Sofia, Bulgaria, air pollution has been a long-standing serious problem. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution.
Sofia has 5 metropolitan weather stations that capture weather data on hourly intervals. The analysis and comparison are based on the data collected from the five stations. The main measure of pollution is the concentration of a pollutant, PM10. The assignment will explore the factors, for example, humidity, altitude, position, etc., that affect the pollution level.
An interactive and informative visualisation analysis would be designed and developed to demonstrate the result of the result of the above tasks.
The Task
Task 1: Spatio-temporal Analysis of Official Air Quality
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Task 3: Relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2
Background Information
Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as a leading environmental cause of cancer.
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).
According to the WHO, 60 per cent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
Data Cleaning Procedure
Problem #1 | Location is not connected with the original EEA data |
---|---|
Issue | Concatenate the EEA data and the meta data to Bring in lat/long in order to show stations on the map |
Solution |
|
Problem #2 | Need consistent aggregation across all data for accuracy. |
---|---|
Issue | BG_5_60881_2018_timeseries.csv has ‘AveragingTime’ as hour |
Solution |
|
Problem #3 | Goehash cannot be parsed directly by tableau |
---|---|
Issue | Geohash is a convenient way of expressing a location (anywhere in the world) using a short alphanumeric string, with greater precision obtained with longer strings, geohash. One geohash value is corresponding to one set of longitude and latitude values. The tableau software needs to use the longitude and latitude values instead of geohash. The data transformation needs to be done. |
Solution |
|
Problem #4 | Difficulty to identify the data points in the city. |
---|---|
Issue |
In the citizen dataset, the sensor data is across the whole country, while the assignment is mainly focusing on the Sofia city. Data cleaning is required to remove or mark the unrelated data. |
Solution |
|
Problem #4 | pollutant concentration data does not appear in the to meteo data set |
---|---|
Issue | Merge the concentration data with the meteo data set |
Solution |
|
Final Data Files
- pollution_master_data.csv This dataset contains the aggragated data of original EEA dataset.
- timeseries.csv The original EEA dataset
- citizen.csv The aggragated data of original Air Tube dataset
- meteo-concentration.csv The aggragated data from the meteo and timeseries data.
Visualisation
Task 1: Spatio-temporal Analysis of Official Air Quality
- PM10 Concentration over the timeline
- PM10 Concentration over the timeline with shade
- PM10 Concentration over Christmas
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
- Citizen geo-distribution
- No. of records by hour across the citizen
- Time dependency of sensor data
Task 3: Relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2
- Relationship between altitude and concentration
[Task 1] PM10 Concentration over the timeline |
---|
Purpose / Description This diagram shows the average concentration of the PM10 recorded from the five stations by hours across years.
|
Interactive Technique
|
Analysis Every year around Christmas Day, the peak figure shows that the air pollution level grows higher than the other days within one year. This might be mainly because of the fireworks. Also, a deeper inspection of the data shows, regularly missing data hourly from 9-10 AM from Mladost station (BG0079A) for the critical 1st week of January. The readings in the hours following this missing data spike up significantly. What is the cause of these dropped data signals during these hours? Was there an instrument malfunction in the official weather stations. If the instruments are so costly relative to the citizen weather stations, then is it expected to be unreliable under some conditions. The missing data from station Orlov and Mladost may cause the average value of the concentration lower than expectation. The maximum concentration among the five stations may be an alternative option, however, that would fail to show the overall situation of the city as the most polluted area is always at the same station.
|
[Task 1] PM10 Concentration over the timeline with shade |
---|
Purpose / Description This diagram shows the average concentration of the PM10 recorded from the five stations by month across years.
|
Interactive Technique
|
Analysis A monthly aggregated view shows Druzhba station having highest peaks during holiday/Christmas times. Druzhba is at 548 meters altitude. This elevation is not very high and a relevant official weather station. The missing data from 2017 to 2018 leads to an inaccurate visualisation. According to the previous years, the air pollution level should be lower than what is displayed. The changes of the pollution level from the give stations are relative the same. In other words, the concentrations of PM10 from the five stations increase and decrease simultaneously.
|
[Task 1] PM10 Concentration over Christmas |
---|
Purpose / Description This diagram shows the average concentration of the PM10 recorded from station Hipodruma <br |
Interactive Technique
|
Analysis Christmas period is a typical period that the pollution level will increase dramatically high and reduced to the normal level in 2 days. From the diagram, the concentration increases to 4 times as normal at the afternoon of the 29 Nov. It reaches the highest level at the mid-night; The situation becomes better after the start of 30 Nov.
|
[Task 2] Citizen geo-distribution |
---|
Purpose / Description This diagram shows a geospatial distribution of all the sensors across the whole city.
<br |
Interactive Technique
|
Analysis This diagram aims to show the geospatial coverage of sensors across the whole country. This is essential since the spatial coverage of the citizen data reflects the confidence and completeness of the whole dataset. This dataset is obtained from citizen database, it is essential to justify the coverage before looking at the pollution level it reflects, if there is some large area is not tracked, the overall result might not be trustworthy. Only the data points within the city area are displayed, the irrelevant data is hidden. The way to distinguish the data points is described in the previous data cleaning procedures. From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible. The color code is responsible for the highest concentration record reported from the sensor at that location. It can be observed that the points with the deepest colour appear at the center area indicating that the center area is the most polluted area. The data can be clustered in month/year and sizes will represent the number of records in that location.
|
[Task 2] No. of records by hour accross citizen |
---|
Purpose / Description This diagram shows the number of records reported from the sensors during the past two years <br |
Interactive Technique
|
Analysis This visualization aims to investigate the time-coverage of the dataset. Over the past two years, the number of records may not be evenly distributed and shown on the graph. The graph can be filtered by a year or by month to change the date frame and the data will also change accordingly, we aim to use to analyze the if the pollution is time-independent. We start from the upper graph, the graph shows the pollution data in the year 2017 and the big red dot shows the high pollution places, by comparing the lower graph which is the data in the year 2018, the big red dot positions changed and that shows the data set are different from each other. When we analyze it further, we can select each month to compare and the results turned to be the same which is the pollution reading is time-independent.
|
[Task 2] Time dependency of sensor data |
---|
Purpose / Description This diagram shows Time dependency of sensor data
|
Interactive Technique
|
Analysis This visualization aims to investigate the time-dependency of the sensor data. If the data shows a common trend across the year, the concentration is time-dependent; if the data fluctuates randomly or keep at a stationary level constantly, it is time-independent. The upper diagram shows some random fluctuation due to some anomalies(e.g. PM10=2000), a filter should be implemented to filter out the extreme data. The lower diagram is with the filter implemented. From March to August, the pollution concentration level remains at a relative low level. From August to December, it increases and reaches the highest point in January. From January to March the situation becomes better after that and get back to normal level.
|
[Task 3] Relationship between concentrtion and altitude |
---|
Purpose / Description This diagram shows the relationship between concentration and altitude |
Interactive Technique
|
Analysis This visualisation aims to investigate the relationship between the altitude and the concentration of pollutants. The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the pollution level and the altitude.
|
[Task 3] Relationship between concentrtion and temperature |
---|
Purpose / Description This diagram shows the relationship between concentration and temperature |
Interactive Technique
|
Analysis This visualization aims to investigate the relationship between the temperature and the concentration of pollutants. The relationship is such that the higher the temperature, the lower the pollutant concentration. This might be the cause of the Christmas spike.
|
Final Work Publication
Visualisation Software
To perform the visual analysis, this is a list of the software which I used.
- Tableau
- Excel
- Java
References
- https://www.datasciencesociety.net/sofia-air-quality-eda-exploratory-data-analysis/
- https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kung-fu-panda/
Comments
Do provide me your feedback!:)