Difference between revisions of "IS428 AY2018-19T1 Ji Xinyi"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 15: Line 15:
  
  
===Data Cleaning Procedure===
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 55: Line 54:
 
|-
 
|-
 
| Solution ||  
 
| Solution ||  
[[File:Dcfour.png|800px|center]]
+
[[File:Dcfour.png|600px|center]]
 
<br/>
 
<br/>
 
The lat/long boundaries are found in the TOPO-DATA.
 
The lat/long boundaries are found in the TOPO-DATA.
Line 67: Line 66:
 
<li>citizen</li>
 
<li>citizen</li>
 
This data-set contains Citizen science air quality measurements with decoded longitude and latitude.  
 
This data-set contains Citizen science air quality measurements with decoded longitude and latitude.  
 +
<li>meteo-concentration</li>
 +
The aggragated data from the meteo and timeseries data.
  
 
</ol>
 
</ol>
  
  
===  ===
 
 
== Interactive Visualization ==
 
== Interactive Visualization ==
 +
The interactive visualisation can be assessed from this link: https://public.tableau.com/profile/ji.xinyi#!/vizhome/SofiaMetroPollutionDataExploration/Story1
 
=== Task 1: Spatio-temporal Analysis of Official Air Quality ===
 
=== Task 1: Spatio-temporal Analysis of Official Air Quality ===
 
Characterize the past and most recent situation with respect to air quality measures in Sofia City. What does a typical day look like for Sofia city? Do you see any trends of possible interest in this investigation? What anomalies do you find in the official air quality dataset? How do these affect your analysis of potential problems to the environment?
 
Characterize the past and most recent situation with respect to air quality measures in Sofia City. What does a typical day look like for Sofia city? Do you see any trends of possible interest in this investigation? What anomalies do you find in the official air quality dataset? How do these affect your analysis of potential problems to the environment?
Line 100: Line 101:
 
<br>
 
<br>
 
[[File:Xinyi task11.png|600px|center]]
 
[[File:Xinyi task11.png|600px|center]]
<br>
+
</br>
|-
+
 
| <b>Interactive Technique</b><br>
 
<ol><li>Select : Pointer</li>
 
The records from a particular station will be highlighted and the rest records become dim.
 
<li>Select : Hover</li>
 
Tooltips are provided to show station name, concentration of PM10, and the timestamp.
 
</ol>
 
<br/>
 
|-
 
| <b>Analysis</b><br>
 
  
 
A monthly aggregated view shows all stations having highest peaks during holiday/Christmas times.  
 
A monthly aggregated view shows all stations having highest peaks during holiday/Christmas times.  
The missing data from 2017 to 2018 leads to an inaccurate visualisation. According to the previous years, the air pollution level should be lower than what is displayed.
+
The missing data from 2017 to 2018 leads to an inaccurate visualization. According to the previous years, the air pollution level should be lower than what is displayed.
 
The changes of the pollution level from the given stations are relative the same. In other words, the concentrations of PM10 from the five stations increase and decrease simultaneously.
 
The changes of the pollution level from the given stations are relative the same. In other words, the concentrations of PM10 from the five stations increase and decrease simultaneously.
 
<br/>
 
<br/>
Line 125: Line 117:
  
  
This graph shows the hourly trend in one day of PM10 concentration in the year of 2018, since the hour data is more completed in 2018.
+
This graph shows the hourly trend in one day of PM10 concentration in the year of 2018, since the hour data is more completed in 2018.</br>
As we can see, the trend of PM10 seems to be similar across the 5 air quality stations. At 0:00AM, the PM10 starts out with a higher value and as the morning comes, the fluctuates up and down. Then we see a sudden dip in the concentration (Possibly an anomaly) around 9AM for the stations. After the dip, the conc. increases at 10AM until starts decreasing again from 11AM to 16PM. Then onwards the concentration level seems to increase again until the next morning.
+
As we can see, the trend of PM10 seems to be similar across the 5 air quality stations. At 0:00AM, the PM10 starts out with a higher value and as the morning comes, the fluctuates up and down. Then we see a sudden dip in the concentration around 8AM for the stations. After the dip, the concentration increases at 10AM until starts decreasing again from 11AM to 16PM. Then onwards the concentration level seems to increase again until the next morning.
 +
 
 +
 
 +
|}
 +
{| class="wikitable"
 +
|-
 +
! style="font-weight: bold;background: #536a87;color:#fbfcfd;width: 20%;" | Do you see any trends of possible interest in this investigation?
 +
|-
 +
|
 +
As we can see from the graphs above, I find some interested things:</br>
 +
1.A monthly aggregated view shows all stations having highest peaks during holiday/Christmas times, there may be fireworks which can have bad impact on air quality. </br>
 +
2.November, December and January have particularly high levels of concentration compared to the rest of the months. Thus, it will be useful to investigate the relation between temperatures with air quality in Sofia City </br>
 +
3.The concentration trends are same for all stations, but value different, so it will be useful to investigate the relation between topography with concentration value in Sofia City.</br>
 +
 
 +
|}
 +
{| class="wikitable"
 +
|-
 +
! style="font-weight: bold;background: #536a87;color:#fbfcfd;width: 20%;" | What anomalies do you find in the official air quality dataset?
 +
|-
 +
|
 +
[[File:Xinyi task16.JPG|600px|center]]
 +
 
 +
 
 +
1.Hourly data is available only for late 2017 and 2018. Hourly data is more representative of the concentration levels to show a typical daily situation of Sofia city. It may because of losing data or because stations just collect data once a day earlier.</br>
 +
2.The graph above shows that 2017 has a massive lack of data across all 6 stations. The station Mladost started to operate in January 2018. The station Orlov Most has stopped data collection since late 2015.
 +
 
 +
 
  
 
|}
 
|}
Line 132: Line 150:
 
Using appropriate data visualisation, you are required will be asked to answer the following types of questions:
 
Using appropriate data visualisation, you are required will be asked to answer the following types of questions:
  
Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
+
*Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others? Are these differences time dependent?  
+
*Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others? Are these differences time dependent?  
 +
 
 +
{| class="wikitable"
 +
|-
 +
! style="font-weight: bold;background: #536a87;color:#fbfcfd;width: 20%;" | Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times?
 +
|-
 +
|<b>Coverage and Distribution</b><br>
 +
[[File:Xinyitask21.png|600px|center]]
 +
This diagram aims to show the geospatial coverage of sensors across the whole country. This is essential since the spatial coverage of the citizen data reflects the confidence and completeness of the whole dataset. This dataset is obtained from citizen database, it is essential to justify the coverage before looking at the pollution level it reflects, if there is some large area is not tracked, the overall result might not be trustworthy.
 +
 
 +
Only the data points within the city area are displayed, the irrelevant data is hidden. The way to distinguish the data points is described in the previous data cleaning procedures.
 +
 
 +
From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible.
 +
 
 +
The colour code is responsible for the highest concentration record reported from the sensor at that location. It can be observed that the points with the deepest colour appear at the centre area indicating that the centre area is the most polluted area.</br>
 +
<b>Performance and Operation</b><br>
 +
[[File:Xinyitask22.png|600px|center]]
 +
The time series above shows the number of measurements over time and displays an obvious increase in the number of citizen science sensors from September 2017 to August 2018. There are certain days where measurements are missing, as seen by the massive downward spikes. These sudden drop in measurements seem to occur at the end and start of the month.(These can results from system down or regular maintenance)
 +
 
 +
<br/>
 +
|}
 +
{| class="wikitable"
 +
|-
 +
! style="font-weight: bold;background: #536a87;color:#fbfcfd;width: 20%;" |Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?
 +
|-
 +
|
 +
[[File:Xinyitask23.png|600px|center]]
 +
The instruments used by citizen scientists are not professional measuring devices. Thus, errors are expected to occur. The captured readings for citizen science range between 0 to 2000, which is much larger than the official air quality measurements of 0 to 690. Therefore, it can be deduced that the citizen science instruments are very inaccurate in comparison to the official air quality stations.The graph shows some random fluctuation due to some anomalies(e.g. PM10=2000), a filter should be implemented to filter out the extreme data./br>
 +
 
 +
<br/>
 +
|}
  
 
=== Task 3 ===
 
=== Task 3 ===
Urban air pollution is a complex issue. There are many factors affecting the air quality of a city. Some of the possible causes are:
+
Urban air pollution is a complex issue. There are many factors affecting the air quality of a city
 +
{| class="wikitable"
 +
|-
 +
! style="font-weight: bold;background: #536a87;color:#fbfcfd;width: 20%;" |  Relationship between concentration and factors
 +
|-
 +
|<b> This diagram shows the relationship between concentration and altitude</b> <br/>
  
- Local energy sources. For example, according to Unmask My City, a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5  
+
[[File:Task31.png|600px|center]]
(particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport.</br>
+
 
* Local meteorology such as temperature, pressure, rainfall, humidity, wind etc.</br>
+
This visualization aims to investigate the relationship between the altitude and the concentration of pollutants with official data.
* Local topography</br>
+
The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the polltion level and the altitude.
* Complex interactions between local topography and meteorological characteristics.</br>
+
 
* Transboundary pollution for example the haze that intruded into Singapore from our neighbours.</br>
+
 
== References ==
+
<br/>
 +
 
 +
<b>This diagram shows the relationship between concentration and temperature</b><br/>
 +
 
 +
[[File:Task32.png|600px|center]]
 +
 
 +
 
 +
With observations of the data set, I found that some of local meteorology not change large-scale obviously such as pressure and humidity. This visualization aims to investigate the relationship between the temperature and the concentration of pollutants. The relationship is such that the higher the temperature, the lower the pollutant concentration. This might be the cause of November, December and January have particularly high levels of concentration compared to the rest of the months.
 +
 
 +
 
 +
<br/>
 +
<b>This diagram shows the relationship between PM10 and PM2.5</b><br/>
 +
 
 +
[[File:Task34.png|600px|center]]
 +
 
 +
 
 +
This visualization aims to investigate the relationship between PM2.5 and PM10. The relationship is such that the higher PM2.5, the higher PM10. Fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport, which means this kind of pollution can destroy air quality.
 +
 
 +
<br/>
 +
|}
  
 +
=References=
 +
* https://www.datasciencesociety.net/sofia-air-quality-eda-exploratory-data-analysis/
 +
* https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kung-fu-panda/
  
== Comments ==
+
=Comments=
 +
Do provide me your feedback!:)

Latest revision as of 00:26, 12 November 2018

Problem & Motivation

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

Dataset Analysis & Transformation Process

This section will elaborate on the exploratory data analysis and transformation process for each dataset to prepare the data for analysis. There are 4 different Zip Files that were provided to us in the assignment. The files provided to us are Air Tube, EEA Data, METEO-data and TOPO-DATA.

In this section, I examine the quality of the data provided by exploring for bad data, gaps in data and informing next steps.


Problem #1 Location is needed for final result to be shown as map and is a learning feature for NN
Issue Bring lat/long/elev data into EEA Data metropolitan data from the metadata.xls file
Solution
Dcone.jpg

Left merge EEA_Data with metadata.xls.

Problem #2 Need consistent aggregation across all data for accuracy.
Issue BG_5_60881_2018_timeseries.csv has ‘AveragingTime’ as hour
Solution
Dctwo.jpg

Problem #3 Goehash cannot be parsed directly by tableau
Issue Geohash is a convenient way of expressing a location (anywhere in the world) using a short alphanumeric string, with greater precision obtained with longer strings, geohash. One geohash value is corresponding to one set of longitude and latitude values. The tableau software needs to use the longitude and latitude values instead of geohash. The data transformation needs to be done.
Solution
Dcthree.png

Use coding method to decode all the geohash to long/lat. Notice that the geohash field is still reserved since it is the unique identifier for the different sensors.

Problem #4 Difficulty to identify the data points in the city.
Issue

In the citizen dataset, the sensor data is across the whole country, while the assignment is mainly focusing on the Sofia city. Data cleaning is required to remove or mark the unrelated data.

Solution
Dcfour.png


The lat/long boundaries are found in the TOPO-DATA. Using coding method to compare if the positions of the sensors lie within the city boundary. An additional boolean value is then assigned to each record to indicate whether the sensor is in the country.


Final Data Files

  1. pollution_master_data
  2. This data-set contains the aggragated data of original EEA dataset.
  3. citizen
  4. This data-set contains Citizen science air quality measurements with decoded longitude and latitude.
  5. meteo-concentration
  6. The aggragated data from the meteo and timeseries data.


Interactive Visualization

The interactive visualisation can be assessed from this link: https://public.tableau.com/profile/ji.xinyi#!/vizhome/SofiaMetroPollutionDataExploration/Story1

Task 1: Spatio-temporal Analysis of Official Air Quality

Characterize the past and most recent situation with respect to air quality measures in Sofia City. What does a typical day look like for Sofia city? Do you see any trends of possible interest in this investigation? What anomalies do you find in the official air quality dataset? How do these affect your analysis of potential problems to the environment?

Characterize the past and most recent situation with respect to air quality measures in Sofia City.
Xinyi task1 1.JPG
Xinyi task12.JPG
Xinyi task13.JPG


As seen in the time-series and the trend line above, the average PM10 concentration has fallen over the years. From the time-series above, it can also be seen that seasonality of concentration levels is relatively constant. PM10 levels tend to rise at the start and end of each year, and this trend can be observed throughout all the years.
As seen in the heatmap, the green colored fields are which satisfy standard set by EU(below 50), we can say normally situation is good, and the air quality is bad during winter.

PM10 Concentration over the timeline with shade
Purpose / Description

This diagram shows the average concentration of the PM10 recorded from the five stations by month across years.


Xinyi task11.png



A monthly aggregated view shows all stations having highest peaks during holiday/Christmas times. The missing data from 2017 to 2018 leads to an inaccurate visualization. According to the previous years, the air pollution level should be lower than what is displayed. The changes of the pollution level from the given stations are relative the same. In other words, the concentrations of PM10 from the five stations increase and decrease simultaneously.

What does a typical day look like for Sofia city?
Xinyi task15.JPG


This graph shows the hourly trend in one day of PM10 concentration in the year of 2018, since the hour data is more completed in 2018.
As we can see, the trend of PM10 seems to be similar across the 5 air quality stations. At 0:00AM, the PM10 starts out with a higher value and as the morning comes, the fluctuates up and down. Then we see a sudden dip in the concentration around 8AM for the stations. After the dip, the concentration increases at 10AM until starts decreasing again from 11AM to 16PM. Then onwards the concentration level seems to increase again until the next morning.


Do you see any trends of possible interest in this investigation?

As we can see from the graphs above, I find some interested things:
1.A monthly aggregated view shows all stations having highest peaks during holiday/Christmas times, there may be fireworks which can have bad impact on air quality.
2.November, December and January have particularly high levels of concentration compared to the rest of the months. Thus, it will be useful to investigate the relation between temperatures with air quality in Sofia City
3.The concentration trends are same for all stations, but value different, so it will be useful to investigate the relation between topography with concentration value in Sofia City.

What anomalies do you find in the official air quality dataset?
Xinyi task16.JPG


1.Hourly data is available only for late 2017 and 2018. Hourly data is more representative of the concentration levels to show a typical daily situation of Sofia city. It may because of losing data or because stations just collect data once a day earlier.
2.The graph above shows that 2017 has a massive lack of data across all 6 stations. The station Mladost started to operate in January 2018. The station Orlov Most has stopped data collection since late 2015.


Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Using appropriate data visualisation, you are required will be asked to answer the following types of questions:

  • Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
  • Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others? Are these differences time dependent?
Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times?
Coverage and Distribution
Xinyitask21.png

This diagram aims to show the geospatial coverage of sensors across the whole country. This is essential since the spatial coverage of the citizen data reflects the confidence and completeness of the whole dataset. This dataset is obtained from citizen database, it is essential to justify the coverage before looking at the pollution level it reflects, if there is some large area is not tracked, the overall result might not be trustworthy.

Only the data points within the city area are displayed, the irrelevant data is hidden. The way to distinguish the data points is described in the previous data cleaning procedures.

From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible.

The colour code is responsible for the highest concentration record reported from the sensor at that location. It can be observed that the points with the deepest colour appear at the centre area indicating that the centre area is the most polluted area.
Performance and Operation

Xinyitask22.png

The time series above shows the number of measurements over time and displays an obvious increase in the number of citizen science sensors from September 2017 to August 2018. There are certain days where measurements are missing, as seen by the massive downward spikes. These sudden drop in measurements seem to occur at the end and start of the month.(These can results from system down or regular maintenance)


Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?
Xinyitask23.png

The instruments used by citizen scientists are not professional measuring devices. Thus, errors are expected to occur. The captured readings for citizen science range between 0 to 2000, which is much larger than the official air quality measurements of 0 to 690. Therefore, it can be deduced that the citizen science instruments are very inaccurate in comparison to the official air quality stations.The graph shows some random fluctuation due to some anomalies(e.g. PM10=2000), a filter should be implemented to filter out the extreme data./br>


Task 3

Urban air pollution is a complex issue. There are many factors affecting the air quality of a city

Relationship between concentration and factors
This diagram shows the relationship between concentration and altitude
Task31.png

This visualization aims to investigate the relationship between the altitude and the concentration of pollutants with official data. The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the polltion level and the altitude.



This diagram shows the relationship between concentration and temperature

Task32.png


With observations of the data set, I found that some of local meteorology not change large-scale obviously such as pressure and humidity. This visualization aims to investigate the relationship between the temperature and the concentration of pollutants. The relationship is such that the higher the temperature, the lower the pollutant concentration. This might be the cause of November, December and January have particularly high levels of concentration compared to the rest of the months.



This diagram shows the relationship between PM10 and PM2.5

Task34.png


This visualization aims to investigate the relationship between PM2.5 and PM10. The relationship is such that the higher PM2.5, the higher PM10. Fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport, which means this kind of pollution can destroy air quality.


References

Comments

Do provide me your feedback!:)