IS428 AY2018-19T1 Liu Jinlongu

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search

To be a Visual Detective: Detecting spatio-temporal patterns

Overview

In Sofia, Bulgaria, air pollution has been a long-standing serious problem. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution.

Sofia has 5 metropolitan weather stations that capture weather data on hourly intervals. The analysis and comparison are based on the data collected from the five stations. The main measure of pollution is the concentration of a pollutant, PM10. The assignment will explore the factors, for example, humidity, altitude, position, etc., that affect the pollution level.

An interactive and informative visualisation analysis would be designed and developed to demonstrate the result of the result of the above tasks.

The Task

Task 1: Spatio-temporal Analysis of Official Air Quality
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Task 3: Relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2

Background Information

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as a leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 per cent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).


Air index in Eruope.png

Data Cleaning Procedure


Problem #1 Location is not connected with the original EEA data
Issue Concatenate the EEA data and the meta data to Bring in lat/long in order to show stations on the map
Solution
Add location EEA.png


Left merge EEA_Data with metadata.xls.

Problem #2 Need consistent aggregation across all data for accuracy.
Issue BG_5_60881_2018_timeseries.csv has ‘AveragingTime’ as hour
Solution


Problem #3 Goehash cannot be parsed directly by tableau
Issue Geohash is a convenient way of expressing a location (anywhere in the world) using a short alphanumeric string, with greater precision obtained with longer strings, geohash. One geohash value is corresponding to one set of longitude and latitude values. The tableau software needs to use the longitude and latitude values instead of geohash. The data transformation needs to be done.
Solution
Decode geohash.png


Use coding method to decode all the geohash to long/lat. Notice that the geohash field is still reserved since it is the unique identifier for the different sensors.

Problem #4 Difficulty to identify the data points in the city.
Issue

In the citizen dataset, the sensor data is across the whole country, while the assignment is mainly focusing on the Sofia city. Data cleaning is required to remove or mark the unrelated data.

Solution
Compare with topo.png


The lat/long boundaries are found in the TOPO-DATA. Using coding method to compare if the positions of the sensors lie within the city boundary. An additional boolean value is then assigned to each record to indicate whether the sensor is in the country.

Problem #4 pollutant concentration data does not appear in the to meteo data set
Issue Merge the concentration data with the meteo data set
Solution
Concatination.png


Use coding method to align the time format and inner join the two tables.


Final Data Files

  1. pollution_master_data.csv
  2. This dataset contains the aggragated data of original EEA dataset.
  3. timeseries.csv
  4. The original EEA dataset
  5. citizen.csv
  6. The aggragated data of original Air Tube dataset
  7. meteo-concentration.csv
  8. The aggragated data from the meteo and timeseries data.

Visualisation


Task 1: Spatio-temporal Analysis of Official Air Quality

  1. PM10 Concentration over the timeline
  2. PM10 Concentration over the timeline with shade
  3. PM10 Concentration over Christmas

Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

  1. Citizen geo-distribution
  2. No. of records by hour across the citizen
  3. Time dependency of sensor data

Task 3: Relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2

  1. Relationship between altitude and concentration


[Task 1] PM10 Concentration over the timeline
Purpose / Description

This diagram shows the average concentration of the PM10 recorded from the five stations by hours across years.


Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show air quality station type, averaging tiem, common name, timestamp, average altitude, average concentration.


Analysis

Every year around Christmas Day, the peak figure shows that the air pollution level grows higher than the other days within one year. This might be mainly because of the fireworks.

Also, a deeper inspection of the data shows, regularly missing data hourly from 9-10 AM from Mladost station (BG0079A) for the critical 1st week of January. The readings in the hours following this missing data spike up significantly. What is the cause of these dropped data signals during these hours? Was there an instrument malfunction in the official weather stations. If the instruments are so costly relative to the citizen weather stations, then is it expected to be unreliable under some conditions.

The missing data from station Orlov and Mladost may cause the average value of the concentration lower than expectation. The maximum concentration among the five stations may be an alternative option, however, that would fail to show the overall situation of the city as the most polluted area is always at the same station.


[Task 1] PM10 Concentration over the timeline with shade
Purpose / Description

This diagram shows the average concentration of the PM10 recorded from the five stations by month across years.

PM10 Concentration over the timeline with shade.png


Interactive Technique
  1. Select : Pointer
  2. The records from a particular station will be highlighted and the rest records become dim.
  3. Select : Hover
  4. Tooltips are provided to show station name, concentration of PM10, and the timestamp.


Analysis

A monthly aggregated view shows Druzhba station having highest peaks during holiday/Christmas times. Druzhba is at 548 meters altitude. This elevation is not very high and a relevant official weather station.

The missing data from 2017 to 2018 leads to an inaccurate visualisation. According to the previous years, the air pollution level should be lower than what is displayed.

The changes of the pollution level from the give stations are relative the same. In other words, the concentrations of PM10 from the five stations increase and decrease simultaneously.

[Task 1] PM10 Concentration over Christmas
Purpose / Description

This diagram shows the average concentration of the PM10 recorded from station Hipodruma

OneThree.JPG

<br

Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show station name, concentration of PM10, and the timestamp.


Analysis

Christmas period is a typical period that the pollution level will increase dramatically high and reduced to the normal level in 2 days. From the diagram, the concentration increases to 4 times as normal at the afternoon of the 29 Nov. It reaches the highest level at the mid-night; The situation becomes better after the start of 30 Nov.



[Task 2] Citizen geo-distribution
Purpose / Description

This diagram shows a geospatial distribution of all the sensors across the whole city.

Citizen geo-distribution.png

<br

Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show sensor's latitude/longitude, highest concentration PM10 and hightest concentration PM2.5.


Analysis

This diagram aims to show the geospatial coverage of sensors across the whole country. This is essential since the spatial coverage of the citizen data reflects the confidence and completeness of the whole dataset. This dataset is obtained from citizen database, it is essential to justify the coverage before looking at the pollution level it reflects, if there is some large area is not tracked, the overall result might not be trustworthy.

Only the data points within the city area are displayed, the irrelevant data is hidden. The way to distinguish the data points is described in the previous data cleaning procedures.

From the visualization above, the citizen data fairly reflects the overall situation of the country. There is no obvious empty region on the map. However, the North part and the South-East part of the map have a relatively low sensor concentration than the central area. Hence, the pollution records in the central area are more credible.

The color code is responsible for the highest concentration record reported from the sensor at that location. It can be observed that the points with the deepest colour appear at the center area indicating that the center area is the most polluted area.

The data can be clustered in month/year and sizes will represent the number of records in that location.




[Task 2] No. of records by hour accross citizen
Purpose / Description

This diagram shows the number of records reported from the sensors during the past two years

No. of records by hour accross citizen.png

<br

Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show date and the number of records reported at that day.


Analysis

This visualization aims to investigate the time-coverage of the dataset. Over the past two years, the number of records may not be evenly distributed, if there is some period of time what the total records were collected are significantly lower than the rest periods, the data corresponding to this period is not sufficient to showcase the pollution level of the whole country. It also reflects that there were some major failures on the sensors during that period of time.

according to the visual analytics, the records are more concentrated during the second half-year of 2018. This suggests that the sensors' performance was improved at that time. During July 2018, it seems that the sensors report fewer records as compared to other days. Especially in 4th and 5th of July, the data size is approximately ten times lower than other days. The sensors might be under maintenance during that days.


[Task 2] Time dependency of sensor data
Purpose / Description

This diagram shows Time dependency of sensor data


Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show date and the concentration.


Analysis

This visualization aims to investigate the time-dependency of the sensor data. If the data shows a common trend across the year, the concentration is time-dependent; if the data fluctuates randomly or keep at a stationary level constantly, it is time-independent.

The upper diagram shows some random fluctuation due to some anomalies(e.g. PM10=2000), a filter should be implemented to filter out the extreme data.

The lower diagram is with the filter implemented. From March to August, the pollution concentration level remains at a relative low level. From August to December, it increases and reaches the highest point in January. From January to March the situation becomes better after that and get back to normal level.

[Task 3] Relationship between concentrtion and altitude
Purpose / Description

This diagram shows the relationship between concentration and altitude

Threeone.png
Interactive Technique
  1. Select : Hover
  2. Tooltips are provided to show date and the concentration.


Analysis

This visualisation aims to investigate the relationship between the altitude and the concentration of pollutants. The five stations located at different altitudes. Among them, the Pavlovo station has the highest altitude while the station Hipodruma is the most polluted station. Hence, there is not a clear relationship between the polltion level and the altitude.



Visualisation Software

To perform the visual analysis, this is a list of the software which I used.

  • Tableau
  • Excel
  • VS Code


References

Comments

Do provide me your feedback!:)