IS428 2018-19 T1 Assign Koh How Han Vincent

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search

Contents

Problem & Motivation

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).


Dataset Analysis & Transformation Process

Dataset Analysis

Dataset Data Description Observation
EEA Data.zip
Content of EEA
Metadata of EEA
  1. Official Air Quality Measurements (EEA Data.zip)
  2. Contain official data recorded from 6 different station
  3. Druzhba, Hipodruma, IAOS/Pavlovo and Nadezhda contain data from 2013 to 2018
  4. Mladost contain data only in January 2018
  5. Metadata is provided. Contain details of each station including latitude and longitude
  6. All data provided have undergo QA and are valid and verified.
Air Tube.zip Example
  1. Citizen Science Air Quality Measurements (Air Tube.zip)
  2. Contains data from September 2017 to August 2018
  3. Also contain meteorological data (Temperature, Humidity & Pressure)
  4. As dataset contain geohash which is not readable by Tableau, transformation is required to convert it into latitude and longitude
  5. P1 is PM10 and P2 is PM2.5
METEO-data.zip
Content of METEO-data
  1. Meteorological measurements (1 station)(METEO-data.zip)
TOPO-DATA.zip Example
  1. Topography data (TOPO-DATA)

Transformation Process

Official air quality measurements

Using various Air Quality Stations around Sofia City from the year 2013 – 2018, Sofia City is able to obtain daily / hourly PM10 concentration readings of air quality (µg/m3)

Issue: Dataset are split into various CSV files based on the air quality station and years. This will make analysis process difficult as data might be required to load in multiple times, causing unnecessary inefficient and possible inaccurate insight drawn.
Solution: Using Python programming language and Pandas library to combine all relevant dataset into one full dataset.

Issue: Incomplete dataset from 2 of the 6 air quality stations, STA-BG0054A and STA-BG0079A.
Solution: Exclude incomplete datasets from merging.

Issue: As dataset are recorded by air quality station and on different year, the period of recording is on different scale (example: certain year contains both day and hourly data). As such, this will cause inaccurate insight being drawn out. In addition, there are plans to benchmark data using EC air quality standards which is only shown in daily or yearly.
Solution: As daily data could not be drill down further to hourly / var, I will adopt the method of drill up by taking the daily mean instead of hourly / var (daily mean = sum of hourly recorded data per day / number of rows of per day).

Citizen science air quality measurements

Issue: Dataset does not provide any latitude or longitude data, instead, only Geohash data are provided. As Tableau is unable to interpret Geohash data, decoding must be perform to retrieve the latitude and longitude data.
Solution: Using Python programming language and pygeohash library to decode and retrieve latitude and longitude of the location.

Decode Geohash


Issue: Dataset are split into various CSV files based on the air quality station and years. This will make analysis process difficult as data might be required to load in multiple times, causing unnecessary inefficient and possible inaccurate insight drawn.
Solution: Using Python programming language and Pandas library to combine all relevant dataset into one full dataset.

Issue: Dataset contains data outside of Sofia City

Original number of sensors

Solution: Manually high and exclude data that are outside of Sofia City.

Exclusion of sensors outside sofia city

Meteorological measurements (1 station)

Issue: Dataset contain day, month and year separately.
Solution: Make use of Tableau MAKEDATE() to form the date variable. With the date variable, visualization can perform drilldown properly.

Issue: Visibility contains negative value which represent missing values in the dataset. However, this could cause potential shift in scale during visualization.
Solution: Change all values than is less than 0 to null

Conversion of negative value to 0


Dataset Import Structure & Process

With all the relevant data analysis and transformation completed, the next step is to import data into Tableau.

Data Source 1, 2 & 6
Data Source 3 and 4
Data source 5 and 7
Data source 8

Each of the above table represent 1 data source. In total, the visualization contains 8 data source.

Additional processing is performed for the following data source

Data Source Issue or Limitation face with original data Process Technique
Task 1: 2013 - 2017 EEA processed data (Merged with 2013 - 2017 data) and metadata are 2 separate file Use Tableau Inner Join (Selected Air quality station from both dataset)
Task 1: Hourly data
Task 3: Topo + Task 1
EEA processed data (All 2018 data) and metadata are 2 separate file Use Tableau Inner Join (Selected Air quality station from both dataset)
Task 1: 2013 - 2017
Task 1: Hourly data
Task 2
Task 3: Meteorological Data
No dropdown to select date type Setup Parameter control
Setup parameter control for Date Type
Task 1: 2013 - 2017
Task 1: Hourly data
Task 2
Task 3: Meteorological Data
As sorting is done automatically by Tableau, having a custom date format would help to prevent wrong sorting from happening during visualization Setup custom month and quarter and selection field
Custom quarter and month for viz

Setup date selection

Condition for date selection
Task 2
Task 3: Meteorological Data
Use to visualize data in day basics. Also to prevent wrong sorting by Tableau. Setup custom day
Custom day

Interactive Visualization

Public Tableau Visualization can be access from here: https://public.tableau.com/profile/vincent.koh#!/vizhome/VA-Assignment/MainDashboard?publish=yes For best viewing experience, viewers are recommended to change their browser screen resolution to 1366 x 768 with full screen mode.

To improve user experience, several interactive tool has been setup to ensure flexibility in navigating and generating of visualization.

Common button
Interactive Technique Rationale Implementation Steps
Click to go back one visualization To allow viewers to return to the previous dashboard.
  1. Click "Floating" under Objects in Dashboard (On the left side of the screen)
  2. Drag "Button" into dashboard
  3. Click on the dragged button and right click. Select "Edit Button"
  4. Select the previous dashboard to navigate to under "Navigate to"
  5. Choose your desired images by click on "Choose" under "Image"
Click to proceed to next visualization To allow viewers to proceed to the next dashboard. Repeat above steps with a few point to take note.
  • "Navigate to" must be the next dashboard
  • Choose another image to prevent confusion
Click to return to main dashboard To allow viewers to return back to main dashboard Repeat above steps with a few point to take note.
  • "Navigate to" must be the main dashboard
  • Choose another image to prevent confusion

Notes: Certain dashboard might not have all 3. Depending on the numbering of the dashboard, button for proceed and return will appear accordingly.


Main Dashboard

As there is a lot of visualization involved for this assignment, using only 1 dashboard is impossible. In addition, with 4 dataset given for generating of visualization, having all 4 in one dashboard may not be able to use the dataset to its fullest potential. Hence, I have decided to split the dashboard into multiple dashboard. These dashboard will be categorise into 3 different category. Official Data, Citizen Data and relationship between Official Data and/or Citizen Data with Topological and/or Meteorological data.

Main Dashboard Guide

The following technique has been adopted to help viewers in navigating between dashboards.

Interactive Technique Rationale Implementation Steps
Navigate across dashboard To allow viewers to access other dashboard easily with a click of the button
  1. Click "Floating" under Objects in Dashboard (On the left side of the screen)
  2. Drag "Button" into dashboard
  3. Click on the dragged button and right click. Select "Edit Button"
  4. Select the dashboard to navigate to under "Navigate to"
  5. Choose your desired images by click on "Choose" under "Image"
Display text upon hovering on button Provide user information on what to expect for the next dashboard.
  1. Under Tooltip of "Edit Button", type in your desired texts for viewers.

Typical day of Sofia City

The following image shows the typical day of Sofia City

Typical day of Sofia City

The following technique has been adopted to help viewers in navigating between dashboards.

Interactive Technique Rationale Implementation Steps
Hover to see average concentration by week To allow viewers to see the average concentration by week in terms of day
  1. Right click on "Datetime Begin", select "Change Data Type" and select "Date & Time"
  2. Drag "Datetime Begin" into "Rows" of the sheet.
  3. Right click on "Datetime Begin", hover to the first "More" seen in the popup menu, select "Weekday"
  4. Drag "Concentration" into "Label" and another "Concentration" into "Color"
  5. Return to the sheet that contain calendar heatmap
  6. Click on "Tooltip" under Mars. Create a new line after the last entry of the Tooltip. Click "Insert" and hover to "Sheet" and select the visualization created previously.
  7. Once the code has been generated, add "Common Name" and "WEEK(Datetime Begin)" into filter.

(<Sheet name="Avg Con. by Weekday" maxwidth="300" maxheight="300" filter="<Common Name>,<WEEK(Datetime Begin)>">)

Click to filter by "Common Name" To provide viewers in depth view of data collected by individual air quality station
  1. Drag "Common Name" into calendar heatmap's Filters. Right click and select "Show Filter"
  2. Drag Calendar Heatmap into Dashboard

Trends & Anomalies

The following image shows the trends & anomalies

Trends & Anomalies
Interactive Technique Rationale Implementation Steps
Filter visualization by date type To provide viewers the ability to show the trend in different date setting Ensure visualization is using the custom date selection created during the additional processing stage in Tableau.
Highlight all chart by common name To provide viewers the ability to view the air quality station only in all 3 visualization Ensure all visualization are using the same "Common Name". Move to the top right hand corner of the legend and click on the funnel (When hover, a tooltip will appear "Highlight Selected Items")

Distribution & Status of Sensors and Data

The following image shows the distribution and status of sensors. In addition, data from the sensors is also shown in the line chart.

Distribution & Status of Sensors and Data
Interactive Technique Rationale Implementation Steps
Drag to highlight the line chart. Selected line will show the location of the sensor in the geographic map To allow viewers to see multiple sensor location based on the recorded data shown in the line chart
  1. Click on Dashboard on Tableau header and select "Action"
  2. Follow the configuration shown in the image.
Filter configuration

Local Energy Source

The following image shows the concentration of P1 and P2 in Sofia.

Local energy source
Interactive Technique Rationale Implementation Steps
Drag the slider to find the date. Geographic map will show the density of the concentration for P1 and P2 To allow viewers to see selected day concentration. Ensure visualization is using the custom date selection created during the additional processing

Local Meteorological Data

The following image shows the meteorological data with the average P1 and P2.

Local meteorology data
Interactive Technique Rationale Implementation Steps
Select data type to see all visualization based on selected date setting To provide viewers the ability to show the trend in different date setting Ensure visualization is using the custom date selection created during the additional processing

Interesting & Anomalous Observations

Task 1: Spatio-temporal Analysis of Official Air Quality

What does a typical day look like for Sofia city?

Based on the visualization below, air pollution are still on the higher side for the first 3 months of 2018.

Overall view


Drilling down to week level and using one of the sensor, Druzhba, we can see that concentration are relatively low throughout the 24 periods. However, there are special cases whereby at certain period of the day, concentration would be higher as compare to others. One such example would be week 31, 0600hrs.

Week view

However, these visualization can be misleading as it does not show the trend throughout the year. Therefore, another visualization is introduce to prevent any inaccurate or bias in analysis. Based on the visualization below, observation can be made that the average air pollution by year has decrease (evident from 2013Q3 - 2014Q1 having higher than 2013Q2, 2013Q3 and 2014Q2). However, certain period of the year still have higher average concentration.

Quarter view


Do you see any trends of possible interest in this investigation?

  1. After generating the average concentration of p1 by month, I have notice that on the month of October, average concentration start to rise and at January, average concentration start to drop till about February to March. One possible reason would be seasonality. Based on the visualization above, there might be a possibility that at lower temperature, concentration will increase.
Quarter over the years


What anomalies do you find in the official air quality dataset?

  1. Missing data happens to all sensors in all years with 2017 being one of the year that has missing data from January 2017 to 27 November 2018
    missing data
  2. During 2017, air quality station with common name Druzhba, has a high average concentration as compare to other air quality station. In addition, air quality station Nadezhda, has a lower average concentration.
    Year View
  3. Nadezhda has unusual high concentration during the period of April to May 2018 between 4am to 11am.
    Unusual Data recording
  4. All 4 sensors contains daily records that are high in concentration.
    Day outlier


How do these affect your analysis of potential problems to the environment?

Anomaly / Unusual Events Potential Trend / Issue
Missing data happens to all sensors in all years with 2017 being one of the year that has missing data from January 2017 to 27 November 2018 This could potentially cause an inaccurate analysis. As there might be a correlation between season and air pollution, having only December will cause average concentration to be high for the year of 2017, leading to inaccurate analysis.
During 2017, air quality station with common name Druzhba, has a high average concentration as compare to other air quality station. In addition, air quality station Nadezhda, has a lower average concentration.

Initial calculation of the 2 sensors distance is approximately 10km apart. Within 10km, concentration shouldn't be too far apart between this 2 sensors. For Nadezhda, one possible would be the amount of data recorded as compare to other 3 sensors. With only 274 days of data, insight could be inaccurate due to lack of data. For Druzhba, there is a possibilities that the sensor is picking up incorrect readings.

Nadezhda has unusual high concentration during the period of April to May 2018 between 4am to 11am. As the sensor is located within the vicinity of home for locals, it may be difficult to narrow down to the actual cause of this issue because there could be many different factors ranging from incorrect reading to household burning of fossil fuels that affect the reading.
All 4 sensors contains daily records that are high in concentration. Since the amount of outliers are fairly consistent, there might be a pattern in the data.

Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Are they well distributed over the entire city?

Sensors are relatively distributed around the city, however, using a density map, I observe that north-east side of Sofia city has slightly more sensors as compare to other part of the city.

Sensor distribution

Are they all working properly at all times?

  1. Most of the time, sensors are working well with a steady increase in the first few months and having a stable number of sensors during the later part of 2018. However, at certain period of the dataset, there are sensors that stop working. This can be seen as there is a sudden dip in the visualization.
  2. On 26 March 2018 12pm, there is a sudden increase by approximate 40 sensors. This could be new user trying out the sensors during that period of the day.
Status of sensor

Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?

  1. On certain occasion, average concentration of p1 and p2 seem to be higher as compare to usual. Using the visualization below, we can see that there are several spike in the data.
P1 and P2 comparison

Which part of the city shows relatively higher readings than others? Are these differences time dependent?

As outlier data has been remove prior to generating this chart and this dashboard only show data point that are above the EC Air Quality Standards for p1 and p2. By highlight all the data point in the above chart (Avg Con. by Hour By Location), the data point that have relatively high reading will be show in the sensor distribution map.
Based on the result and some Google Map search on latitude and longitude, 2 of the location ([42.676, 23.320], [42.696, 23.304]) are denser with locals. Higher reading from this 2 location could be due to household burning of fossil fuels or transportation. The last 2 location ([42.67, 23.232], [42.610, 23.344]) are rather less denser with locals (based on Google Street View), however high readings due to household burning could not be rejected as there are still locals living in this 2 locations.
These higher readings are quite consistent as shown in the chart (Avg Con. by Hour by Location) throughout the 24 hours.

Higher reading

Task 3

Local energy sources

After reviewing data from 2017 to 2018, I realise that as the place get denser with local (ex: central of Sofia City), the concentration seem to be higher. This could be due to household burning of fossil fuels or transportation. However, the result can be inconclusive as there could be other external factors that affect the concentration.

Local Energy Sources vs P1 & P2

Local meteorology such as temperature, pressure, rainfall, humidity, wind etc

One relationship can be observe between local meteorology and air pollution. When temperature start to decrease, concentration for PM2.5 and PM10 start to increases, precipitation and visibility start to decreases. One possible reasons could be due to temperature inversion. Where higher air sinks due to the air being squeezed or compressed from higher air pressure at lower elevation, sinking air can be warm to the point it is warmer than air below it. Also known as upper air inversion, it create a barrier to prevent thunderstorms from forming. This can cause poor air quality as pollutant can build up.

Relationship between meteorology and air pollution

Complex interactions between local topography and meteorological characteristics.

One relationship can be observe is that when surface temperature increase or decrease, high grounds (darker part of the map in the visualization) does not seem to be affected by the increase in concentration level. One possible reason is that high ground causes difficulties for the pollutant to be disperse which results the concentration of P1 and P2 relatively high in Sofia City.

Topology vs Meteorology

Transboundary pollution for example the haze that intruded into Singapore from our neighbours.

After viewing the data by week. Within the visualization, it was noted that the next high concentration area is at Plovdiv (See below image).
After some research, I found out that there is a heating station at that location. Manage by Toplofikatsiya Plovdiv in the city of Plovdiv in Southern Bulgaria, it is one of the largest heating network in the country. Consisting of 2 power stations, one is steam-gas power station for electricity and thermal power and another, heat-only boiler station.
Publish on 6 April 2018, novinite (Sofia news agency) reported that Plovdiv has the most polluted air in the last heating season. With the news as evident and the distance between 2 state, I am confident to say that the power station located at Plovdiv could be a reason that Sofia city is experiencing high air pollution.

Transboundary identification

References

  1. EC Air Quality Standard [[1]]
  2. How winter affect air pollution [[2]]
  3. Environment in Sofia [[3]]
  4. Cause, effect and solution for air pollution [[4]]
  5. How Meteo affect air pollutant [[5]]
  6. Bulgaria season [[6]]
  7. News on Plovdiv (For Transboundary) [[7]]
  8. Topo Layer on Tableau Map [[8]]

Comments