IS428 2018-19 T1 Assign Koh How Han Vincent
Contents
- 1 Problem & Motivation
- 2 Dataset Analysis & Transformation Process
- 3 Dataset Import Structure & Process
- 4 Interactive Visualization
- 5 Interesting & Anomalous Observations
- 5.1 Task 1: Spatio-temporal Analysis of Official Air Quality
- 5.2 Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
- 5.2.1 Are they well distributed over the entire city?
- 5.2.2 Are they all working properly at all times?
- 5.2.3 Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?
- 5.2.4 Which part of the city shows relatively higher readings than others? Are these differences time dependent?
- 5.3 Task 3
- 6 References
- 7 Comments
Problem & Motivation
Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).
According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
Dataset Analysis & Transformation Process
Dataset Analysis
Dataset | Data Description | Observation |
---|---|---|
EEA Data.zip |
| |
Air Tube.zip | Example |
|
METEO-data.zip |
| |
TOPO-DATA.zip | Example |
|
Transformation Process
Official air quality measurements
Using various Air Quality Stations around Sofia City from the year 2013 – 2018, Sofia City is able to obtain daily / hourly PM10 concentration readings of air quality (µg/m3)
Issue: Dataset are split into various CSV files based on the air quality station and years. This will make analysis process difficult as data might be required to load in multiple times, causing unnecessary inefficient and possible inaccurate insight drawn.
Solution: Using Python programming language and Pandas library to combine all relevant dataset into one full dataset.
Issue: Incomplete dataset from 2 of the 6 air quality stations, STA-BG0054A and STA-BG0079A.
Solution: Exclude incomplete datasets from merging.
Issue: As dataset are recorded by air quality station and on different year, the period of recording is on different scale (example: certain year contains both day and hourly data). As such, this will cause inaccurate insight being drawn out. In addition, there are plans to benchmark data using EC air quality standards which is only shown in daily or yearly.
Solution: As daily data could not be drill down further to hourly / var, I will adopt the method of drill up by taking the daily mean instead of hourly / var (daily mean = sum of hourly recorded data per day / number of rows of per day).
Citizen science air quality measurements
Issue: Dataset does not provide any latitude or longitude data, instead, only Geohash data are provided. As Tableau is unable to interpret Geohash data, decoding must be perform to retrieve the latitude and longitude data.
Solution: Using Python programming language and pygeohash library to decode and retrieve latitude and longitude of the location.
Issue: Dataset are split into various CSV files based on the air quality station and years. This will make analysis process difficult as data might be required to load in multiple times, causing unnecessary inefficient and possible inaccurate insight drawn.
Solution: Using Python programming language and Pandas library to combine all relevant dataset into one full dataset.
Issue: Dataset contains data outside of Sofia City
Solution: Manually high and exclude data that are outside of Sofia City.
Meteorological measurements (1 station)
Issue: Dataset contain day, month and year separately.
Solution: Make use of Tableau MAKEDATE() to form the date variable. With the date variable, visualization can perform drilldown properly.
Issue: Visibility contains negative value which represent missing values in the dataset. However, this could cause potential shift in scale during visualization.
Solution: Change all values than is less than 0 to null
Dataset Import Structure & Process
With all the relevant data analysis and transformation completed, the next step is to import data into Tableau.
Each of the above table represent 1 data source. In total, the visualization contains 8 data source.
Additional processing is performed for the following data source
Data Source | Issue or Limitation face with original data | Process Technique |
---|---|---|
Task 1: 2013 - 2017 | EEA processed data (Merged with 2013 - 2017 data) and metadata are 2 separate file | Use Tableau Inner Join (Selected Air quality station from both dataset) |
Task 1: Hourly data Task 3: Topo + Task 1 |
EEA processed data (All 2018 data) and metadata are 2 separate file | Use Tableau Inner Join (Selected Air quality station from both dataset) |
Task 1: 2013 - 2017 Task 1: Hourly data Task 2 Task 3: Meteorological Data |
No dropdown to select date type | Setup Parameter control |
Task 1: 2013 - 2017 Task 1: Hourly data Task 2 Task 3: Meteorological Data |
As sorting is done automatically by Tableau, having a custom date format would help to prevent wrong sorting from happening during visualization | Setup custom month and quarter and selection field
Setup date selection |
Task 2 Task 3: Meteorological Data |
Use to visualize data in day basics. Also to prevent wrong sorting by Tableau. | Setup custom day |
Interactive Visualization
Public Tableau Visualization can be access from here: https://public.tableau.com/profile/vincent.koh#!/vizhome/VA-Assignment/MainDashboard?publish=yes For best viewing experience, viewers are recommended to change their browser screen resolution to 1366 x 768 with full screen mode.
To improve user experience, several interactive tool has been setup to ensure flexibility in navigating and generating of visualization.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Click to go back one visualization | To allow viewers to return to the previous dashboard. |
|
Click to proceed to next visualization | To allow viewers to proceed to the next dashboard. | Repeat above steps with a few point to take note.
|
Click to return to main dashboard | To allow viewers to return back to main dashboard | Repeat above steps with a few point to take note.
|
Notes: Certain dashboard might not have all 3. Depending on the numbering of the dashboard, button for proceed and return will appear accordingly.
Main Dashboard
As there is a lot of visualization involved for this assignment, using only 1 dashboard is impossible. In addition, with 4 dataset given for generating of visualization, having all 4 in one dashboard may not be able to use the dataset to its fullest potential. Hence, I have decided to split the dashboard into multiple dashboard. These dashboard will be categorise into 3 different category. Official Data, Citizen Data and relationship between Official Data and/or Citizen Data with Topological and/or Meteorological data.
The following technique has been adopted to help viewers in navigating between dashboards.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Navigate across dashboard | To allow viewers to access other dashboard easily with a click of the button |
|
Display text upon hovering on button | Provide user information on what to expect for the next dashboard. |
|
Typical day of Sofia City
The following image shows the typical day of Sofia City
The following technique has been adopted to help viewers in navigating between dashboards.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Hover to see average concentration by week | To allow viewers to see the average concentration by week in terms of day |
(<Sheet name="Avg Con. by Weekday" maxwidth="300" maxheight="300" filter="<Common Name>,<WEEK(Datetime Begin)>">) |
Click to filter by "Common Name" | To provide viewers in depth view of data collected by individual air quality station |
|
Trends & Anomalies
The following image shows the trends & anomalies
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Filter visualization by date type | To provide viewers the ability to show the trend in different date setting | Ensure visualization is using the custom date selection created during the additional processing stage in Tableau. |
Highlight all chart by common name | To provide viewers the ability to view the air quality station only in all 3 visualization | Ensure all visualization are using the same "Common Name". Move to the top right hand corner of the legend and click on the funnel (When hover, a tooltip will appear "Highlight Selected Items") |
Distribution & Status of Sensors and Data
The following image shows the distribution and status of sensors. In addition, data from the sensors is also shown in the line chart.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Drag to highlight the line chart. Selected line will show the location of the sensor in the geographic map | To allow viewers to see multiple sensor location based on the recorded data shown in the line chart |
|
Local Energy Source
The following image shows the concentration of P1 and P2 in Sofia.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Drag the slider to find the date. Geographic map will show the density of the concentration for P1 and P2 | To allow viewers to see selected day concentration. | Ensure visualization is using the custom date selection created during the additional processing |
Local Meteorological Data
The following image shows the meteorological data with the average P1 and P2.
Interactive Technique | Rationale | Implementation Steps |
---|---|---|
Select data type to see all visualization based on selected date setting | To provide viewers the ability to show the trend in different date setting | Ensure visualization is using the custom date selection created during the additional processing |
Interesting & Anomalous Observations
Task 1: Spatio-temporal Analysis of Official Air Quality
What does a typical day look like for Sofia city?
Based on the visualization below, air pollution are still on the higher side for the first 3 months of 2018.
Drilling down to week level and using one of the sensor, Druzhba, we can see that concentration are relatively low throughout the 24 periods. However, there are special cases whereby at certain period of the day, concentration would be higher as compare to others. One such example would be week 31, 0600hrs.
However, these visualization can be misleading as it does not show the trend throughout the year. Therefore, another visualization is introduce to prevent any inaccurate or bias in analysis. Based on the visualization below, observation can be made that the average air pollution by year has decrease (evident from 2013Q3 - 2014Q1 having higher than 2013Q2, 2013Q3 and 2014Q2). However, certain period of the year still have higher average concentration.
Do you see any trends of possible interest in this investigation?
- After generating the average concentration of p1 by month, I have notice that on the month of October, average concentration start to rise and at January, average concentration start to drop till about February to March. One possible reason would be seasonality. Based on the visualization above, there might be a possibility that at lower temperature, concentration will increase.
What anomalies do you find in the official air quality dataset?
- Missing data happens to all sensors in all years with 2017 being one of the year that has missing data from January 2017 to 27 November 2018
- During 2017, air quality station with common name Druzhba, has a high average concentration as compare to other air quality station. In addition, air quality station Nadezhda, has a lower average concentration.
- Nadezhda has unusual high concentration during the period of April to May 2018 between 4am to 11am.
- All 4 sensors contains daily records that are high in concentration.
How do these affect your analysis of potential problems to the environment?
Anomaly / Unusual Events | Potential Trend / Issue |
---|---|
Missing data happens to all sensors in all years with 2017 being one of the year that has missing data from January 2017 to 27 November 2018 | This could potentially cause an inaccurate analysis. As there might be a correlation between season and air pollution, having only December will cause average concentration to be high for the year of 2017, leading to inaccurate analysis. |
During 2017, air quality station with common name Druzhba, has a high average concentration as compare to other air quality station. In addition, air quality station Nadezhda, has a lower average concentration. |
Initial calculation of the 2 sensors distance is approximately 10km apart. Within 10km, concentration shouldn't be too far apart between this 2 sensors. For Nadezhda, one possible would be the amount of data recorded as compare to other 3 sensors. With only 274 days of data, insight could be inaccurate due to lack of data. For Druzhba, there is a possibilities that the sensor is picking up incorrect readings. |
Nadezhda has unusual high concentration during the period of April to May 2018 between 4am to 11am. | As the sensor is located within the vicinity of home for locals, it may be difficult to narrow down to the actual cause of this issue because there could be many different factors ranging from incorrect reading to household burning of fossil fuels that affect the reading. |
All 4 sensors contains daily records that are high in concentration. | Since the amount of outliers are fairly consistent, there might be a pattern in the data. |
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Are they well distributed over the entire city?
Sensors are relatively distributed around the city, however, using a density map, I observe that north-east side of Sofia city has slightly more sensors as compare to other part of the city.
Are they all working properly at all times?
- Most of the time, sensors are working well with a steady increase in the first few months and having a stable number of sensors during the later part of 2018. However, at certain period of the dataset, there are sensors that stop working. This can be seen as there is a sudden dip in the visualization.
- On 26 March 2018 12pm, there is a sudden increase by approximate 40 sensors. This could be new user trying out the sensors during that period of the day.
Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?
- On certain occasion, average concentration of p1 and p2 seem to be higher as compare to usual. Using the visualization below, we can see that there are several spike in the data.
Which part of the city shows relatively higher readings than others? Are these differences time dependent?
As outlier data has been remove prior to generating this chart and this dashboard only show data point that are above the EC Air Quality Standards for p1 and p2. By highlight all the data point in the above chart (Avg Con. by Hour By Location), the data point that have relatively high reading will be show in the sensor distribution map.
Based on the result and some Google Map search on latitude and longitude, 2 of the location ([42.676, 23.320], [42.696, 23.304]) are denser with locals. Higher reading from this 2 location could be due to household burning of fossil fuels or transportation. The last 2 location ([42.67, 23.232], [42.610, 23.344]) are rather less denser with locals (based on Google Street View), however high readings due to household burning could not be rejected as there are still locals living in this 2 locations.
These higher readings are quite consistent as shown in the chart (Avg Con. by Hour by Location) throughout the 24 hours.
Task 3
Local energy sources
After reviewing data from 2017 to 2018, I realise that as the place get denser with local (ex: central of Sofia City), the concentration seem to be higher. This could be due to household burning of fossil fuels or transportation. However, the result can be inconclusive as there could be other external factors that affect the concentration.
Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
One relationship can be observe between local meteorology and air pollution. When temperature start to decrease, concentration for PM2.5 and PM10 start to increases, precipitation and visibility start to decreases. One possible reasons could be due to temperature inversion. Where higher air sinks due to the air being squeezed or compressed from higher air pressure at lower elevation, sinking air can be warm to the point it is warmer than air below it. Also known as upper air inversion, it create a barrier to prevent thunderstorms from forming. This can cause poor air quality as pollutant can build up.
Complex interactions between local topography and meteorological characteristics.
One relationship can be observe is that when surface temperature increase or decrease, high grounds (darker part of the map in the visualization) does not seem to be affected by the increase in concentration level. One possible reason is that high ground causes difficulties for the pollutant to be disperse which results the concentration of P1 and P2 relatively high in Sofia City.
Transboundary pollution for example the haze that intruded into Singapore from our neighbours.
After viewing the data by week. Within the visualization, it was noted that the next high concentration area is at Plovdiv (See below image).
After some research, I found out that there is a heating station at that location. Manage by Toplofikatsiya Plovdiv in the city of Plovdiv in Southern Bulgaria, it is one of the largest heating network in the country. Consisting of 2 power stations, one is steam-gas power station for electricity and thermal power and another, heat-only boiler station.
Publish on 6 April 2018, novinite (Sofia news agency) reported that Plovdiv has the most polluted air in the last heating season. With the news as evident and the distance between 2 state, I am confident to say that the power station located at Plovdiv could be a reason that Sofia city is experiencing high air pollution.
References
- EC Air Quality Standard [[1]]
- How winter affect air pollution [[2]]
- Environment in Sofia [[3]]
- Cause, effect and solution for air pollution [[4]]
- How Meteo affect air pollutant [[5]]
- Bulgaria season [[6]]
- News on Plovdiv (For Transboundary) [[7]]
- Topo Layer on Tableau Map [[8]]