Difference between revisions of "IS428 AY2018-19T1 Tian Seet Yuen"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
Line 55: Line 55:
 
| Issue || In the datasets, 11-character geohash was used to provide geographical details. However, Tableau is unable to read these values.  
 
| Issue || In the datasets, 11-character geohash was used to provide geographical details. However, Tableau is unable to read these values.  
 
|-
 
|-
| Solution || R Notebook was utilized to generate the latitude and longitude values according to these geohash values. After which, these values were combined to the current dataset as new columns.
+
| Solution || [[File:R geo decode.png|thumb|center]]
[[File:R geo decode.png|thumb|center]]
+
R Notebook was utilized to generate the latitude and longitude values according to these geohash values. After which, these values were combined to the current dataset as new columns.
 +
|-
 +
! Problem #2 || Presence of additional latitude and longitude values outside of Sofia City
 +
|-
 +
| Issue || [[File:Prob1.png|600px|frameless|center]]
 +
After plotting the map based on the geohash values, there were many additional latitude and longitude values which lie outside of Sofia City. This is irrelevant to our research, which is based solely in Sofia City.
 +
|-
 +
| Solution || [[File:Sol1.png|600px|frameless|center]]
 +
Tableau was utilized to manually exclude dots (lat and lng values) which appeared outside of Sofia City.
 +
|-
 +
! Problem #3 || Lack of Regional Identification
 +
|-
 +
| Issue || [[File:Prob2.png|600px|frameless|center]]
 +
Task 2 specifically asked for regional differences, yet in this map, viewers are unable to reference or immediately tell any regional identification for each of these dots (lat and lng values) that appeared.
 +
|-
 +
| Solution || [[File:Kmeans clustering.png|800px|frameless|center]]
 +
Firstly, I exported the data from Tableau to SAS Enterprise Guide. Next, I utilized k-means clustering based on two variables, the latitude and longitude values of each Geohash. I set the k (no. of clusters) as 9, which I've determined as representative of the 952 Geohash values. After which, I saved the output and re-run Tableau. Now, I have a Regional Identity for each Geohash value, which allows me to tackle Task 2 and spot regional differences in terms of P1 and P2 values.
 
|}
 
|}
  

Revision as of 16:59, 11 November 2018

Problem & Motivation

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

Data Exploration and Preparation

Official Air Quality Dataset

EEA Data

Firstly, let's look at the Official Air Quality data. According to the metadata, there are 6 stations - Nadezhda, Hipodruma, Druzhba, Orlov Most, IAOS/Pavlovo and Mladost. All datasets were merged together into a single csv file.

The important variables are as follow:

  1. AirQualityStationEoICode
  2. CommonName
  3. AirPollutant
  4. AveragingTime
  5. Concentration
  6. DateTime
  7. Longitude
  8. Latitude
Problem #1 Missing Values
Issue Upon further inspection, data for Year 2016 - 2018 were missing for Orlov Most station, and data for Year 2013 - 2017 were missing for Mladost. Meanwhile, data from Jan 2017 to Oct 2017 were missing for all stations.
Solution Since the main goal is to visualize the overall characteristics of air quality in Sofia City, Orlov Most station was excluded completely since most of its data is missing. However, the remaining stations remained included in the EDA process to discover potential patterns and insights.
Problem #2 Different AveragingTime formats
Issue In this set of timeseries data, there appears to be different AveragingTime formats of PM10 concentration values. In total, there are 3 different formats - 1. Day 2. Var 3. Hour. To illustrate, data for Year 2016 - 2018 are mostly recorded in Hour or Var format.
Solution The datasets will be separated into two formats: Daily and Hourly. Firstly, Var values were converted to Hour values by deducting 1 hour from the Var values. Next, to convert Hour values to Day values, the mean for Hour values in each Day was computed. With that, we have two sets of data - Daily and Hourly.

Citizen Science Air Quality Dataset

Air Tube

Next, let's explore the Citizen science Air Quality measurements. Firstly, there were 2 datasets - one for 2017 and another for 2018. After which, both datasets were combined together, via a notebook written in R.

R Notebook to combine datasets.png
Problem #1 Tableau unable to read 11-character Geohash values
Issue In the datasets, 11-character geohash was used to provide geographical details. However, Tableau is unable to read these values.
Solution
R geo decode.png

R Notebook was utilized to generate the latitude and longitude values according to these geohash values. After which, these values were combined to the current dataset as new columns.

Problem #2 Presence of additional latitude and longitude values outside of Sofia City
Issue
Prob1.png

After plotting the map based on the geohash values, there were many additional latitude and longitude values which lie outside of Sofia City. This is irrelevant to our research, which is based solely in Sofia City.

Solution
Sol1.png

Tableau was utilized to manually exclude dots (lat and lng values) which appeared outside of Sofia City.

Problem #3 Lack of Regional Identification
Issue
Prob2.png

Task 2 specifically asked for regional differences, yet in this map, viewers are unable to reference or immediately tell any regional identification for each of these dots (lat and lng values) that appeared.

Solution
Kmeans clustering.png

Firstly, I exported the data from Tableau to SAS Enterprise Guide. Next, I utilized k-means clustering based on two variables, the latitude and longitude values of each Geohash. I set the k (no. of clusters) as 9, which I've determined as representative of the 952 Geohash values. After which, I saved the output and re-run Tableau. Now, I have a Regional Identity for each Geohash value, which allows me to tackle Task 2 and spot regional differences in terms of P1 and P2 values.

Interactive Visualization and Design

Task 1 Filter bar
Due to the vast amount of visualizations intended to tackle Task 1, squeezing all visualizations within one sheet would result in an overly-cluttered dashboard and thus, not ideal. As such, icons were created to allow users to navigate across sheets to view visualizations for different purposes.
Design2
This geographical map acts as a filter to allow users to attain a good sense of the Station’s geographical location. Upon hovering over the circle, a tooltip appears to represent the station’s name. Sheets within the dashboard are dynamic to this filter, and this allows user to possibly draw insights between geographical differences and PM10 values.
Design3
Reference lines are added to this density plot, to allow users to immediately see where the PM10 measurement lies within. The ranges follow the Air Quality Index given by EEA. Source: http://airindex.eea.europa.eu/. Reference lines are also color coded to represent the danger of PM10 values, whereby “Good” is color coded as dark green and “Very Poor” is color coded as bright red, which signifies danger.
Design4
Upon hovering, the tooltip tells you immediately the AQI category and Average concentration value for the PM10 measurement.
Design5
This line chart is also color coded based on the AQI category. Viewers are able to tell immediately the trend of PM10 values based on hour.

The following Dashboard is designed to tackle Task 1 specifically:

Dashboard1
Dasboard1.2
Design6.png
Dashboard icons to allow viewers to navigate across sheet. Multiple sheets are used to prevent overly-cluttered graphs within a single sheet.
Design7.png
Slider allows viewer to toggle through each Month and observe and time dependent patterns.
Design8.png
Toggle for users to select measure and filter the graph based on interest.
Design9.png
Color-coded map which shows each Geohash as a coloured circle, according to its clustered region. This map also acts as a filter and highlight, to allow viewers to identify and geographical dependent patterns. Viewers are also able to select the Region based on the Legend shown.

The following Dashboard is designed to tackle Task 2 specifically:

Dashboardver2.png
Dashboardver2.1.png

Task 1: Spatio-temporal Analysis of Official Air Quality

Characterize the past and most recent situation with respect to air quality measures in Sofia City. Do you see any trends of possible interest in this investigation?
Answer1.png

Past vs Most Recent Characteristics of Air Pollution in Sofia City and Observed Trends

1. This Calendar Heatmap shows the average values of PM10 in Sofia City, by Month and Hour. Via this heatmap, we observe that the color shade trend towards yellow in November, then red in December and January, then fade back to yellow and light green again in February. However, through March to October, most boxes are shaded green.

This suggests that Sofia City experiences a seasonal trend of PM10 levels, with peak levels in the months from December to January, ranging from Poor to Very Poor levels, while remaining relatively stable within Good to Moderate AQI (Air Quality Index) ratings in the other 8 months.

This observation is interesting as the months from December to January are typically within the Winter season for most countries in Europe. Perhaps there may be a negative correlation between temperature and PM10 values, whereby the lower the temperature, the higher the PM10 value. I will be exploring this further in Task 3 to confirm this hypothesis.

2. As we look across this Heatmap (Top Down), we see that the colors are trending towards yellow and green. The most obvious trend could be observed by comparing the month of December in 2017 to the Decembers in earlier years.

In 2018, for the months of May to September, the shade for the green colored boxes are even darker as compared to those between 2013 and 2016. This suggests that PM10 values in Sofia City could be improving slightly towards the healthier range in recent times.

Answer2.png

3. At the left of the above image, the density plot shows all absolute values of PM10 recorded by EEA for Sofia City. As we filter the months of May to September, we can observe that values of PM10 in Year 2018 are mostly concentrated within Good to Satisfactory levels. This was not the case in the earlier years, where PM10 values (especially 2013 – 2015) were concentrated within the Moderate level as well.

In addition, we can also observe a consistent decrease in density level of PM10 measurements that fall within the Moderate Range (35 – 50) over the years.

Hence, this confirms that PM10 values in Sofia City could be improving slightly towards the healthier range in recent times.

Answer4.png

4. Via these two stacked bar graphs, we can observe a significant increase in proportion of PM10 measurements which fall within the “Poor” and “Very Poor” AQI levels in January, November and December as compared to the other 8 months.

What does a typical day look like for Sofia city?
Answer5.png

Typical day in Sofia City

1. This Heatmap shows of average PM10 values by Day and Hour. In Sofia City, PM10 levels seem to peak around the first 8 hours (0:00 – 8:00) and last 8 hours (18:00 – 24:00) of the day. In the middle 8 hours (9:00 – 17:00), PM10 levels seem to be lowest. However, this trend is not very consistent throughout the several days of the month.

What anomalies do you find in the official air quality dataset? How do these affect your analysis of potential problems to the environment?
Answer6.png

Anomalies and its Impact on Analysis

1. Diving deeper into the PM10 values for each station, interestingly, Druzhba Station had significantly healthier values of PM10 through November 2017 to September 2018. The values of Druzhba Station could have been anomalies and disrupted the overall trend of PM10 values in Sofia City. For instance, the conclusion made in point earlier (whereby PM10 values have improved towards the healthier ranges) could be false, as most of the improvement could be attributed by Druzhba Station instead of Sofia City as a whole.

Answer7.png

2. Between December 2013 and January 2014, average PM10 values were extremely high and frequently fell within the “Very Poor” AQI range.

3. Data was also missing from January to November 2017. This could skew the overall trend of PM10 values towards a healthier range, since January measurements, which tend to be higher are missing.

Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city?
Answerqn2

Sensors’ coverage

1. In this Density Choropleth Map, the intensity for the color red represents the distribution and concentration of Sensors within Sofia City. Interestingly, towards the north-east region of Sofia City, there is a significant lack of sensors. Meanwhile, in the central region, the area is highly populated with sensors. The lack of sensors in the north-east region may be due to it being a lowly-populated area, while the central region is likely to have a densely-populated region and thus, a higher concentration of sensors.

Are they all working properly at all times?
Answerv2-2.png

Sensors’ performance and operation

1. This line graph shows the total sum of Pressure recorded by all sensors. The measurement of Pressure used was sum instead of average so as to show stark differences for any unusual behaviour. Average measurements could aggregate these differences away and thus, sum was used instead. From this graph, we see that on 31 January, 30 March, 1 April and 1 May, Pressure values were 0. This could be likely due to a scheduled monthly maintenance at the end/start of each month.

However, on days between 4 to 12 July 2018, value of 0 was recorded as well. This may be due to a sensor equipment breakdown or an omission of values.

Lastly, we see that on 26th March 2018, there was a stark fall in Pressure readings, and this continued for the rest of the year. This is likely due to the removal of several sensors.

Answerv2-3.png


Answerv2-4.png

2. The same trend was observed for Humidity and P2, whereby values dipped significantly on 31 January, 30 March, 1 April, 1 May and on days between 4 to 12 July 2018. However, beside 31 January, Humidity values were not completely 0 for the stated days. This suggests that a minority of the sensors were unaffected by unconfirmed event which resulted in the recorded measurements as 0.

Similar to Pressure, Humidity values dropped and did not recover back to the original trend after 26 March 2018.

Answerv2-5.png

3. Meanwhile, the sensors for P1 seems to be working consistently well through September 2017 to August 2018. There were no unexpected behaviour from the sensors.

In conclusion, it is likely that the sensors used to measure Humidity, Pressure and P2 are multi-functional and measures the 3 variables together, while sensors used to measure P1 were separate from these 3 variables.

Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others??
3-1.png

Regional differences in P1 and P2 readings

1. In January, Regions 2, 4 and 9 seems to show relatively higher readings for P1 and P2. Average values for P1 and P2 were significantly higher especially in the top 3 Regions 2, 4 and 9, as compared to the bottom four Regions 3, 5, 7 and 8. In addition, P1 and P2 values appear to fluctuate greatly in January.

3-2.png

2. The top three regions 2, 4 and 9 also tend to be in the central area of Sofia City, while the bottom four Regions 3, 5, 7 and 8 tend to be located within the outskirts of the central regions. As such, differences in P1 and P2 readings are likely to be location dependent, at least for the month of January.

Next, to investigate if these differences are time dependent, we look at the average values of P1 and P2 across each month, for each region.

3-3.png

3. As we scroll through February to March, we observe that the rankings start to change. However, these differences for the month of May are not as large as compared to January. In addition, all regions showed relatively lower variances in the average values of P1 and P2 as compared to January.

3-4.png

4. Interestingly, in August, while all other regions showed relatively low values of P1 and P2, Region 9 showed a much higher value for the respective air pollution measures. In terms of variance, region 9 was highest as well.

3-5.png

5. Lastly, as we scroll towards the month of December, we see that the rankings start trending towards the original one, similar to the rankings mentioned in January. Differences between P1 and P2 values amongst the different regions are high and obvious once more. As such, these differences in P1 and P2 values are time dependent, whereby Regions 2, 4 and 9 showed relatively higher average values of P1 and P2.

Task 3: Relationships and Causal Factors of Air Pollution


References


Comments