Difference between revisions of "IS428 AY2018-19T1 Lau Zi Quan"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
Line 136: Line 136:
 
''' Sensor Performace:'''<br> Sensor Performance can be identified by looking at the number of records collected at every hour. we will roughly identify the number of sensors we have and if there is no records or drop in the number of records, that would mean that it has unusual performance.  
 
''' Sensor Performace:'''<br> Sensor Performance can be identified by looking at the number of records collected at every hour. we will roughly identify the number of sensors we have and if there is no records or drop in the number of records, that would mean that it has unusual performance.  
  
<centre> 2017 </centre>
+
<center> 2017 </center>
 
[[File:SensorDistribution2017.PNG|center|500px]]
 
[[File:SensorDistribution2017.PNG|center|500px]]
<centre> 2018 </centre>
+
<center> 2018 </center>
 
[[File:SensorDistribution2018.PNG|center|500px]]
 
[[File:SensorDistribution2018.PNG|center|500px]]
 
<br>  
 
<br>  
We zoom in to the map of Sofia City to look at the distribution as AirTube sensor data is spread across from the whole Bulgaria.  
+
We zoom in to the map of Sofia City to look at the distribution as AirTube sensor data is spread across from the whole Bulgaria. 2017 could be the start of the project, where sensors are deployed and crowdsource to gather the information. We see a steady growth in 2017 and more clustered in Sofia city and some of its outskirts. 2018 data points increases but we identified that there are some days where the sensors collected drastically drop to as low as only 6 readings.  
  
  
 
''' Sensor Operations : Unusual Behaviour (Detecting Outlier) '''<br>
 
''' Sensor Operations : Unusual Behaviour (Detecting Outlier) '''<br>
''' Input Dashboard Here'''<br>
+
Next detecting the unusual behaviour of the data collected. Sensors are collected on a voluntary basis so it is expected the data to be dirty and consist outliers.
 +
Air Tube data consists of Humidity, Pressure, Temperature and Concentration (PM10 and PM2.5). Even the Geohash consist outlier, but we have cleared it earlier.
 +
The best way to calculate outlier and look at the distribution for numerical data is with a box plot. An outlier can be calculated based on 1.5 times of IQR higher than Q3 or lower than Q1.  But in this case, the data are scientific data. This means that we first have to understand the nature of the measures like what does Pressures means and how much pressure on Earth is an acceptable value. For example, Pressure is acceptable for 100kpa at sea level. So if we have like 0kpa means the person is at very high altitude, would probably be located at the Stratosphere, 10km above sea level. I also look at the weather forcast to look at the normal temperature range in Bulgaria and Sofia City. We can give a high threshold as it might increase or degrees due global warmining, but in this case, our objective is to identify the outlier first.
  
 +
<center> Min and Max Outlier Box Plot </center>
 
[[File:IdentifySensorUnusualBehaviour.PNG|center|500px]]
 
[[File:IdentifySensorUnusualBehaviour.PNG|center|500px]]
 +
<br>
 +
From this data, the blue data points are the maximum values recorded at each location, and red represent the minimum values recorded. For each measure we can determine the outliers clearly.
 +
Firstly, the concentration of PM 10 and 2.5 is as high as 2000, that is way above the unhealthy and dangerous range of 300.
 +
Secondly, pressure has extreme values of 0 to 300kpa. Next, humidity is way outside the 0-100 scale. Values of 5000 and -5000 could be missing values placeholder.
 +
Lastly, Temperature collected is way below and above the highest temperature range of Bulgaria.
  
 +
Therefore, I felt that there is a need to clean the data first before we proceed to use and visualise the data.
 +
 +
<center> Min and Max Outlier Box Plot (After) </center>
 
[[File:Cleaned Outlier.PNG|center|500px]]
 
[[File:Cleaned Outlier.PNG|center|500px]]
  
 +
From this, we can see that the values are more acceptable and at a realistic range.
  
 
''' Part 2 : ''' <hr>
 
''' Part 2 : ''' <hr>

Revision as of 02:27, 12 November 2018

To be a Visual Detective


Overview

Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally. Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer.

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).

According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).


Dataset Analysis & Transformation Process

Four major data sets in zipped file format are provided. They can be download by click on this link.

  • Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description HERE…
  • Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data).
  • Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
  • Topography data (TOPO-DATA)


Dataset Data Attributes Rationale Of Usage
EEA Data.zip
EEA EDA.png
  1. Official air quality measurements (5 stations in the city)
  2. There is a Metadata.xlsx file which contains information of stations.
  3. There are a total of 6 Stations.
  4. Druzhba, Hipodruma, IAOS/Pavlovo and Nadezhda have data from 2013 to 2018
  5. Mladost only has data in 2018, it may be the replacement the station of OrlovMost which only have data from 2013 to 2015.
  6. All data from EEA are validated, valid and verified.
Air Tube.zip
AirTube EDA.png
  1. Citizen science air quality measurements
  2. Because of the information and coverage by official air measurements data are limited, Crowdsourcing from the citizens to gather data are important.
  3. These sensors are deployed all over Bulgaria across the year 2017 to 2018.
  4. Data from AirTube are expected to be noisy and requires cleaning and verification.
  5. P1 data is PM10 and P2 data is PM2.5
METEO-data.zip
Meteo EDA.png
  1. Data collected by Meteorological measurements
  2. There are a total of 2449 records in this dataset.
  3. This data is collected at Sofia Airport and is measured at a standard 2-meter height. Pressure data is adjusted to sea-level.
  4. Data have undergone QA and should be without error.


Transformation Process

Issue 1 : Merging of Data Set
For EEA Data, there is a total of 28 csv file from different stations location in sofia across different years. In addition there a xlsx metadata which consists imporatant information like CommonName (Station Name) and Latitude and Longtitude of the station. In order to proceed with analysis, we need to merge all the data set together.
Solution :
Method 1: Using Python Pandas:
I made use of python pandas read_csv function to load the data into dataframe in order to concatenate the data. After we concat the data, we can merge the data based on the StationEoICode to get more information on the stations.

MergeEEAPython.PNG


I used the same method to merged Air Tube Data for the year 2017 and 2018. By using this method, I can export the csv into a single file and load it into different analytics tool for visualization. Therefore I used this method over the next method, however eventually did not manage to use other visualisation tool
Method 2: Using Tableau Union Function:
Similarly, Tableau can merge the data together using the union features. Subsequently, we can inner join the data based on the StationEoICode.

TabMergeEEA2.PNG

Issue 2 : Geohashing for Airtube Data
Solution  : I used python to convert the geohash to coordinates. I reference to python geohash2 library link.

GeoHashDecoder.PNG


After I decode the geohash, I discover that there are noises in the process. There is a particular geohash "m-2105171" which is unable to be decode. I used this converter to decode the geohash to obtain the latitude and longtitude. There are also a total of null values in the geohash data. In this case, I would probably be remove these 5 records as the geolocation is not located in Sofia or even near Bulgaria.

GeoHashError.png


Issue 3 : Outlier/Noise in Airtube Dataset
During the Exploratory Data Analysis Process, I discovered extreme values in the dataset for certain measures. However, for Task 2, exploring outlier for Airtube Data is required. Thus this clean dataset is used after and before Task 2 to compare the results.

  1. Temperature: The lowest temperature in Bulgaria is -38.3 degrees Celsius and the highest is 45.2 degrees Celsius. In the Airtube Data, there are values as low as -400 and high as 70 degrees. Thus, these records could have erroneous data and need to be treated.
  2. Pressure: Pressure at sea level is 100 kpa,(100,000 Pa) and the higher the elevation, then the lower the pressure. Therefore it would be interesting if the pressure falls below 0 or higher that 200kpa. Pressure reduce by 1.2kpa for every 100 meters elevation.
  3. Humidity: Humidity is often measured in a relative scale from 0 to 100. Thus values like 1000 and -1000 could be type or noises.

Solution :

  1. Temperature : I removed values for Temperatures that are > 50 degrees Celsius and < -50 degrees Celsius. That is about 25,068 records removed.
  2. Pressure: Remove > 50kpa and < 150 kpa
  3. Humidity: Remove the > 1000 and -1000

Issue 4 : Transforming Meteo Dataset
Meteo Data is collected from Sofia Airport. It contains data like wind speed, humidity, temperature and etc. The missing value is replaced with -9999. However, in order for tableau to read the data, we need to transform the 3 columns of Year, Month and Day into one. Solution :
I used python to concatenate the columns and export it.


Task 1: Spatio-temporal Analysis of Official Air Quality

Firstly, we are looking at only EEA Data from 2013 to 2018. By looking at the data as a whole, we identified that all stations have missing values from the period of 1 Jan 2017 to 28 November 2017.

OverallControlChart.PNG


From this simple plot, we are able to identify that there is a pattern in the increase of the concentration of PM10. This means that there could be an interesting reason for the cause. Thus I decided to explore what is the current standards for PM10 to be considered unhealthy. Sofia City is located in Bulgaria, which is part of EU, thus I referenced to their standards of air quality from this link. From this link we can further categorize the PM Air quality into different categories by on EU Air Quality Standards. Firstly 50μg/m3 measured daily is the limit for Bulgaria with a 35 exceedences each year. Thus we need to generate a graph that can clearly pinpoint on which day the concentration exceeds and when are the days where people in Sofia city can enjoy breathing healthy air.

ColorGroup.PNG

I used the above categorization as my Color Scaling to visualize how a typical day in Sofia City looks like.

Heatmap PM10 Concentration.PNG

By categorizing the concentration, we can identify that actually, Sofia City is facing a high level of concentration of PM10. Surprisingly, other than the spikes in January and December, Sofia City is also facing a high concentration of pollutant across the years except for June.

ControlChart.PNG

Although Heatmap can highlight the seriousness of pollution Sofia is facing, but using Control Plot, we can use it to identify the underlying pattern and interesting insight from this graph. You can notice that every year during 24th December and between 18th to 24th January, there is a significant rise in the concentration of PM10 in Sofia. Could this be a coincidence or a reason behind this? I look up the national holidays of Bulgaria and try to identify to see other Festive Seasons also have a significant rise other than Christmas Season, but in this case, there isn't.

I found out that this is due to their culture and custom. This is article, people in Bulgaria celebrate Christmas Eve by following their tradition. Strict tradition demanded that a fire is built in the hearth, with enough wood to burn all night and into Christmas Day, to help with the new birth of the sun. Thus, we can actually see that the air pollutant rises over the night during those periods. It is kind of similar to how the Chinese celebrate the lunar festival by burning incense.

Putting the charts together
Task 1 Dashboard.png

After building all the individual charts, I put together those charts which I deem that can tell the most story from and gather insights to build a dashboard. We can further split and identify the trend of each Official Station.


Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Part 1 :


Sensor Coverage :
In order to identify sensor coverage, we need to use a map to look at the location of each sensor and how many sensors are located in which area. This can be answered by looking at the distribution across Sofia map. Sensor Performace:
Sensor Performance can be identified by looking at the number of records collected at every hour. we will roughly identify the number of sensors we have and if there is no records or drop in the number of records, that would mean that it has unusual performance.

2017
SensorDistribution2017.PNG
2018
SensorDistribution2018.PNG


We zoom in to the map of Sofia City to look at the distribution as AirTube sensor data is spread across from the whole Bulgaria. 2017 could be the start of the project, where sensors are deployed and crowdsource to gather the information. We see a steady growth in 2017 and more clustered in Sofia city and some of its outskirts. 2018 data points increases but we identified that there are some days where the sensors collected drastically drop to as low as only 6 readings.


Sensor Operations : Unusual Behaviour (Detecting Outlier)
Next detecting the unusual behaviour of the data collected. Sensors are collected on a voluntary basis so it is expected the data to be dirty and consist outliers. Air Tube data consists of Humidity, Pressure, Temperature and Concentration (PM10 and PM2.5). Even the Geohash consist outlier, but we have cleared it earlier. The best way to calculate outlier and look at the distribution for numerical data is with a box plot. An outlier can be calculated based on 1.5 times of IQR higher than Q3 or lower than Q1. But in this case, the data are scientific data. This means that we first have to understand the nature of the measures like what does Pressures means and how much pressure on Earth is an acceptable value. For example, Pressure is acceptable for 100kpa at sea level. So if we have like 0kpa means the person is at very high altitude, would probably be located at the Stratosphere, 10km above sea level. I also look at the weather forcast to look at the normal temperature range in Bulgaria and Sofia City. We can give a high threshold as it might increase or degrees due global warmining, but in this case, our objective is to identify the outlier first.

Min and Max Outlier Box Plot
IdentifySensorUnusualBehaviour.PNG


From this data, the blue data points are the maximum values recorded at each location, and red represent the minimum values recorded. For each measure we can determine the outliers clearly. Firstly, the concentration of PM 10 and 2.5 is as high as 2000, that is way above the unhealthy and dangerous range of 300. Secondly, pressure has extreme values of 0 to 300kpa. Next, humidity is way outside the 0-100 scale. Values of 5000 and -5000 could be missing values placeholder. Lastly, Temperature collected is way below and above the highest temperature range of Bulgaria.

Therefore, I felt that there is a need to clean the data first before we proceed to use and visualise the data.

Min and Max Outlier Box Plot (After)
Cleaned Outlier.PNG

From this, we can see that the values are more acceptable and at a realistic range.

Part 2 :


YearlyQuardrant.PNG
DecemberQuardrant NoFilter.PNG
DecemberQuardrant.PNG
HourlyQuardrant.PNG
CompareHourly.png
HourlyCyclePlot.PNG

Using appropriate data visualisation, you are required will be asked to answer the following types of questions:

Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.


Now turn your attention to the air pollution measurements themselves. Which part of the city shows relatively higher readings than others? Are these differences time dependent? Limit your response to no more than 6 images and 800 words.


Task 3: Air Quality Measure Analysis

CorrelationMetrix.PNG
Tenary.PNG
FilterTenary.PNG
MeteoDataCorrMetrix.PNG


Urban air pollution is a complex issue. There are many factors affecting the air quality of a city. Some of the possible causes are:

  1. Local energy sources. For example, according to Unmask My City, a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport.
  2. Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
  3. Local topography
  4. Complex interactions between local topography and meteorological characteristics.
  5. Transboundary pollution for example the haze that intruded into Singapore from our neighbours.

In this third task, you are required to reveal the relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2. Limit your response to no more than 5 images and 600 words.


Conclusion

https://public.tableau.com/views/Final_StoryBoard/Spatial-TemporalAnalysisofAirQualityinSofiaCity?:embed=y&:display_count=yes&publish=yes

StoryBoard.PNG


Reference

  1. IS428 2016-17 Term1 Assign3 Gwendoline Tan Wan Xin link
  2. IS428 2016-17 Term1 Assign3 Tan Kee Hock link
  3. Datasciencesociety Monthly Challenge Sofia City By Jeremy link
  4. Datasciencesociety Monthly Challenge Sofia City By MARTINPETROV link
  5. Neelesh Kamkolkar Tableau - Sofia Pollution Data Exploration link
  6. Neelesh Kamkolkar Tableau - Sofia Official Pollution Data Story link
  7. Bulgaria Celebrates with Christmas Eve Traditions - link
  8. BULGARIAN OFFICIAL HOLIDAYS - link
  9. EU Air Quality - link
  10. Air Quality Index - link


Feedbacks

Please feel free to provide your feedback. Thank you.