Difference between revisions of "IS428 AY2018-19T1 Gokarn Malika Nitin"
Jump to navigation
Jump to search
Line 16: | Line 16: | ||
=<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Dataset Analysis and Transformation Process</font></div>= | =<div style="background: #581845; padding: 15px; line-height: 0.3em; text-indent: 15px; font-size:18px; font-family:Open Sans, Arial, sans-serif"><font color= #ffffff>Dataset Analysis and Transformation Process</font></div>= | ||
− | + | <div style="font-family:Open Sans, Arial, sans-serif;font-size:15px"> | |
<!-- | <!-- | ||
− | + | ==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Dataset Download</div>== | |
− | |||
− | <div style="font-family:Open Sans, Arial, sans-serif;font-size: | ||
− | |||
− | </div> | ||
− | |||
− | |||
Four major data sets in zipped file format are used and are available for download below: | Four major data sets in zipped file format are used and are available for download below: | ||
− | * Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description [https://drive.google.com/file/d/1v5yCL-LdriDwa65qXPbFL7b0tydylDlb/view | + | * Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description [https://drive.google.com/file/d/1v5yCL-LdriDwa65qXPbFL7b0tydylDlb/view here] |
* Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data). | * Citizen science air quality measurements (Air Tube.zip), incl. temperature, humidity and pressure (many stations) and topography (gridded data). | ||
* Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility | * Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility | ||
Line 34: | Line 28: | ||
They can be download by click on this [https://storage.cloud.google.com/global-datathon-2018/sofia-air/air-sofia.zip link]. | They can be download by click on this [https://storage.cloud.google.com/global-datathon-2018/sofia-air/air-sofia.zip link]. | ||
− | |||
− | <div style="font-family:Open Sans, Arial, sans-serif;font-size: | + | ==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Data Description and Understanding</div>== |
− | <b> | + | <div style="font-family:Open Sans, Arial, sans-serif;font-size:16px"><b>Official Air Quality Measurements</b></div> |
− | </div> | ||
+ | The downloaded zip file has 30 files under EEA data, including 1 metadata file, 1 readme.txt and 28 station data files in a .csv format. | ||
+ | The format for data contained by the station files: | ||
+ | [[File:EEA stations data.jpg|550px|center]] | ||
+ | The format for data contained by the metadata file is as below: | ||
+ | [[File:EEA metadata.jpg|550px|center]] | ||
+ | <div style="font-family:Open Sans, Arial, sans-serif;font-size:16px"><b>Citizen Science Air Quality Measurements</b></div> | ||
− | + | The downloaded zip file has 3 files under AirTube data, including 1 sample file, 1 .csv.gz file for 2017 and 1 for 2018. | |
− | + | The format for data contained by these sensor data files looks as follows: | |
− | + | [[File:AirTube Data.jpg|550px|center]] | |
+ | ==<div style="font-family:Open Sans, Arial, sans-serif;font-size:17px">Data Cleaning and Transformation</div>== | ||
{| class="wikitable" style="background-color:#FFFFFF;" width="100%" | {| class="wikitable" style="background-color:#FFFFFF;" width="100%" | ||
|- | |- | ||
Line 61: | Line 60: | ||
|- | |- | ||
! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem #2 | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem #2 | ||
+ | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | EEA TimeFrame for Analysis | ||
+ | |- | ||
+ | | Issue || Missing data is an issue because a factor of seasonality would definitely be pivotal in understanding the trends in air pollution. Data is missing for much of the year of 2017 (further elaborated on in Task 1). Additionally, the stations start to record at the hourly level from 2016 onwards, this would affect "hourly analysis" if any. | ||
+ | |- | ||
+ | | Solution || Using the "DateTimeEnd" variable, for the daily analysis, leave the data as is, generalization and insights will be affected based on new data if uncovered. | ||
+ | <br/> | ||
+ | For the hourly analysis, use data only from November 2017 onwards. This tackles the sparsity of data in 2016's hourly analysis as well as the missing data in 2017. | ||
+ | |} | ||
+ | |||
+ | {| class="wikitable" style="background-color:#FFFFFF;" width="100%" | ||
+ | |- | ||
+ | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem #3 | ||
! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | AirTube Data Building Issues | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | AirTube Data Building Issues | ||
|- | |- | ||
Line 68: | Line 79: | ||
<br/> | <br/> | ||
Upon importing the decoded dataset into Tableau, I found 4 points that have latitude and longitude values of 0.000000, as well as 1 point that has a latitude value of -4.025953, and a longitude value of 78.751781. As neither of these 5 points is anywhere near Bulgaria or Sofia City I have excluded them from the dataset as a whole. | Upon importing the decoded dataset into Tableau, I found 4 points that have latitude and longitude values of 0.000000, as well as 1 point that has a latitude value of -4.025953, and a longitude value of 78.751781. As neither of these 5 points is anywhere near Bulgaria or Sofia City I have excluded them from the dataset as a whole. | ||
+ | <br/> | ||
+ | Additionally, there are sensors plotted to be way outside the boundaries of Sofia city, using the | ||
|} | |} | ||
{| class="wikitable" style="background-color:#FFFFFF;" width="100%" | {| class="wikitable" style="background-color:#FFFFFF;" width="100%" | ||
|- | |- | ||
− | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem # | + | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;width: 10%" | Problem #4 |
! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | AirTube Data Outliers and Noise Removal | ! style="font-weight: bold;background: #6F1E29;color:#FFFFFF;" | AirTube Data Outliers and Noise Removal | ||
|- | |- | ||
Line 79: | Line 92: | ||
| Solution || In order to remove the noise and outliers, the recorded temperatures above 50 degrees Celsius and below -40 degrees Celsius are removed. | | Solution || In order to remove the noise and outliers, the recorded temperatures above 50 degrees Celsius and below -40 degrees Celsius are removed. | ||
|} | |} | ||
+ | </div> | ||
--> | --> | ||
Revision as of 14:54, 11 November 2018
Contents
Problem and Motivation
Dataset Analysis and Transformation Process
Task 1: Spatio-temporal Analysis of Official Air Quality
Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements
Task 3
Urban air pollution is a complex issue. There are many factors affecting the air quality of a city. Some of the possible causes are:
- Local energy sources. For example, according to Unmask My City, a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport.
- Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
- Local topography
- Complex interactions between local topography and meteorological characteristics.
- Transboundary pollution, for example, the haze that intruded into Singapore from our neighbours.
In this third task, you are required to reveal the relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2. Limit your response to no more than 5 images and 600 words.
Software
- Tableau - for visualization of the various tasks
- Python - for geocoding