Difference between revisions of "Sofia City: Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 34: Line 34:
 
==EEA Dataset==
 
==EEA Dataset==
 
The EEA folder comes with 28 files across 6 stations from 2013-2018. It is observed that some of the stations do not have complete records (denoted by x in table below) across the years notably Orlov Most and Mladost. We first consolidate the dataset for each station by concatenating the files from 2013-2018.<br/>
 
The EEA folder comes with 28 files across 6 stations from 2013-2018. It is observed that some of the stations do not have complete records (denoted by x in table below) across the years notably Orlov Most and Mladost. We first consolidate the dataset for each station by concatenating the files from 2013-2018.<br/>
 
+
<br/>
[[File:Info.jpg|750px|frameless|left]]
+
[[File:Info.jpg|750px|frameless|left]]<br/>
 
[[File:Concat 9421.jpg|400px|frameless]] <br/>
 
[[File:Concat 9421.jpg|400px|frameless]] <br/>
  
Line 42: Line 42:
 
[[File:Addincolumns.jpg|400px|frameless|left]] <br/>
 
[[File:Addincolumns.jpg|400px|frameless|left]] <br/>
 
[[File:Combine EEA.jpg|450px|frameless|left]] <br/>
 
[[File:Combine EEA.jpg|450px|frameless|left]] <br/>
 +
  
  
Line 82: Line 83:
  
 
==Air Tube Dataset==
 
==Air Tube Dataset==
 +
  
 
The Air Tube folder contains data from collected from Citizen Science Air Quality stations over the span of Sep 2017 to Aug 2018. The file comes with a column “Geohash” which encodes the geographic location of air stations into a short string of letters and digits.<br/>
 
The Air Tube folder contains data from collected from Citizen Science Air Quality stations over the span of Sep 2017 to Aug 2018. The file comes with a column “Geohash” which encodes the geographic location of air stations into a short string of letters and digits.<br/>
Line 88: Line 90:
  
 
[[File:Rcode.jpg|1000px|frameless|left]]<br/>
 
[[File:Rcode.jpg|1000px|frameless|left]]<br/>
 +
  
 
By concatenating the files from 2017 and 2018, we were able to compile the final working file containing 3610146 rows.<br/>
 
By concatenating the files from 2017 and 2018, we were able to compile the final working file containing 3610146 rows.<br/>
Line 111: Line 114:
 
Using descriptive analysis function in JMP Pro 14.0, we were able to perform exploratory visualization of the dataset. Initial investigation shows that the distribution of P1 and P1 hovers around range of 0-300. Upon closer inspection, we that there are 3262 counts of value 2000 recorded for P1 (PM10) and 3314 counts of value 1000 recorded for P2 (PM2.5). Incidentally, these values coincide at similar period of recording. To preserve the integrity of the dataset, these values will be retained but kept in consideration as outliers in later visualization steps in Tableau.<br/>
 
Using descriptive analysis function in JMP Pro 14.0, we were able to perform exploratory visualization of the dataset. Initial investigation shows that the distribution of P1 and P1 hovers around range of 0-300. Upon closer inspection, we that there are 3262 counts of value 2000 recorded for P1 (PM10) and 3314 counts of value 1000 recorded for P2 (PM2.5). Incidentally, these values coincide at similar period of recording. To preserve the integrity of the dataset, these values will be retained but kept in consideration as outliers in later visualization steps in Tableau.<br/>
  
[[File:Distributions.jpg|500px|frameless|left]]
+
 
 +
[[File:Distributions.jpg|400px|frameless|left]]

Revision as of 12:31, 18 November 2018

Banner.jpg Understanding Air Quality in Sofia

Background

Data Preparation

Task 1

Task 2

Task 3

 


Data Preparation

EEA Dataset

The EEA folder comes with 28 files across 6 stations from 2013-2018. It is observed that some of the stations do not have complete records (denoted by x in table below) across the years notably Orlov Most and Mladost. We first consolidate the dataset for each station by concatenating the files from 2013-2018.

Info.jpg


Concat 9421.jpg

The original files do not come with information on geographical location, altitude as well as common names of the station. These information, however, could be found inside the metadata table provided. We were to add these information into the individual air tube files by creating new columns “Longitude”, “Latitude”, “Altitude” and “Name” respectively.

Addincolumns.jpg


Combine EEA.jpg





















The final working file is created by concatenating dataset from all 6 stations across all years.

Air Tube Dataset

The Air Tube folder contains data from collected from Citizen Science Air Quality stations over the span of Sep 2017 to Aug 2018. The file comes with a column “Geohash” which encodes the geographic location of air stations into a short string of letters and digits.

We extracted the corresponding Longitude and Latitude using the R package Geohash.

Rcode.jpg



By concatenating the files from 2017 and 2018, we were able to compile the final working file containing 3610146 rows.

Combine AirTube.jpg









Using descriptive analysis function in JMP Pro 14.0, we were able to perform exploratory visualization of the dataset. Initial investigation shows that the distribution of P1 and P1 hovers around range of 0-300. Upon closer inspection, we that there are 3262 counts of value 2000 recorded for P1 (PM10) and 3314 counts of value 1000 recorded for P2 (PM2.5). Incidentally, these values coincide at similar period of recording. To preserve the integrity of the dataset, these values will be retained but kept in consideration as outliers in later visualization steps in Tableau.


Distributions.jpg