ISSS608 2018-19 T1 Assign Chen Jingyi Data preparation

From Visual Analytics and Applications
Jump to navigation Jump to search

O706xof6f9b3e2b8.jpg Observable Effects of Bulgaria Air Pollution Crisis

Overview

Data Preparation

Task 1

Task 2

Task 3

 
Data source Data preperation
Official air quality measurements
(5 stations in the city)
(EEA Data.zip)
1. Use vlookup to match data from 5 stations with information in metadata.xlsx, and here we don't take the data of station BG0079A into consideration since it only has records of 2018. The following graph shows how to handle each column.
Xnip2018-11-319 19-51-16.jpg

2. Use tableau to create fields.
2.1 consider var & hour as the same group(since there is only 1 minute’s difference in the time length), group them together in a new group named “hour & var”.

Picture1.png

2.2 Similarly, group time together.

Picture2.png

3. Create new field called “weekday” to generate day information of the week.

Picture3.png

4. Group by air quality station type(background=Druzhba +Hipodruma+Nadezhda, traffic=IAOS+Orlov Most)

Picture4.png
Citizen science air quality measurements
(Air Tube.zip)
1.Use gh_decode from geohash package in R to turn geohash coordination into longitude and latitude.
2.Use reverse_geocode package in Python to transfer coordinations into a new column with city names.
3.Concatenate 2017 and 2018 data in JMP, and get about 1,700,000 rows of data for exploration and data cleaning.
4.For 'pressure',statistics should be more than 0,but there exists 195390 records=0, which is impossible.
Pressure.png

5.For 'humidity',should be within the range of 0 to 100, need to remove ouliers.

Humidity.png

6.For 'temperature',according to data, the temperature of Bulgaria is usually between -10 to 40, but here we want to leave some allowances for measurement, so expand the range to -25~57°C.

Temp.png

7.Filter out anormal data from step 4,5,6, transfer all of them into a new sheet, so left with near 1,400,000 records of clean data.Then filter by city name of 'Sofia',and get around 800,000 rows,after that we bring them into Tableau for furthur analysis and visualization. The format of final dataset is as below:

Xnip2018-11-320 19-18-03.jpg
Meteorological measurements (1 station)
(METEO-data.zip)
1.Delete all missing values
2.Change name of each col according to readme_IP.txt
3.Combine 'Day','Month','Year' attribute into a column named 'Date' in excel.
Topography data
(TOPO-DATA)
1.Nothing special should be done for this data set.
2.The distribution of data shows that there is a gradient transformation of altitude in sofia city: the elevation is the lowest in the north-east side, and goes higher to the south-west part. Here we choose not to match coordinations with former datasets since there are a lot of mismatches, and only take this acknowledge as a common sense.
Xnip2018-11-321 11-54-38.jpg