IS428 AY2018-19T1 S Jonas Nevin

Visual Detective -Solving the Mystery behind Sofia City

Background

Air pollution is an important risk factor for health in Europe and worldwide. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer. Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. In fact, Sofia City, Bulgaria has been identified as one of the most polluted cities in the world with one of the highest readings of PM2.5 and Pm10.

However, that is not to say that Bulgaria alone is solely to be blamed for these high readings. There could be external factors that could be causing this. Thus, with what I’ve learnt in class I aim to create a data visualization that would help users to identify trends and easily identify what could be possible reasons for the high pollution readings captured.

Data Preparation

The data given contained 4 different folders – Air Tube (Citizen science measurements), EEA (Official air quality measurements), METEO, TOPO. Data transformation had to be done these different datasets to solve the following issues:

EEA Dataset

Issue 1: Though the data were provided from 6 different stations, the data as denoted from the highlighted part of the image below is seen to be given from only 2013 to 2015. As a result, this might not be current and relevant.

Solution: This data has been completely removed from our analysis.

Issue 2: When opening each of these datasets, at first glance, all the data is contained in a single column. The header in Cell A1 shows the different data given.

600px

Solution: The data is separated into their separate columns by using the text-to-column function under the Data tab of Excel. From the data would thus, be as follows:

Issue 3: All the data are separated by different sheets according to the different years. This makes it more tedious as there are too many different files to work with. Also, as we will see later, it would help ease the data transformation process in issue 4.

Solution: All data that share taken from similar stations will be combined to form a compiled sheet of all the different years. Eventually, there would be 5 compiled sheets for the 5 different stations.

Issue 4: For 2018 data among the 5 different stations, it is noted that the data under “AveragingTime” are either in terms of hours or variables while, data for other years are in terms of days. Therefore, this must be standardised or else, it might lead to inaccurate comparison of data between the different years.

Solution: All data for 2018 are converted to daily figures instead. The steps are as follows:

a. Separate the DatetimeBegin to 3 separate rows in terms of Date, Time and Timezone as follows. Time and Timezone can be removed completely from the dataset since we will only base our analysis on daily figures.

b. Cut this new Date column and paste it on an empty space on the right of the sheet, including the concentration corresponding to it. We will work with just these 2 columns to convert them into daily figures.

c. All the concentration levels which share the same date will be averaged to get the unique daily figure for that date using an ‘AVERAGEIF’ function. This would convert all the hour data into a daily average concentration. Rename the title for the dates as ‘DatetimeBegin’.

d. Thereafter, link this new concentration column and link it back to the original data where the other columns such as Country and Namespace are. The should look like as follows. The ‘DatetimeBegin’ column would not have any repetitive dates.

Airtube Dataset

Issue 1: When first extracting, the data would in this GZ format. Therefore, I was unable open it.

600px

Solution: Download an external software such as 7zip (https://www.7-zip.org/download.html) to help extract these files.

Issue 2: It can be observed that the datasets in the Air Tube folder only provide the geohashes for the locations. Instead, the latitudinal and longitudinal values are needed.

600px

Solution: To convert the geohashes into latitudinal and longitudinal values, R Studio must be used. The steps taken are as follows: a. Extract just the ‘Geohashes’ column from the 2017 and 2018 data file into separate CSVs respectively. b. In each of the CSV, remove duplicates using the ‘Remove Duplicates’ function under the data tab in excel. This removes unnecessary time and effort for R to calculate. c. Move these files to a separate folder in any drive on the computer. I placed mine in my C drive of windows. d. Open R Studio and open a R Notebook. Save it into the same folder the data files are in. Then, do the following steps in R. The steps below are shown just for the 2018 CSV. These steps have to be repeated for the 2017 file and changes in the naming have to be made accordingly.

600px

A. Ensure the geohash package is installed and running in your R Studio before you transform the data. B. Import the data. Ensure that the name of the file is copied exactly, especially since it is case sensitive. C. To convert the geohash to latitudinal and longitudinal, use this code. The ‘$geohash’ in the code ensures that the decode function works on every value in the CSV. e. Running the code would present the output after which they would need to be copied page by page manually and pasted into the specific CSV it is for. Since I’ve decoded the 2018 CSV in this example, I would copy these data and paste it into the 2018 CSV so that all the rows are filled. These steps would be repeated for the 2017 data.

(There is probably a more automatic and time-efficient way of doing this but I’ve yet to find out how)

600px

The final output in the CSV should look something like this. This file will be denoted as Table 1.

600px

f. This data would then need to be merged with the original data which contains the other variables such as pressure and time. Move the ‘lat’ and ‘long’ values from Table 1 to this original dataset by first, moving the sheet over and then using the “VLOOKUP” function to copy the values from one sheet to another.

600px

Issue 3: After combining both the 2017 and 2018 datasets in the AirTube data folder, it can be noticed, as per the following screenshot, that the points are populated all around Bulgaria and not in Sofia City alone when plotting the geographical distribution using the geocodes. Therefore, the points not in Sofia City will have to be removed.

600px

Solution 2: To remove the unnecessary points, the following steps were taken: 1. Create a CSV with just the latitude and longitude of Sofia City: a. Latitude: 42.69833 b. Longitude: 23.31994 2. Create a new tableau map visualization to identify how Sofia City looks on the map as follows:

600px

3. After identifying how and where Sofia City is on the map, head back to the air tube geographical visualization and using the lasso tool, highlight the same area to select only the points in Sofia City.

600px

4. Hover over any of the highlighted points and select Keep Only. That would filter out any other points that are not part of Sofia City. If needed, you can zoom in even further and individually remove point that are not in the region by right clicking the data point and selecting “exclude”.

METEO Dataset

Issue 1: The column headers in the dataset are difficult to decode on its own. For example, daily average temperature is denoted as “TASAVG”. As a result, it may be difficult when trying to create visualizations while working with such headers. Solution 2: Rename the headers so that they can be understood at first glance. Also, the precipitation columns can be removed as all rows have the values of either -9999 or 0. The other headers have been renamed as follows: Before After TASMAX MaxTemp TASAVG AvgTemp TASMIN MinTemp DPMAX MaxDewPointTemp DPAVG AVGDewPointTemp DPMIN MinDewPointTemp RHMAX MaxHumidity RHAVG AVGHumiditiy RHMIN MinHumidity sfcWindMAX MaxWindSpeed sfcWindAVG AVGWindSpeed sfcWindMIN MinWindSpeed PSLMAX MaxSurfacePressure PSLAVG AVGSurfacePressure PSLMIN MinSurfacePressure VISIB AVGvisibility

Issue 2: As per the screen shot below, the Meteo data splits the data in terms year, month and day into 3 separate columns. This makes it harder to work with to create visualizations.

600px

Solution 2: Do the following: 1. Right click on any of the following columns> “Create Calculated Field” 2. Type the following formula down and click ok:

600px

That would create a new column with the full date.

600px

Interactive Visualization

The interactive visualization can be accessed here: https://public.tableau.com/profile/jonas.nevin#!/vizhome/VisualDetectiveAssignment/SofiaCityStoryboard?publish=yes

Screenshot	Purpose
400px	Storyboard The storyboard function was opted for as the main way to traverse through the different dashboards. Though there is no correct order in which the data should be analysed, there was a clear approach taken to understand this data better and this function was the best way to bring that across. As one traverses through the different stories, he/she gets a brief overview on what to expect from each of the dashboards. The following are the various types of visualizations used in the various dashboards:
400px	Time Series Across the 3 datasets, the time series is essential to help understand the patterns across the various measurements of time (monthly, daily, hourly). Main Features: Trendlines: To help identify what the average is and in which time periods most pollution take place. Filters: To enable the user to pinpoint to a specific time period.
400px	Heatmap Calendar The heatmap calendar enables users to easily view trends identify time periods where there are unique observations.
400px	Symbol Map By combining the map and boxplot, users can specifically identify specific areas on the map just by highlight and the points on the boxplot. Main Features: Slider filter: Makes it easier for the user to traverse between the different time periods.
400px	Correlation Matrix This visualization type is extremely useful for the 3rd task where we identify relationship between different factors and air measurements.

Task 1:EEA Dataset

Anomalies

Screenshot	Anomalies
400px	There is missing data from January 2017 to October 2017. This might affect our analysis as this missing data can be considered as very recent data and coupled with the fact that there not all months of data for 2018 are available, it might hamper our analysis.
400px	There are a lot of missing data for the year 2017 as well among all 5 air quality stations. This prevents us from clearly identifying trends from year to year.

Trends and Findings

Trend 1

IS428 AY2018-19T1 S Jonas Nevin

Contents

Background

EEA Dataset

Airtube Dataset

METEO Dataset

Anomalies

Trends and Findings

Navigation menu

Search