Difference between revisions of "IS428 AY2018-19T1 S Jonas Nevin"
Sjnevin.2015 (talk | contribs) (Created page with "<div style="background: #3b3b3b; padding: 15px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #a9a9a9 solid 32px; font-size: 20px">Visual Detective -Sol...") |
Sjnevin.2015 (talk | contribs) |
||
Line 10: | Line 10: | ||
The data given contained 4 different folders – Air Tube (Citizen science measurements), EEA (Official air quality measurements), METEO, TOPO. Data transformation had to be done these different datasets to solve the following issues: | The data given contained 4 different folders – Air Tube (Citizen science measurements), EEA (Official air quality measurements), METEO, TOPO. Data transformation had to be done these different datasets to solve the following issues: | ||
=== EEA Dataset === | === EEA Dataset === | ||
− | Issue 1: Though the data were provided from 6 different stations, the data as denoted from the highlighted part of the image below is seen to be given from only 2013 to 2015. As a result, this might not be current and relevant | + | '''Issue 1:''' Though the data were provided from 6 different stations, the data as denoted from the highlighted part of the image below is seen to be given from only 2013 to 2015. As a result, this might not be current and relevant |
Solution: This data has been completely removed from our analysis. | Solution: This data has been completely removed from our analysis. | ||
[[File:1 ky.|600px|center]] | [[File:1 ky.|600px|center]] | ||
− | Issue 2: When opening each of these datasets, at first glance, all the data is contained in a single column. The header in Cell A1 shows the different data given. | + | '''Issue 2:''' When opening each of these datasets, at first glance, all the data is contained in a single column. The header in Cell A1 shows the different data given. |
[[File:1 ky.|600px|center]] | [[File:1 ky.|600px|center]] | ||
− | Solution: The data is separated into their separate columns by using the text-to-column function under the Data tab of Excel. From the data would thus, be as follows: | + | '''Solution:''' The data is separated into their separate columns by using the text-to-column function under the Data tab of Excel. From the data would thus, be as follows: |
[[File:1 ky.|600px|center]] | [[File:1 ky.|600px|center]] | ||
− | Issue 3: All the data are separated by different sheets according to the different years. This makes it more tedious as there are too many different files to work with. Also, as we will see later, it would help ease the data transformation process in issue 4. | + | |
− | Solution: All data that share taken from similar stations will be combined to form a compiled sheet of all the different years. Eventually, there would be 5 compiled sheets for the 5 different stations. | + | '''Issue 3:''' All the data are separated by different sheets according to the different years. This makes it more tedious as there are too many different files to work with. Also, as we will see later, it would help ease the data transformation process in issue 4. |
+ | |||
+ | '''Solution:''' All data that share taken from similar stations will be combined to form a compiled sheet of all the different years. Eventually, there would be 5 compiled sheets for the 5 different stations. | ||
Issue 4: For 2018 data among the 5 different stations, it is noted that the data under “AveragingTime” are either in terms of hours or variables while, data for other years are in terms of days. Therefore, this must be standardised or else, it might lead to inaccurate comparison of data between the different years. | Issue 4: For 2018 data among the 5 different stations, it is noted that the data under “AveragingTime” are either in terms of hours or variables while, data for other years are in terms of days. Therefore, this must be standardised or else, it might lead to inaccurate comparison of data between the different years. | ||
[[File:1 ky.|600px|center]] | [[File:1 ky.|600px|center]] |
Revision as of 21:41, 11 November 2018
Background
Air pollution is an important risk factor for health in Europe and worldwide. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as the leading environmental cause of cancer. Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. In fact, Sofia City, Bulgaria has been identified as one of the most polluted cities in the world with one of the highest readings of PM2.5 and Pm10.
However, that is not to say that Bulgaria alone is solely to be blamed for these high readings. There could be external factors that could be causing this. Thus, with what I’ve learnt in class I aim to create a data visualization that would help users to identify trends and easily identify what could be possible reasons for the high pollution readings captured.
The data given contained 4 different folders – Air Tube (Citizen science measurements), EEA (Official air quality measurements), METEO, TOPO. Data transformation had to be done these different datasets to solve the following issues:
EEA Dataset
Issue 1: Though the data were provided from 6 different stations, the data as denoted from the highlighted part of the image below is seen to be given from only 2013 to 2015. As a result, this might not be current and relevant Solution: This data has been completely removed from our analysis.
Issue 2: When opening each of these datasets, at first glance, all the data is contained in a single column. The header in Cell A1 shows the different data given.
Solution: The data is separated into their separate columns by using the text-to-column function under the Data tab of Excel. From the data would thus, be as follows:
Issue 3: All the data are separated by different sheets according to the different years. This makes it more tedious as there are too many different files to work with. Also, as we will see later, it would help ease the data transformation process in issue 4.
Solution: All data that share taken from similar stations will be combined to form a compiled sheet of all the different years. Eventually, there would be 5 compiled sheets for the 5 different stations. Issue 4: For 2018 data among the 5 different stations, it is noted that the data under “AveragingTime” are either in terms of hours or variables while, data for other years are in terms of days. Therefore, this must be standardised or else, it might lead to inaccurate comparison of data between the different years.
Solution: All data for 2018 are converted to daily figures instead. The steps are as follows: a. Separate the DatetimeBegin to 3 separate rows in terms of Date, Time and Timezone as follows. Time and Timezone can be removed completely from the dataset since we will only base our analysis on daily figures.
b. Cut this new Date column and paste it on an empty space on the right of the sheet, including the concentration corresponding to it. We will work with just these 2 columns to convert them into daily figures.
c. All the concentration levels which share the same date will be averaged to get the unique daily figure for that date using an ‘AVERAGEIF’ function. This would convert all the hour data into a daily average concentration. Rename the title for the dates as ‘DatetimeBegin’.
d. Thereafter, link this new concentration column and link it back to the original data where the other columns such as Country and Namespace are. The should look like as follows. The ‘DatetimeBegin’ column would not have any repetitive dates.
Airtube Dataset
Issue 1: When first extracting, the data would in this GZ format. Therefore, I was unable open it.
Solution: Download an external software such as 7zip (https://www.7-zip.org/download.html) to help extract these files.
Issue 2: It can be observed that the datasets in the Air Tube folder only provide the geohashes for the locations. Instead, the latitudinal and longitudinal values are needed.
Solution: To convert the geohashes into latitudinal and longitudinal values, R Studio must be used. The steps taken are as follows: a. Extract just the ‘Geohashes’ column from the 2017 and 2018 data file into separate CSVs respectively. b. In each of the CSV, remove duplicates using the ‘Remove Duplicates’ function under the data tab in excel. This removes unnecessary time and effort for R to calculate. c. Move these files to a separate folder in any drive on the computer. I placed mine in my C drive of windows. d. Open R Studio and open a R Notebook. Save it into the same folder the data files are in. Then, do the following steps in R. The steps below are shown just for the 2018 CSV. These steps have to be repeated for the 2017 file and changes in the naming have to be made accordingly.
A. Ensure the geohash package is installed and running in your R Studio before you transform the data. B. Import the data. Ensure that the name of the file is copied exactly, especially since it is case sensitive. C. To convert the geohash to latitudinal and longitudinal, use this code. The ‘$geohash’ in the code ensures that the decode function works on every value in the CSV. e. Running the code would present the output after which they would need to be copied page by page manually and pasted into the specific CSV it is for. Since I’ve decoded the 2018 CSV in this example, I would copy these data and paste it into the 2018 CSV so that all the rows are filled. These steps would be repeated for the 2017 data.
(There is probably a more automatic and time-efficient way of doing this but I’ve yet to find out how)
The final output in the CSV should look something like this. This file will be denoted as Table 1.
f. This data would then need to be merged with the original data which contains the other variables such as pressure and time. Move the ‘lat’ and ‘long’ values from Table 1 to this original dataset by first, moving the sheet over and then using the “VLOOKUP” function to copy the values from one sheet to another.
Issue 3: After combining both the 2017 and 2018 datasets in the AirTube data folder, it can be noticed, as per the following screenshot, that the points are populated all around Bulgaria and not in Sofia City alone when plotting the geographical distribution using the geocodes. Therefore, the points not in Sofia City will have to be removed.
Solution 2: To remove the unnecessary points, the following steps were taken: 1. Create a CSV with just the latitude and longitude of Sofia City: a. Latitude: 42.69833 b. Longitude: 23.31994 2. Create a new tableau map visualization to identify how Sofia City looks on the map as follows: