IS428 2018-19 Term1 Assign Aaron Poh Weixin

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search

Background & Objectives

Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.

Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3). According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).

The objective of this project is to first understand..... (continue)

Task 1: Spatio-temporal Analysis of Official Air Quality

Data Preparation

Before diving into the data cleaning process, it is important to detail some findings/anomalies I got from simply observing and scanning the data:

  • The given dataset only focuses on PM10, which will therefore be the main focus of our analysis
  • Stations 60881 and 9484 have serious data quality issues
    • Station 60881 only has data for 2018
    • Station 9484 only has data up to the year 2015
    • Station 9484 has missing data for the months of Oct to Dec for the year 2015
    • Station 9484 has missing data for the months of Aug to Sep for the year 2013
  • Averaging format inconsistent across time periods
    • Year 2013-2014 averages data daily
    • Year 2015 averages data daily, except for 31 Dec, which they classify as hourly
    • Year 2016 mostly averages data daily, with some hourly averages
    • Year 2017 averages hourly and only has data for the months of November and December. They also have a 'var' average which does not make sense
    • Year 2018 averages mostly hourly, with some daily averages

Considering the task requires analysis of the past and the most recent air quality situation in Sofia City, the large amount of missing data in Station 9484 would be unacceptable, hence I decided to exclude it from the analysis. Furthermore, since Bulgaria is not a big country, 4 stations should be sufficient to explain the air quality situation
Note: (Year 2017 only has severe missing data in the original dataset, but I managed to download the latest version which captures most of the daily data)

With the above information in mind, I decided on the following cleaning procedure (in Python):

Steps
Steps Taken
Screenshot
1

I started off with cleaning the years 2013-2015 and 2017(updated) because they had the cleanest data format. Cleaning was simply converting 'DatetimeBegin' into a suitable DateTime format ("dd/mm/yyyy")

I also included a calendar list to help me count the number of missing dates and more importantly identify which are the missing dates.

Note:

  • 'DatetimeBegin' was used as the base for the true measurement's date
  • 2015 contains 1 hourly average concentration. Since the corresponding date is missing from the daily and there was only 1 such row, I simply manually changed the 'AveragingTime' to 'day'
Picture2.png
2

Years 2016 and 2018 were slightly more complicated because they contained both hourly and daily averages.

My cleaning procedure is as follows:

  1. Convert 'DatetimeBegin' to the suitable DateTime format
  2. Average concentration using 'AveragingTime' and 'Date'(newly created column from above step) as the MultiIndex
  3. Now we have 2 types of daily averages: (1) one from the original daily average and (2) a second from the above step. We now check for any missing data in (1) that can be filled in by the new data in (2)
  4. If there are 2 data of the same dates, data from (1), the original data will be prioritised
  5. We fill in the rest of the columns with data from the original dataset (since the data is repeated ever row for these columns)


Note:

  • Station 9642 for the year 2016 does not have data for July, Aug, Oct. At this point only approximately 100 data points are missing from the 5 years, so I will decide again whether to exclude this station later during the visualisation stage.
  • I will not be imputing any missing data, because I lack knowledge on the most suitable imputation technique for this dataset. Using the year's average also don't seem suitable considering how some months can have sudden spikes in concentrations which might skew the average. Furthermore, most visualisation tools automatically draw a straight line across the dates with missing data between the nearest dates with known concentrations. This is reasonable since most datasets do not have very severe missing data (<8% missing data).
Picture3.png
3

This step is the simplest as it simply merges all the cleaned datasets into 1 dataset to be parsed into the relevant visualisation tools

Functions used:

  • frames = [df0,df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,df15,df16,df17,df18,df19,df20,df21,df2,df23]
  • merge_csv = pd.concat(frames)

Data Visualisation

Picture4.png

Findings:
A typical day in Sofia City really depends on the month you are referring to because concentration levels change very drastically very quickly.

  1. For the majority of the year from March till October, average concentration levels are below the EU limit of 50μg/m3
  2. Concentration levels usually peak around December or January
    1. This could be due to added pollution during the festive months

While the visualisation alone is not able to tell us the possible cause of the spike in concentration levels, it is able to highlight trends in concentration levels.

  1. For example, the station STA-BG00052A has a consistent trend of lower concentration levels over the years. This might indicate a healthier environment for the citizens around the region
  2. More encouragingly, the average concentration levels in December have been dropping significantly over the years. This could have a spillover effect unto January (another unhealthy month) should this trend continue
  3. Looking at the below simple line graph, we can observe that peak average concentration levels seem to be slightly trending downwards, which brings further good news for the citizens of Sofia City
Picture5.png


Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Data Preparation

The first step of any visualisation project will always be a visual inspection to understand the data and cleaning the data if necessary. Similar to the earlier task, I have listed down below the steps I have taken to clean the data:

Steps
Issue
Steps Taken
1

Geohash is not a readable format by most visualisation tools including Tableau

Decode geohash using Prof's geohash_geocoder R tool

2

Data is stored separately for the years 2017 and 2018

Furthermore, the large number of rows makes initial data exploration difficult especially with Excel

Use JMP's concatenate tool to combine both 2017 and 2018 data

3

There are duplicate points where a geohash may produce multiple different readings of P1, P2, Temperature etc for the same hour

This could be due to multiple readings within the hour

Use JMP's summary tool to average out readings grouped by location and time variables

Also, we exclude the 4 NA's rows

Data Visualisation

Unexpectd behaviours of sensors

Picture3.5.png

I did this step first as I think this step is essential to not just understand unexpected behaviours, but importantly, identify outliers. The outlier plots on the right of the histogram plots show the data points that lie outside the interquartile range (IQR) of the normal box plot.

Initial analysis of the above image suggests that there are many unexpected readings by the sensors, especially those that lie outside the IQR. However, data lying outside the IQR is not sufficient evidence to warrant exclusion. Therefore, given my limited understanding of weather metrics data, I'll be analysing the variables individually and suggest conservative actions to clean data that we are confident are wrong.

Variable
Issue
Steps Taken
P1 & P2

These variables likely to be either PM10 or PM2.5 according to the AirTube website.

It's unexpected to see extremely high PM values such as 1000 and 2000

Technically, PM values can go very high, but given my limited knowledge, I will only exclude the maximum values 2000 and 1000

Temperature

The range of values suggests that values are in degree celsius

What's unexpected is that we see temperatures above 100°C and below -100°C which is impossible

I will exclude temperatures above 60°C and below -90°C (the highest and lowest temperatures ever recorded on Earth)

Humidity

It is impossible for humidity to be negative

Exclude negative humidity readings

Pressure

It's weird just looking at the histogram that the most commonly observed pressure readings can differ so greatly

Exclude negative pressure readings

The above criteria exclude only 0.7% of the full dataset, which is reasonable. This indicates that at least 0.7% of all readings are faulty. Having carefully cleaned the anomalous readings accordingly, we can now analyse the sensors' coverage and performance:

Are they well distributed over the entire city?

Using the lasso tool, I was able to filter the data points within Sofia City:

Map Dist.png

The above diagram clearly shows that the sensors are highly concentrated within the city centre of Sofia, Sofia Grad. This is true even across 2017 to 2018

Are they all working properly at all times?

Sensor's effectiveness 2.png

The above graphs tell the following stories:

  1. Number of sensors have been steadily increasing over time, with some variance (possibly due to spoilt sensors and replacement by additional contingency sensors)
  2. There are at least 6 major disruptions with all sensors not working averaging 2hrs to 2days
  3. Sensors which have a longer working shelf life are less likely to experience 'blackout periods'

Which part of the city shows relatively higher readings than others?

Readings 1s.gif

P1 and P2
Focusing on the left image, we generally see that the city centre (Sofia Grad), tend to have higher readings of P1 and P2 than the outskirts of Sofia City.
One interesting finding is that the values of P1 and P2 then to be higher from November to January. These findings are consistent with our findings from Task 1 (ignoring anomalies reported by some possibly faulty sensors during the months of June to August).


Temperature
Temperature doesn't seem to be location dependent as visually there doesn't seem to be a connection between locations within Sofia City and temperature readings. However, we can observe that similar temperatures tend to cluster around other similar temperatures nearby.
However, there is a strong correlation between time of year and temperature. As we can see in the image above, high temperatures (redness of circle) tend to occur around the months of April to August, which unsurprisingly, are the summer months.


Humidity
Similar to temperature, humidity doesn't seem to be location dependent as visually there doesn't seem to be a connection between locations within Sofia City and humidity readings. However, we can observe that humidity readings can differ very drastically even for stations located close to each other.
Also similar to temperature, humidity seems to have a positive correlation with summer months - humidity readings increase in value as it approaches summer.


Pressure
Pressure can be observed by focusing on the size of the circles. As we can see, pressure seems to be relatively constant regardless of the location of the sensor and time of the year. However, as previously discussed in our analysis of unexpected readings of sensors, this analysis should be read with caution as the size of the circles could be disproportionately skewed due to the presence of small pressures (~0) causing a majority of pressures (>50,000) to be viewed about the same size.

Task 3

Data Preparation

Fortunately, data preparation for this diagram is relatively easy due to a relatively clean dataset

METEO Data

  1. Step 1: Delete columns PRCPMAX and PRCPMIN and (because all the values are missing)
  2. Step 2: Remove rows with missing values, i.e values = -9999 or -9999.9
  3. Step 3: Remove 1 extremem value from PRCPAVG column

The above simple steps lead to a reasonable 11.1% reduction in sample size.

Summary statistics below show that there are no major issues with the cleaned data:

Meteo summary.png

The next step of the cleaning process will be to pivot the data using Excel.

TOPO Data

The problem with this data set is that the geocodes in the data set does not directly match any of the geocodes in our earlier AirTube data.

One solution I tried was to map AirTube data to the closest elevation in the TOPO data using KNN classification. However, this was not a feasible solution given the limited time I had as it was way too computationally intensive requiring a total of nearly 700 million comparisons to be made.

I then decided not clean any data and proceed straight to the visualisation step

Data Visualisation

METEO Data

We will send two copies of the data to Tableau and join them using inner-join by the year, month and day. This step will allow us to produce a correlation matrix as shown below:

Note: Due to multiple similar data categories, I will only be using the columns that capture average readings plus the visibility column for a total of 7 columns for comparison (i.e. DPAVG, PRCPAVG, PSLAVG, RHAVG, sfcWindAVG, TASAVG and VISIB)

Corr matrix.png

Looking at the above correlation matrix, one can easily see a snapshot of all correlations by month-year from January 2012 to August 2018. The combination of colours makes it easy to interpret the strength of the correlation between variables.

Some Findings:

  1. TASAVG and DPAVG have a very strong positive correlation, which is not surprising since they are both measuring some variant of atmospheric temperature
  2. VISIB and PSLPAVG repeatedly switches between positive and negative axis suggesting that there isn't a strong correlation between the 2 variables.



However, the above plot only shows average relationship across time. There could be cyclical correlations that are time-dependent. A variant of the above graph below allows users to compare correlations within each year to see if there is a possible cyclical pattern.

Other Findings:

  1. For example, the below graph suggests that relative humidity and temperature might have a strong negative correlation during the middle of the year. Users can scroll through the individual years to verify their hypothesis.
Corr matrix2.png

TOPO Data

To understand the relationship between topological and meteorological factors, I mapped out the elevation from the TOPO Data unto Sofia City and used that as a background for the sensor readings I did in Task 2. An example is shown below:

Image.png

Users can also select the various views they are interested in, such as temperature, P1 and P2, etc.

Note that unlike Task 2, I have filtered out data points that lie outside the topological region to help us better focus on the relationship within the square

Some Findings:

  1. Elevation is largely flat in Sofia City, with only the South West corner of the city experiencing significant increase in elevation
  2. Sensors are often placed on areas with lower elevation, possibly because the city is built mostly on flat ground
  3. There are insufficient sensors in the elevated areas to suggest any correlation between elevation and meteorological factors