IS428 AY2018-19T1 Huiyeon Kimn

From Visual Analytics for Business Intelligence
Revision as of 23:54, 11 November 2018 by Huiyeon.kim.2016 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Problem & Motivation

Bulgaria is suffering from a significant concern which roams around and among many different European Countries. Air pollution, known to be one of the top risk factors for health, threatens the air of Bulgaria. PM2.5 and PM10, widely found air pollutant is found ubiquitously in Bulgaria and know to be far exceeding the restrictions set by the European Union and the WHO (World Health Organization). As of the past 3 years, Bulgaria has had the highest PM2.5 concentrations among its neighboring countries, leading it to become one of the most polluted regions in the world.

With over 60 percent of the urban population in this beautiful country being exposed to dangerous particle matters, the health risk amongst Bulgarians are increasing. As such, there is an urgent need to address such a concern by analyzing the current trends and patterns of PM2.5 and PM10 concentrations so that effective measures can be taken.

We decided to create an interactive visualization, using visual platform such as Tableau to analyze the data collected over the 6 years (2013 - 2018). The platform to be create is made to satisfy these following objectives:

1. Identify patterns, events and abnormal patterns in the Citizen Science Air Quality data through pollution concentrations and other various meteorological data

2. Identify typical patterns, interesting events and trends in the past and recent by the levels of PM10 concentrations, as reported by official data

3. Analyze and identify potential associations among variables that may correlate with the air pollution.


Data Transformation and Analysis Process

We have received 4 sets of data for this analysis assignment. Each of the folders contain different records of data:

• EEA Data (time series PM10 concentrations from 2013 – 2018, recorded as official) • Air Tube Data (meteorological and concentrations from 2017-2018 in various regions) • METEO data (basic statistic summary such as wind, etc. from 2012-2018) • TOPO data (topographical data with elevation)

Official Air Quality Data (EEA)

The given data contains information about 6 stations with time range ranging from 2013 to 2018, depending on the station.

Issue: Different Stations with Different Time Range
Solution:The highlighted station of BG_5_9484 has data only from 2013 to 2015. Since the gap between the other stations and this station is quite big, it will not be meaningful nor correct to show any analysis based on the data of this Air Quality Station. Hence it will be removed from the analysis. For the remaining dtaa points, we were able to use it with the time series as the data seemed to be correct.

EEAWrong.png

Issue: Geographical Data separated into another Excel Workbook
Solution:The Excel formula shown in the above image was used to match the latitude and Longitude data to our actual data. This formula helped speed up the process by a lot.

Excel Formula EEA.png

Air Tube Data

Issue: Location Data encoded in Geohash
Solution: By writing a script in python, the encoded Geohash was converted back into Latitude and Longitude.

Coding.png

METEO Data

METEO Data had to be transformed in order to answer some answers in Task 3. By doing some exploratory Data Analysis, we were able to find some interesting information, which are:

  1. Using the map functionality in Tableau, we were able to find that the METEO data describes a location which happens to be the location of Mladost described in Task 1
  2. Mladost has data which is described in an Hourly manner while the METEO data is currently is in Days. As such, the Mladost data had to be aggregated in terms of Days.

First, the EEA data was manipulated using Python Scripting.

Coding2a.png

Then the METEO data's date was combined into one column by using the "Combine Column" option in Power Query Excel.

The data now are ready to be analyzed!

Interactive Visualization

The following visuals are the output of the analysis.

Task 1: Spatio-temporal Analysis of Official Air Quality

Due to the difference in averaging time in the EEA dataset where 2018, it is Hourly, while the rest in Daily, we decided to split the two data sets in to two different Dashboards.

Dashboard 1 - 2013 - 2017


Task 1 - Dashboard 1.png


Dashboard 2 - 2018


Task 1 - Dashboard 2.png


What does a typical day look like for Sofia city?

Task 1 - Typical 1.png


The above data visualization shows the hourly trend of PM10 concentration in the year of 2018. We only see 2018 as the hourly data is only available in 2018.

As we can see, the trend of PM10 seems to be similar across the 5 air quality stations. At 0:00AM, the PM10 starts out with a higher value and as the morning comes, the fluctuates up and down. Then we see a sudden dip in the concentration (Possibly an anomaly) around 9AM for the stations. After the dip, the conc. increases at 10AM until starts decreasing again from 11AM to 16PM. Then onwards the concentration level seems to increase again until the next morning.

This is the typical conc. level per day in Sofia City.

Do you see any trends of possible interest in this investigation?

Task 1 - Trend 1.png


As shown as such in the amp visualization above, we could see that the PM10 concentration level is higher is the western region of Sofia and the eastern region has a lower level of concentration.

Task 1 - Trend 2.png


2018 - Monthly trend

As shown in the visualization above, the concentration level seems to be much higher in the year of 2018 - January and February than the rest of the months in 2018. This pattern holds true for all the 5 stations in 2018.

Task 1 - Trend 3.png


2013 - 2017 - Monthly trend

Around the months of November to January, is the highest concentration of PM10 across all years in Sofia city. As also seen in the year of 2018, there must be something during that time which increases the concentration level.

Most of the months in the year seems to be okay and has concentration level lower than the EU limit of 50 but these months of November to January poses a real problem for Sofia City

With more investigation, we found that they have a fire burning tradition which could be leading to the spike in the PM10 concentration level. As such, the government could possibly see what they can do about the spike in air pollutant due to these traditions.

What anomalies do you find in the official air quality dataset?

Task 1 - Anomaly 1.png


The circled portion of the chart has some missing data (Large portions of missing data) which leads to a straight line in the chart.

Task 1 - Anomaly 2.png


The circled portion as missing data. 9AM of all the months seem to be having a sudden drop in the Concentration level which has to be figured out why as it is consistent in its erroneous ways.


How do these affect your analysis of potential problems to the environment?
These anomalies would affect our analysis as we would not know what happened during those times and do not know why the anomalies occured. It could possibly lead to wrong conclusions as, if the story in the missing data is actually important, the assumptions made might be wrong.

But due to the comprehensive dataset other than the missing data, we are already able to see enough information to conclude many points mentioned above.


Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements

Dashboard - 1

Task 2 - Dashboard.png

Dashboard - 2

Task 2 - Dashboard 2.png

Part 1 - Sensor Operation, Coverage and Performance

Are they well distributed over the entire city?

Task 2 - City.png


The image above shows the sensor coverage of Sofia City. As you can see from the density of the maps, the sensors are not equally distributed across the city. Sofia-Grad, the center of Sofia or the Capital of Bulgaria seems to have the sensors packed up while the other regions such as the north or the east seems to be lacking with their sensor placements. This could be possibly looked into by the Bulgarian Government.

Are they all working properly at all times?

Task 2 - Working.png


The image above shows the sensor operation of Sofia City. As the line graph shows, the number of sensors in Sofia City is always increasing since the time of 2017 to 2018. But there are sudden drops in the number of sensors in the region which could possibly mean that the number of non-working sensors drastically increased. March 31st 2018, May 1st 2018, July 6th 2018 etc, sensors for these days are low. Did the sensors stop working for a day?

Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture?

Task 2 - Performance.png


The image above shows the sensor performance of Sofia City. There are three boxplots which shows the reading of Pressure, Humidity and Temperature. The box plot shows the distribution of the sensor data. We notice some anomaly in the data. There seems to be extreme data points in all three of the box plots which would not make sense in real life as it would be lethal to every human alive. Pressure being zero, Humidity zero and Temperature falling upto -70 degree celcius is absurd for a data point. This could only mean that the sensors might have been exposed to other sources of heat, humidity etc which made the sensor fluctuate badly or the sensor couldve been broken.

Part 2 - Air Pollutant Analysis

Which part of the city shows relatively higher readings than others?

Task 2 - City 2.png


The above image shows the concentration of P1 and P2 in the city of Sofia. It shows that the concentration of P1 and P2 are quite similar and are very well concentrated in the middle region of Sofia. This could possibly be due to the number of sensors being packed up in the central region and not due to the high concentration.

Are these differences time dependent?

Task 2 - Time Series.png


The above image shows the concentration of P1 and P2 across a time series. This chart aims to answer if the concentration is dependent on time. As we have seen before, due to the fire burning tradition in Sofia city, the months of November - December - January - February has a really high concentration of P1 and P2. Also, P1 and P2 follows a positive relationship where when P1 is high, P2 will also tend to be high. No matter what the year is, the festival seems to be extremely likely to be the reason for the high P1 and P2 and health risks in Bulgaria, Sofia.


Task 3 - Potential Factor Analysis

Dashboard 1

Task 3 - Dashboard 1.png


Dashboard 2

Task 3 - Dashboard 2.png


What are the factors influencing Concentration Levels?

Task 3 - Comparison 1.png


The above image shows the detailed analysis of concentration level vs the meteorology data found in METEO.csv. It describes 4 different factors which are compared to the PM10 concentration in Sofia City.

  • Dew Point vs Concentration - we can see that dew point has a negative relationship with concentration as when dew point increases, the concentration of PM10 decreases and vice-versa.
  • Wind Speed vs Concentration – Just as dew point vs concentration, wind speed also has a negative relationship with PM10 concentration
  • Relative Humidity vs Concentration – Unlike the above 2 points, Relative Humidity seems to have a slight positive relationship with concentration as when RH increases, Concentration decreases
  • Precipitation vs Concentration – Just as Relative humidity, Precipitation also seem to have a slight positive relationship with concentration.



Task 3 - Comparion 3.png


From the above visualization, we can determine a few insights –

  • P1 vs Temp – Temperature seems to have a negative relationship with P1, as when temp increase – P1 decreases.
  • P1 vs Pressure – Pressure have a positive relationship with P1, although the relationship is not as strong and evident, when Pressure increases, P1 also increases
  • P1 vs Humidity – It also follows the same trend as Pressure vs P1, it is not as strong but when Humidity increases, P1 increases

Seeing the visuals which describes Temp, Pressure and Humidity vs P2, we see a similar trend which the P1 follows with temperature negatively related with P2 while pressure and humidity has a slight positive relationship with P2

Conclusion

With all these analysis, we can conclude that the concentration level is mainly due to the December tradition and also can be influenced by factors such as temperature and wind speed. This could be the stepping stone for the Bulgarian government to make Sofia city a better place to live for everyone. I wish this project has some influence on the world.

Tableau Link

https://public.tableau.com/profile/huiyeon.kim#!/vizhome/Question1_127/2018?publish=yes

References

Comments

Feel free to leave us some comments so that we can improve!

No. Name Date Comments
1. Insert your name here Insert date here Insert comment here
2. Insert your name here Insert date here Insert comment here
3. Insert your name here Insert date here Insert comment here