ISSS608 2016-17 T3 Assign ANGAD SRIVASTAVA DataPrep

From Visual Analytics and Applications
Jump to navigation Jump to search
ISSS608 AngadSrivastava WikiHeading.jpg

The Challenge

Data Preparation

Visualization Tools

VAST Submissions

Feedback and Comments

 


The visualization exercises covered in the next 2 tabs make use of 4 distinct datasets. A brief overview of the data preparation and univariate analysis for all 4 datasets is covered in subsequent sections. These datasets and corresponding analysis has been prepared using both JMP Pro 13 and Tableau.

A brief description of all 4 datasets and preparatory steps is given below:

Geolocation Data

This dataset was created based on the location X and Y coordinate points provided as part of the challenge description. The VAST challenge documentation provides geographical coordinates for all 4 factories and 9 sensors. In addition, a “Type” column was added to differentiate between the type of infrastructural construction (Factory or Sensor) represented by the corresponding coordinates.

Based on the given information, the table below with 4 columns was created which when plotted in a 200x200 coordinate block map shows the following geographical layout of all factories and sensors.

ISSS608 AngadSrivastava DataPrep 1.jpg
Figure 1


Sensor Data

ISSS608 AngadSrivastava DataPrep 2.jpg
Figure 3

This dataset provided by the VAST challenge documentation in its original format, contains hourly readings for each of the 4 chemicals captured by every Sensor. The number of records span across 3 months i.e. April, August and December, 2016. The adjacent image shows a sample of the dataset provided. Using JMP Pro 13, missing values for this dataset were checked to confirm that all 79,243 rows in this dataset did not have any missing values.


ISSS608 AngadSrivastava DataPrep 3.jpg
Figure 2


Univariate analysis of the readings also show that the measure of chemical readings is highly skewed. It is noteworthy to highlight the degree of skewness by stating that the 99.5th percentile measure is 6.46 and the maximum value at 101.1. The median reading is at 0.39. These insights have been used in the preparation of selected visualizations, as covered in the next tab.

The distribution of the numeric nominal Monitor field was also analysed to ascertain the frequency of all readings for all Sensors. As shown in the adjacent image, all 9 Sensors capture similar number of readings in the 3 months of data provided with minor differences as shown in the frequency count below:

ISSS608 AngadSrivastava DataPrep 4.jpg

Figure 4
ISSS608 AngadSrivastava DataPrep 5.jpg
Figure 5


Meteorological Data

ISSS608 AngadSrivastava DataPrep 6.jpg
Figure 7

This dataset, as provided by the VAST challenge documentation in its original format, contains atmospheric wind related information. The data covers meteorological readings once every 3 hours with the general wind direction and wind speed captured for that time frame. A snippet of this dataset is shown in the adjacent image.

The relevance of the data field for “Elevation” given in this dataset is unclear and requires additional information. For the purpose of this investigation, this field has been ignored until further information can be provided.

Possible missing values were checked with the following results:

ISSS608 AngadSrivastava DataPrep 7.jpg
Figure 6

As shown above, there are 2 anomalous records. Further investigations showed that the 2 records are - a empty row and missing values on 30th August, 2016 at 3AM. These 2 records were removed as part of the data cleaning process.

Univariate analysis on the Wind Speed (m/s) shows the frequency of wind speed is well distributed between 0.1 m/s and 6.8 m/s.

ISSS608 AngadSrivastava DataPrep 8.jpg
Figure 8

Frequency analysis of Wind Direction shows that the wind direction is more skewed between 150 to 360 degrees. Since the given wind direction data is north-facing, the frequency distribution shows that the wind direction is mostly from west to east.

ISSS608 AngadSrivastava DataPrep 9.jpg
Figure 9


Factory Identification Data

This is a manually created and derived dataset that uses the all of the aforementioned data tables. The purpose and usage of this dataset is to map the chemical readings captured by Sensors at each time frame, with the general wind speed and direction, in an effort to perform geolocation analysis and estimate the factory origin of chemical emissions.

This dataset combines the Sensor data, Meteorological data and Geolocation data to perform an overall geographical analysis. Steps to recreate this dataset are as follows:

  • Using JMP Pro 13, a Cartesian join was performed on Meteorological data and Geolocation data.


ISSS608 AngadSrivastava DataPrep 10.jpg
Figure 10


The resultant dataset contains all Meteorological records for all 9 sensors and 4 factories, as shown below. Our investigation only requires us to consider the Sensor data and estimate the factory origin of chemical emissions. The next 2 steps show the process of retrieving this subset of data.
  • The Sensor names were recoded to its numerical equivalent before creating a subset, as shown below:
ISSS608 AngadSrivastava DataPrep 11.jpg
Figure 11


  • A subset of this dataset was created to filter out records only for the 9 Sensors and not the 4 factories. This was done using the Subset functionality of JMP on the field “Type”.
ISSS608 AngadSrivastava DataPrep 12.jpg
Figure 12


  • The resultant Subset created for Sensors with 6,345 rows is shown below:


ISSS608 AngadSrivastava DataPrep 13.jpg
Figure 13


ISSS608 AngadSrivastava DataPrep 14.jpg
Figure 14
  • Further investigations showed that in order to plot the influence area of each Sensor to trace back to the Factory origin of chemical emissions, any visualization will require a polygonal mapping of its geographically cover the meteorological influence.
By applying and modifying a theoretical understanding of the Windrose model, a triangle was used as the decided upon polygon. In order to plot a triangle for each Sensor based on the wind speed and direction at every given time-frame; the required data should contain 3 coordinate points to capture the 3 ends of the triangle. In order to do this, a dummy table containing the 3 coordinate types was created, as shown in the adjacent image.
  • A Cartesian join was performed on these 2 data tables, which resulted in a consolidated data table with 19,035 records in the following format:


ISSS608 AngadSrivastava DataPrep 15.jpg
Figure 15


  • Finally, this dataset was loaded in Tableau to perform visual analysis. The corresponding Sensor readings for the time frames captured in this dataset were mapped by performing an inner join with the Sensor data in Tableau. The screenshot below shows the inner join performed on Sensor and the Date time fields.


ISSS608 AngadSrivastava DataPrep 16.jpg
Figure 16


The limitation of the last step is that the analysis performed in Question 3 of Mini Challenge 2 from the VAST Challenge 2017, is based on Sensor readings every 3 hours and not every hour.