ISSS608 2016-17 T3 Assign ANGAD SRIVASTAVA DataPrep
|
|
|
|
|
The visualization exercises covered in the next 2 tabs make use of 4 distinct datasets. A brief overview of the data preparation and univariate analysis for all 4 datasets is covered in subsequent sections. These datasets and corresponding analysis has been prepared using both JMP Pro 13 and Tableau.
A brief description of all 4 datasets and preparatory steps is given below:
Geolocation Data
This dataset was created based on the location X and Y coordinate points provided as part of the challenge description. The VAST challenge documentation provides geographical coordinates for all 4 factories and 9 sensors. In addition, a “Type” column was added to differentiate between the type of infrastructural construction (Factory or Sensor) represented by the corresponding coordinates.
Based on the given information, the table below with 4 columns was created which when plotted in a 200x200 coordinate block map shows the following geographical layout of all factories and sensors.
Sensor Data
This dataset provided by the VAST challenge documentation in its original format, contains hourly readings for each of the 4 chemicals captured by every Sensor. The number of records span across 3 months i.e. April, August and December, 2016. The adjacent image shows a sample of the dataset provided. Using JMP Pro 13, missing values for this dataset were checked to confirm that all 79,243 rows in this dataset did not have any missing values.
Univariate analysis of the readings also show that the measure of chemical readings is highly skewed. It is noteworthy to highlight the degree of skewness by stating that the 99.5th percentile measure is 6.46 and the maximum value at 101.1. The median reading is at 0.39. These insights have been used in the preparation of selected visualizations, as covered in the next tab.
The distribution of the numeric nominal Monitor field was also analysed to ascertain the frequency of all readings for all Sensors. As shown in the adjacent image, all 9 Sensors capture similar number of readings in the 3 months of data provided with minor differences as shown in the frequency count below:
Meteorological Data
This dataset, as provided by the VAST challenge documentation in its original format, contains atmospheric wind related information. The data covers meteorological readings once every 3 hours with the general wind direction and wind speed captured for that time frame. A snippet of this dataset is shown in the adjacent image.
The relevance of the data field for “Elevation” given in this dataset is unclear and requires additional information. For the purpose of this investigation, this field has been ignored until further information can be provided.
Possible missing values were checked with the following results:
As shown above, there are 2 anomalous records. Further investigations showed that the 2 records are - a empty row and missing values on 30th August, 2016 at 3AM. These 2 records were removed as part of the data cleaning process.
Univariate analysis on the Wind Speed (m/s) shows the frequency of wind speed is well distributed between 0.1 m/s and 6.8 m/s.
Frequency analysis of Wind Direction shows that the wind direction is more skewed between 150 to 360 degrees. Since the given wind direction data is north-facing, the frequency distribution shows that the wind direction is mostly from west to east.