Difference between revisions of "ISSS608 2017-18 T3 Assign Vaishnavi Praveen Agarwal DataPrep"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(102 intermediate revisions by the same user not shown)
Line 33: Line 33:
 
|}
 
|}
  
==Data Description==
+
 
<div style="margin:0px; padding: 2px; background: #E6E6FA; font-family: Arial; border-radius: 1px; text-align:left">
+
=Data Description=
 +
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
 
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 
|-
 
|-
|  
+
|style="text-align:center"|
<b>Step</b>
+
;
 +
<b>File Name</b>
 +
|style="text-align:center"|
 +
<b>Variables</b>
 +
 
 +
|-
 +
|
 +
<b>Boonsong Lekagul waterways readings (.csv file)</b>  
 
||
 
||
<b>Approach</b>
 
||
 
<b>Description</b>
 
  
 +
The given data file has reading of 106 chemicals at 10 different locations from 1998 - 2016.
 +
 +
<b>i. id</b> (numeric , unique)
 +
<br><b>ii. value</b> (numeric)
 +
<br><b>iii. location</b> (string)
 +
<br><b>iv. sample date</b> (string)
 +
<br><b>v. measure</b> (string)
 +
 
|-
 
|-
 
|
 
|
1.
+
<b>chemical units of measure (.csv file)</b>
||
 
<b>Data Understanding</b>  
 
 
||
 
||
<b>i. Read in Raster Layer (Lekagul Roadways Map)</b>
 
* It is a single layer raster file. 200x200.
 
  
class      : RasterLayer
+
The given data file has the unit of each measure (106 measures) in which the readings were taken.
<br> dimensions  : 200, 200, 40000  (nrow, ncol, ncell)
 
<br> resolution  : 1, 1  (x, y)
 
<br> extent      : 0, 200, 0, 200  (xmin, xmax, ymin, ymax)
 
<br> coord. ref. : NA
 
<br> names      : Lekagul_Roadways_2018
 
<br> values      : 0, 255  (min, max)
 
  
 +
<b>i. measure</b> (string, unique)
 +
<br><b>ii. unit</b> (string)
  
<b>ii. Find out structure of Raster Layer</b>
 
<br> Extent          : 40000
 
<br> CRS arguments  : NA
 
<br> File Size      : 41078
 
<br> Object Size    : 14376 bytes
 
<br> Layer          : 1
 
 
|-
 
|-
|
+
|
 
+
<b>Waterways Final (.jpg file)</b>
2.
 
||
 
<b>Data Cleaning</b>  
 
 
||
 
||
<b>i. Import two CSV Files (Birds)</b>
 
* 2081 Training Birds (Metadata)
 
* 15 Test Birds (Provided by Kasios)
 
  
 +
The Waterways Final is a map image that shows the location of dumping site and the waterways.
 +
<span style=text-align:center;">[[Image:Waterways Final.jpg|200px|thumb|left]]</span>
  
<b>ii. Fix Data Quality Issues</b>  
+
|}
* Change File ID from numeric to character
+
</div>
* Change coordinates to numeric
 
* Change Date from Character to Date
 
* Omit the two NA values for the Y coordinate.
 
* Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.)
 
* Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":")
 
* Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default)
 
* Clean the Quality (Recode ‘no score’ as ‘NA’)
 
  
  
 +
=Data Preparation=
 +
===1. Grouping Measures===
  
<b>iii. Data Manipulation</b>
+
There are a total of 106 measures given in the data file, and it is difficult to view all the measures at a glance. In order to make the visualization simple I have grouped the measure based on their Chemical Composition and Behavior.
* Extract out the “Year” and “Month” from the date, as new columns
 
* Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)
 
  
 +
- A new column was added in the file <b>chemical units of measure.csv</b>
 +
<br>- Name the column as <b>groups</b>
 +
<br>- Chemical properties of each measure was studied and they were divided in <b>12 groups</b>.
 +
<br>- Manually assigned the relevant group to each measure.
  
<b>iv. Geospatial File Compatibility</b>  
+
<b>The 12 groups formed are:</b>
* Convert CSV file (2081 birds) into the following:
 
** spatial point data frame
 
** sp format
 
** shp format
 
** st_read compatible format
 
** readOGR compatible format
 
** ppp format (for spatstat compatibility)
 
  
 +
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
 +
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 +
|-
 +
|style="text-align:center"|
 +
;
 +
<b>Groups</b>
 +
|style="text-align:center"|
 +
<b>Measures</b>
  
<b>v. Data Overview & Exploration</b>  
+
|-
* Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot()
+
|
* Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot
+
<b>Field measurements</b>  
 +
||
  
 +
Anionic active surfactants, Fecal coliforms, Fecal streptococci, Total coliforms, Total extractable matter, Total hardness
 +
 +
|-
 +
|
 +
<b>Heavy metals</b>
 +
||
  
<b>vi. Selection of Treatment & Control Groups</b>
+
Arsenic, Cadmium, Chromium, Copper, Iron, Lead, Manganese, Mercury, Nickel, Zinc
* Use ‘Rose Pipits’ as Treatment Group
 
* Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause
 
* Use ‘All Birds’ as third control to see if external factors were the cause
 
  
 
|-
 
|-
 
|
 
|
3.
+
<b>Hydrocarbon</b>
 
||
 
||
<b>Geospatial Visualisation </b>  
+
 
 +
Acenaphthene, Acenaphthylene, Anthracene, Benzo(a)anthracene, Benzo(a)pyrene, Benzo(b)fluoranthene, Benzo(g,h,i)perylene, Benzo(k)fluoranthene, Chrysene, Fluoranthene, Fluorene, Indeno(1,2,3-c,d)pyrene, Naphthalene, PAHs, Pentachlorobenzene, Petroleum hydrocarbons, Phenanthrene, Pyrene, Tetrachloromethane
 +
 
 +
|-
 +
|
 +
<b>Insecticides, Pesticides and Herbicides</b>  
 
||
 
||
<b><u>Spatial Point Pattern Visualisation</u></b>
 
  
<b>i. Prepare polygon layer </b>
+
Alachlor, Aldrin, Atrazine, Endosulfan (alpha), Endosulfan (beta), gamma-Hexachlorocyclohexane, Isodrin, Methoxychlor, Metolachlor, p,p-DDD, p,p-DDE, p,p-DDT, Simazine, Trifluralin
* Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map
+
 
* Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package
+
|-
 +
|
 +
<b>Metal</b>  
 +
||
  
 +
Aluminium, Barium, Berilium, Boron, Calcium, Cesium, Magnesium, Potassium, Selenium, Sodium
  
<b>ii. Kernel Density Plot </b>
+
|-
* First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity)
+
|
* Apply the Kernel Density Plot (By Year; 2012-2017)
+
<b>Mineral and Nutrients</b>  
** For All Birds
+
||
** For Rose Pipits only (Treatment Group)
 
** For OS & LB only (Control Groups)
 
  
 +
Dissolved silicates, Orthophosphate-phosphorus, Silica (SiO2), Total dissolved phosphorus, Total nitrogen, Total phosphorus
  
<b>iii. Adjust Parameters (sigma) </b>
+
|-
* Adjust the plots by using the sigma of the most dense cluster
+
|
** This is typically the largest sigma
+
<b>Nitrogen</b>  
 +
||
  
 +
Inorganic nitrogen, Organic nitrogen, Total organic carbon
  
<b>iv. Fine-Tune for Clearer Visualisation </b>
 
* Then add in the dumping site & adjust the colour/size
 
* So that we can visualize the clusters relative to the dumping site
 
 
|-
 
|-
 
|
 
|
4.
+
<b>Organic</b>  
||
 
<b>Statistical Confirmation </b>  
 
 
||
 
||
<b><u>Spatial Point Pattern Analysis & Cluster Confirmation </u></b>
 
  
<b>i. Quadrat Analysis (Density Based Measure)  </b>
+
alpha-Hexachlorocyclohexane, AOX, beta-Hexaxchlorocyclohexane, Dieldrin, Endrin, Heptachlor, Heptachloroepoxide, Hexachlorobenzene, PCB 101, PCB 118, PCB 138, PCB 153, PCB 180, PCB 28, PCB 52
* Apply Monti-Carlo Simulation
 
* Followed by Quadrat Test to test for clustering
 
  
 +
|-
 +
|
 +
<b>Others</b>
 +
||
  
<b>ii. Nearest Neighbour (Density Based Measure)  </b>
+
1,2,3-Trichlorobenzene, 1,2,4-Trichlorobenzene, AGOC-3A, Macrozoobenthos, Methylosmoline
* Apply Monti-Carlo Simulation
 
* Followed by Clark-Evans Test to test for clustering
 
  
 +
|-
 +
|
 +
<b>Oxygen</b>
 +
||
  
<b>iii. K-Function (Distance Based Measure) </b>
+
Biochemical Oxygen, Chemical Oxygen Demand (Cr), Chemical Oxygen Demand (Mn), Dissolved organic carbon, Dissolved oxygen, Oxygen saturation
* Apply Monti-Carlo simulation
 
* Visualise significance based on confidence envelope
 
  
 +
|-
 +
|
 +
<b>Salt</b>
 +
||
  
<b>iv. K-Cross (for bivariate analysis) </b>
+
Ammonium, Bicarbonates, Carbonates, Chlorides, Chlorodinine, Cyanides, Nitrates, Nitrites, Sulfides, Sulphates, Total dissolved salts
* Apply Monti-Carlo simulation
 
* Visualise spatial dependence (significance) based on confidence envelope
 
  
 
|-
 
|-
 
|
 
|
5.
+
<b>Water temperature</b>
 
||
 
||
<b>Audio Processing</b>
 
||
 
<b>i. Data Preparation </b>
 
* Read in MP3 Files (Training & Testing Data)
 
* Convert to .wav format using writeWav()
 
* Convert .wav files to data frame using analyzeFolder()
 
* Read in data frame
 
  
 +
Water temperature
 +
 +
|}
 +
</div>
  
<b>ii. Audio Extraction & Manipulation </b>
+
===2. Coordinate Plotting===
* Extract only 1 of 2 channels (choose left).
 
* Convert each sound array to floating point values ranging from -1 to 1.
 
  
 +
Waterways Final is a Map image provided in the data, but the data files does not contain Latitude and Longitude for the Places shown on the Map. Therefore, to use the Map as an interactive sheet, I will create the Latitude and Longitude manually in the following way:
  
<b>iii. Adjust Parameters (sigma) </b>
+
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
* Adjust the plots by using the sigma of the most dense cluster
+
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
** This is typically the largest sigma
+
|-
 +
|style="text-align:center"|
 +
;
 +
<b>Steps</b>
 +
|style="text-align:center"|
 +
<b>Image</b>
  
 +
|-
 +
|
 +
<b>Adding Data and Background Image for Map</b>
  
<b>iv. Fine-Tune for Clearer Visualisation </b>
+
*Open Boonsong Lekagul waterways readings (.csv file) in Tableau, right click on Location Dimension, click on Geographic Role and select ZIP Code / Postcode. This will generate Latitude and Longitude in Measures. Add the generated fields to Columns and Rows respectively.
* Then add in the dumping site & adjust the colour/size
+
*Go to Map options and click on Background Images. In the dialogue box, browse the Waterways Final (.jpg file). In the X Field set Right value to 249 and in the Y Field set the Top value to 249. Go to options and select Always show entire image option and Apply.
* So that we can visualize the clusters relative to the dumping site
+
 +
||
 +
<span style=text-align:center;">[[Image:Vaishnavi_1.png|300px|thumb|center]]</span>
  
 
|-
 
|-
 
|
 
|
6.  
+
<b>Annotate Point to generate Coordinates</b>
 +
 
 +
*Create a new excel file Locations, add a column named Location and manually enter the places. Add other two column named X and Y.
 +
*Right click on the tip of the arrow on map, click on Annotate and select Point.
 +
*In the dialog box, insert fields Latitude(generated) and Longitude(generated).
 +
*Manually copy paste the annotated values in the Location.xlsx file, X and Y columns.
 +
*Do this for all the locations.
 +
 
 
||
 
||
<b>Audio Visualisation</b>
+
<span style=text-align:center;">[[Image:Vaishnavi_2.png|300px|thumb|center]]</span>
||
 
<b>i. Amplitude Envelope Plot</b>  
 
* Use env() to plot the envelopes of the amplitude plots
 
* Do this for all the 15 test birds
 
* Do this for 5 training birds per species and select most representative plot as ‘dictionary’
 
  
 +
|}
 +
</div>
  
<b>ii. Oscillogram Plot</b>
+
===3. Combining Files===
* Use seewave package to plot oscillogram
 
* Do this for all the 15 test birds
 
* Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’
 
  
 +
After cleaning and preparing the data, we have total 3 files: <b>Boonsong Lekagul waterways readings (.csv file)</b>, <b>chemical units of measure (.csv file)</b> and <b>Location (.xlsx file)</b>. We will combine these file using the common columns as Join clause and then do the Visualization.
 +
 +
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
 +
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 +
|-
 +
|style="text-align:center"|
 +
;
 +
<b>Steps</b>
 +
|style="text-align:center"|
 +
<b>Image</b>
  
<b>iii. Distribution of audio parameters, using Trellis Plot</b>
 
* Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files
 
* Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation
 
* Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species:
 
** dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median
 
* These were selected as they vary across the species more, than the other of the 8 variables
 
* Use ggplot() to plot a trellis plot using the 19 training species
 
* Label the mean using black solid line
 
* Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean
 
* Visualise and identify the top 3 closest species to the mean, by parameter
 
* Select the species based on most no. of parameters selected as closest
 
 
|-
 
|-
 
|
 
|
7.  
+
<b>Combining Boonsong Lekagul waterways readings (.csv file) and chemical units of measure (.csv file)</b>
 +
 
 
||
 
||
<b>Audio Classification</b>  
+
*Open the Data Source, add a new Text File <b>Boonsong Lekagul waterways readings (.csv file)</b>.
 +
*Again click on Add in Connections, click on Text File and add <b>chemical units of measure (.csv file)</b>.
 +
*Click on Join and select <b>Left Join</b>.
 +
<span style=text-align:center;">[[Image:Vaishnavi_3.png|550px|thumb|center]]</span>
 +
*Use <b>Measure</b> as the Join clause between the two files.
 +
<span style=text-align:center;">[[Image:Vaishnavi_4.png|300px|thumb|center]]</span>
 +
 
 +
 
 +
|-
 +
|
 +
<b>Combining Boonsong Lekagul waterways readings (.csv file) and Location (.xlsx file)</b>
 +
 
 
||
 
||
<b>i. Decision Tree</b>  
+
*Open the Data Source, add a new Microsoft Excel File <b>Location (.xlsx file)</b>.
* Use rpart, caret and e1071 libraries
+
*Click on Join and select <b>Left Join</b>.
* Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data
+
<span style=text-align:center;">[[Image:Vaishnavi_5.png|400px|thumb|center]]</span>
* Build decision tree model using training data. Evaluate misclassification rate.
+
*Use <b>Location</b> as the Join clause between the two files.
* Apply model to 15 testing birds to predict the species
+
<span style=text-align:center;">[[Image:Vaishnavi_6.png|300px|thumb|center]]</span>
  
 
<b>ii. Random Forest</b>
 
* Use randomForest library to create a Random Forest model with default parameters
 
* Then we will fine tune the model by changing 'mtry'
 
* We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry).
 
** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
 
** Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x.
 
* Evaluate the RF's misclassification rate, with the Decision Tree
 
* Compare with visualisation plots, to see if the prediction matches
 
  
 
|}
 
|}
 
</div>
 
</div>
 +
 +
===4. Final Dataset===
 +
 +
 +
 +
<span style=text-align:center;">[[Image:Vaishnavi_7.png|1100px|thumb|left]]</span>

Latest revision as of 11:44, 8 July 2018

Pic.jpg   VAST Challenge: Mini Challenge 2

The Challenge

Data Preparation

Visualizations

Insights and Conclusion

Feedback and Comments

All Assignments

 


Data Description

File Name

Variables

Boonsong Lekagul waterways readings (.csv file)

The given data file has reading of 106 chemicals at 10 different locations from 1998 - 2016.

i. id (numeric , unique)
ii. value (numeric)
iii. location (string)
iv. sample date (string)
v. measure (string)

chemical units of measure (.csv file)

The given data file has the unit of each measure (106 measures) in which the readings were taken.

i. measure (string, unique)
ii. unit (string)

Waterways Final (.jpg file)

The Waterways Final is a map image that shows the location of dumping site and the waterways.

Waterways Final.jpg


Data Preparation

1. Grouping Measures

There are a total of 106 measures given in the data file, and it is difficult to view all the measures at a glance. In order to make the visualization simple I have grouped the measure based on their Chemical Composition and Behavior.

- A new column was added in the file chemical units of measure.csv
- Name the column as groups
- Chemical properties of each measure was studied and they were divided in 12 groups.
- Manually assigned the relevant group to each measure.

The 12 groups formed are:

Groups

Measures

Field measurements

Anionic active surfactants, Fecal coliforms, Fecal streptococci, Total coliforms, Total extractable matter, Total hardness

Heavy metals

Arsenic, Cadmium, Chromium, Copper, Iron, Lead, Manganese, Mercury, Nickel, Zinc

Hydrocarbon

Acenaphthene, Acenaphthylene, Anthracene, Benzo(a)anthracene, Benzo(a)pyrene, Benzo(b)fluoranthene, Benzo(g,h,i)perylene, Benzo(k)fluoranthene, Chrysene, Fluoranthene, Fluorene, Indeno(1,2,3-c,d)pyrene, Naphthalene, PAHs, Pentachlorobenzene, Petroleum hydrocarbons, Phenanthrene, Pyrene, Tetrachloromethane

Insecticides, Pesticides and Herbicides

Alachlor, Aldrin, Atrazine, Endosulfan (alpha), Endosulfan (beta), gamma-Hexachlorocyclohexane, Isodrin, Methoxychlor, Metolachlor, p,p-DDD, p,p-DDE, p,p-DDT, Simazine, Trifluralin

Metal

Aluminium, Barium, Berilium, Boron, Calcium, Cesium, Magnesium, Potassium, Selenium, Sodium

Mineral and Nutrients

Dissolved silicates, Orthophosphate-phosphorus, Silica (SiO2), Total dissolved phosphorus, Total nitrogen, Total phosphorus

Nitrogen

Inorganic nitrogen, Organic nitrogen, Total organic carbon

Organic

alpha-Hexachlorocyclohexane, AOX, beta-Hexaxchlorocyclohexane, Dieldrin, Endrin, Heptachlor, Heptachloroepoxide, Hexachlorobenzene, PCB 101, PCB 118, PCB 138, PCB 153, PCB 180, PCB 28, PCB 52

Others

1,2,3-Trichlorobenzene, 1,2,4-Trichlorobenzene, AGOC-3A, Macrozoobenthos, Methylosmoline

Oxygen

Biochemical Oxygen, Chemical Oxygen Demand (Cr), Chemical Oxygen Demand (Mn), Dissolved organic carbon, Dissolved oxygen, Oxygen saturation

Salt

Ammonium, Bicarbonates, Carbonates, Chlorides, Chlorodinine, Cyanides, Nitrates, Nitrites, Sulfides, Sulphates, Total dissolved salts

Water temperature

Water temperature

2. Coordinate Plotting

Waterways Final is a Map image provided in the data, but the data files does not contain Latitude and Longitude for the Places shown on the Map. Therefore, to use the Map as an interactive sheet, I will create the Latitude and Longitude manually in the following way:

Steps

Image

Adding Data and Background Image for Map

  • Open Boonsong Lekagul waterways readings (.csv file) in Tableau, right click on Location Dimension, click on Geographic Role and select ZIP Code / Postcode. This will generate Latitude and Longitude in Measures. Add the generated fields to Columns and Rows respectively.
  • Go to Map options and click on Background Images. In the dialogue box, browse the Waterways Final (.jpg file). In the X Field set Right value to 249 and in the Y Field set the Top value to 249. Go to options and select Always show entire image option and Apply.
Vaishnavi 1.png

Annotate Point to generate Coordinates

  • Create a new excel file Locations, add a column named Location and manually enter the places. Add other two column named X and Y.
  • Right click on the tip of the arrow on map, click on Annotate and select Point.
  • In the dialog box, insert fields Latitude(generated) and Longitude(generated).
  • Manually copy paste the annotated values in the Location.xlsx file, X and Y columns.
  • Do this for all the locations.
Vaishnavi 2.png

3. Combining Files

After cleaning and preparing the data, we have total 3 files: Boonsong Lekagul waterways readings (.csv file), chemical units of measure (.csv file) and Location (.xlsx file). We will combine these file using the common columns as Join clause and then do the Visualization.

Steps

Image

Combining Boonsong Lekagul waterways readings (.csv file) and chemical units of measure (.csv file)

  • Open the Data Source, add a new Text File Boonsong Lekagul waterways readings (.csv file).
  • Again click on Add in Connections, click on Text File and add chemical units of measure (.csv file).
  • Click on Join and select Left Join.
Vaishnavi 3.png
  • Use Measure as the Join clause between the two files.
Vaishnavi 4.png


Combining Boonsong Lekagul waterways readings (.csv file) and Location (.xlsx file)

  • Open the Data Source, add a new Microsoft Excel File Location (.xlsx file).
  • Click on Join and select Left Join.
Vaishnavi 5.png
  • Use Location as the Join clause between the two files.
Vaishnavi 6.png


4. Final Dataset

Vaishnavi 7.png