Difference between revisions of "ISSS608 2017-18 T3 Assign Vaishnavi Praveen Agarwal DataPrep"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 44: Line 44:
 
|-
 
|-
 
|
 
|
<b>Data Understanding</b>  
+
<b>Boonsong Lekagul waterways readings</b>  
 
||
 
||
 
<b>i. Read in Raster Layer (Lekagul Roadways Map)</b>  
 
<b>i. Read in Raster Layer (Lekagul Roadways Map)</b>  
 
* It is a single layer raster file. 200x200.  
 
* It is a single layer raster file. 200x200.  
 
class      : RasterLayer
 
<br> dimensions  : 200, 200, 40000  (nrow, ncol, ncell)
 
<br> resolution  : 1, 1  (x, y)
 
<br> extent      : 0, 200, 0, 200  (xmin, xmax, ymin, ymax)
 
<br> coord. ref. : NA
 
<br> names      : Lekagul_Roadways_2018
 
<br> values      : 0, 255  (min, max)
 
 
  
 
<b>ii. Find out structure of Raster Layer</b>
 
<b>ii. Find out structure of Raster Layer</b>
Line 67: Line 58:
 
|  
 
|  
  
<b>Data Cleaning</b>  
+
<b>chemical units of measure</b>  
 
||
 
||
 
<b>i. Import two CSV Files (Birds)</b>  
 
<b>i. Import two CSV Files (Birds)</b>  
Line 90: Line 81:
 
* Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)
 
* Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)
  
 
<b>iv. Geospatial File Compatibility</b>
 
* Convert CSV file (2081 birds) into the following:
 
** spatial point data frame
 
** sp format
 
** shp format
 
** st_read compatible format
 
** readOGR compatible format
 
** ppp format (for spatstat compatibility)
 
 
 
<b>v. Data Overview & Exploration</b>
 
* Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot()
 
* Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot
 
 
 
<b>vi. Selection of Treatment & Control Groups</b>
 
* Use ‘Rose Pipits’ as Treatment Group
 
* Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause
 
* Use ‘All Birds’ as third control to see if external factors were the cause
 
 
|-
 
|
 
<b>Geospatial Visualisation </b>
 
||
 
<b><u>Spatial Point Pattern Visualisation</u></b>
 
 
<b>i. Prepare polygon layer </b>
 
* Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map
 
* Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package
 
 
 
<b>ii. Kernel Density Plot </b>
 
* First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity)
 
* Apply the Kernel Density Plot (By Year; 2012-2017)
 
** For All Birds
 
** For Rose Pipits only (Treatment Group)
 
** For OS & LB only (Control Groups)
 
 
 
<b>iii. Adjust Parameters (sigma) </b>
 
* Adjust the plots by using the sigma of the most dense cluster
 
** This is typically the largest sigma
 
 
 
<b>iv. Fine-Tune for Clearer Visualisation </b>
 
* Then add in the dumping site & adjust the colour/size
 
* So that we can visualize the clusters relative to the dumping site
 
|-
 
|
 
<b>Statistical Confirmation </b>
 
||
 
<b><u>Spatial Point Pattern Analysis & Cluster Confirmation </u></b>
 
 
<b>i. Quadrat Analysis (Density Based Measure)  </b>
 
* Apply Monti-Carlo Simulation
 
* Followed by Quadrat Test to test for clustering
 
 
 
<b>ii. Nearest Neighbour (Density Based Measure)  </b>
 
* Apply Monti-Carlo Simulation
 
* Followed by Clark-Evans Test to test for clustering
 
 
 
<b>iii. K-Function (Distance Based Measure)  </b>
 
* Apply Monti-Carlo simulation
 
* Visualise significance based on confidence envelope
 
 
 
<b>iv. K-Cross (for bivariate analysis) </b>
 
* Apply Monti-Carlo simulation
 
* Visualise spatial dependence (significance) based on confidence envelope
 
 
|-
 
|
 
<b>Audio Processing</b>
 
||
 
<b>i. Data Preparation </b>
 
* Read in MP3 Files (Training & Testing Data)
 
* Convert to .wav format using writeWav()
 
* Convert .wav files to data frame using analyzeFolder()
 
* Read in data frame
 
 
 
<b>ii. Audio Extraction & Manipulation </b>
 
* Extract only 1 of 2 channels (choose left).
 
* Convert each sound array to floating point values ranging from -1 to 1.
 
 
 
<b>iii. Adjust Parameters (sigma) </b>
 
* Adjust the plots by using the sigma of the most dense cluster
 
** This is typically the largest sigma
 
 
 
<b>iv. Fine-Tune for Clearer Visualisation </b>
 
* Then add in the dumping site & adjust the colour/size
 
* So that we can visualize the clusters relative to the dumping site
 
 
|-
 
|
 
<b>Audio Visualisation</b>
 
||
 
<b>i. Amplitude Envelope Plot</b>
 
* Use env() to plot the envelopes of the amplitude plots
 
* Do this for all the 15 test birds
 
* Do this for 5 training birds per species and select most representative plot as ‘dictionary’
 
 
 
<b>ii. Oscillogram Plot</b>
 
* Use seewave package to plot oscillogram
 
* Do this for all the 15 test birds
 
* Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’
 
 
 
<b>iii. Distribution of audio parameters, using Trellis Plot</b>
 
* Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files
 
* Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation
 
* Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species:
 
** dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median
 
* These were selected as they vary across the species more, than the other of the 8 variables
 
* Use ggplot() to plot a trellis plot using the 19 training species
 
* Label the mean using black solid line
 
* Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean
 
* Visualise and identify the top 3 closest species to the mean, by parameter
 
* Select the species based on most no. of parameters selected as closest
 
|-
 
|
 
<b>Audio Classification</b>
 
||
 
<b>i. Decision Tree</b>
 
* Use rpart, caret and e1071 libraries
 
* Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data
 
* Build decision tree model using training data. Evaluate misclassification rate.
 
* Apply model to 15 testing birds to predict the species
 
 
 
<b>ii. Random Forest</b>
 
* Use randomForest library to create a Random Forest model with default parameters
 
* Then we will fine tune the model by changing 'mtry'
 
* We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry).
 
** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
 
** Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x.
 
* Evaluate the RF's misclassification rate, with the Decision Tree
 
* Compare with visualisation plots, to see if the prediction matches
 
  
 
|}
 
|}
 
</div>
 
</div>

Revision as of 15:04, 6 July 2018

Pic.jpg   VAST Challenge: Mini Challenge 2

The Challenge

Data Preparation

Visualizations

Insights and Conclusion

Feedback and Comments

All Assignments

 

Data Description

File Name

Variables

Boonsong Lekagul waterways readings

i. Read in Raster Layer (Lekagul Roadways Map)

  • It is a single layer raster file. 200x200.

ii. Find out structure of Raster Layer
Extent : 40000
CRS arguments : NA
File Size : 41078
Object Size : 14376 bytes
Layer : 1

chemical units of measure

i. Import two CSV Files (Birds)

  • 2081 Training Birds (Metadata)
  • 15 Test Birds (Provided by Kasios)


ii. Fix Data Quality Issues

  • Change File ID from numeric to character
  • Change coordinates to numeric
  • Change Date from Character to Date
  • Omit the two NA values for the Y coordinate.
  • Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.)
  • Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":")
  • Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default)
  • Clean the Quality (Recode ‘no score’ as ‘NA’)


iii. Data Manipulation

  • Extract out the “Year” and “Month” from the date, as new columns
  • Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)