Difference between revisions of "ISSS608 2017-18 T3 Assign Vaishnavi Praveen Agarwal DataPrep"

Revision as of 14:59, 6 July 2018

VAST Challenge: Mini Challenge 2

The Challenge

Data Preparation

Visualizations

Insights and Conclusion

Feedback and Comments

All Assignments

Data Description

Step	Approach	Description
1.	Data Understanding	i. Read in Raster Layer (Lekagul Roadways Map) It is a single layer raster file. 200x200. class : RasterLayer dimensions : 200, 200, 40000 (nrow, ncol, ncell) resolution : 1, 1 (x, y) extent : 0, 200, 0, 200 (xmin, xmax, ymin, ymax) coord. ref. : NA names : Lekagul_Roadways_2018 values : 0, 255 (min, max) ii. Find out structure of Raster Layer Extent : 40000 CRS arguments : NA File Size : 41078 Object Size : 14376 bytes Layer : 1
2.	Data Cleaning	i. Import two CSV Files (Birds) 2081 Training Birds (Metadata) 15 Test Birds (Provided by Kasios) ii. Fix Data Quality Issues Change File ID from numeric to character Change coordinates to numeric Change Date from Character to Date Omit the two NA values for the Y coordinate. Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.) Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":") Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default) Clean the Quality (Recode ‘no score’ as ‘NA’) iii. Data Manipulation Extract out the “Year” and “Month” from the date, as new columns Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter) iv. Geospatial File Compatibility Convert CSV file (2081 birds) into the following: spatial point data frame sp format shp format st_read compatible format readOGR compatible format ppp format (for spatstat compatibility) v. Data Overview & Exploration Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot() Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot vi. Selection of Treatment & Control Groups Use ‘Rose Pipits’ as Treatment Group Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause Use ‘All Birds’ as third control to see if external factors were the cause
3.	Geospatial Visualisation	Spatial Point Pattern Visualisation i. Prepare polygon layer Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package ii. Kernel Density Plot First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity) Apply the Kernel Density Plot (By Year; 2012-2017) For All Birds For Rose Pipits only (Treatment Group) For OS & LB only (Control Groups) iii. Adjust Parameters (sigma) Adjust the plots by using the sigma of the most dense cluster This is typically the largest sigma iv. Fine-Tune for Clearer Visualisation Then add in the dumping site & adjust the colour/size So that we can visualize the clusters relative to the dumping site
4.	Statistical Confirmation	Spatial Point Pattern Analysis & Cluster Confirmation i. Quadrat Analysis (Density Based Measure) Apply Monti-Carlo Simulation Followed by Quadrat Test to test for clustering ii. Nearest Neighbour (Density Based Measure) Apply Monti-Carlo Simulation Followed by Clark-Evans Test to test for clustering iii. K-Function (Distance Based Measure) Apply Monti-Carlo simulation Visualise significance based on confidence envelope iv. K-Cross (for bivariate analysis) Apply Monti-Carlo simulation Visualise spatial dependence (significance) based on confidence envelope
5.	Audio Processing	i. Data Preparation Read in MP3 Files (Training & Testing Data) Convert to .wav format using writeWav() Convert .wav files to data frame using analyzeFolder() Read in data frame ii. Audio Extraction & Manipulation Extract only 1 of 2 channels (choose left). Convert each sound array to floating point values ranging from -1 to 1. iii. Adjust Parameters (sigma) Adjust the plots by using the sigma of the most dense cluster This is typically the largest sigma iv. Fine-Tune for Clearer Visualisation Then add in the dumping site & adjust the colour/size So that we can visualize the clusters relative to the dumping site
6.	Audio Visualisation	i. Amplitude Envelope Plot Use env() to plot the envelopes of the amplitude plots Do this for all the 15 test birds Do this for 5 training birds per species and select most representative plot as ‘dictionary’ ii. Oscillogram Plot Use seewave package to plot oscillogram Do this for all the 15 test birds Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’ iii. Distribution of audio parameters, using Trellis Plot Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation Out of the 15 median (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species: dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median These were selected as they vary across the species more, than the other of the 8 variables Use ggplot() to plot a trellis plot using the 19 training species Label the mean using black solid line Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean Visualise and identify the top 3 closest species to the mean, by parameter Select the species based on most no. of parameters selected as closest
7.	Audio Classification	i. Decision Tree Use rpart, caret and e1071 libraries Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data Build decision tree model using training data. Evaluate misclassification rate. Apply model to 15 testing birds to predict the species ii. Random Forest Use randomForest library to create a Random Forest model with default parameters Then we will fine tune the model by changing 'mtry' We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x. Evaluate the RF's misclassification rate, with the Decision Tree Compare with visualisation plots, to see if the prediction matches

Difference between revisions of "ISSS608 2017-18 T3 Assign Vaishnavi Praveen Agarwal DataPrep"

Revision as of 14:59, 6 July 2018

Data Description

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 34: / Line 34: @@
 ==Data Description==
+<div style="margin:0px; padding: 2px; background: #E6E6FA; font-family: Arial; border-radius: 1px; text-align:left">
+{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
+|-
+|
+<b>Step</b>
+||
+<b>Approach</b>
+||
+<b>Description</b>
+|-
+|
+.
+||
+<b>Data Understanding</b>
+||
+<b>i. Read in Raster Layer (Lekagul Roadways Map)</b>
+* It is a single layer raster file. 200x200.
+class       : RasterLayer
+<br> dimensions  : 200, 200, 40000  (nrow, ncol, ncell)
+<br> resolution  : 1, 1  (x, y)
+<br> extent      : 0, 200, 0, 200  (xmin, xmax, ymin, ymax)
+<br> coord. ref. : NA
+<br> names       : Lekagul_Roadways_2018
+<br> values      : 0, 255  (min, max)
+<b>ii. Find out structure of Raster Layer</b>
+<br> Extent          : 40000
+<br> CRS arguments   : NA
+<br> File Size       : 41078
+<br> Object Size     : 14376 bytes
+<br> Layer           : 1
+|-
+|
+.
+||
+<b>Data Cleaning</b>
+||
+<b>i. Import two CSV Files (Birds)</b>
+* 2081 Training Birds (Metadata)
+* 15 Test Birds (Provided by Kasios)
+<b>ii. Fix Data Quality Issues</b>
+* Change File ID from numeric to character
+* Change coordinates to numeric
+* Change Date from Character to Date
+* Omit the two NA values for the Y coordinate.
+* Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.)
+* Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":")
+* Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default)
+* Clean the Quality (Recode ‘no score’ as ‘NA’)
+<b>iii. Data Manipulation</b>
+* Extract out the “Year” and “Month” from the date, as new columns
+* Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)
+<b>iv. Geospatial File Compatibility</b>
+* Convert CSV file (2081 birds) into the following:
+** spatial point data frame
+** sp format
+** shp format
+** st_read compatible format
+** readOGR compatible format
+** ppp format (for spatstat compatibility)
+<b>v. Data Overview & Exploration</b>
+* Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot()
+* Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot
+<b>vi. Selection of Treatment & Control Groups</b>
+* Use ‘Rose Pipits’ as Treatment Group
+* Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause
+* Use ‘All Birds’ as third control to see if external factors were the cause
+|-
+|
+.
+||
+<b>Geospatial Visualisation </b>
+||
+<b><u>Spatial Point Pattern Visualisation</u></b>
+<b>i. Prepare polygon layer </b>
+* Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map
+* Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package
+<b>ii. Kernel Density Plot </b>
+* First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity)
+* Apply the Kernel Density Plot (By Year; 2012-2017)
+** For All Birds
+** For Rose Pipits only (Treatment Group)
+** For OS & LB only (Control Groups)
+<b>iii. Adjust Parameters (sigma) </b>
+* Adjust the plots by using the sigma of the most dense cluster
+** This is typically the largest sigma
+<b>iv. Fine-Tune for Clearer Visualisation </b>
+* Then add in the dumping site & adjust the colour/size
+* So that we can visualize the clusters relative to the dumping site
+|-
+|
+.
+||
+<b>Statistical Confirmation </b>
+||
+<b><u>Spatial Point Pattern Analysis & Cluster Confirmation </u></b>
+<b>i. Quadrat Analysis (Density Based Measure)  </b>
+* Apply Monti-Carlo Simulation
+* Followed by Quadrat Test to test for clustering
+<b>ii. Nearest Neighbour (Density Based Measure)   </b>
+* Apply Monti-Carlo Simulation
+* Followed by Clark-Evans Test to test for clustering
+<b>iii. K-Function (Distance Based Measure)  </b>
+* Apply Monti-Carlo simulation
+* Visualise significance based on confidence envelope
+<b>iv. K-Cross (for bivariate analysis) </b>
+* Apply Monti-Carlo simulation
+* Visualise spatial dependence (significance) based on confidence envelope
+|-
+|
+.
+||
+<b>Audio Processing</b>
+||
+<b>i. Data Preparation </b>
+* Read in MP3 Files (Training & Testing Data)
+* Convert to .wav format using writeWav()
+* Convert .wav files to data frame using analyzeFolder()
+* Read in data frame
+<b>ii. Audio Extraction & Manipulation </b>
+* Extract only 1 of 2 channels (choose left).
+* Convert each sound array to floating point values ranging from -1 to 1.
+<b>iii. Adjust Parameters (sigma) </b>
+* Adjust the plots by using the sigma of the most dense cluster
+** This is typically the largest sigma
+<b>iv. Fine-Tune for Clearer Visualisation </b>
+* Then add in the dumping site & adjust the colour/size
+* So that we can visualize the clusters relative to the dumping site
+|-
+|
+.
+||
+<b>Audio Visualisation</b>
+||
+<b>i. Amplitude Envelope Plot</b>
+* Use env() to plot the envelopes of the amplitude plots
+* Do this for all the 15 test birds
+* Do this for 5 training birds per species and select most representative plot as ‘dictionary’
+<b>ii. Oscillogram Plot</b>
+* Use seewave package to plot oscillogram
+* Do this for all the 15 test birds
+* Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’
+<b>iii. Distribution of audio parameters, using Trellis Plot</b>
+* Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files
+* Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation
+* Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species:
+** dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median
+* These were selected as they vary across the species more, than the other of the 8 variables
+* Use ggplot() to plot a trellis plot using the 19 training species
+* Label the mean using black solid line
+* Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean
+* Visualise and identify the top 3 closest species to the mean, by parameter
+* Select the species based on most no. of parameters selected as closest
+|-
+|
+.
+||
+<b>Audio Classification</b>
+||
+<b>i. Decision Tree</b>
+* Use rpart, caret and e1071 libraries
+* Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data
+* Build decision tree model using training data. Evaluate misclassification rate.
+* Apply model to 15 testing birds to predict the species
+<b>ii. Random Forest</b>
+* Use randomForest library to create a Random Forest model with default parameters
+* Then we will fine tune the model by changing 'mtry'
+* We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry).
+** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
+** Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x.
+* Evaluate the RF's misclassification rate, with the Decision Tree
+* Compare with visualisation plots, to see if the prediction matches
+|}
+</div>