Difference between revisions of "ISSS608 2017-18 T3 Assign Chan En Ying Grace Methodology"
Line 71: | Line 71: | ||
==<font size="5"><font color="#000000">'''Approach Taken'''</font></font>== | ==<font size="5"><font color="#000000">'''Approach Taken'''</font></font>== | ||
− | The following outlines the 7 key steps used for the analysis - Data Cleaning, Data Preparation, Geospatial Visualisation, Statistical Confirmation, Audio Processing, Audio Visualisation and Audio Classification. For the full codes, the R | + | The following outlines the 7 key steps used for the analysis - Data Cleaning, Data Preparation, Geospatial Visualisation, Statistical Confirmation, Audio Processing, Audio Visualisation and Audio Classification. For the full codes, the R notebook is uploaded in the assignment folder. |
<div style="margin:0px; padding: 2px; background: #E6E6FA; font-family: Arial; border-radius: 1px; text-align:left"> | <div style="margin:0px; padding: 2px; background: #E6E6FA; font-family: Arial; border-radius: 1px; text-align:left"> | ||
Line 147: | Line 147: | ||
<b>v. Data Overview & Exploration</b> | <b>v. Data Overview & Exploration</b> | ||
− | * Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using | + | * Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot() |
− | * Use | + | * Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot |
Line 166: | Line 166: | ||
<b>i. Prepare polygon layer </b> | <b>i. Prepare polygon layer </b> | ||
* Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map | * Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map | ||
− | * Merge Raster Polygon with Rose Pipit Layer, using | + | * Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package |
Line 220: | Line 220: | ||
<b>i. Data Preparation </b> | <b>i. Data Preparation </b> | ||
* Read in MP3 Files (Training & Testing Data) | * Read in MP3 Files (Training & Testing Data) | ||
− | * Convert to .wav format using | + | * Convert to .wav format using writeWav() |
− | * Convert .wav files to data frame using | + | * Convert .wav files to data frame using analyzeFolder() |
* Read in data frame | * Read in data frame | ||
Line 246: | Line 246: | ||
|| | || | ||
<b>i. Amplitude Envelope Plot</b> | <b>i. Amplitude Envelope Plot</b> | ||
− | * Use | + | * Use env() to plot the envelopes of the amplitude plots |
* Do this for all the 15 test birds | * Do this for all the 15 test birds | ||
* Do this for 5 training birds per species and select most representative plot as ‘dictionary’ | * Do this for 5 training birds per species and select most representative plot as ‘dictionary’ | ||
Line 258: | Line 258: | ||
<b>iii. Distribution of audio parameters, using Trellis Plot</b> | <b>iii. Distribution of audio parameters, using Trellis Plot</b> | ||
− | * Run | + | * Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files |
* Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation | * Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation | ||
* Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species: | * Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species: | ||
Line 275: | Line 275: | ||
|| | || | ||
<b>i. Decision Tree</b> | <b>i. Decision Tree</b> | ||
− | * Use | + | * Use rpart, caret and e1071 libraries |
− | * Using the extracted dataframe with | + | * Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data |
* Build decision tree model using training data. Evaluate misclassification rate. | * Build decision tree model using training data. Evaluate misclassification rate. | ||
* Apply model to 15 testing birds to predict the species | * Apply model to 15 testing birds to predict the species | ||
Line 282: | Line 282: | ||
<b>ii. Random Forest</b> | <b>ii. Random Forest</b> | ||
− | * Use | + | * Use randomForest library to create a Random Forest model with default parameters |
− | * Then we will fine tune the model by changing | + | * Then we will fine tune the model by changing 'mtry' |
* We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). | * We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). | ||
** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. | ** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. | ||
Line 293: | Line 293: | ||
</div> | </div> | ||
− | Note: Frequency Analysis using the Fourier Transformation, for a Spectrogram Plot comparison, was attempted but did not produce good variation across the species. Pitch contour and pitch tracking with identified peaks also produced little variation in visualisations. These methods were hence dropped from this analysis but the visualisations are made available in the R | + | Note: Frequency Analysis using the Fourier Transformation, for a Spectrogram Plot comparison, was attempted but did not produce good variation across the species. Pitch contour and pitch tracking with identified peaks also produced little variation in visualisations. These methods were hence dropped from this analysis but the visualisations are made available in the R notebook for reference. |
<br> | <br> |
Latest revision as of 19:16, 30 June 2018
|
|
|
|
|
Tools
R was the primary tool used in this analysis. SAS JMP Pro, Tableau and QGIS were also used to supplement the initial stage of the exploratory analysis.
The following lists the packages used for the project’s scope - for data cleaning, data visualisation, geospatial analysis and audio processing/classification.
|
Approach Taken
The following outlines the 7 key steps used for the analysis - Data Cleaning, Data Preparation, Geospatial Visualisation, Statistical Confirmation, Audio Processing, Audio Visualisation and Audio Classification. For the full codes, the R notebook is uploaded in the assignment folder.
Step |
Approach |
Description |
1. |
Data Understanding |
i. Read in Raster Layer (Lekagul Roadways Map)
class : RasterLayer
|
2. |
Data Cleaning |
i. Import two CSV Files (Birds)
iii. Data Manipulation
|
3. |
Geospatial Visualisation |
Spatial Point Pattern Visualisation i. Prepare polygon layer
|
4. |
Statistical Confirmation |
Spatial Point Pattern Analysis & Cluster Confirmation i. Quadrat Analysis (Density Based Measure)
|
5. |
Audio Processing |
i. Data Preparation
|
6. |
Audio Visualisation |
i. Amplitude Envelope Plot
|
7. |
Audio Classification |
i. Decision Tree
|
Note: Frequency Analysis using the Fourier Transformation, for a Spectrogram Plot comparison, was attempted but did not produce good variation across the species. Pitch contour and pitch tracking with identified peaks also produced little variation in visualisations. These methods were hence dropped from this analysis but the visualisations are made available in the R notebook for reference.