Difference between revisions of "ISSS608 2017-18 T3 Assign Vaishnavi Praveen Agarwal DataPrep"
Line 34: | Line 34: | ||
==Data Description== | ==Data Description== | ||
+ | <div style="margin:0px; padding: 2px; background: #E6E6FA; font-family: Arial; border-radius: 1px; text-align:left"> | ||
+ | {| class="wikitable" style="background-color:#FFFFFF;" width="100%" | ||
+ | |- | ||
+ | | | ||
+ | <b>Step</b> | ||
+ | || | ||
+ | <b>Approach</b> | ||
+ | || | ||
+ | <b>Description</b> | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | 1. | ||
+ | || | ||
+ | <b>Data Understanding</b> | ||
+ | || | ||
+ | <b>i. Read in Raster Layer (Lekagul Roadways Map)</b> | ||
+ | * It is a single layer raster file. 200x200. | ||
+ | |||
+ | class : RasterLayer | ||
+ | <br> dimensions : 200, 200, 40000 (nrow, ncol, ncell) | ||
+ | <br> resolution : 1, 1 (x, y) | ||
+ | <br> extent : 0, 200, 0, 200 (xmin, xmax, ymin, ymax) | ||
+ | <br> coord. ref. : NA | ||
+ | <br> names : Lekagul_Roadways_2018 | ||
+ | <br> values : 0, 255 (min, max) | ||
+ | |||
+ | |||
+ | <b>ii. Find out structure of Raster Layer</b> | ||
+ | <br> Extent : 40000 | ||
+ | <br> CRS arguments : NA | ||
+ | <br> File Size : 41078 | ||
+ | <br> Object Size : 14376 bytes | ||
+ | <br> Layer : 1 | ||
+ | |- | ||
+ | | | ||
+ | |||
+ | 2. | ||
+ | || | ||
+ | <b>Data Cleaning</b> | ||
+ | || | ||
+ | <b>i. Import two CSV Files (Birds)</b> | ||
+ | * 2081 Training Birds (Metadata) | ||
+ | * 15 Test Birds (Provided by Kasios) | ||
+ | |||
+ | |||
+ | <b>ii. Fix Data Quality Issues</b> | ||
+ | * Change File ID from numeric to character | ||
+ | * Change coordinates to numeric | ||
+ | * Change Date from Character to Date | ||
+ | * Omit the two NA values for the Y coordinate. | ||
+ | * Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.) | ||
+ | * Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":") | ||
+ | * Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default) | ||
+ | * Clean the Quality (Recode ‘no score’ as ‘NA’) | ||
+ | |||
+ | |||
+ | |||
+ | <b>iii. Data Manipulation</b> | ||
+ | * Extract out the “Year” and “Month” from the date, as new columns | ||
+ | * Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter) | ||
+ | |||
+ | |||
+ | <b>iv. Geospatial File Compatibility</b> | ||
+ | * Convert CSV file (2081 birds) into the following: | ||
+ | ** spatial point data frame | ||
+ | ** sp format | ||
+ | ** shp format | ||
+ | ** st_read compatible format | ||
+ | ** readOGR compatible format | ||
+ | ** ppp format (for spatstat compatibility) | ||
+ | |||
+ | |||
+ | <b>v. Data Overview & Exploration</b> | ||
+ | * Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot() | ||
+ | * Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot | ||
+ | |||
+ | |||
+ | <b>vi. Selection of Treatment & Control Groups</b> | ||
+ | * Use ‘Rose Pipits’ as Treatment Group | ||
+ | * Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause | ||
+ | * Use ‘All Birds’ as third control to see if external factors were the cause | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | 3. | ||
+ | || | ||
+ | <b>Geospatial Visualisation </b> | ||
+ | || | ||
+ | <b><u>Spatial Point Pattern Visualisation</u></b> | ||
+ | |||
+ | <b>i. Prepare polygon layer </b> | ||
+ | * Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map | ||
+ | * Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package | ||
+ | |||
+ | |||
+ | <b>ii. Kernel Density Plot </b> | ||
+ | * First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity) | ||
+ | * Apply the Kernel Density Plot (By Year; 2012-2017) | ||
+ | ** For All Birds | ||
+ | ** For Rose Pipits only (Treatment Group) | ||
+ | ** For OS & LB only (Control Groups) | ||
+ | |||
+ | |||
+ | <b>iii. Adjust Parameters (sigma) </b> | ||
+ | * Adjust the plots by using the sigma of the most dense cluster | ||
+ | ** This is typically the largest sigma | ||
+ | |||
+ | |||
+ | <b>iv. Fine-Tune for Clearer Visualisation </b> | ||
+ | * Then add in the dumping site & adjust the colour/size | ||
+ | * So that we can visualize the clusters relative to the dumping site | ||
+ | |- | ||
+ | | | ||
+ | 4. | ||
+ | || | ||
+ | <b>Statistical Confirmation </b> | ||
+ | || | ||
+ | <b><u>Spatial Point Pattern Analysis & Cluster Confirmation </u></b> | ||
+ | |||
+ | <b>i. Quadrat Analysis (Density Based Measure) </b> | ||
+ | * Apply Monti-Carlo Simulation | ||
+ | * Followed by Quadrat Test to test for clustering | ||
+ | |||
+ | |||
+ | <b>ii. Nearest Neighbour (Density Based Measure) </b> | ||
+ | * Apply Monti-Carlo Simulation | ||
+ | * Followed by Clark-Evans Test to test for clustering | ||
+ | |||
+ | |||
+ | <b>iii. K-Function (Distance Based Measure) </b> | ||
+ | * Apply Monti-Carlo simulation | ||
+ | * Visualise significance based on confidence envelope | ||
+ | |||
+ | |||
+ | <b>iv. K-Cross (for bivariate analysis) </b> | ||
+ | * Apply Monti-Carlo simulation | ||
+ | * Visualise spatial dependence (significance) based on confidence envelope | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | 5. | ||
+ | || | ||
+ | <b>Audio Processing</b> | ||
+ | || | ||
+ | <b>i. Data Preparation </b> | ||
+ | * Read in MP3 Files (Training & Testing Data) | ||
+ | * Convert to .wav format using writeWav() | ||
+ | * Convert .wav files to data frame using analyzeFolder() | ||
+ | * Read in data frame | ||
+ | |||
+ | |||
+ | <b>ii. Audio Extraction & Manipulation </b> | ||
+ | * Extract only 1 of 2 channels (choose left). | ||
+ | * Convert each sound array to floating point values ranging from -1 to 1. | ||
+ | |||
+ | |||
+ | <b>iii. Adjust Parameters (sigma) </b> | ||
+ | * Adjust the plots by using the sigma of the most dense cluster | ||
+ | ** This is typically the largest sigma | ||
+ | |||
+ | |||
+ | <b>iv. Fine-Tune for Clearer Visualisation </b> | ||
+ | * Then add in the dumping site & adjust the colour/size | ||
+ | * So that we can visualize the clusters relative to the dumping site | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | 6. | ||
+ | || | ||
+ | <b>Audio Visualisation</b> | ||
+ | || | ||
+ | <b>i. Amplitude Envelope Plot</b> | ||
+ | * Use env() to plot the envelopes of the amplitude plots | ||
+ | * Do this for all the 15 test birds | ||
+ | * Do this for 5 training birds per species and select most representative plot as ‘dictionary’ | ||
+ | |||
+ | |||
+ | <b>ii. Oscillogram Plot</b> | ||
+ | * Use seewave package to plot oscillogram | ||
+ | * Do this for all the 15 test birds | ||
+ | * Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’ | ||
+ | |||
+ | |||
+ | <b>iii. Distribution of audio parameters, using Trellis Plot</b> | ||
+ | * Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files | ||
+ | * Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation | ||
+ | * Out of the 15 <b>median</b> (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species: | ||
+ | ** dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median | ||
+ | * These were selected as they vary across the species more, than the other of the 8 variables | ||
+ | * Use ggplot() to plot a trellis plot using the 19 training species | ||
+ | * Label the mean using black solid line | ||
+ | * Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean | ||
+ | * Visualise and identify the top 3 closest species to the mean, by parameter | ||
+ | * Select the species based on most no. of parameters selected as closest | ||
+ | |- | ||
+ | | | ||
+ | 7. | ||
+ | || | ||
+ | <b>Audio Classification</b> | ||
+ | || | ||
+ | <b>i. Decision Tree</b> | ||
+ | * Use rpart, caret and e1071 libraries | ||
+ | * Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data | ||
+ | * Build decision tree model using training data. Evaluate misclassification rate. | ||
+ | * Apply model to 15 testing birds to predict the species | ||
+ | |||
+ | |||
+ | <b>ii. Random Forest</b> | ||
+ | * Use randomForest library to create a Random Forest model with default parameters | ||
+ | * Then we will fine tune the model by changing 'mtry' | ||
+ | * We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). | ||
+ | ** Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. | ||
+ | ** Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x. | ||
+ | * Evaluate the RF's misclassification rate, with the Decision Tree | ||
+ | * Compare with visualisation plots, to see if the prediction matches | ||
+ | |||
+ | |} | ||
+ | </div> |
Revision as of 14:59, 6 July 2018
|
|
|
|
|
|
Data Description
Step |
Approach |
Description |
1. |
Data Understanding |
i. Read in Raster Layer (Lekagul Roadways Map)
class : RasterLayer
|
2. |
Data Cleaning |
i. Import two CSV Files (Birds)
iii. Data Manipulation
|
3. |
Geospatial Visualisation |
Spatial Point Pattern Visualisation i. Prepare polygon layer
|
4. |
Statistical Confirmation |
Spatial Point Pattern Analysis & Cluster Confirmation i. Quadrat Analysis (Density Based Measure)
|
5. |
Audio Processing |
i. Data Preparation
|
6. |
Audio Visualisation |
i. Amplitude Envelope Plot
|
7. |
Audio Classification |
i. Decision Tree
|