From Visual Analytics and Applications
Jump to navigation
Jump to search
|
|
Line 44: |
Line 44: |
| |- | | |- |
| | | | | |
− | 1.
| |
− | ||
| |
| <b>Data Understanding</b> | | <b>Data Understanding</b> |
| || | | || |
Line 69: |
Line 67: |
| | | | | |
| | | |
− | 2.
| |
− | ||
| |
| <b>Data Cleaning</b> | | <b>Data Cleaning</b> |
| || | | || |
Line 117: |
Line 113: |
| |- | | |- |
| | | | | |
− | 3.
| |
− | ||
| |
| <b>Geospatial Visualisation </b> | | <b>Geospatial Visualisation </b> |
| || | | || |
Line 146: |
Line 140: |
| |- | | |- |
| | | | | |
− | 4.
| |
− | ||
| |
| <b>Statistical Confirmation </b> | | <b>Statistical Confirmation </b> |
| || | | || |
Line 173: |
Line 165: |
| |- | | |- |
| | | | | |
− | 5.
| |
− | ||
| |
| <b>Audio Processing</b> | | <b>Audio Processing</b> |
| || | | || |
Line 200: |
Line 190: |
| |- | | |- |
| | | | | |
− | 6.
| |
− | ||
| |
| <b>Audio Visualisation</b> | | <b>Audio Visualisation</b> |
| || | | || |
Line 229: |
Line 217: |
| |- | | |- |
| | | | | |
− | 7.
| |
− | ||
| |
| <b>Audio Classification</b> | | <b>Audio Classification</b> |
| || | | || |
Revision as of 15:02, 6 July 2018
VAST Challenge: Mini Challenge 2
Data Description
File Name
|
Variables
|
Data Understanding
|
i. Read in Raster Layer (Lekagul Roadways Map)
- It is a single layer raster file. 200x200.
class : RasterLayer
dimensions : 200, 200, 40000 (nrow, ncol, ncell)
resolution : 1, 1 (x, y)
extent : 0, 200, 0, 200 (xmin, xmax, ymin, ymax)
coord. ref. : NA
names : Lekagul_Roadways_2018
values : 0, 255 (min, max)
ii. Find out structure of Raster Layer
Extent : 40000
CRS arguments : NA
File Size : 41078
Object Size : 14376 bytes
Layer : 1
|
Data Cleaning
|
i. Import two CSV Files (Birds)
- 2081 Training Birds (Metadata)
- 15 Test Birds (Provided by Kasios)
ii. Fix Data Quality Issues
- Change File ID from numeric to character
- Change coordinates to numeric
- Change Date from Character to Date
- Omit the two NA values for the Y coordinate.
- Clean the Dates (All standardise to m/d/y. For missing month/year, I will replace with NA. For missing day, I will impute as 1st day of the month.)
- Clean the Timing (Standardise all to 24 hour formatting. Use “.” instead of ":")
- Clean the Vocalisation Type (Standardise all to lower case. For values consisting of both ‘song and call’, change to ‘call’, assumed as a sign of distress while ‘song’ is assumed as the default)
- Clean the Quality (Recode ‘no score’ as ‘NA’)
iii. Data Manipulation
- Extract out the “Year” and “Month” from the date, as new columns
- Create a new column for Quarter (Q1,Q2,Q3,Q4) & Season (Spring, Summer, Fall, Winter)
iv. Geospatial File Compatibility
- Convert CSV file (2081 birds) into the following:
- spatial point data frame
- sp format
- shp format
- st_read compatible format
- readOGR compatible format
- ppp format (for spatstat compatibility)
v. Data Overview & Exploration
- Overlay 2081 Birds, Raster Map & Dumping Site, for an integrated overview using plot()
- Use facet_wrap() to visualise location of clustering across species, across time, and across season, and by call/song in a trellis plot
vi. Selection of Treatment & Control Groups
- Use ‘Rose Pipits’ as Treatment Group
- Use ‘Ordinary Snape’ and ‘Lesser Birchbeere’ as Control Groups to see if dumping were the cause
- Use ‘All Birds’ as third control to see if external factors were the cause
|
Geospatial Visualisation
|
Spatial Point Pattern Visualisation
i. Prepare polygon layer
- Create a 200x200 spatial polygon to depict the boundaries of Lekagul raster map
- Merge Raster Polygon with Rose Pipit Layer, using owin() from spatstat package
ii. Kernel Density Plot
- First, set sigma=bw.diggle (Uses cross-validation to select a smoothing bandwidth for the kernel estimation of point process intensity)
- Apply the Kernel Density Plot (By Year; 2012-2017)
- For All Birds
- For Rose Pipits only (Treatment Group)
- For OS & LB only (Control Groups)
iii. Adjust Parameters (sigma)
- Adjust the plots by using the sigma of the most dense cluster
- This is typically the largest sigma
iv. Fine-Tune for Clearer Visualisation
- Then add in the dumping site & adjust the colour/size
- So that we can visualize the clusters relative to the dumping site
|
Statistical Confirmation
|
Spatial Point Pattern Analysis & Cluster Confirmation
i. Quadrat Analysis (Density Based Measure)
- Apply Monti-Carlo Simulation
- Followed by Quadrat Test to test for clustering
ii. Nearest Neighbour (Density Based Measure)
- Apply Monti-Carlo Simulation
- Followed by Clark-Evans Test to test for clustering
iii. K-Function (Distance Based Measure)
- Apply Monti-Carlo simulation
- Visualise significance based on confidence envelope
iv. K-Cross (for bivariate analysis)
- Apply Monti-Carlo simulation
- Visualise spatial dependence (significance) based on confidence envelope
|
Audio Processing
|
i. Data Preparation
- Read in MP3 Files (Training & Testing Data)
- Convert to .wav format using writeWav()
- Convert .wav files to data frame using analyzeFolder()
- Read in data frame
ii. Audio Extraction & Manipulation
- Extract only 1 of 2 channels (choose left).
- Convert each sound array to floating point values ranging from -1 to 1.
iii. Adjust Parameters (sigma)
- Adjust the plots by using the sigma of the most dense cluster
- This is typically the largest sigma
iv. Fine-Tune for Clearer Visualisation
- Then add in the dumping site & adjust the colour/size
- So that we can visualize the clusters relative to the dumping site
|
Audio Visualisation
|
i. Amplitude Envelope Plot
- Use env() to plot the envelopes of the amplitude plots
- Do this for all the 15 test birds
- Do this for 5 training birds per species and select most representative plot as ‘dictionary’
ii. Oscillogram Plot
- Use seewave package to plot oscillogram
- Do this for all the 15 test birds
- Do this for 5 of the training birds, per species and select most representative plot as ‘dictionary’
iii. Distribution of audio parameters, using Trellis Plot
- Run analyzeFolder() from soundgen library to the entire collections to extract dataframe from .wav files
- Find the acoustic parameters that are particularly relevant i.e. mean>standard deviation
- Out of the 15 median (mean and sd not used) attributes available after extracting the dataframe, the following 7 will be used for analysis as they produced greatest variation across species:
- dom_median,HNR_median, meanFreq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median
- These were selected as they vary across the species more, than the other of the 8 variables
- Use ggplot() to plot a trellis plot using the 19 training species
- Label the mean using black solid line
- Use ggplot() to insert the 15 testing birds, with blue dotted line as the testing bird's mean
- Visualise and identify the top 3 closest species to the mean, by parameter
- Select the species based on most no. of parameters selected as closest
|
Audio Classification
|
i. Decision Tree
- Use rpart, caret and e1071 libraries
- Using the extracted dataframe with analyzeFolder(), out of the 2081 birds, set 70% as training data, 30% as validation data
- Build decision tree model using training data. Evaluate misclassification rate.
- Apply model to 15 testing birds to predict the species
ii. Random Forest
- Use randomForest library to create a Random Forest model with default parameters
- Then we will fine tune the model by changing 'mtry'
- We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry).
- Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
- Mtry: Number of variables randomly sampled as candidates at each split. Default value for classification is sqrt(p) where p is number of variables in x.
- Evaluate the RF's misclassification rate, with the Decision Tree
- Compare with visualisation plots, to see if the prediction matches
|