ISSS608 2017-18 T3 Assign Zhang Yingdi Task2
|
|
|
|
|
Kasios are reporting that there are plenty of Rose-crested Blue Pipits happily living and nesting in the Preserve. They have provided a set of Pipit bird calls, recently recorded across the Preserve, with locations of where they were recorded. They claimed the Rose-crested Blue Pipits are a thriving population.
The objective of this question is to investigate if the claim by Kasios are factual. To support the investigation, both Machine Learning approach and Visualization Approach are applied. This question is mainly explored in R and Tableau.
Data Preparation
R is used for the data preparation. The audio files in the “ALL BIRDS” folder will be used to train and test the model. The audio files in the “Test Birds from Kasios” folder will be used to predict the outcome. For the Machine Learning approach, a classifier will be built to predict if the given 15 test audio files are Pipit bird calls.
All the audio files are given in MP3 format. To analyse the audio files, firstly, the MP3 files are converted to .wav format using wirteWav() function. Next, convert the .wav files to data frame using the analyzeFolder() function. Due to the long-time of processing large number of files, only audio files with quality “A” in “ALL BIRDS” folder is selected to be used in this question. The result of the analyzeFolder() for audio files in “ALL BIRDS” folder and “Test Birds from Kasios” folder are saved as “all_birds_wav.csv” and “test_birds_from_kasios_wav.csv” respectively.
The converted .csv files contains 72 columns shown as below:
Column sound is the .wav files name. The rest columns are all parameters of the audio files with numeric values.
A new column with the bird name is added to the data table.
Next, a new column with target variable is added to the data table. As the classifier will predict if the test birds are Rose-crested Blue Pipits or not, there will be two values “Rose Pipit” and “not Rose Pipit” in the new column Target.
Next, column sound is removed since it is the .wav files name and will not be used for building the model.
Next, missing values are checked for all the columns and rows. Below shows the number of missing values in all the columns. It shows that all rows of columns "pitchCep_median","pitchCep_sd" and "pitchCep_mean" are filled with “NA”. Hence, these three columns are removed. Besides, the rows with “NA” are also removed from the data.
Lastly, move the column Target to the first column for the ease of building the classifier in the later stage.
Machine Learning Approach
The objective of the model will be to predict if the test bird is Rose-crested Blue Pipits. Two classification models Random Forest and Decision Tree are built in this assignment. Decision Tree outputs are easily comprehensible, and its graphical representation is intuitive, it also requires a lesser data cleaning compared to other algorithms. The data are split into 70% training data and 30% test data.
Random Forest
A Random Forest model is built, the accuracy rate of the model tested with test data is 0.96. The Confusion Matrix is shown as below: