Difference between revisions of "ISSS608 2017-18 T3 Assign Li Hongxin Methodology"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(35 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
 
;
 
;
[[ISSS608_2017-18_T3_Assign_Li_Hongxin_Methodology|<b> <font color="#FFFFFF">Methodology</font>]]
+
[[ISSS608_2017-18_T3_Assign_Li_Hongxin_Methodology|<b> <font color="#FFFFFF">Methodology</font>]]</b>
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
 
;
 
;
[[ISSS608_2017-18_T3_Assign_Li_Hongxin_Data_Visualization| <font color="#FFFFFF">Data Visualization</font>]]
+
[[ISSS608_2017-18_T3_Assign_Li_Hongxin_Answers| <font color="#FFFFFF">Answers</font>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" |  
Line 23: Line 23:
 
|  &nbsp;
 
|  &nbsp;
 
|}
 
|}
 +
 +
<br>
 
==Tools==
 
==Tools==
a. <b>R:</b> used for data cleaning.  
+
 
 +
<b>a. R:</b> used for data cleaning.  
  
 
<i>Packages: tidyverse</i>
 
<i>Packages: tidyverse</i>
  
<b>b. Tableau:</b> used for Map & Pattern visualization  
+
<b>b. Tableau:</b> used for Map & Pattern visualization.
  
<b>c. Python:</b> used for Density visualization, audio visualization and audio classification
+
<b>c. Python:</b> used for density visualization, audio visualization and audio classification.
  
 
<i>Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn</i>
 
<i>Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn</i>
  
==Process for Data Cleaning==
+
<br>
a
+
==Process for Data Preparation==
 +
 
 +
The following are 5 key steps for data cleaning, and data manipulation for further visualization and analysis.
 +
 +
* <b>Step 1:  Deal with Missing Values. </b>Replace all symbols such as "?", "??:??" in Time, and "No score" in Quality which stand for missing values, into NA.
 +
 
 +
* <b>Step 2:  Fix Data Quality Issues. </b>Transform all letters into uppercase for convenience, and remove extra spaces and "?".
 +
 
 +
* <b>Step 3:  Unify the Date & Time Format. </b>Transform all Date into "%Y-%m-%d" format. If the raw data doesn't contain month or day info, we impute the data as "-01-"(January) and "-01"(the first day). Transform all Time into "HH:mm" format and use standardized all times into 24 hour formatting. If raw data doesn't contain minute info, set it as "00". If raw data contain letters such as "morning", or "dawning", imputed them into "08:00" or "18:00".
 +
 
 +
* <b>Step 4:  Modify Data Types. </b>Change X and Y coordinate from character into int.
 +
 
 +
* <b>Step 5:  Create Season and Timeslot variables based on Date and Time. </b>For example, set March to May as Spring ,and set 06:00 to 12:00 as "Morning".
 +
 
 +
<br>
 +
==Pattern Visualization and Analysis==
 +
<div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left">
 +
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 +
|-
 +
|
 +
<b>Approach</b>
 +
||
 +
<b>Description</b>
 +
|-
 +
|
 +
<b>Geo-spatial Visualization</b>
 +
<br><i>Figure out the effect of Dumping site by focusing on the geographical distribution of birds</i>
 +
||
 +
<u> i. Scatter Plot on Map </u>
 +
* Scatter plot contains XY coordinates, which indicates the geographical distribution of birds
 +
* Create a 200x200 map background to integrate the real geo situation with wildlife reserve
 +
* Highlight of the Dumping site to help analyse the effect on the birds.
 +
* Re-scale the XY coordinate from 1x1 gird into 5x5 to show density
 +
<u> ii. Kernel Density Plot</u>
 +
* Apply kernel density plot by year(from 2012 to 2017) on Rose-crested Blue Pipit
 +
* Add map ground and highlight of Dumping site
 +
* Create a gif file to show the dynamic change of dense clusters
 +
|-
 +
|
 +
<b> Trend Visualization</b>
 +
<br><i>Understand the growth pattern of birds</i>
 +
||
 +
<u> Area/Line Graph</u>
 +
* Remove the data in 2018. Since only 3-month data available and it was not comparable with other years with 12-month data
 +
* Create area/line graph of all species and by bird species from 1983 to 2017
 +
|-
 +
|
 +
<b> Interactive Dashboard</b>
 +
<br><i>Understand the pattern of all 19 bird species, not only the Rose-crested Blue Pipit</i>
 +
||
 +
<u> Combine the result of Geo-spatial visualization and trend visualization</u>
 +
* Dashboard includes scatter plot with map background and area graph
 +
* Users selection: Year/Season/Timeslot(eg.Morning)/Bird Species
 +
|}
 +
</div>
 +
 
 +
<br>
 +
==Audio Data Analysis==
 +
<div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left">
 +
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
 +
|-
 +
|
 +
<b>Approach</b>
 +
||
 +
<b>Description</b>
 +
|-
 +
|
 +
<b>Audio Visualization</b>
 +
||
 +
<u> i. Waveplot</u>
 +
* Plot one "Call" sample and one "Song" sample for each of 19 class since different types of voice may have various patterns as train set
 +
* Only samples with "Quality A" were selected to avoid noise disturbance
 +
* Plot the 15 test birds as well and compare with the train set
 +
<u> ii. Specgram</u>
 +
* Plot specgram for the same train set and test birds, to help improve the visualization of bird songs
 +
|-
 +
|
 +
<b>Audio Classification</b>
 +
||
 +
<u> i. Feature Extraction</u>
 +
<br>5 types of features were selected, and by combining them we got total 193 features for each bird song
 +
* Mel-frequency cepstral coefficients (MFCC)
 +
* Chromagram of a short-time Fourier transform
 +
* Mel-scaled power spectrogram
 +
* Octave-based spectral contrast
 +
* Tonnetz
 +
<u> ii. Data Partition and Feature Labels</u>
 +
* Out of 2081 bird songs, set 70% as training data and 30% as test data
 +
* Add labels which stand for the name of bird species to each feature
 +
<u> iii. Classification Methods</u>
 +
* Logistic Regression
 +
* SVM
 +
* Random Forest
 +
|}
 +
</div>

Latest revision as of 11:56, 8 July 2018

Pipits hx.jpg VAST Mini Challenge 1: "Cheep" Shots?

Background

Methodology

Answers

Conclusions

 


Tools

a. R: used for data cleaning.

Packages: tidyverse

b. Tableau: used for Map & Pattern visualization.

c. Python: used for density visualization, audio visualization and audio classification.

Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn


Process for Data Preparation

The following are 5 key steps for data cleaning, and data manipulation for further visualization and analysis.

  • Step 1: Deal with Missing Values. Replace all symbols such as "?", "??:??" in Time, and "No score" in Quality which stand for missing values, into NA.
  • Step 2: Fix Data Quality Issues. Transform all letters into uppercase for convenience, and remove extra spaces and "?".
  • Step 3: Unify the Date & Time Format. Transform all Date into "%Y-%m-%d" format. If the raw data doesn't contain month or day info, we impute the data as "-01-"(January) and "-01"(the first day). Transform all Time into "HH:mm" format and use standardized all times into 24 hour formatting. If raw data doesn't contain minute info, set it as "00". If raw data contain letters such as "morning", or "dawning", imputed them into "08:00" or "18:00".
  • Step 4: Modify Data Types. Change X and Y coordinate from character into int.
  • Step 5: Create Season and Timeslot variables based on Date and Time. For example, set March to May as Spring ,and set 06:00 to 12:00 as "Morning".


Pattern Visualization and Analysis

Approach

Description

Geo-spatial Visualization
Figure out the effect of Dumping site by focusing on the geographical distribution of birds

i. Scatter Plot on Map

  • Scatter plot contains XY coordinates, which indicates the geographical distribution of birds
  • Create a 200x200 map background to integrate the real geo situation with wildlife reserve
  • Highlight of the Dumping site to help analyse the effect on the birds.
  • Re-scale the XY coordinate from 1x1 gird into 5x5 to show density

ii. Kernel Density Plot

  • Apply kernel density plot by year(from 2012 to 2017) on Rose-crested Blue Pipit
  • Add map ground and highlight of Dumping site
  • Create a gif file to show the dynamic change of dense clusters

Trend Visualization
Understand the growth pattern of birds

Area/Line Graph

  • Remove the data in 2018. Since only 3-month data available and it was not comparable with other years with 12-month data
  • Create area/line graph of all species and by bird species from 1983 to 2017

Interactive Dashboard
Understand the pattern of all 19 bird species, not only the Rose-crested Blue Pipit

Combine the result of Geo-spatial visualization and trend visualization

  • Dashboard includes scatter plot with map background and area graph
  • Users selection: Year/Season/Timeslot(eg.Morning)/Bird Species


Audio Data Analysis

Approach

Description

Audio Visualization

i. Waveplot

  • Plot one "Call" sample and one "Song" sample for each of 19 class since different types of voice may have various patterns as train set
  • Only samples with "Quality A" were selected to avoid noise disturbance
  • Plot the 15 test birds as well and compare with the train set

ii. Specgram

  • Plot specgram for the same train set and test birds, to help improve the visualization of bird songs

Audio Classification

i. Feature Extraction
5 types of features were selected, and by combining them we got total 193 features for each bird song

  • Mel-frequency cepstral coefficients (MFCC)
  • Chromagram of a short-time Fourier transform
  • Mel-scaled power spectrogram
  • Octave-based spectral contrast
  • Tonnetz

ii. Data Partition and Feature Labels

  • Out of 2081 bird songs, set 70% as training data and 30% as test data
  • Add labels which stand for the name of bird species to each feature

iii. Classification Methods

  • Logistic Regression
  • SVM
  • Random Forest