Difference between revisions of "ISSS608 2017-18 T3 Assign Li Hongxin Methodology"
(7 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" | | | style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" | | ||
; | ; | ||
− | [[ISSS608_2017- | + | [[ISSS608_2017-18_T3_Assign_Li_Hongxin_Answers| <font color="#FFFFFF">Answers</font>]] |
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" | | | style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#B3D9D9; text-align:center;" width="25%" | | ||
Line 23: | Line 23: | ||
| | | | ||
|} | |} | ||
+ | |||
+ | <br> | ||
==Tools== | ==Tools== | ||
Line 35: | Line 37: | ||
<i>Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn</i> | <i>Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn</i> | ||
+ | <br> | ||
==Process for Data Preparation== | ==Process for Data Preparation== | ||
The following are 5 key steps for data cleaning, and data manipulation for further visualization and analysis. | The following are 5 key steps for data cleaning, and data manipulation for further visualization and analysis. | ||
− | + | * <b>Step 1: Deal with Missing Values. </b>Replace all symbols such as "?", "??:??" in Time, and "No score" in Quality which stand for missing values, into NA. | |
− | + | * <b>Step 2: Fix Data Quality Issues. </b>Transform all letters into uppercase for convenience, and remove extra spaces and "?". | |
− | + | * <b>Step 3: Unify the Date & Time Format. </b>Transform all Date into "%Y-%m-%d" format. If the raw data doesn't contain month or day info, we impute the data as "-01-"(January) and "-01"(the first day). Transform all Time into "HH:mm" format and use standardized all times into 24 hour formatting. If raw data doesn't contain minute info, set it as "00". If raw data contain letters such as "morning", or "dawning", imputed them into "08:00" or "18:00". | |
− | + | * <b>Step 4: Modify Data Types. </b>Change X and Y coordinate from character into int. | |
− | + | * <b>Step 5: Create Season and Timeslot variables based on Date and Time. </b>For example, set March to May as Spring ,and set 06:00 to 12:00 as "Morning". | |
+ | <br> | ||
==Pattern Visualization and Analysis== | ==Pattern Visualization and Analysis== | ||
<div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left"> | <div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left"> | ||
Line 66: | Line 70: | ||
* Create a 200x200 map background to integrate the real geo situation with wildlife reserve | * Create a 200x200 map background to integrate the real geo situation with wildlife reserve | ||
* Highlight of the Dumping site to help analyse the effect on the birds. | * Highlight of the Dumping site to help analyse the effect on the birds. | ||
+ | * Re-scale the XY coordinate from 1x1 gird into 5x5 to show density | ||
<u> ii. Kernel Density Plot</u> | <u> ii. Kernel Density Plot</u> | ||
* Apply kernel density plot by year(from 2012 to 2017) on Rose-crested Blue Pipit | * Apply kernel density plot by year(from 2012 to 2017) on Rose-crested Blue Pipit | ||
Line 89: | Line 94: | ||
</div> | </div> | ||
+ | <br> | ||
==Audio Data Analysis== | ==Audio Data Analysis== | ||
<div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left"> | <div style="margin:0px; padding: 2px; background:#B0E0E6; font-family: Arial; border-radius: 1px; text-align:left"> | ||
Line 112: | Line 118: | ||
|| | || | ||
<u> i. Feature Extraction</u> | <u> i. Feature Extraction</u> | ||
− | * | + | <br>5 types of features were selected, and by combining them we got total 193 features for each bird song |
− | <u> ii. Classification Methods</u> | + | * Mel-frequency cepstral coefficients (MFCC) |
+ | * Chromagram of a short-time Fourier transform | ||
+ | * Mel-scaled power spectrogram | ||
+ | * Octave-based spectral contrast | ||
+ | * Tonnetz | ||
+ | <u> ii. Data Partition and Feature Labels</u> | ||
+ | * Out of 2081 bird songs, set 70% as training data and 30% as test data | ||
+ | * Add labels which stand for the name of bird species to each feature | ||
+ | <u> iii. Classification Methods</u> | ||
* Logistic Regression | * Logistic Regression | ||
* SVM | * SVM |
Latest revision as of 11:56, 8 July 2018
|
|
|
|
Contents
Tools
a. R: used for data cleaning.
Packages: tidyverse
b. Tableau: used for Map & Pattern visualization.
c. Python: used for density visualization, audio visualization and audio classification.
Packages: os, glob, pandas, numpy, matplotlib, seaborn, librosa, sklearn
Process for Data Preparation
The following are 5 key steps for data cleaning, and data manipulation for further visualization and analysis.
- Step 1: Deal with Missing Values. Replace all symbols such as "?", "??:??" in Time, and "No score" in Quality which stand for missing values, into NA.
- Step 2: Fix Data Quality Issues. Transform all letters into uppercase for convenience, and remove extra spaces and "?".
- Step 3: Unify the Date & Time Format. Transform all Date into "%Y-%m-%d" format. If the raw data doesn't contain month or day info, we impute the data as "-01-"(January) and "-01"(the first day). Transform all Time into "HH:mm" format and use standardized all times into 24 hour formatting. If raw data doesn't contain minute info, set it as "00". If raw data contain letters such as "morning", or "dawning", imputed them into "08:00" or "18:00".
- Step 4: Modify Data Types. Change X and Y coordinate from character into int.
- Step 5: Create Season and Timeslot variables based on Date and Time. For example, set March to May as Spring ,and set 06:00 to 12:00 as "Morning".
Pattern Visualization and Analysis
Approach |
Description |
Geo-spatial Visualization
|
i. Scatter Plot on Map
ii. Kernel Density Plot
|
Trend Visualization
|
Area/Line Graph
|
Interactive Dashboard
|
Combine the result of Geo-spatial visualization and trend visualization
|
Audio Data Analysis
Approach |
Description |
Audio Visualization |
i. Waveplot
ii. Specgram
|
Audio Classification |
i. Feature Extraction
ii. Data Partition and Feature Labels
iii. Classification Methods
|