ISSS608 2017-18 T3 Assign Akanksha Shrirang Yadav dataprep

From Visual Analytics and Applications
Jump to navigation Jump to search

MC1 AkankshaYadav.jpg VAST Mini Challenge 1: Are The Beloved Pipits Disappearing?

Overview

Data Preparation & Approach

Discovering Trends - Exploratory Data Analysis

Is Kasios Singing A True Song?

The State Of Pipits

References

 


Data Description

The description of the provided dataset files is:

Data Details
ALL BIRDS.zip
  • Contains call & songs from the known birds in Boonsong Lekagul Wildlife Preserve
  • 2081 MP3 files
  • Filename is an index containing the name of species and an integer referring to the metadata file
AllBirdsv4.csv
  • Metadata file for the sound files
  • Fields contained are ->
    • File ID - An index to the file name in ALL BIRDS sound file collection
    • Vocalization_type: Kind of bird sound. Values are call or song or some other particular sound
    • Quality: Measure of the quality of the bird sound. Values are A, B, C, D, E, No Score
    • Time & Date: When the sound was captured
    • X & Y Coordinates: Values ranging from 0 to 199
Lekagul Roadways 2018 Map
  • 200 X 200-pixel map of the Preserve
  • The coordinates from AllBirdsv4 considered from bottom left to top right (0,0) to (199,199)
Test Birds from Kasios
  • Bird sounds Kasios claims as Pipit
  • Captured over past couple of months
Test Bird Location
  • X & Y Coordinates: Location of the test bird sound recordings


Approach In A Nutshell

  1. After initial exploration of the dataset in SAS JMP, few data inconstancies and data coding issues were discovered. The steps taken to clean the dataset are covered in ‘Data Cleaning’ section.
  2. Tableau was used to perform further exploration on the cleaned dataset. Basic visualizations were built in Tableau which revealed few trends & patterns for the species of the birds.
  3. Geospatial as well as distribution analysis was performed in R using the bird call collection data & the provided Lekagul Roadways map to uncover the trends of all the known bird species across the Preserve. This analysis constituted exploring how birds move from the ‘Alleged Dumping Location of Kasios’ (X=148 & Y=159 on the map).
  4. Next, audio analysis was performed on the provided 2081 sound files in R to extract features of an audio. Relevant featrues were selected from the set of extracted features. These selected features were used to build visualizations of bird species according to their sound waves. The test bird sound files were compared against these visualizations in an attempt to uncover the truth behind Kasios’ claim about Rose-crested Blue Pipit species.
  5. Based on the analysis done so far, insights were drawn concerning the state of Rose-crested Blue Pipit.


Tools Used

  • Primary Tool Used for Creating Visualizations & Performing Analysis -> R
  • Initial Exploration & Data Cleaning -> SAS JMP
  • Basic Plot Creation -> Tableau

R Packages Used:

  • ggplot2
  • sp
  • dplyr
  • tidyverse
  • spatstat
  • readbitmap
  • raster
  • grid
  • imager
  • pixmap
  • stringr
  • soundgen
  • seewave
  • tuneR
  • corrplot
  • ClustOfVar
  • cluster


Data Cleaning

  • The ‘AllBirdsv4.csv’ contains 8 variables as explained above. Columns ‘Vocalization_type’, ‘Date’ & ‘Time’ were found to have inconsistent coding. Additionally, ‘Date’ field was found to be ‘Nominal’ & containing invalid values such as ‘0000-00-00’ or ‘2012-12-00’ etc.


T3 Assign DataCleaning1.png


  • The modeling type of ‘Date’ field was updated from ‘Nominal’ to ‘Continuous’. The date format was made consistent to ‘mm/dd/yyyy’. All the dates missing the day value were imputed with 1st day of the month since the recordings will anyway be visualized at the granularity of year, quarter or month. ‘Time’ column was also cleaned to have ‘:’ instead of ‘.’ & ‘?’ replaced with ‘NA’.
  • The values in ‘Vocalization_type’ were recoded to be small case & ‘?’ were replaced with ‘NA’.
  • The modeling type of coordinate column ‘Y’ was changed to ‘Continuous’ from ‘Nominal’. 2 values in column Y contained ‘?’ which were removed during the cleaning.
  • Finally, ‘Year’, ‘Quarter’ & ‘Month’ were extracted from the ‘Date’ column and added to the dataset. An additional column called ‘Is_Dumping_Location?’ was calculated indicating if the location where call was recorded is the alleged dumping site location or not.
  • Cleaned and final dataset ->


T3 Assign DataCleaning2.png


A Look At The Distributions

  • Looking at the distributions, it can be seen that there are 19 species of birds. The species we were interested in "Rose-crested Blue Pipit" has 186 recordings over the time of data collection.
  • There are more recordings for bird call than bird song.


T3 Assign EDA Dist1.png
T3 Assign EDA Dist2.png