ISSS608 2017-18 T3 Assign NEVIL BRUNO Data Prep
|
|
|
|
|
|
|
Data
For our analysis, the following datasets are at our disposal:
Tools Used
For the data prep, analysis, and visualizations, the following tools will be used:
1. SAS JMP Pro 13
2. Microsoft Excel
3. Tableau Desktop: Professional Edition
4. R Studio
- plotly
- tuneR
5. Microsoft Paint
6. Photo to GIF: GIF Maker
Data Prep
On initial exploratory data analysis, we find that there are no issues with the MP3 files and map. For the CSV files, there are format issues in the vocalization type, and date and time fields. We can use SAS JMP to recode these values to a standard format. For the vocalization type, we recode the values to a standard nomenclature format. For the dates, we will be using the dd/mm/yyyy format. For time, we use the 24-hour format.
There are no format issues with the File IDs, Bird names, Quality scores, X&Y coordinates for both files. For missing values, we will be excluding them from the analysis.
Preliminary EDA gives us the following insights:
1. There are 19 unique bird species. Out of the 2081 records, 186 are of the Rose-Crested Blue Pipit
2. The Vocalization type is mainly ‘Songs’ and ‘Calls’
3. There are 5 levels for audio quality: A, B, C, D, E. In the next section, we will analyse the audio samples to determine what each grade signifies.
4. There are very few recordings pre-2007. There is also lack of data in 2018.
5. Most of the recordings took place during the morning (06:00 to 12:00)
6. Number of pipit recordings have a significant number post 2009.
7. Looking at the overall data at monthly level granularity, it can be observed that the number of birds for most of the months is very low. The month of May has on average high number of recordings compared to the other months, especially between 2011 and 2017. Apart from this, there are a lot of months which have no recordings.
Map Prep
The map provided is a 200 X 200 .bmp image of the reserve. For analysing patterns and anomalies on across the Preserve, the map can be imported into tableau and will be used in our analysis:
We are also provided with coordinates for the dumping ground (148,159). For our analysis, we would require an area around the point. This is done so that we can analyze the dumping site, and its immediate surrounding area. A square 20 X 20 area centered around the dumping ground can be created by plotting the corner coordinates, and creating a shaded area representing the square on Tableau . The coordinates for the square corners are as follows:
- TOP LEFT: (128,179)
- TOP RIGHT: (168,179)
- BOTTOM LEFT: (128,139)
- BOTTOM RIGHT: (168,139)
This area can be used to determine if the dumping site affects the bid scatter across the preserve.
Final Decisions & Assumptions before Analysis
1.2018 data is not a sign of decline in numbers, it is the lack of data recorded during this time which translates to the low numbers.
2.Due to the lower numbers pre-2007, we will be excluding them from the analysis. We will be keeping our analysis time-period from 2007-2017 inclusive.
3.Due to irregularities at monthly level data, the analysis will be kept at a yearly level.
4.The 20 X 20 area marked on the map is the zone that is affected by the dumping at point (148,159).