ISSS608 2017-18 T3 Assign Joel Choo Peng Yeow Are You Guilty

"You See, But You Do Not Observe" - Sherlock
Looking Deeper Into The Network Of Connected Individuals

Background

Methodology

Company Growth

Are You Guilty?

Conclusion

[Back To Assignments]

Do You Plead Guilty?"

1. Data Preperation

Using the list of suspects provided by the insider, we would like to determine if anyone else appears to be closely related to the group and which employees are making suspicious purchase. The below depicts the original network of the list of suspicious individuals provided by the insider. The size of nodes and node labels indicates the in-degree each node appears within the dataset. Larger node represents that many edges go in the node and communications are directed at them.

Lindsy Henion, Richard Fox and Jose Ringwalk seems to be the prominent ones here and we will investigate further.

After filtering all activities that happened in the company, we obtain 1722 employees (nodes) and 1904 activities (edges) as seen below.

2. Understanding Centrality Measures

In a connected graph, closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes. We will use this metric to identify who are close to the group of suspects.

Betweeness on the other hand represents the degree of which nodes stand between each other and high betweenness means more information will pass through that node. Removing the node will lose a large part of the graph.

After running the algorithm in Gephi, we obtain the centrality measures. Betweeness are skewed right and lesser observations have a high betweenness. Closeness on the other hand are distributed more evenly and we would expect many employees to be closely connected.

3. Solving The Crime

3.1 Finding Those Who Are Close To The Group

Using closeness as the size of a node, we have obtained a network with many who seem to be close to the suspicious group. 896 nodes remain on the filtered graph and we will use them to find out their interactions over time.

With betweenness as the size of the node, there are 4 big players in the company and they are likely to be very influential people.

3.2 Suspicious Purchases

Training Birds (19 Species)

First, we will plot the amplitude envelope of each sound wave of the bird species, for a quick visualization across the 19 bird species.

Testing Birds (15 Birds Identified By Kasios)

Next, we will plot the amplitude envelope of each sound wave for the 15 test birds.

Predicted Results, Based on Visualisation

By visualizing the envelope of the amplitude envelope plots of both the training and testing data, the last column shows the predicted species for each of the 15 test birds.

2 out of 15 birds are predicted to be Rose Pipits. They are Test Bird 2 and Test Bird 9.

4. Oscillogram Comparison

For confirmation, let us also look at the oscillogram which gives us not only the envelope but also the waves. Oscillogram is the plot of relative amplitude vs. time in seconds (while a Spectrogram is the plot of frequency in kHz vs. time in seconds).

Training Birds

The oscillograms of each of the 19 species of the training birds are as follows:

No.	Species Name	Oscillogram
1.	Bent Beat Riffraff
2.	Blue Collared Zipper
3.	Bombadil
4.	Broad Winged Jojo
5.	Canadian Cootamum
6.	Carries Champagne Pipit
7.	Darkwing Sparrow
8.	Eastern Corn Skeet
9.	Green Tipped Scarlet Pipit
10.	Lesser Birchbeere
11.	Orange Pine Plover
12.	Ordinary Snape
13.	Pinkfinch
14.	Purple Tooting Tout
15.	Qax
16.	Queenscoat
17.	Rose-Crested Blue Pipit
18.	Scrawny Jay
19.	Vermillion Trillian

Testing Birds

The oscillograms of each of the 15 test birds are as follows.

The predicted species is indicated in the last column, after visualising and comparing the similarity of the amplitude plots. Our results show that the predicted species based on oscillogram visualisation, matches the predicted species based on envelope plot visualisation. This is not a surprise because the envelope is obtained from the oscillogram.

We plot both, because the envelope gives a quick comparison while the oscillogram provides a more indepth visualisation.

ID	Oscillogram	Predicted Species	Same as earlier predicted by envelope?
1		Eastern Corn Skeet	Yes. Though, this is quite close to the Rose-Crested Pipit. However, the Pipit produces more ‘chirps’ per 100 sec, as compared to the Skeet.
2		Rose-Crested Pipit	Yes.
3		Queenscoat	Yes.
4		Bombadil	Yes.
5		Canadian Cootamum	Yes.
6		Qax	Yes.
7		Canadian Cootamum	Yes.
8		Green-Tipped Scarlet Pipit	Yes.
9		Rose-Crested Blue Pipit	Yes.
10		Qax	Yes.
11		Scrawny Jay	Yes.
12		Qax	Yes.
13		Qax	Yes.
14		Bombadil	Yes.
15		Pinkfinch	Yes.

5. Trellis Plot (Distribution Analysis)

A caveat to the previous analysis is that we did not make use of all the training birds in the visualisation. Rather, we randomly selected 5 birds per species to visualise, and then chose 1 to represent the entire species. Thus, we now make use of all the training birds by plotting the distributions across the parameters.

Narrowed Down 7 Acoustic Parameters

To make the analysis more rigorous, we will use analyzeFolder() to obtain the individual values of each parameters of the audiofile. AnalyseFolder() provides an acoustic analysis of all .wav files in a folder.

There is a total of 15 parameters, out of which, 7 are chosen as these 7 parameters have greater distinction between the species. The 7 parameters are: dom_median, HNR_median,mean, Freq_median, peakFreq_median, pitch_median, pitchAutocor_median, pitchSpec_median.

Distributions

The trellis plot of the 7 parameters of the training birds is shown below, where the mean is indicated by the black solid line.

Next, we will plot each of the 15 testing birds from Kasios onto this plot, in blue dotted line. We will then select the closest species for each parameter. The species with the most parameters selected will be assigned as the predicted species.

Given that Test Bird 2 and Test Bird 9 were predicted to be Rose-Crested Blue Pipits, we will focus on these two birds for visualisation.

Test Bird 2 Against Training Birds

The following shows the trellis plot of Test Bird 2 (in blue dotted line) against the 2081 Training Birds (black solid line).

The ticks in green represent the top 3 (or more if there is a tie) closest training species to the testing species, by parameter. The species with the highest ticks (i.e. closest to the testing bird) will be selected as the predicted species.

Based on this, Test Bird 2 is predicted to be a Qax.

Test Bird 9 Against Training Birds

The following shows the trellis plot of Test Bird 9 (in blue dotted line) against the 2081 Training Birds (black solid line).

The ticks in green represent the top 3 (or more if there is a tie) closest training species to the testing species, by parameter. The species with the highest ticks (i.e. closest to the testing bird) will be selected as the predicted species.

Based on this, Test Bird 9 seems like a Vermillion Trillian.

Results

From the above plots, based on the most number of parameters closest to the species’ mean, Test Bird 2 seems like a Qax while Test Bird 9 seems like a Vermillion Trillian. Unfortunately, this does not match our earlier predictions by visualizing the amplitude plot. We conclude that this method may not be ideal as it is a numerical representation, while the amplitude plots are more likely to be more reflective (though less representative of the entire training population).

As such, we will rely on Method 1 (Envelope Plot) & Method 2 (Oscillogram Plot), and leave Method 3 (Trellis Plot) out from our concluding hypothesis. The results are as follows, where Test Bird 2 and 9 are predicted to be Rose Pipits. But before that, let's also try audio classification to see if it supports our results.

6. Audio Classification

We also attempted classification to predict the bird species - first by experimenting Decision Tree and then Random Forest.

After extracting the dataframe of the training data using `analyzeFolder`, we set aside 70% of the 2081 birds as training data, and 30% as validation data. The 15 birds from Kasios form the testing data.

Method 1: Decision Tree

The decision tree produced a high misclassification error rate of 0.574.

Based on the Decision Tree Model, Test Bird 2 was predicted as a Lesser Birchbeere (with a low probability of 0.51) while Test Bird 9 was predicted as a Green Tipped Scarlet Pipit (with a low probability of 0.41).

This is contrary to our earlier predictions. Out of the 15 predictions, only 1 matches, and that's Test Bird 7 (in green below). Given that the misclassification rate is rather high (57%), we should not rely on our classification results from the Decision Tree model. While decision trees are easy to interpret, they tend to be more simplistic and don't produce good accuracy rates.

Method 2: Random Forest

Instead, we use Random Forest to improve the performance of decision trees. The algorithm starts by building out trees. We attempt 3 different Random Forest models, by fine-tuning the parameters to reduce misclassification rate. First, we set no. of trees = 500, mtry =2 (Model 1). Next, we set mtry to 6 (Model 2) and mtry = 10 (Model 3).

Unfortunately, the lowest classification rate is 0.5565 (Model 2 & 3), which is low and only slightly better than the Decision Tree model. Moreover, not only did the predicted results not match our visualisation plots, the table below shows that the predicted results did not match that of the Decision Tree either.

We will thus not rely on the predicted results from classification. More work on fine-tuning the model needs to be done.

7. Visualisation or Classification?

Visualisation.

In my opinion, classification is not a good method for predicting bird species at all. This is because, the data obtained is actually the same as that used in the Trellis Plots (i.e. the 15 parameters representing sound e.g. amplitude mean). Bird calls across species may have similar amplitude mean, pitch frequency etc, but are different in nature. We should look at the shape (wave pattern), than at the statistical parameters.

Thus, the best method is still visualising the oscillogram and amplitude envelopes.

8. Where Did the 2 Suspected Rose Pipits Come From?

Coordinate Plot

Going back to our earlier findings from the visualisation plots that Birds 2 and 9 are likely to be Rose Pipits, let’s analyse their location coordinates to understand whether these 2 predicted birds are found near the Rose Pipit clusters.

In fact, these two predicted-to-be-Pipit birds (represented in green below) are not found in the two clusters near the dumping site. But they did appear together, which makes sense since birds of the same species tend to fly together, lending credibility to our prediction by visualisation.

Concluding Hypothesis: Pipits NOT being found across the Preserve, based on Kasios Birds

Key Observations

1. Only 2 out of the 15 birds have resemblance to the Rose Pipits.
2. These 2 birds were not found near the dumping site, neither were they found in the previous 2 clusters identified.

Hypothesis: Pipits not found across preserve

Given that only 2 of the 15 birds provided by Kasios were likely to be Pipits, Kasios' claim that the Pipits were thriving across the Preserve is doubted. Based on the set of bird calls supplied by Kasios, it does not support the claim of Pipits being found across the Preserve.