ISSS608 2016-17 T1 Assign3 Nguyen Tien Duong
01:00 PM "Hallo, KAMPONG detective office here"
01:00 PM "Hi, We are calling from Dinno world, we are chaotic now... need your help urgently please......" - a worrying voice
01:01 PM "Sir, please stay calm. We understand your situation, let us know the case"
01:01 PM "Please help us, we are closing down now, a VIP person maybe in danger."
...
01:14 PM "A team has been dispatched to your place. We will prey the criminal down."
A team from KAMPONG detective office was immediately dispatched to Dinno park to interview, collect data and prey the criminal.
Dinno fun park is divided by 5 subzones:
1.Coaster Alley
2.Entry Corridor
3.Kiddie Land
4.Tundra Land
5.Wet Land
In this investigation, the major objects that directly related to the crime is Pavililion 32 and the Stage where the Soccer Player was invited to performce 63.
The park distributed mobile device / apps to all visitors for their ticket, check in and communication.
The park was running game via apps to encourage visitors answer the provided questions to earn prize.
Dispatched team at Dinno Park sent back dataset collected from the park which contents all the tracking of visitors during 3 days that the soccer player had been performed from 6-8 Jun 2014.
The team has collected evidence in digital form of "Communication", "Movement" and "Settings" (Metadata of the park on current Park's activities, marketings events,..)
COMMUNICATION
4.1M Number of Records
9,429 Number of Source IDs
9,391 Number of Target IDs
MOVEMENT
05 seconds Resolution time tracking
100x100 Resolution of geogaphical movement
02 Types of movement tracking: Check-in, Movement
SETTINGS
Map of the park
Major activities
This BACKSTAGE TECHNICAL information provides insigh of evedence investigation process. Since the collection are encoded in data, the detective team needs to execute proper data analysis technique from overview, filtering, brushing and zoom in detail to gain the insight juice of the collected data.
Data Exploratory: Diving in dataset, regconize problems and resolutionalize to target issues before process any slicing process Data Preparation: Critical process to make data usable. Slice data by difference ways, aggregate and stage data in appropriate storage for investigation Trials: POC process to try on different technologies, different approaches and techniques to visualize datasets that may be challenging to understand. This process involves plenty of trials and errors. Hackaton: Code implementation to build up visualization and create meaningful information from provided data. This task includes plenty of codes and application usage
After a very short phone communication with Dinno Managers, the team quickly moves on exploring the content that collected from executives team deployed to the park. The first assignment is always to grab the highest overview level of data.
Different approaches are proposed such as network analysis using sigmajs, build up heat map to check the overall distribution of data, a bar chart to grab the quickest information about data statistic. Evaluating options:
- Sigma.Js (or any network analysis): This is not going to be promising. The communication data seems very big, and mass network analysis is inefficient.
- Heatmap: Considerable. It can give the overall feeling of how the frequency of each communication is distributed. However, it is tough to differentiate colors and unable to sort by value.
- Bar chart: This is selected solution. Simple, right-away answer to see who is the most, or how the communication data is distributed. However, the drawback of this method is to be unable to see the entire whole picture of everything, since the number of visitors could be thousands. We can consider using heatmap to corporate with this bar chart.
#Found 01
Right after the graph is out, we are alerted of 2 IDs extremely outstanding high communication volume in 2 graphs and 1 additional ID only appears in "IN" graph, which suggests the receiver.
Let's tackle the easiest ID who comes with the name "External". This is indicated as the dummy target that everyone sent SMS to, so "external" receive SMS from people from the park. However, it is aggregated count, combining of whoever outside of the part that people in part sent SMS to.
Now, we are left with 2 other IDs: 839736 and 1278894. These 2 IDs have the high volume of communication that needs further investigation. We wanted to discover who are they communicated with? Did they send a plenty of SMSes to small groups of people or they send to a huge group of recipients?
#Found 02
To answer those questions, we went further to plot a rough cut of how many distinct IDs that ever had communication with these 2 IDs, The result on the right hand side has confirmed with us that: the 2 IDs established mass communication to thousands of people.
The "External" is as expected, this ID does not have any record in form of responding back since this is a common ID represented for outside of the Park.
Now, we decided to know in depth of what exactly are they? In this case, we are unable to make a conclusion by using the current chart. At the next step, we observe the behavior of those IDs. See the result presented in the next section.
#Found 03
ID 1278894 has "Office working hours". Visualizing sending message of this IDs, we found it only happens from 12 noon till 8.55PM every day crossing 3 days. This sounds interesting and brings us to suspect that ID may be related to the office of the park. We will zoom in and find what does this ID do during that time window.
#Found 04
ID 1278894 has hourly seasoning active - on and off regularly. That should define the ID nature of communication is highly subjected to operation of an organization. The pattern on-off each hour alternatively is outstanding from normal human communicating patten
#Found 05
ID 1278894 has "machine style". Diving to better resolution, we conclude that ID operation is a robot since it keeps on sending out messages every 5 minutes. Furthermore, it sends to a huge number of IDs in the part, look at the position of IDs where sms was sent, it is fixed only in Entry Corridor where the office is located. Therefore, that should be only a POST server that used by the park to send regular messages to subscribers in the park. The "In" graph shows responding from visitors sending back respond to the auto server, that explains why there are some late responds, for instance, the 2 points of time highlighted in the chart below.
Matching this operation with the park's information that shown in their official website, we concluded: this is the GAME sending server which operated by the park to send to and get responds from visitors.
Lets's move on to next ID to find what was hapenning with it.
#Found 07
ID 839736 has similarity with ID 1278894 in terms of working hours pattern. This ID also has its active hours but wider range 8AM to 11.30PM. This is official operating hours of the park. So we may not be able to conclude at this moment. However, this ID looks like a "public service" where it sends and gets sms from a massive crowd. Further investigation is needed. The below graph shows the ID getting responding out sms for selected zones, excluded Wetland and Coaster Alley. Higher rate was found on Saturday and Sunday.
#Found 08
ID 839736 has similarity with ID 1278894 shows absolute disturbing pattern on Sunday with 2 peaks at 12noon and 2.45PM and messiness fluctuation. This is highly correlated to the criminal timing. This pattern demonstrated a human related behavior nature. We can make a close-guess that is Customer Service caller ID - who are human and deal with everyone in the park. They receive and plenty calls when the public found out the crime and wish call and clarify or inquiry. Pulling out records FROM & TO this ID, we found similarity in both ways of communication. That fits with our assumption that ID 839736 is a customer service.
#Found 09
ID External is an arbitrary ID represented for any Communication node from outside of the park. This ID tracking reveals an interesting fact of time when the public found the crime scene. At 11.59AM Sunday, there is a top peak of communication from guesses in the park to outside world. That is sharp with a tip. The data suggest that about 11.55AM, the public has found crime scene, so they call to customer services and at the same time take pictures and/or sms to friends to share the "hot news"
We have been diving 3 outstanding IDs and identify them. However, we do have vast of interest to see the pattern of guests in the park. Therefore, in this part, we will filter out all 3 IDs above to study the rest.
#Found 10
The overall trend shows that weekends on Saturday and Sunday are more communication among park's guests. That may because of the volume of visitors is high during weekends. The data also can show the trend of the show business: visitors normally will aim for the last show rather than the first show. This is really happening here as well.
This communication pattern shows the high peak regularly appeared in Coaster Alley where the performance was conducted at that time (11AM, 4PM+), but on Sunday, the second peak was no more, due to the cancellation of the tour. We also found some interesting peak in Wetland during Sunday which raised the second peak locally at 12noon. We will navigate through them closer.
And this is a zoom in for COAST ALLEY:
#Found 11
Since the nature of communication is networking, it has own unique characteristics with direction and weight in which each nodes are interacting with the rest of network. We therefore will visualize the communication activities at the peak 11AM on Sunday to analyze the communication of this crowd; that will help to understand further detail of how and who made the peak.
(for svg support vector graphic version, please access via:https://wiki.smu.edu.sg/1617t1ISSS608g1/File%3ANTD_Comm_Network_11AM_Sun.svg, I intended not to load upfront here to avoid slowing down the loading process)
In this analysis, in order to group and create "community" for nodes that has close relationship (communication) with the other, we modulize with ratio of 1.0 and classify it with different colors. In the visualization, we found 8 major groups that shown in the graph.
Interestingly, if we build up manual application using D3.js to rearrange the nodes, we even found more insighful information, using the same data and time filter:
Obviously it has shown that each community has a "centroid" who send SMS OUT to the group of people. That could be a tour group with the leader keep on track of everyone in the group.
Further interesting fact, we though people in the park are strangers, but ended up that they are not. Some how, each community has commnunicated from a node to another node belonging to other commnunity. There are some "floating" points which do not linked with anyone; however, most of them are somehow linked with eachother.