ISSS608 2016-17 T1 Assign3 Chen Yi Fan
Contents
Overview
DinoFun World is a typical modest-sized amusement park. In this assignment, a set of visitors’ communication and movement data to this park were given. This dataset covered 3 days’ visitors’ information within the park from Friday (6 June 2014) to Sunday (8 June 2014). It happened to have a weekend tribute named “Scott Jones Weekend” during the 3 days. It was to celebrate Scott Jones, who was from a town nearby DinoFun World, years of stardom in international play. Scott was scheduled to appear in two stage shows each on Friday, Saturday and Sunday to talk about his life and career. In addition, a show of memorabilia related to his illustrious career would be displayed in the park’s Pavilion. However, the event did not go as planned. Scott’s weekend was marred by crime and mayhem perpetrated by a poor, misguided and disgruntled figure from Scott’s past.
We are tasked to detect the communication pattern among the visitors or park employees. Specifically, the following questions are set:
- Identify those IDs that stand out for their large volumes of communication. For each of these IDs, Characterize the communication patterns we see.
- Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where.
- From this data, can we hypothesize when the vandalism was discovered?
Dataset Preparation
Because the questions are all around the communication patterns among the visitors, we would focus on the communication dataset for a start. Hence the 3 days’ communication data were imported into JMP to be prepared for the analysis. No missing data were found from the 3 days’ communication dataset.
- Converted all Timestamp column in the 3 days dataset into Day of Week and Time of Day (12 hours format) using the DateTime function in JMP
- Renamed all from and to columns to Source and Target.
- Added a new column of Type into the 3 datasets. Hardcoded its value to Directed. This was to prepare the edge table to be used in Gephi.
- Using table summary function in JMP to find out the unique IDs from each table. This would be used to find the number of unique IDs and revisit IDs as shown below. The unique IDs table would also be imported into Gephi as the node file for the network analysis later.
- From the above table summary, we could also find out which IDs had most or least communication
With the table summary results, we could easily observe an increasing trend for the unique visitors coming to the park from Friday to Sunday as shown from the summary below.
Friday | Saturday | Sunday | |
---|---|---|---|
Number of unique visitors | 2,950 | 5,297 | 6,118 |
Approach
Large Volume of Communication
Based on the above data preparation, we could summarize there were total 9,429 unique IDs recorded throughout the 3 days. 1,510 or 16% were repeat IDs. Among all, 2 IDs (1278894 and 839736) had the most communications with others.
For ID 1278894, it had the same communication pattern for the 3 days. It started to send batch messages every 5 mins from 12pm to 12:55pm, 2 to 2:55pm, 4 to 4:55pm, 6 to 6:55pm and 8 to 8:55pm. Meanwhile, it also received response messages from those IDs.
Due to the regular pattern and massive messages being sent out at the same time, we can hypothesis ID 1278894 is an automated system which sent messages to the IDs registered in its system at a fix interval. With the increasing visitors coming to the park during the 3 days, the messages it sent also increased with Sunday having the most messages being delivered. All of the messages were sent out from Entry Corridor.
The messages 1278894 received also appeared as a periodic pattern that the number of messages ranging from a few dozens to less than 200 were received within an hour right after it sent out the first batch of messages. It might be logical to infer the messages sent out were some contest questions which only allowed park visitors to respond within an hour. Some prizes may be awarded to those who managed to send in their answers to attract visitors to participate in the game.
There were 2 peaks happened at the same time, 4pm on Friday and Saturday from Coaster Alley. Since Scott Jones visited twice per day to the park from Friday to Sunday, the contest questions sent at 4pm on these 2 days might be related to his show. Some of the messages flushed in could be from his fans. However, this phenomenon did not happen on Sunday. As we know, his show was marred because of a vandalism, we could infer it happened on Sunday. Therefore, we did not see the same pattern in Sunday.
Throughout the 3 days, Wet Land was the place from where most messages were received. It matched with the information provided in the question that a show of memorabilia related to his illustrious career being displayed in the park’s Pavilion which is located at Wet Land.
On the other hand, for ID 839736, also located at Entry Corridor, received no more than 19 messages and sent less than 24 messages on Friday and Saturday throughout the whole days.
But on Sunday, it suddenly received 1,573 messages at 12pm from Wet Land. Until 12:44pm, it dropped to 24 messages. There were a few small spikes afterwards. It finally resumed to its usual frequency at about 15:49. Meanwhile, we could also observe a short spike at Coaster Alley from 14:28 to 15:02.
From above, we could hypothesis ID 839736 was the customer service who was responsible to provide help to the visitors and respond to visitors’ questions.
Noticeable Communication Patterns
Next, we would concentrate on the few spikes observed from the above analysis to understand individual groups’ communication patters.
- Pattern 1
Firstly, we exported the Friday and Saturday data at 4pm when there were 2 spikes on each day at Coaster Alley. There were 2,147 rows filtered for Friday and 3,707 rows obtained for Saturday.
From the Gephi network chart shown below, we could see the 2 IDs 1049061 619203 sent the most messages on Friday and Saturday.
- Pattern 2
On Sunday, mass messages sent out in Coaster Alley was during 2:28pm to 3pm. There were 2 distinct groups initially. One group sent messages mainly to 839736 as well as 1278894.
After removing the 2 big in degree nodes, 8 more groups appeared. Among them, the ID 1022772 from red group has sent the most messages.
- Patter 3
Since the vandalism occurred during 11:00am to 1pm on Sunday, we should exam the communication pattern during this period further. By importing the data into Gephi, we obtained the 4 groups below.
Betweenness Centrality is an indicator of a node’s centrality in a network. A node with high betweenness centrality has a large influence on the transfer of items through the network.
968967 and 445493 both appeared ranking highest than the other nodes. It indicates these 2 nodes have large influence on the transfer of items through the network.
Closeness Centrality is to measure the centrality in a network of a node. The more central a node is, the closer it is to all other nodes. In the following chart, 2082743 showed it was closer other nodes.
Eccentricity measures the distance between a node and the node that is furthest from it. A high eccentricity means that the furthest away node in the network is a long way away, and a low eccentricity means that the furthest away is actually quite close. But we did not observe much differences among the nodes for this measurement.
When the vandalism was discovered?
From the above analysis, it shows the vandalism likely happened on Sunday 11:59 am to 12:44pm.