ISSS608 2016-17 T1 Assign3 CHIA Yong Jian

From Visual Analytics and Applications
Jump to navigation Jump to search

What Happened?

Based on the case information, this is what can be gathered:

  1. What event was going on? Scott Jones, an international football star, was to celebrate his years of stardom and appear in two stage shows each from Friday to Sunday. His memorabilia was to be displayed at the park's pavilion
  2. What was the crime? - Vandalism was discovered.
  3. Who did the crime? - A "poor, misguided and disgruntled figure from Scott’s past". The identity of the person is not known, yet.
  4. Where is the location of crime: Within At DinoFun World. Since Scott Jones's weekend include two events, the crime either happened at where his stage show is or where he is showing his memorabilia at the pavilion. Since the crime was on vandalism, it is likely that his items on displayed were vandalised. However, we will look into the communications and movement data to confirm that this is happening.


The DinoFun World website provides the following additional details:

  • The memorabillia exhibition is at the Creighton Pavilion (Entrance at Wet Land). No specific mention of where the stage show is at, although based on the park map, the park only have 1 location that is a stage - Grinosaurus Stage (Entrance at Coaster Alley).
CHIA YONG JIAN Assign3 ParkMap.png
  • There is also a DinoFun World App that allows visitors to check-in at rides, SMS friends, which can be installed on the person's own phone or via devices borrowed from the park. In addition there is also a Cindysaurus trivia game embedded within the mobile application. It will be useful to examine the communications and movement data to observe patterns.
  • Unfortunately, some information such as park opening and closing time is not available in the case information or on the DinoFun website.

Data Review and Preparation

SAS JMP Pro 12 was used to review and prepare data.

Communications Data

Three days worth of data (Friday, Saturday, Sunday) of the fateful weekend was provided, with each having between 948,739 to 1,655,866 records. Each file has the following columns:

Column Description Remarks
Timestamp in YYYY-MM-DD HH24:MM:SS format -
from unique identifier of the sender All communications came from a unique identifier - no messages are sent by an "external" party
to unique identifier number of the receiver Receiver ID can be indicated as "external" as well - assuming external means to a party not in the park.
location indicates where the message was sent from. All messages came from one of the following areas in the park: Wet Land, Tundra Land, Kiddie Land, Entry Corridor, Coaster Alley

No missing data was observed in the dataset.

For loading into Gephi later, the following columns will be renamed:

  • "from" renamed to "Source"
  • "to" renamed to "Target"

Movement Data

Movement data was also provided for the 3 days, with each having between 6 to 10 million records, and the following columns:

Column Description Remarks
Timestamp in YYYY-MM-DD HH24:MM:SS format -
id Unique identifier of the person making the movement -
type Either movement or check-in type -
X X-Coordinate of the location Values are between 0-100
Y Y-Coordinate of the location Values are between 0-100

The movement data, together with the park map, will be plotted using Tableau to visualise movement of the individuals in the park.

Which IDs have large amount of communications?

An exploration of the communications data was done in Tableau. The below shows the results

Overall Observations

Using a boxplot, there are two particular IDs that stands out for the number of communications sent across the three days - 1278894 and 839736. When the observations was done by each day over the three-day period, both IDs still consistently show the highest percentage of records of the total communications sent. An interesting observation is that on Sunday (June 8), the percentage of communications sent by ID 839736 jumped in comparison to the other days.

CHIA YONG JIAN Assign3 Task1 FromByDay.png



Using a boxplot, the top three receivers of communications was ID 1278894 and 839736, and communications to "External" parties. When observed by each day, the top IDs for received communications remained consistent. An interesting observation is on Sunday (June 8) where there was a spike of communications to ID 839736. A drill down of data for this ID will be needed for further analysis to understand why the spike.

CHIA YONG JIAN Assign3 Task1 ToByDay.png

Next, an investigation will be drilled down into the IDs identified above.

ID 1278894

A heatmap was constructed in Tableau with the hour and second timings displayed for this ID by each date. You can notice that this ID sends out messages right on the dot every 5 minutes, when it is active. Furthermore, there is only 1 location where the messages are sent from - Entry Corridor. The messages tend to get more approaching Sunday as well. Given the preciseness of the timestamps, it is hypothesized that this could be the Cindysaurus Trivia Game server sending messages to park visitors throughout the day.

CHIA YONG JIAN Assign3 Task1 1278894.png

ID 839736

For this ID, the amount of communications remained generally flat on Friday and Saturday, but have large spikes on Sunday. Based on the timestamps, you will notice that there was a spike in messages sent to this ID at 12pm, with replies coming at 12.03pm. Secondly, at almost 2.42pm, there was also a spike in messages from this ID, with immediate responses from other IDs around 10 seconds later.

It is probably not possible if this ID is an individual, after all, to send so many messages to other park visitors at one time will likely be restricted in the app. The spike in the messages to and from may suggest some event or activity going on, causing the spike. If that's the case, this ID could possibly be the park information and assistance communications line for visitors to report and receive important information on what's going on in the park. This can be further strengthened by the fact that all communications from this ID came from the Entry Corridor location in the park, which is the entrance to the park.

CHIA YONG JIAN Assign3 Task1 839736.png

10 Communication Patterns in the Data

Spike of messages sent to external parties

Observation 1: In the chart below, you can notice that there is a spike of communications to external parties on Sunday close to 12pm. This seems to be almost the same pattern observed with communications sent to ID 839736 at around 12pm as well. Could this indicate something of interest to the park visitors (like..vandalism?) that they can't help but share with their family and friends?

CHIA YONG JIAN Assign3 Task2 Obs external.png

Network graph observations when a spike of messages was sent on Sunday 8 June at Lunchtime

From the previous observations, the time period of 11.57am till 12.03pm on Sunday reveals a spike of messages sent to external and ID 839736. Furthermore, the spike appeared in the Wet Land area of the park. A data extract was performed for the abovementioned time period and location to review the communications patterns during these few crucial minutes. In the network graph below, Force Atlas layout was used as it provides clearer visuals among other layouts such as Yifan Hu and Fruchterman Reingold. Modularity statistic was run to allow detect of communities in the data, and afterwards colours applied to the graph. Out-degree attribute was used to adjust the size of the nodes. There are a few observations we can see here:

  • Observation 2: The largest community on the lower left (shaded blue) are park visitors who have sent messages to ID 839736, which in an earlier section we suggest that it is a park information service.
CHIA YONG JIAN Assign3 Task2 Obs outdegree.png
  • Observation 3: Park visitors tend to travel in groups.Furthermore, there appears to be also not just group leaders but sub group leaders, especially for large groups like the ones below. This is because the sub group leaders appear to also take on the communications responsibilities to other group members, but on a lesser basis than the overall group leader.
CHIA YONG JIAN Assign3 Task2 Obs outdegree groups.png


Network graph observations when a spike of messages was sent on Sunday 8 June from 2.40pm

In the above observations, we noticed a burst of messages sent to ID 839736 from 2.40pm onwards. A network graph was generated. To allow discovery, communications to and from ID 1278894 was excluded. We took a time period of 2.40-2.45pm to generate the graph, with node size determined by the out-degrees. In addition, the edges was coloured to show which location the messages was sent at.

  • Observation 4: Through the overview of the network graph, we can notice that most of the park visitors were somehow connected to what's happening at that period of time. Only some park visitors at the circumference of the network diagram appears to be in their own group oblivious to whatever is happening.
CHIA YONG JIAN Assign3 Task2 Obs outdegree2.png
  • Observation 5: We can also notice that there are some group leaders that has sent message to their members, whom in turn, sent messages to ID 839736. Perhaps the group members wanted to validate the information that the group leader has sent them. The group leaders predominantly appear to be sending the messages from Wet Land and Coaster Alley, which happened to be close to the Pavilion and Stage. Messages from ID 839736 are then sent via the Entry Corridor. Examples of such groups are in the image below.
CHIA YONG JIAN Assign3 Task2 Obs outdegree2 relay.png

Park visitors who are in the park across 3 days

By intersecting the source of communications across 3 days, a list of IDs was retrieved (excluding "External", 1278894 and 839736) was extracted. Each day was then filtered against this list by the source, and a network graph was plotted for each day. ForceAtlas 2 layout provided easy to understand communities. Degree attribute was used as a simple measure to assess the number of connections the visitors have with each other. The below image shows the progression of the networks across the three days:

CHIA YONG JIAN Assign3 Task2 Obs outdegree3 all3days overview.png
  • Observation 6: Majority of the park visitors are connected to each other in one way or another. Large groups could be seen on Friday. These could be tour groups or family and friends that have came together to the park and taking advantage of the 3-day passes that provides great savings. There is a also a large group of visitors in the middle of the network graph, which appears like smaller groups or individuals, who also were able to interact with the larger groups from time to time. Interestingly, larger groups tend to communicate less with other similar sized groups, as seen in the number of edges compared to the edges with the smaller groups and individuals.
  • Observation 7: On Saturday and Sunday, most of the larger groups still remained intact, however the communications patterns are starting to change. From 7 distinct large groups on Friday, it became 6 on the last 2 days. This could indicate significant communications is happening within 1 group and the smaller groups/individuals, forming into possibly new groups in the process.

Unusual communications patterns in the morning and evening

On Friday and Sunday mornings, there are communications sent from places within the park even before the ID have went through the entry corridor. On Friday it was ID 439105, and Sunday it was 1401601. This could mean these visitors entered the park through some illegal or suspicious means. An extraction of data for Friday and Sunday was done for the respective ID, as long they are the sender or receiver of the communications. A network graph was then plotted to observe this phenomenon. Out-degree was used for adjusting node sizes, while Force Atlas provided a clear view of the network.

  • Observation 8: For ID 439105 on Friday, a check with the movement data reveals that the first check-in for this ID was directly at Kiddie Land. Its main communications was with IDs 1464748, 1696241, 580064, and 1053224. Coincidentally, these communications was performed all at Wet Land. It will be worth watching out for these IDs, although 439105 also chatted with many other visitors. Perhaps this might not be the suspect, who will probably try to minimise or eliminate communications to prevent suspicions.
CHIA YONG JIAN Assign3 Task2 Obs SuspiciousMorning439105.png


  • Observation 9: For ID 1401601 on Sunday, a check with the movement data reveals that the first check-in for this ID was directly at Kiddie Land as well. the person has few communications, of which the most significant was 1434060 (ignoring the ID 839736 and 1278894 which was are speculated to be park services and app features). This could signify this person came into the park with just one partner. It's worth to watch out as well for this ID in case it's one of the suspects for the crime.
CHIA YONG JIAN Assign3 Task2 Obs SuspiciousMorning1401601.png

When was vandalism discovered?