ISSS608 2016-17 T1 Assign3 Vaishnavi AMS
DinoFun World is a typical modest-sized amusement park, sitting on about 215 hectares and hosting thousands of visitors each day. It has a small town feel, but it is well known for its exciting rides and events.
One event last year was a weekend tribute to Scott Jones, internationally renowned football (“soccer,” in US terminology) star. Scott Jones is from a town nearby DinoFun World. He was a classic hometown hero, with thousands of fans who cheered his success as if he were a beloved family member. To celebrate his years of stardom in international play, DinoFun World declared “Scott Jones Weekend”, where Scott was scheduled to appear in two stage shows each on Friday, Saturday, and Sunday to talk about his life and career. In addition, a show of memorabilia related to his illustrious career would be displayed in the park’s Pavilion. However, the event did not go as planned. Scott’s weekend was marred by crime and mayhem perpetrated by a poor, misguided and disgruntled figure from Scott’s past.
While the crimes were rapidly solved, park officials and law enforcement figures are interested in understanding just what happened during that weekend to better prepare themselves for future events. They are interested in understanding how people move and communicate in the park, as well as how patterns changes and evolve over time, and what can be understood about motivations for changing patterns.
The Task
Using the in-app communication data over the three days of the Scott Jones celebration, visual analytics is applied to solve the crime and discover patterns in the crowd that will help officials better prepare themselves for such future events.
Questions for investigation
1. In this assignment , we aim to identify the IDs that stand out for their large volumes of communication and the characteristics of their communication patterns.
2. Identify up to 10 communications patterns in the data and characterize who is communicating, with whom, when and where. We will aim to prioritize those patterns that are most likely to relate to the crime.
3. From this data, we can finally attempt to hypothesize when the vandalism was discovered?
Tools utilized
- Microsoft Excel 2016 – Data cleaning and data preparation
- JMP Pro 12 – Data cleaning and data preparation
- Tableau 10.0 – Data visualization and analysis
- NodeXL - Data visualization and analysis of Communication data
- Gephi - Data visualization and analysis of Communication data
Data provided
- Communication data of three days of Scott Jones celebration
- Movement data of three days of Scott Jones celebration
- Park layout
- Park website
Data preparation: Examine the data and make appropriate changes wherever necessary using Excel and JMP Pro 12 to make the data fit for analysis. Extract selected data for analysis.
Data visualization and Analysis: Construct Network graphs and heat maps to examine the underlying insights and patterns and draw conclusions.
Data Visualization and Analysis
Question 1
In the first question we aim to identify the IDs that stand out among all IDs in the park in the course of three days.By aggregating the total number of messages sent from each ID and the total number of messages received by each ID we will be able to draw a conclusion to this question. The aggregation was done in JMP Pro through the Table --> Summary function. Visualization is done in Tableau and a bubble chart is created.
- The above bubble chart shows the IDs that send out messages. Among these two IDs stand out ID - 1278894 and ID - 839736.
- In the above bubble chart the IDs that receive the maximum messages stand out. In addition the two IDs observed in the previous chart ID - 1278894 and ID - 839736 we can also observe that external ID also seems to receive a large share of the messages that are sent each day.
On further drilling down ID-839736 on an hourly basis each day we can find some interesting patterns
- Almost all the IDs that visit the park communicate with this ID and messages are sent to and received back from this ID in short interval. This could be a help desk or Information center of the Dino Fun Park where people send out queries and are replied back with the answers.
- There is a peak observed in the number of messages this ID receives at 9 AM , between 2 to 3 PM, around 4 PM and around 6 PM.
- At these peak positions the number of messages the ID receives outweighs the messages this ID sends out
- More people could be querying the Information center due to the shows being held for Scott Jones celebration
- We know that there were two shows conducted at the Grinosaurus stage everyday for Scott Jones celebration. We can assume that this show might have been around 9 to 10 AM and the second show around 3 to 4 PM
- The ID- 1278894 has a very interesting and different pattern in comparison to the Information center ID. There is a very significant difference between the number of messages sent and received by the ID.
- The number of messages sent is very much higher than the number of messages received.
- The messages from this ID are sent out every 5 minutes in an hour. The ID then starts receiving messages once they have sent messages to these IDs.
- Once messages are sent out in an hour period no messages are sent out in the next hour.
- We can assume from the characteristics above that this ID could be the Cindysaurus trivia game. We can find details about this game in the Dino park website.
Question 2
Observation 1
- On comparing the Information center communication pattern(839736) on an hourly basis across Friday , Saturday and Sunday data we can observe that the patterns on Friday and Saturday are similar across the hours with peaks observed at 9 AM, 12 PM, 2 PM, 4 PM and 8 PM in both messages sent and received.
- But on Sunday we can observe that communication data spikes at primarily at two points. One was at 12 PM and the other was around 3 PM. This could mean that a large number of people were communicating with the Information center regarding an issue. We can assume that the vandalism must have been discovered around this time and the people were reporting or inquiring this with the Information center. The second spike could be due to the second show of the day closed due to the vandalism and people weren't aware of the show being cancelled and were inquiring about it.
Observation 2
- On comparing the trivia game communication patterns over the three days we find the patterns similar in both the sent and received section and no disruption in the pattern due to the vandalism
- The trivia game messages are sent out first at 12 PM and then 2 PM , 4 PM , 6 PM and then at 8 PM once every five minutes in a one hour interval
Observation 3
Blue – June 6, Friday
Yellow – June 7, Sat
Red – June 8, Sunday
- On comparing the communication patterns over the three days by location we can observe that the messages exchanged were higher in the Wetland region in comparison to any other region.
- The Tundra Land and Entry corridor were the next highest regions with high communication messages exchanged
- The high volume in Wetland region was due to the Scott Jones show of memorabilia conducted at Creighton Pavilion which is in Wetland region.
- The high volume at Entry land could be attributed to the Information Centre and the Cindysaurus trivia game to be both located at the Entry corridor.
Observation 4
- From the above time series graphs we can observe the trend of the communication pattern over three days. Green line denotes the communication pattern on Sunday and the red on Saturday. They have a similar pattern throught the day except at two critical points.
- There is a peak at 12 PM and another peak at 2 PM
- The peak at 12 PM could have been due to the vandalism being discovered at the Creighton pavilion at this point of time
- The peak at 2 PM could have been due to the show being closed at the Grinosaurus stage
Observation 5
To analyse communication data we make use of NodeXL to gain more insights.
- We load the file onto NodeXL and mark the graphs as directed.
- next we run the Harel-Koren Fast Multiscale and the refresh the graph to obtain the below graph
- Since we have identified the time , the vandalism was discovered was around 11:40 PM with the help of both movement and communication patterns we have extracted the data from time 10:00 PM and 1:00 PM to identify the IDs that were communicating with each other during the time the vandalism was discovered.
- We can identify the two major IDs that were sending and receiving messages during this period was the external ID and Information center ID.
- External ID only received messages and did not send out any messages back
- The Information center was the major player in this with most of the IDs reaching out to the Park's Info center to inquire and report the vandalism
- From the above graph we can also observe that there are a group of IDs that communicate among themselves only and do not contact the Information center or the external IDs during this period of time
- We can cluster these people to be tour guides or park officials where they communicate only with each other and do not reach out to external or park center(IDs- 2047906 , 668872 , 543006 , 955733 )
- We can also observe two or three groups of three or four communicating among themselves without reaching out to anyone else in the park. These IDs could be the suspects ( Few of these IDs - 281145 , 1217112 , 2090883 )
Observation 6
- From the graphs we can understand that External ID is one of the largest receiver of messages in all three days.
- From the above graph we can observe there is a spike in communication between 10 to 12 PM on Sunday to the external area. This could be attributed to the fact that the vandalism was discovered on Sunday around 11:40 PM.
- The patterns in other times to the external has a constant pattern on all three days except the time period mentioned above
- We can assume the external ID to be social networking sites where people post about the vandalism. It could also mean that all communications outside the park to family , friends etc is classified by the park under External as well
Question 3
To answer the question as to when the vandalism was discovered , we make use of both communication and movement data.
Interactive dashboards were created for both movement and communication data to observe the variations by time
Dino park - Communication pattern
Dino park - Crowd Movement pattern
- From the above graphs , we can observe the movement patterns across the three days.
- On Sunday we can observe that the number of people at the Creighton Pavilion is less when compared to the crowd movement during Friday and Saturday. This suggests that vandalism must have happened before that and that the attraction was closed down to public. We can further observe this as there are no check ins at Creighton Pavilion after 12 which further substantiates the fact of the vandalism happening before 12 PM.
- We can also observe very less crowd at the Grinosaurus stage at 4 PM on Sunday when compared to Friday and Saturday. We can assume from this that the show had been cancelled at the Grinosaurus stage due to the vandalism at the Creighton pavilion.This can also be observed from the lack of checkins at the stage.
- From the communication data we can observe a large volume of communication at 11:40 PM at Wet lands area on Sunday. This is the area in which the Creighton pavilion is located.So we can conclude that this is the time that the vandalism was discovered and messages were sent to the other IDs such as Information centre , external and family and friends of the group.
- We can observe on Friday and Saturday that there is a large volume of data in the Wetlands area. But this has substantially decreased on Sunday as the attraction had been closed down to public.
- 4 PM maybe the time the crowd are usually allowed to view the Scott Jones memorabilia in the evening.
Node XL documentation :
Network analysis using NodeXL :