ISSS608 2016-17 T1 Assign3 Parikshit Ravindra MAYEE
Contents
Overview
In this assignment I have tried to explore and visualize the communication patterns of the visitors, over 3 days, to DinoFun World (fictitious amusement park). All of my analysis and visualization has been consolidated in an interactive dashboard and published to Tableau public at Communication_Patterns.
Approach
My approach for this assignment was focused on answering following questions:
1. Identify those IDs that stand out for their large volumes of communication. For each of these IDs
a. Characterize the communication patterns you see.
b. Based on these patterns, what do you hypothesize about these IDs?
2. Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.
3. From this data, can you hypothesize when the vandalism was discovered? Describe your rationale.
Data
Only Communication dataset for 3 days was used for the analysis of this assignment. Movement data was not used.
Following files were used:
• DinoFunWorld_CommData.zip consist of in-app communication data over the three days of the Scott Jones celebration.
• DinoFunWorld_LayoutMap.zip consists of a jpg file.
• DinoFunWorld_Website.zip consists of webpages of DinoFun World Park.
Data Preparation & Analysis
Communications data for park visitors was made available for each data. The dataset was structured to show the timestamp of each communication, who initiated the communication, who was the recipient of the communication and the location where the communication was sent from.
Basic data exploration & Combining 3 days data:
In order to simplify visualizing in Tableau I first analysed the data in SAS JMP pro.
I performed basic data exploration by plotting distributions of the columns. I observed that ‘To’ column consisted of text ‘external’ representing all communication where the recipient was out of the park and hence would not have an ID to be represented. I recoded (Cols ==> Utilities ==> recode) the ‘external’ communication to represent as ‘0’.
After recoding I modified the data type-modeling type for ‘From’ & ‘To’ columns to Numeric-Nominal.
I performed a quick check to identify any missing patterns (Tables ==> Missing data patterns). No missing values were observed in the dataset.
The combined dataset of 3 days consisted of 4,153,329 rows. I saved this combined dataset as a CSV file through JMP. The CSV file with combined dataset for 3 days was imported and used as the sources for analysis in tableau.
Next, I combined the 3 days data into one file using JMP’s Tables ==> Concatenate function.
Question 1: Identify those IDs that stand out for their large volumes of communication. For each of these IDs.
1. Characterize the communication patterns you see.
2. Based on these patterns, what do you hypothesize about these IDs?
By plotting the bubble chart for ‘From’ & ‘To’ columns from dataset, using the number of communications for each ID to represent the volume of communication, I was able to identify the IDs sending and receiving High volume of communications.
Based on above visualization, it can be concluded that IDs 1278894 & 839736 showed unusually high proportion of communication volume compared to the other IDs.
The high volume bubble with label ‘0’ represents the communications received by external source.
Next step was to identify the communication pattern for these 2 IDs.
I went back to JMP in order to prepare a new data set for these two IDs. Using Select Rows functionality, I selected all rows where the communication was sent by or received by 1278894 & 839736.
I created 2 new columns UID & Communication type. UID represents the ID for which the row contains information. Communication type explicitly specific if the communication is being sent (Out) or received (IN) by the UID.
There was no communication between 1278894 & 839736, so the formulas used to derive at the values would not violate any logic in this customized case.
This dataset was saved as CSV file imported to Tableau.
I analyzed the communication pattern for 1278894 first. My observations for the characteristics of communication patterns are documented below:
1) 1278894 sends out communication in a very precise manner. All communications sent by this ID originates from Entry Corridor.
2) First communication is sent out at 12 noon and ends at 8:55 PM. These communications are sent every alternate hour (12.x, 2.x, 4.x, 6.x, 8.x pm)
3) Also, these communications are sent with an interval of 5 minutes within the hour. i.e. Exactly 12 communications are sent every alternate hour.
4) These communications are sent to huge number of ID in the park. The unique IDs to which the communication is sent by 1278894, is not constant.
5) This ID receives communications after it has sent the first communication for the hour. The Communications are sent from all over the park.
Hypothesis based on communication pattern of 1278894:
ID 1278894 sends and receives communications from a very large number of IDs in Park. Realistically, it is not possible for any Park visitor to get acquainted with such a large number of people and communicate with them in a such a precise manner. I can hypothesis that this ID is likely to be associated with park services.
Also, since the communications are sent in very precise manner, I can further hypothesise that this ID is likely to be an automated system for sending and receiving messages.
Next I analysed the communication pattern for 839736. My observations for the characteristics of communication patterns are documented below:
1) 839736 sends & receives communication throughout the day for all 3 days with no specific pattern in terms of timing or the IDs to which the communication is sent.
2) All communications sent by this ID originates from Entry Corridor. Communications received by this ID originates all over the park.
3) Communication starts at the start of the day around 8 am and ends with the end of park operations after 11pm.
4) One major observation from the communication pattern is the spike in the communication volume between 11am & 4pm on 3rd day.
5) By excluding the communications between 12 & 4 pm (which has extremely high number of communications and overpowers the other patterns), I observed a pattern which shows that the number of communications gradually increases from 8 am to 12 noon. After 12 noon the number of communications drops gradually until 3 pm after which it again increases until 7 pm. Finally, after 7 pm the number of communications drop continuously and dies out around 11:30 pm i.e. with the end of operations of park.
Hypothesis based on communication pattern of 839736:
ID 839736 sends and receives communications from a very large number of IDs in Park. Realistically, it is not possible for any Park visitor to get acquainted with such a large number of people and communicate with them in a such high. I can hypothesis that this ID is likely to be associated with park services.
I did not find any specific pattern in terms of timing of the communications. Communications are sent & received continuously throughout the day. This leads me to hypothesise that the ID 839736 is likely to be operated manually by park officials.
Also, an important observation differentiating ID 839736 from ID 1278894 was that, 839736 received very high volume of communications on 3rd day around 12 noon. I did not observe similar spike in communication for 1278894. This further strengthens my hypothesis that 1278894 is likely to be automated system rather than being operated manually.
Question 2: Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.
Pattern 1: External communications
Volume of communications sent outside the park (external communication) show 1 specific pattern. Every day the number of communications drop around 10.x am hours and again drops around 3.x pm hours. There is an exception to this pattern on 3rd day at 3.x pm which doesn’t show a drop in communication.
Pattern 2:
Communications from Coaster Alley shows a specific pattern in terms of high volume of communications being sent out in 11.x am hour and 4.x pm hours. Exception to this pattern is observed on 3rd where we don’t see spike of high communication around 4.x pm hour.
Pattern 3:
Communications from Entry corridor follows a pattern of high volume communications 5 times a day at specific interval. This relates to the communications pattern explained for the ID 1278894 in question 1.
Pattern 4:
Communications originating from wetland constitute the highest volume for the day on all 3 days.
Pattern 5:
Communications from Wetland show an unusual spike in volume of communication starting around 11:39 am with a sudden drop at 11:53 am and then again a spike to reach maximum volume at 12 noon.
Question 3: From this data, can you hypothesize when the vandalism was discovered? Describe your rationale.
Based on the communications patterns documented above, one clear observation that can be drawn is for most of the patterns we observe an exception or break in pattern on 3rd day after 11 am. Extremely high volume of communications starting from 11.x hour on 3rd day, can be used to hypothesise that the said vandalism was discovered during this time frame since the communications are exchanged between several IDs and are not confined to specific groups.
Focusing on this time line (starting 11.x am on 3rd day) and observing the communication patterns from different location points to the fact that the spike in communications started in Wetland at around 11:30 AM. This spike can be hypothesised to represent the most likely time when the said vandalism was discovered.
Results
Results for my visual analysis are available on Tableau Public: Communication_Patterns
Tools Utilized
1. SAS JMP Pro : Used for initial data analysis, data cleaning and restructuring data.
2. Tableau : Used for exploratory data analysis and to provide visual representations.
3. Tableau Public : Visual dashboard was published to Tableau Public and the web url is shared above.
References
1. http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202015/challenges/Mini-Challenge%202/
2. http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202015/challenges/Mini-Challenge%202/entries/Purdue%20University%20-%20Zhao/
3. https://community.jmp.com/community/academic
4. https://community.tableau.com/