Difference between revisions of "ISSS608 2016-17 T1 Assign3 Agrim Gairola"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "ISSS608 2016-17 T1 Assign1_Agrim Gairola =<B>MAYHEM AT DINOFUN World</B> =Overview= <br/>DinoFun World is a typical modest-sized amusement park, sitting on about 215 hecta...")
 
m (qw)
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[ISSS608 2016-17 T1 Assign1_Agrim Gairola]]
+
 
=<B>MAYHEM AT DINOFUN World</B>
+
=<B>MAYHEM AT DINOFUN WORLD</B>=
 +
 
 +
[[File:Title.jpg|1500px|frameless|center]]
 +
 
 
=Overview=
 
=Overview=
 
<br/>DinoFun World is a typical modest-sized amusement park, sitting on about 215 hectares and hosting thousands of visitors each day. It has a small town feel, but it is well known for its exciting rides and events.  
 
<br/>DinoFun World is a typical modest-sized amusement park, sitting on about 215 hectares and hosting thousands of visitors each day. It has a small town feel, but it is well known for its exciting rides and events.  
Line 6: Line 9:
  
 
=The Task=
 
=The Task=
<p>You have access to the in-app communication data over the three days of the Scott Jones celebration. This includes communications between the paying park visitors, as well as communications between the visitors and park services. In addition, the data also contains records indicating if and when the user sent a text to an external party.  
+
<p>You have access to the in-app communication data over the three days of the Scott Jones celebration. This includes communications between the paying park visitors, as well as communications between the visitors and park services. In addition, the data also contains records indicating if and when the user sent a text to an external party.<br/> <br/>
Use visual analytics to analyze the available data and develop responses to the questions below. In addition, prepare a video that shows how you used visual analytics to solve this challenge. We encourage novel visual representations and analytic approaches!
+
<B>Task1:</B> Use visual analytics to analyze the available data and develop responses to the questions below.<br/>
Identify those IDs that stand out for their large volumes of communication. For each of these IDs  
+
a.Identify those IDs that stand out for their large volumes of communication.<br/>
Characterize the communication patterns you see.  
+
b.For each of these IDs Characterize the communication patterns you see.<br/>
Based on these patterns, what do you hypothesize about these IDs? Note: Please limit your response to no more than 4 images and 300 words.
+
c.Based on these patterns, what do you hypothesize about these IDs?<br/>
Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime. Note: Please limit your response to no more than 10 images and 1000 words.
+
<br/>
From this data, can you hypothesize when the vandalism was discovered? Describe your rationale. Note: Please limit your response to no more than 3 images and 300 words.</p>
+
<B>Task2:</B> Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime<br/>
 +
<br/>
 +
<b>Task3</B>: From this data, can you hypothesize when the vandalism was discovered? Describe your rationale. Note: Please limit your response to no more than 3 images and 300 words.</p>
 
<br/>
 
<br/>
 
=Tools Used=
 
=Tools Used=
Line 21: Line 26:
  
  
=Data Preparation=
+
 
The following steps were carried out to prepare the data for effective analysis:
+
=Task1=
 +
<b>a. Identification of IDs with Large communication</B><br/>
 +
In order to analyse the IDs that have made the largest amount of communication, we plot a chart between IDs and total outgoing calls made. It can be clearly seen that 2 IDs in specific make excessively large number of calls. These IDs are 1278894 and  839736. <br/>
 +
 
 +
[[File:First1.jpg|800px|frameless|center]]
 +
 
 
<br/>
 
<br/>
<b>Data Manipulation</b>: A unique ID was given to each record for the ease of analysis.<br/>
+
In order to understand the location of these two IDs and where these calls were made from, Let us deep dive into these IDs.
<b>Data Type Conversion</B>: On importing the data into JMP, age and work experience was kept in continuous data type. All the remaining data was converted to nominal data type.<BR/>
+
[[File:Second.jpg|800px|frameless|center]]<br/>
<B>Missing data analysis</B>: Missing data analysis was performed on the data in order to identify the missing data and suitably recoding them.<br/>
+
 
[[File:1.jpg|500px|frameless|center]]
+
It is clear from the graph below that these IDs are not moving and are making all their communication from the Entry Corridor. This is strange since any visitor to the park would communicate throughout the park.<br/> <br/>
<B>Assumption</B>: There were several unambiguous values that could be noted throughout the data set. These values were recoded based on the below assumptions:<Br/>
+
<b>b. Communication Patterns</B> <br/><br/>
[[File:2.jpg|200px|frameless|center]]
+
In order to analyse the communication patterns of these IDs, we can make use of the Tools Gephi. To prepare the provided data for Gephi, we make the following changes to the database:<BR/>
 +
*Replace "from" with "Source" and "to" with "Target"<br/><br/>
 +
On importing the data for the three days into Gephi, the following network can be seen using the settings as shown below:<br/>
 +
 
 +
[[File:Fri.jpg|800px|frameless|center]]
  
<I>All “?” values in survey items were taken as 2.5 such that it does not hamper the analysis while comparing the mean scores.</I>
+
[[File:Sat.jpg|800px|frameless|center]]
  
<B>Additional Columns for Categories</B>: Additional columns were created for each of the categories such that it represented the survey items under it. For e.g.: A new column was created for Quality which would have the mean of values in QU1,QU2,QU3,QU4,QU5 thus representing the overall score for quality for the ease of analysis.<br/>
+
[[File:Sun.jpg|800px|frameless|center]]
[[File:Ag4.jpg|400px|frameless|center]]<br/>
+
From the network diagram, it is clear that there are three nodes that are participating in maximum volume of communication. Two of these nodes represent the IDs that were discovered previously. The third node represents all the communication that was made to external parties. ID 1278894 and 8398736 both are communicating with a large volume of park visitors. Additionally it is interesting to note that these IDs do not communicate with each other, neither do they have any external communication.<br/>
 +
Closely observing the network we notice that communication data of ID 839736 is significantly higher than 1278894. Additionally 839736 has similar volume of incoming and outgoing data while ID 1278894 has large volume of Incoming Data.<br/><br/>
 +
<b>c.ID Hypothesis</B><br/>
 +
From the above visualizations, we have the following information:<br/>
 +
* Communication Volume of ID 1278894 and 839736 is significantly higher that other park visitors.<br/>
 +
* All communication made by the IDs is from Entry Corridors.<br/>
 +
* These IDs communicate to almost all other IDs present in the park.</br>
 +
* These IDs do not communicate with each other. Neither do they communicate with any external party.<br/>
 +
* ID 1278894 sends out large volume of communication at intervals of 5 min.<br/>
  
=Demographics=  
+
<b>Hypothesis</B><br/>
<br/>
+
It is safe to assume that both these IDs are not park visitors. These are most likely employees/Machines of the park who have very specific task of communicating with the park visitors.<br/>
In order to understand the data set accurately, let us first analyse the demographics. <br/>
+
ID 1278894:This ID is most likely an automated communication service where sends out communication every 5 min to visitors and the visitors respond to it. Considering the frequency of the communication of this ID with other IDs is very high, it is most likely a messaging service.
 +
ID 839736: This ID appears to a park employee who is responsible to handling the queries of the park visitors. <br/><br/>
 +
 
 +
=Task 2=
 +
<b>Interesting Pattern 1</b>
 +
The below plot shows the communication patterns between 12 PM and 1 PM on Sunday. There is a clear peak in communication volume after every 5 minutes. This is interesting since this communication data represents the communication happening in Entry Corridor.<br/> These peaks could represent broadcast massages/calls being send out by a park employee to convey information regarding the vandalism in the park.<br/>
 +
 
 +
[[File:Interesting.png|800px|frameless|center]]<br/>
 +
 
 +
<B>Interesting Pattern 2</b>(Discovery of Vandal)<br/>
 +
On analyzing the movement data, we analyse the distance covered and time spent by each ID on the three days. We notice something peculiar on movement data of Saturday.
 +
ID 1983765 has covered a very large distance as compared to the time. He has checked into very few rides and has mostly seen moving the entire day. <br/>
 +
On Sunday the same ID barely spends any time in the park and doesnt cover much distance in the park either. This ID has visited the park on all three days. ID 1983765 behavior pattern looks suspicious and hence we analyse his movement further.<br/>
 +
[[File:Criminal2.jpg|500px|frameless|center]]<br/>
 +
[[File:Criminal01.jpg|1500px|frameless|center]]<br/>
 +
 
 +
To further analyse the movement of ID 1983765, we analyse the movement data by plotting the X and Y coordinates on the Park map. We notice that the ID enters the park in the morning, travels straight to Creighton Pavillion, then moves to Scholts express. The ID later exits the park around 11:40 AM.
 +
 
 +
[[File:Criminal 4.jpg|1000px|frameless|center]]<br/>
 +
[[File:Criminal4.jpg|800px|frameless|center]]<br/>
 +
 
 +
It is clear from these activities that ID 1983765 is a suspect and his activities in the park seem suspicious.<br/>
 +
 +
=Task 3=
  
<b>Treemap</B>: Below is a screenshot along with the link to the video of a treemap with several different hierarchies. This treemap accurately shows the demographics of the data in one look.<br/>
+
This task involves discovery of the time of the vandalism.The communication data can be used to discover the time and location of the vandalism.It is safe to assume that the communication will see a peak in its volume after the Vandalism happens.Due to the very large dataset, let us start our exploration scope from the communication data provided for Sunday since Sunday was the most eventful day of the weekend with maximum visitors. <br/>
https://www.youtube.com/watch?v=BnRFP_Xuwvg&feature=youtu.be
+
[[File:Fifth.jpg|1000px|frameless|center]]
[[File:Ag5.jpg|800px|frameless|center]] <br/>
+
The heatmap depicts the volume of communication with respect to time for various locations. It can be seen that there is a intense peak in the communication in the wetland area between 11 AM to 1 PM. The heat map also shows us soon after the communication peaks in the Wetland, we see a rise in the volume of communication in the Entry Corridor. This could be the park park visitors trying to contact the park helpdesk to report vandalism or seek assistance.
  
<b>Distribution</b>: On analysis of the distribution of the data, the following interesting patterns can be seen regarding the demographics of the participants:<br/>
+
Let us now depdive into the communication data between 11 AM and 1 PM.
Age: Most participants (80%) who took part in the survey were between the age 32-53<br/>
+
[[File:Sixth.jpg|1200px|frameless|center]]<br/>
[[File:Ag6.jpg|500px|frameless|center]]
 
Gender: The survey comprised of 58% males and 42% females.<br/>
 
[[File:Ag7.jpg|400px|frameless|center]]
 
Experience: 50% participants have over 4-15 Years of experience. This shows that the data set has a wide range of experience among participants<br/>
 
[[File:Ag8.jpg|400px|frameless|center]]
 
UOC Position:It is interesting to note that almost 72% of the faculty is adjunct staff.</br>
 
[[File:Ag9.jpg|400px|frameless|center]]
 
Domain: For 39.5% of the participants domain mentioned as 6 which has been assumed as “others”. A large number of participants belong to Arts and Humanities and Science.
 
[[File:Ag10.jpg|400px|frameless|center]]
 
Registered User: Another interesting thing to note is that majority of users of Wikipedia are unregistered.
 
[[File:Ag11.jpg|500px|frameless|center]]
 
  
=Exploration and Analysis=
+
We can conclude from the graph that the Vandalism happened around 11:30 AM at the Wet Land area.
<br/>
 
Lets try to answer the following questions from the data sets using visual analytics techniques<br/>
 
<b>Q1: Which is the best rated and worst rated survey Item? </B> <br/>
 
To answer the above question, we plot a bar graph between the survey categories and their mean score. We notice that Sharing attitude has obtained the highest mean score while use behavior has been scored the least. From this we can infer that the general perception of the survey participants is that Wikipedia is an excellent platform for sharing information due to its open platform, availability of academic journals and online collaborative material. On the other hand, the use behavior has been rated poorly since apparently the participants are not using it to create teaching material or develop educational activities.<br/>
 
[[File:Ag12.jpg|800px|frameless|center]]<br/>
 
  
<b>Q2 How have the question under category Sharing Attitude been rated?</B> <br/>
+
In order to support our Hypothesis, we can confirm the time of the Vandalism by visualizing the volume of external calls.<br/>
We can arrive onto the answer to the above question by deepdiving into the category of Sharing Attitude. For this, we analyse SA1,SA2,Sa3 and plot them as shown below.<br/>
 
[[File:Ag13.jpg|800px|frameless|center]] <br/>
 
  
On inspecting the outlier, we notice that it is represents the rating of just 1 person (ID 40) and hence can be ignored as the opinion of one person could be biased and cannot be taken as a general trend. Hence it would be safe to say that the general perception is that Wikipedia is an excellent source for Sharing.
+
[[File:Seventh.jpg|1200px|frameless|center]]
[[File:Ag14.jpg|500px|frameless|center]] <br/>
 
  
 +
The peak could indicate the rise in external calls by just after the vandalism occurred as the visitors could be communicating to the Police or their family and friends about the vandalism. The peak in the external calls confirms our hypothesis that the vandalism occurred between 11 AM to 12 PM on Sunday.<br/>
  
<B>Q3 Is there a difference in the perception of registered Users and unregistered users?</B> <br/>
+
=Result=
The below line plot compared the rating by Registered Wiki users and unregistered wiki users indicating that there is a clear difference in the opinion between registered and unregistered users specially for the categories of Behavioural, Intention, Experience, Profile 2.0, Use Behaviour and Visibility.<br/>
+
From the visual evidence provided above we can conclude the day and time of the Vandalism. We can also identify the possible suspects of the crime. Below if the summary of Day and Time of Crime and possible suspect.<Br/>
  
[[File:Ag15.jpg|500px|frameless|center]]<br/>
+
<b>Day of Crime:Saturday<br/>
 +
Time of Crime:Between 11AM to 12PM<br/>
 +
Location of Crime: Wet Land(Creighton Pavilion)</b><br/><br/>
 +
Supporting Evidance:
 +
*Peak in communication volume between 11AM-12PM on Saturday from the Wetland Area.
 +
*Peak in External calls between 11AM-12PM from the wetland area.
 +
*Broadcast messages/calls from the entry corridor every 5 min to intimate people about the Vandalism.<br/><br/>
  
=Results=
+
<b>Suspected ID: 1983765</b><br/><br/>
From the above graphs, the following conclusions can be made:<br/>
+
Supporting Evidence:
*Sharing Attitude is the best rated category of the question where as Use behaviour is the most poorly rated category.<br/>
+
*This visitor visited the park on Friday, Saturday as well as Sunday. The visitor covered unexpectedly large distance on Saturday indicating that he was just walking around inspecting the park and not checking into rides.
*It can be seen instructors and associate who do not have a PHD have scored a 5 for SA1,SA2 and SA3 indicating that the Non-PhD Instructors and associate professors use Wikipedia to publish, share and collaborate  with other members of the group <br/>
+
*The Visitor spent just 4 Hrs on Sunday. He came into the Park after 8 AM and left the park around 11:40 AM which suspiciously coincides with the time of the Vandalism.
*Majority of the participants of the Survey are unregistered members. This could lead to inaccuate reviews on the survey as unregistered users might not be aware of the full use of Wikipedia<br/>
+
*<b>No Communication Data</b> could be found for this ID indicating that although he visited the park on three days, there was no communication made by him.
*There is apparent disparity between opinions of the registered and unregistered users in various categories of questions.<br/>
+
*On Sunday this visitor came to the park, walked straight to the Craighton Pavilion in Wetland Area(location of vandalism) and then soon exit the park. This indicates that he had targetted his movement inside the park and had planned in advance where he wanted to go.

Latest revision as of 04:14, 1 November 2016

MAYHEM AT DINOFUN WORLD

Title.jpg

Overview


DinoFun World is a typical modest-sized amusement park, sitting on about 215 hectares and hosting thousands of visitors each day. It has a small town feel, but it is well known for its exciting rides and events. One event last year was a weekend tribute to Scott Jones, internationally renowned football (“soccer,” in US terminology) star. Scott Jones is from a town nearby DinoFun World. He was a classic hometown hero, with thousands of fans who cheered his success as if he were a beloved family member. To celebrate his years of stardom in international play, DinoFun World declared “Scott Jones Weekend”, where Scott was scheduled to appear in two stage shows each on Friday, Saturday, and Sunday to talk about his life and career. In addition, a show of memorabilia related to his illustrious career would be displayed in the park’s Pavilion. However, the event did not go as planned. Scott’s weekend was marred by crime and mayhem perpetrated by a poor, misguided and disgruntled figure from Scott’s past.

The Task

You have access to the in-app communication data over the three days of the Scott Jones celebration. This includes communications between the paying park visitors, as well as communications between the visitors and park services. In addition, the data also contains records indicating if and when the user sent a text to an external party.

Task1: Use visual analytics to analyze the available data and develop responses to the questions below.
a.Identify those IDs that stand out for their large volumes of communication.
b.For each of these IDs Characterize the communication patterns you see.
c.Based on these patterns, what do you hypothesize about these IDs?

Task2: Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime

Task3: From this data, can you hypothesize when the vandalism was discovered? Describe your rationale. Note: Please limit your response to no more than 3 images and 300 words.


Tools Used

  • Tableau version 10.0
  • JMP Pro
  • Gephi
  • Microsoft Office


Task1

a. Identification of IDs with Large communication
In order to analyse the IDs that have made the largest amount of communication, we plot a chart between IDs and total outgoing calls made. It can be clearly seen that 2 IDs in specific make excessively large number of calls. These IDs are 1278894 and 839736.

First1.jpg


In order to understand the location of these two IDs and where these calls were made from, Let us deep dive into these IDs.

Second.jpg


It is clear from the graph below that these IDs are not moving and are making all their communication from the Entry Corridor. This is strange since any visitor to the park would communicate throughout the park.

b. Communication Patterns

In order to analyse the communication patterns of these IDs, we can make use of the Tools Gephi. To prepare the provided data for Gephi, we make the following changes to the database:

  • Replace "from" with "Source" and "to" with "Target"

On importing the data for the three days into Gephi, the following network can be seen using the settings as shown below:

Fri.jpg
Sat.jpg
Sun.jpg

From the network diagram, it is clear that there are three nodes that are participating in maximum volume of communication. Two of these nodes represent the IDs that were discovered previously. The third node represents all the communication that was made to external parties. ID 1278894 and 8398736 both are communicating with a large volume of park visitors. Additionally it is interesting to note that these IDs do not communicate with each other, neither do they have any external communication.
Closely observing the network we notice that communication data of ID 839736 is significantly higher than 1278894. Additionally 839736 has similar volume of incoming and outgoing data while ID 1278894 has large volume of Incoming Data.

c.ID Hypothesis
From the above visualizations, we have the following information:

  • Communication Volume of ID 1278894 and 839736 is significantly higher that other park visitors.
  • All communication made by the IDs is from Entry Corridors.
  • These IDs communicate to almost all other IDs present in the park.
  • These IDs do not communicate with each other. Neither do they communicate with any external party.
  • ID 1278894 sends out large volume of communication at intervals of 5 min.

Hypothesis
It is safe to assume that both these IDs are not park visitors. These are most likely employees/Machines of the park who have very specific task of communicating with the park visitors.
ID 1278894:This ID is most likely an automated communication service where sends out communication every 5 min to visitors and the visitors respond to it. Considering the frequency of the communication of this ID with other IDs is very high, it is most likely a messaging service. ID 839736: This ID appears to a park employee who is responsible to handling the queries of the park visitors.

Task 2

Interesting Pattern 1 The below plot shows the communication patterns between 12 PM and 1 PM on Sunday. There is a clear peak in communication volume after every 5 minutes. This is interesting since this communication data represents the communication happening in Entry Corridor.
These peaks could represent broadcast massages/calls being send out by a park employee to convey information regarding the vandalism in the park.

Interesting.png


Interesting Pattern 2(Discovery of Vandal)
On analyzing the movement data, we analyse the distance covered and time spent by each ID on the three days. We notice something peculiar on movement data of Saturday. ID 1983765 has covered a very large distance as compared to the time. He has checked into very few rides and has mostly seen moving the entire day.
On Sunday the same ID barely spends any time in the park and doesnt cover much distance in the park either. This ID has visited the park on all three days. ID 1983765 behavior pattern looks suspicious and hence we analyse his movement further.

Criminal2.jpg


Criminal01.jpg


To further analyse the movement of ID 1983765, we analyse the movement data by plotting the X and Y coordinates on the Park map. We notice that the ID enters the park in the morning, travels straight to Creighton Pavillion, then moves to Scholts express. The ID later exits the park around 11:40 AM.

Criminal 4.jpg


Criminal4.jpg


It is clear from these activities that ID 1983765 is a suspect and his activities in the park seem suspicious.

Task 3

This task involves discovery of the time of the vandalism.The communication data can be used to discover the time and location of the vandalism.It is safe to assume that the communication will see a peak in its volume after the Vandalism happens.Due to the very large dataset, let us start our exploration scope from the communication data provided for Sunday since Sunday was the most eventful day of the weekend with maximum visitors.

Fifth.jpg

The heatmap depicts the volume of communication with respect to time for various locations. It can be seen that there is a intense peak in the communication in the wetland area between 11 AM to 1 PM. The heat map also shows us soon after the communication peaks in the Wetland, we see a rise in the volume of communication in the Entry Corridor. This could be the park park visitors trying to contact the park helpdesk to report vandalism or seek assistance.

Let us now depdive into the communication data between 11 AM and 1 PM.

Sixth.jpg


We can conclude from the graph that the Vandalism happened around 11:30 AM at the Wet Land area.

In order to support our Hypothesis, we can confirm the time of the Vandalism by visualizing the volume of external calls.

Seventh.jpg

The peak could indicate the rise in external calls by just after the vandalism occurred as the visitors could be communicating to the Police or their family and friends about the vandalism. The peak in the external calls confirms our hypothesis that the vandalism occurred between 11 AM to 12 PM on Sunday.

Result

From the visual evidence provided above we can conclude the day and time of the Vandalism. We can also identify the possible suspects of the crime. Below if the summary of Day and Time of Crime and possible suspect.

Day of Crime:Saturday
Time of Crime:Between 11AM to 12PM
Location of Crime: Wet Land(Creighton Pavilion)


Supporting Evidance:

  • Peak in communication volume between 11AM-12PM on Saturday from the Wetland Area.
  • Peak in External calls between 11AM-12PM from the wetland area.
  • Broadcast messages/calls from the entry corridor every 5 min to intimate people about the Vandalism.

Suspected ID: 1983765

Supporting Evidence:

  • This visitor visited the park on Friday, Saturday as well as Sunday. The visitor covered unexpectedly large distance on Saturday indicating that he was just walking around inspecting the park and not checking into rides.
  • The Visitor spent just 4 Hrs on Sunday. He came into the Park after 8 AM and left the park around 11:40 AM which suspiciously coincides with the time of the Vandalism.
  • No Communication Data could be found for this ID indicating that although he visited the park on three days, there was no communication made by him.
  • On Sunday this visitor came to the park, walked straight to the Craighton Pavilion in Wetland Area(location of vandalism) and then soon exit the park. This indicates that he had targetted his movement inside the park and had planned in advance where he wanted to go.