Difference between revisions of "ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS/Finals"

From Analytics Practicum
Jump to navigation Jump to search
Line 43: Line 43:
  
  
====<big>'''Student'''</big><br/>====
+
====<big>'''Duplicate requests'''</big><br/>====
 +
Duplicate request is defined as a user makes multiple identical requests with the same URL within 30-second time span This is supposedly caused by auto page refresh. <br/>
 +
The removal of duplicate records was done in two rounds for efficiency reason. The first round takes each record as a single string and look backs 20 lines in the rolling cache for exact match. The second round only looks for matches by hashed email and URL within 30-second range. 
  
 
: 1. '''Understanding the data''' :
 
: 1. '''Understanding the data''' :

Revision as of 01:24, 23 April 2017

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

FINALS

Analysis Workflow

The workflow diagram (Diagram 1) shows “data cleaning for general analysis process”, “horizon chart process” and “word cloud process”. A cleaned log dataset with additional attributes is produced after data preparation process. These attributes come from URL and additional data set collected.

Diag2.PNG


Duplicate requests

Duplicate request is defined as a user makes multiple identical requests with the same URL within 30-second time span This is supposedly caused by auto page refresh.
The removal of duplicate records was done in two rounds for efficiency reason. The first round takes each record as a single string and look backs 20 lines in the rolling cache for exact match. The second round only looks for matches by hashed email and URL within 30-second range.

1. Understanding the data :