Difference between revisions of "ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS"

From Analytics Practicum
Jump to navigation Jump to search
Line 36: Line 36:
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
 
<div><font face="Roboto">
 
<div><font face="Roboto">
xxx
+
After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor.
 +
The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples.
 +
As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:
 +
# Over 25 percent of the requests are for web resources (e.g. js, gif).
 +
# 87 percent of the requests are HTTP GET request
 +
# Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
 +
# Multiple encoding of search phrase in request url, based on the database
 +
# Requested items are usually serialised in their own way
 +
# 10 percent of the requests point to internal URL
 +
# 16 percent of the requests have a status code other than 200
 +
Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.
 
</font></div>
 
</font></div>
  

Revision as of 14:17, 15 January 2017

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

DATA EXPLORATION

After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor. The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples. As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:

  1. Over 25 percent of the requests are for web resources (e.g. js, gif).
  2. 87 percent of the requests are HTTP GET request
  3. Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
  4. Multiple encoding of search phrase in request url, based on the database
  5. Requested items are usually serialised in their own way
  6. 10 percent of the requests point to internal URL
  7. 16 percent of the requests have a status code other than 200

Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.

SUMMARY OF DATA EXPLORATION

xxx