Difference between revisions of "ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS"

From Analytics Practicum
Jump to navigation Jump to search
 
(15 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
| style="padding:0.2em; font-size:100%; background-color:#4db8ff; text-align:center; color:#F5F5F5" width="10%" |  
 
| style="padding:0.2em; font-size:100%; background-color:#4db8ff; text-align:center; color:#F5F5F5" width="10%" |  
 
[[ANLY482_AY2016-17_T2_Group16 | <font color="#F5F5F5" size=2 face="Garamond"><b>HOME</b></font>]]
 
[[ANLY482_AY2016-17_T2_Group16 | <font color="#F5F5F5" size=2 face="Garamond"><b>HOME</b></font>]]
 
| style="background:none;" width="1%" | &nbsp;
 
| style="padding:0.2em; font-size:100%; background-color:#4db8ff; text-align:center; color:#F5F5F5" width="10%" |
 
[[ANLY482_AY2016-17_T2_Group16: ABOUT US | <font color="#F5F5F5" size=2 face="Garamond"><b>ABOUT US</b></font>]]
 
  
 
| style="background:none;" width="1%" | &nbsp;
 
| style="background:none;" width="1%" | &nbsp;
Line 37: Line 33:
  
 
<!--Content-->
 
<!--Content-->
<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA COLLECTION</font></div></div>
+
===<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA EXPLORATION</font></div></div>===
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
 
<div><font face="Roboto">
 
<div><font face="Roboto">
xxx
+
After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor.
</font></div>
+
The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples.
 +
As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:
 +
# Over 25 percent of the requests are for web resources (e.g. js, gif).
 +
# 87 percent of the requests are HTTP GET request
 +
# Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
 +
# Multiple encoding of search phrase in request url, based on the database
 +
# Requested items are usually serialised in their own way
 +
# 10 percent of the requests point to internal URL
 +
# 16 percent of the requests have a status code other than 200
 +
Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.
  
<div style="height: 2em"></div>
+
<div style="height: 1em"></div>
 +
===<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA PREPARATION</font></div></div>===
 +
<div style="height: 1em"></div>
 +
The given data is all well-formated, but we have identified necessary preparation of data is required in following areas:
 +
# Remove irrelevant requests such as those to web-resources or general web pages
 +
# Student and request data is collected from different sources and stored in separate tables; a joining of the data is required before student request patterns can be analysed.
 +
#Extract datetime components from CLF datetime string to slice data
 +
#Extract domain for each request to identify the database requested
 +
#Split the data lines into three groups: 1) with explicit search phrase in URL 2) without explicit search phrase but with other forms of reference to a title
 +
# Extract the requested title/search phrase
 +
# Infer the category of resource from the requested title/search phrase
  
<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA INTEGRATION AND FILTERING</font></div></div>
 
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
<div><font face="Roboto">
+
===<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA ANALYSIS</font></div></div>===
<big>'''Extracted Table'''</big><br/>
+
<div style="height: 1em"></div>
xxx
+
1. Slice records by various time frames, and user attributes and count the number of records. By analysing the number of requests under each dimension, we will be able to capture the characteristics of the databases.<br><br>
 +
2.  K-means clustering to profile students on dimensions of:
 +
* user agent
 +
* most frequently used database
 +
* most frequently requested resource category.
 +
* faculty
 +
* if the user is in Dean’s List
 +
This result will help the management better understand the user behaviours for different segments of students.<br><br>
 +
3. Perform Market Basket Analysis on the titles viewed by a student in a session. The analysis results can be potentially used for recommendation system.
  
</font></div>
 
 
<div style="height: 2em"></div>
 
 
<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">DATA CLEANING AND EXPLORATION </font></div></div>
 
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
<div><font face="Roboto">
+
===<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">REPORTING AND DATA PROCESSING PIPELINE</font></div></div>===
<big>'''Issues'''</big><br/>
+
<div style="height: 1em"></div>
xxx
+
At the end of the semester, the analysis results will be presented with interactive charts. A formal report will be delivered to the library analytics team. The report will consist of all useful findings from our study, and actionable recommendations to the library. The team will advise on areas of interest for further studies. A data processing pipeline consisting of the necessary scripts and programs will also be delivered, with instructions on setup and execution of the process. The library analytics team will be able to rerun the analysis on new request log data.
</font></div>
 
 
 
<div style="height: 2em"></div>
 
 
<!--/Content-->
 
<!--/Content-->

Latest revision as of 14:55, 15 January 2017

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

DATA EXPLORATION

After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor. The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples. As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:

  1. Over 25 percent of the requests are for web resources (e.g. js, gif).
  2. 87 percent of the requests are HTTP GET request
  3. Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
  4. Multiple encoding of search phrase in request url, based on the database
  5. Requested items are usually serialised in their own way
  6. 10 percent of the requests point to internal URL
  7. 16 percent of the requests have a status code other than 200

Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.

DATA PREPARATION

The given data is all well-formated, but we have identified necessary preparation of data is required in following areas:

  1. Remove irrelevant requests such as those to web-resources or general web pages
  2. Student and request data is collected from different sources and stored in separate tables; a joining of the data is required before student request patterns can be analysed.
  3. Extract datetime components from CLF datetime string to slice data
  4. Extract domain for each request to identify the database requested
  5. Split the data lines into three groups: 1) with explicit search phrase in URL 2) without explicit search phrase but with other forms of reference to a title
  6. Extract the requested title/search phrase
  7. Infer the category of resource from the requested title/search phrase

DATA ANALYSIS

1. Slice records by various time frames, and user attributes and count the number of records. By analysing the number of requests under each dimension, we will be able to capture the characteristics of the databases.

2. K-means clustering to profile students on dimensions of:

  • user agent
  • most frequently used database
  • most frequently requested resource category.
  • faculty
  • if the user is in Dean’s List

This result will help the management better understand the user behaviours for different segments of students.

3. Perform Market Basket Analysis on the titles viewed by a student in a session. The analysis results can be potentially used for recommendation system.

REPORTING AND DATA PROCESSING PIPELINE

At the end of the semester, the analysis results will be presented with interactive charts. A formal report will be delivered to the library analytics team. The report will consist of all useful findings from our study, and actionable recommendations to the library. The team will advise on areas of interest for further studies. A data processing pipeline consisting of the necessary scripts and programs will also be delivered, with instructions on setup and execution of the process. The library analytics team will be able to rerun the analysis on new request log data.