Difference between revisions of "ANLY482 AY2016-17 T2 Group16: PROJECT OVERVIEW"

From Analytics Practicum
Jump to navigation Jump to search
 
(12 intermediate revisions by 2 users not shown)
Line 27: Line 27:
 
==<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">BUSINESS PROBLEM & MOTIVATION</font></div></div>==
 
==<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">BUSINESS PROBLEM & MOTIVATION</font></div></div>==
 
<div><font face="Roboto">
 
<div><font face="Roboto">
The library subscribes to eBook platforms with contents from a range of publishers. These databases provide contents that have largely enriched the library’s resources and make an integral part of the library repository.
+
The library subscribes to eBook platforms with contents from a range of publishers. These databases provide contents that have largely enriched the library’s resources and make an integral part of the library repository.<br />
When a student user requests contents from the databases, the request goes through the library’s proxy server. The proxy server captures a digital trace for each user request, which contains request url, user ID, and user agent. With the aim of providing easy services to users, the management hope to better understand the usage patterns of the databases. The challenge is to programmatically extract the meaningful user inputs within billions of request record, as most records are irrelevant to the project objectives.
+
 
 +
When a student user requests contents from the databases, the request goes through the library’s proxy server. The proxy server captures a digital trace for each user request, which contains request url, user ID, and user agent. With the aim of providing easy services to users, the management hope to better understand the usage patterns of the databases. The challenge is to programmatically extract the meaningful user inputs within billions of request record, as most records are irrelevant to the project objectives.<br />
 +
 
 
Our main focus of the project is to understand the usefulness of the library eBook database in fulfilling the student queries. By analysing the proxy trace, we can define the characteristics of the users as well as examining the usage rate of the database. The success rate of the students queries are also part of target findings. Since the dataset is not static, we aim to provide a processing pipeline to help the sponsor in looking for new findings with new enrolled students in the future.
 
Our main focus of the project is to understand the usefulness of the library eBook database in fulfilling the student queries. By analysing the proxy trace, we can define the characteristics of the users as well as examining the usage rate of the database. The success rate of the students queries are also part of target findings. Since the dataset is not static, we aim to provide a processing pipeline to help the sponsor in looking for new findings with new enrolled students in the future.
 
</font></div>
 
</font></div>
Line 35: Line 37:
  
 
<div><font face="Roboto">
 
<div><font face="Roboto">
#Understand the characteristics of database and users requests.
+
===Understand the characteristics of database and users requests.===
Help the management understand and take actions on each database by profiling databases. The dimensions for profiling includes
+
:1. Help the management understand and take actions on each database by profiling e-book databases. The considered parameters includes
user profiles (faculty, program, year)
+
::*user profiles (faculty, program, year)
number of requests
+
::*number of requests
popular requested items.  
+
::*popular requested items.  
The profiling will be done on multiple timescales (eg. by hours in the day, by days of the week, or by weeks) to identify chronological patterns.  
+
::The profiling will be done on multiple timescales (eg. by hours in the day, by days of the week, or by weeks) to identify chronological patterns.  
The analysis result will help decision makers understand who requested for which items from each database. By slicing data on multiple timescales, the team will be able to identify the peak periods and general trend of requests for each database.  
+
::The analysis result will help decision makers understand who requested for which items from each database. By slicing data on multiple timescales, the team will be able to identify the peak periods and general trend of requests for each database.  
Help the management better understand the user behaviours by profiling users. Identify special e-book usage patterns for students from different faculty or with different academic performance. The usage patterns are chronological patterns, device used, and sites requested. This would help the management to better understand the users, and devise better approach to improve service for each user type.
+
:2. Help the management better understand the user behaviours by profiling users; identify special e-book usage patterns for students from different faculties or with different academic performance. The usage patterns are chronological patterns, device used, and sites requested. This would help the management to better understand the users, and devise better approach to improve service for each type of users.
Timing of student access the database is another aspect of our focus, which is to dive deeper into the finding and patterns regarding the e-materials users. A focus we will be taking is to compare the behaviour pattern of the normal students against the dean's-list students. In particular, when do they access the material. One possible aspect is to find out any group is particularly favourable in last minute work.
+
 
Consistency of the student access the resources is a continuous aspect of the previous objective as it is useful to know how students need the materials across the entire semester. Whether it is widespread or intensely concentrated on a certain period. So it can give insights to resouces manager to
+
===How user approach the database===
 +
# Timing of student access the database is another aspect of our focus, which is to dive deeper into the finding and patterns regarding the databases users. A focus we will be taking is to compare the behaviour pattern of dean's-list students against other students. In particular, when do they access the materials. One possible aspect is to find out any group that is particularly in favour in last minute work.
 +
# Consistency of the student access the resources is a continuous aspect of the previous objective as it is useful to know how students need the materials across the entire semester. Whether it is widespread or intensely concentrated on a certain period. So it can give insights to manager to understand the student behaviour pattern and give suggestion on workshop planning and resource preparation.
 +
 
 +
===How e-resources are related===
 
Within the sessions of all the users, we can carry out ‘Market Basket Analysis’ to sniff out the popular combination of e-books being viewed and downloaded. The purpose of such action is to potentially create a foundation for e-resources recommendation in the future.
 
Within the sessions of all the users, we can carry out ‘Market Basket Analysis’ to sniff out the popular combination of e-books being viewed and downloaded. The purpose of such action is to potentially create a foundation for e-resources recommendation in the future.
 
</font></div>
 
</font></div>
  
 
==<div style="background: #00ADEF; line-height: 0.3em; border-left: #00ADEF solid 13px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;"><font face ="Garamond" color= "white" size="3">METHODOLOGY</font></div></div>==
 
 
===<div><font face="Roboto">Data exploration</font></div>===
 
After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor.
 
The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples.
 
As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:
 
# Over 25 percent of the requests are for web resources (e.g. js, gif).
 
# 87 percent of the requests are HTTP GET request
 
# Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
 
# Multiple encoding of search phrase in request url, based on the database
 
# Requested items are usually serialised in their own way
 
# 10 percent of the requests point to internal URL
 
# 16 percent of the requests have a status code other than 200
 
Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.
 
 
===<div><font face="Roboto">Data Preparation</font></div>===
 
The given data is all well-formated, but we have identified necessary preparation of data is required in following areas:
 
# Remove irrelevant requests such as those to web-resources or general web pages
 
# Student and request data is collected from different sources and stored in separate tables; a joining of the data is required before student request patterns can be analysed.
 
#Extract datetime components from CLF datetime string to slice data
 
#Extract domain for each request to identify the database requested
 
#Split the data lines into three groups: 1) with explicit search phrase in URL 2) without explicit search phrase but with other forms of reference to a title
 
# Extract the requested title/search phrase
 
# Infer the category of resource from the requested title/search phrase
 
 
===<div><font face="Roboto">Data analysis</font></div>===
 
# Slice records by various time frames, and user attributes and count the number of records. By analysing the number of requests under each dimension, we will be able to capture the characteristics of the databases.
 
# K-means clustering to profile students on dimensions of
 
* user agent
 
* most frequently used database
 
* most frequently requested resource category.
 
* faculty
 
* if the user is in Dean’s List
 
This result will help the management better understand the user behaviours for different segments of students.
 
#Perform Market Basket Analysis on the titles viewed by a student in a session. The analysis results can be potentially used for recommendation system.
 
 
===<div><font face="Roboto">Reporting and data processing pipeline</font></div>===
 
  
 
<!--/Content-->
 
<!--/Content-->

Latest revision as of 15:01, 15 January 2017

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION


BUSINESS PROBLEM & MOTIVATION

The library subscribes to eBook platforms with contents from a range of publishers. These databases provide contents that have largely enriched the library’s resources and make an integral part of the library repository.

When a student user requests contents from the databases, the request goes through the library’s proxy server. The proxy server captures a digital trace for each user request, which contains request url, user ID, and user agent. With the aim of providing easy services to users, the management hope to better understand the usage patterns of the databases. The challenge is to programmatically extract the meaningful user inputs within billions of request record, as most records are irrelevant to the project objectives.

Our main focus of the project is to understand the usefulness of the library eBook database in fulfilling the student queries. By analysing the proxy trace, we can define the characteristics of the users as well as examining the usage rate of the database. The success rate of the students queries are also part of target findings. Since the dataset is not static, we aim to provide a processing pipeline to help the sponsor in looking for new findings with new enrolled students in the future.

PROJECT OBJECTIVE

Understand the characteristics of database and users requests.

1. Help the management understand and take actions on each database by profiling e-book databases. The considered parameters includes
  • user profiles (faculty, program, year)
  • number of requests
  • popular requested items.
The profiling will be done on multiple timescales (eg. by hours in the day, by days of the week, or by weeks) to identify chronological patterns.
The analysis result will help decision makers understand who requested for which items from each database. By slicing data on multiple timescales, the team will be able to identify the peak periods and general trend of requests for each database.
2. Help the management better understand the user behaviours by profiling users; identify special e-book usage patterns for students from different faculties or with different academic performance. The usage patterns are chronological patterns, device used, and sites requested. This would help the management to better understand the users, and devise better approach to improve service for each type of users.

How user approach the database

  1. Timing of student access the database is another aspect of our focus, which is to dive deeper into the finding and patterns regarding the databases users. A focus we will be taking is to compare the behaviour pattern of dean's-list students against other students. In particular, when do they access the materials. One possible aspect is to find out any group that is particularly in favour in last minute work.
  2. Consistency of the student access the resources is a continuous aspect of the previous objective as it is useful to know how students need the materials across the entire semester. Whether it is widespread or intensely concentrated on a certain period. So it can give insights to manager to understand the student behaviour pattern and give suggestion on workshop planning and resource preparation.

How e-resources are related

Within the sessions of all the users, we can carry out ‘Market Basket Analysis’ to sniff out the popular combination of e-books being viewed and downloaded. The purpose of such action is to potentially create a foundation for e-resources recommendation in the future.