ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

DATA EXPLORATION

After objective establishment, the problem is analysed by studying the available data. The purpose of this step is to 1) uncover the underlying structure, 2) extract relevant variables, 3) test assumptions. When doing data exploration, the team will be able to determine the feasibility of the objectives, and make necessary changes with the sponsor. The data we will work with are requests (i.e. digital trace), which is a NCSA Common Log Format data with billions of record captured by the library’s URL rewriting proxy server. This data set captures all user request to external databases. The data include dimensions of user ID, request time, http request line, response time, and user agent. The student data, specifying faculty, admission year, graduation year, and degree program, is also provided for the team. For non-disclosure reason, the user identifier - emails - are obfuscated by hashing to a 64-digit long hexadecimal numbers. The hashed ID will be used to linked up two tables. Please refer to appendix for the complete data dimensions and samples. As our team have already started exploratory analysis on a given sample data of one day’s request, which contains 6.62 million records, we have identified some features as follow:

  1. Over 25 percent of the requests are for web resources (e.g. js, gif).
  2. 87 percent of the requests are HTTP GET request
  3. Request are unevenly directed to the databases. (stdev = 46.6, avg = 10.2)
  4. Multiple encoding of search phrase in request url, based on the database
  5. Requested items are usually serialised in their own way
  6. 10 percent of the requests point to internal URL
  7. 16 percent of the requests have a status code other than 200

Judging from the features above, accurately extracting search phrase from request data is a challenging task. The feasibility will be ascertained after analysing the full dataset.

DATA PREPARATION

The given data is all well-formated, but we have identified necessary preparation of data is required in following areas:

  1. Remove irrelevant requests such as those to web-resources or general web pages
  2. Student and request data is collected from different sources and stored in separate tables; a joining of the data is required before student request patterns can be analysed.
  3. Extract datetime components from CLF datetime string to slice data
  4. Extract domain for each request to identify the database requested
  5. Split the data lines into three groups: 1) with explicit search phrase in URL 2) without explicit search phrase but with other forms of reference to a title
  6. Extract the requested title/search phrase
  7. Infer the category of resource from the requested title/search phrase

DATA ANALYSIS

1. Slice records by various time frames, and user attributes and count the number of records. By analysing the number of requests under each dimension, we will be able to capture the characteristics of the databases.

2. K-means clustering to profile students on dimensions of:

  • user agent
  • most frequently used database
  • most frequently requested resource category.
  • faculty
  • if the user is in Dean’s List

This result will help the management better understand the user behaviours for different segments of students.

3. Perform Market Basket Analysis on the titles viewed by a student in a session. The analysis results can be potentially used for recommendation system.

REPORTING AND DATA PROCESSING PIPELINE

At the end of the semester, the analysis results will be presented with interactive charts. A formal report will be delivered to the library analytics team. The report will consist of all useful findings from our study, and actionable recommendations to the library. The team will advise on areas of interest for further studies. A data processing pipeline consisting of the necessary scripts and programs will also be delivered, with instructions on setup and execution of the process. The library analytics team will be able to rerun the analysis on new request log data.