Difference between revisions of "ANLY482 AY2016-17 T2 Group16: HOME/Interim"

From Analytics Practicum
Jump to navigation Jump to search
Line 41: Line 41:
 
=== Overview ===
 
=== Overview ===
  
 +
The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.
  
 
=== Data Integration and Filtering ===
 
=== Data Integration and Filtering ===

Revision as of 16:29, 18 February 2017

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

INTERIM PROGRESS

Overview

The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.

Data Integration and Filtering

Extracted Table

Challenges

Choice of Key Measurements

Data Cleaning and Exploration

Issues

Exploration

Findings

Page Level

Post Level

Revised Methodology

Proxy Approach

Sort out search result page and download only. Use keyword "query" and "\Wq=" to identify search result. Use file size to identify download.

Pattern Approach

Systematically find the useful url pattern and determine the meaning.
Motivation: identify the user action for each request

Original steps: For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts) For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other). For each pattern, get request counts and user counts (based on sessions). Recognize and attach actions for each pattern

New approach: Identify interested databases based on sponsor preference (and perhaps technical feasibility)

Database Domains (appearing in sample data) lawnet

  • .lawnet.sg

westlaw

  • .westlaw.co.uk
  • .westlaw.com
  • .westlaw.co

Ebsco ebooks

  • .ebscohost.com

MyiLibrary

  • .myilibrary.com

ebrary ebrary.com

To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter. We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi. Attach its count and an example to each pattern Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify. From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot) try revisiting some pages to identify function change http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 to http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 it takes me to

“docID=10572555” identifies this item viewed “p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics Database Domains (appearing in sample data) Keywords lawnet

  • .lawnet.sg

“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”) “contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml) “queryStr=” for query input

westlaw

  • .westlaw.co.uk
  • .westlaw.com
  • .westlaw.co
  • .westlaw.co.uk conatins “docguid” but the content is coded with internal ID

“query”


Ebsco ebooks

  • .a.ebscohost.com
  • .b.ebscohost.com

“bquery” as parameter name indicates search input MyiLibrary

  • .myilibrary.com

“tid” for title, but it is located with internal ID ebrary ebrary.com “p00” for search queries “docSearch.action?docID=10596700&p00=” is document data for rendering

5. Decipher query string and content IDs As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.

Integration: Use training dataset to extract patterns Construct rules to get access type and query/id

Revised Work Scope

Revised Work Plan

Appendix