Difference between revisions of "ANLY482 AY2016-17 T2 Group16: HOME/Interim"

From Analytics Practicum
Jump to navigation Jump to search
Line 39: Line 39:
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
 
<div><font face="Roboto">
 
<div><font face="Roboto">
=== Overview ===
 
  
 +
=== Executive Summary ===
 
The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.
 
The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.
  
=== Data Integration and Filtering ===
+
=== Exploratory Analysis ===
==== Extracted Table ====
 
  
 +
==== Horizon Chart Analysis ====
  
==== Challenges ====
 
  
 +
==== Population Pyramid Analysis ====
  
==== Choice of Key Measurements ====
 
  
 +
==== Customer Surveys Results Analysis ====
  
=== Data Cleaning and Exploration ===
 
==== Issues ====
 
  
 
==== Exploration ====
 
 
 
=== Findings ===
 
==== Page Level ====
 
 
 
==== Post Level ====
 
 
 
 
 
 
=== Revised Methodology ===
 
 
 
==== Proxy Approach ====
 
Sort out search result page and download only. Use keyword "query" and "\Wq=" to identify search result. Use file size to identify download.
 
 
==== Pattern Approach ====
 
Systematically find the useful url pattern and determine the meaning.<br>
 
<b>Motivation:</b> identify the user action for each request
 
 
Original steps:
 
For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts)
 
For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other).
 
For each pattern, get request counts and user counts (based on sessions).
 
Recognize and attach actions for each pattern
 
 
New approach:
 
Identify interested databases based on sponsor preference (and perhaps technical feasibility)
 
<table>
 
<tr><b></b>
 
</tr>
 
</table>
 
 
Database
 
Domains (appearing in sample data)
 
lawnet
 
*.lawnet.sg
 
westlaw
 
*.westlaw.co.uk
 
*.westlaw.com
 
*.westlaw.co
 
Ebsco ebooks
 
*.ebscohost.com
 
MyiLibrary
 
*.myilibrary.com
 
ebrary
 
ebrary.com
 
 
To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter.
 
We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi.
 
Attach its count and an example to each pattern
 
Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify.
 
From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot)
 
try revisiting some pages to identify function
 
change
 
http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
 
to
 
http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
 
it takes me to
 
 
“docID=10572555” identifies this item viewed
 
“p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query
 
This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics
 
Database
 
Domains (appearing in sample data)
 
Keywords
 
lawnet
 
*.lawnet.sg
 
“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”)
 
“contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml)
 
“queryStr=” for query input
 
 
westlaw
 
*.westlaw.co.uk
 
*.westlaw.com
 
*.westlaw.co
 
*.westlaw.co.uk conatins “docguid” but the content is coded with internal ID
 
“query”
 
 
 
Ebsco ebooks
 
*.a.ebscohost.com
 
*.b.ebscohost.com
 
“bquery” as parameter name indicates search input
 
MyiLibrary
 
*.myilibrary.com
 
“tid” for title, but it is located with internal ID
 
ebrary
 
ebrary.com
 
“p00” for search queries
 
“docSearch.action?docID=10596700&p00=” is document data for rendering
 
 
5. Decipher query string and content IDs
 
As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.
 
 
Integration:
 
Use training dataset to extract patterns
 
Construct rules to get access type and query/id
 
 
=== Revised Work Scope ===
 
 
 
=== Revised Work Plan ===
 
 
 
=== Appendix ===
 
  
  
 
<!--/Content-->
 
<!--/Content-->

Revision as of 16:40, 18 February 2017

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

INTERIM PROGRESS

Executive Summary

The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.

Exploratory Analysis

Horizon Chart Analysis

Population Pyramid Analysis

Customer Surveys Results Analysis