ANLY482 AY2016-17 T2 Group16: HOME/Interim
Contents
Overview
Data Integration and Filtering
Extracted Table
Challenges
Choice of Key Measurements
Data Cleaning and Exploration
Issues
Exploration
Findings
Page Level
Post Level
Revised Methodology
Proxy Approach
Sort out search result page and download only. Use keyword "query" and "\Wq=" to identify search result. Use file size to identify download.
Pattern Approach
Systematically find the useful url pattern and determine the meaning.
Motivation: identify the user action for each request
Original steps: For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts) For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other). For each pattern, get request counts and user counts (based on sessions). Recognize and attach actions for each pattern
New approach: Identify interested databases based on sponsor preference (and perhaps technical feasibility)
Database Domains (appearing in sample data) lawnet
- .lawnet.sg
westlaw
- .westlaw.co.uk
- .westlaw.com
- .westlaw.co
Ebsco ebooks
- .ebscohost.com
MyiLibrary
- .myilibrary.com
ebrary ebrary.com
To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter. We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi. Attach its count and an example to each pattern Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify. From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot) try revisiting some pages to identify function change http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 to http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 it takes me to
“docID=10572555” identifies this item viewed “p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics Database Domains (appearing in sample data) Keywords lawnet
- .lawnet.sg
“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”) “contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml) “queryStr=” for query input
westlaw
- .westlaw.co.uk
- .westlaw.com
- .westlaw.co
- .westlaw.co.uk conatins “docguid” but the content is coded with internal ID
“query”
Ebsco ebooks
- .a.ebscohost.com
- .b.ebscohost.com
“bquery” as parameter name indicates search input MyiLibrary
- .myilibrary.com
“tid” for title, but it is located with internal ID ebrary ebrary.com “p00” for search queries “docSearch.action?docID=10596700&p00=” is document data for rendering
5. Decipher query string and content IDs As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.
Integration: Use training dataset to extract patterns Construct rules to get access type and query/id