Difference between revisions of "ANLY482 AY2016-17 T2 Group16: HOME/Interim"

From Analytics Practicum
Jump to navigation Jump to search
Line 78: Line 78:
 
Systematically find the useful url pattern and determine the meaning.<br>
 
Systematically find the useful url pattern and determine the meaning.<br>
 
<b>Motivation:</b> identify the user action for each request
 
<b>Motivation:</b> identify the user action for each request
 +
 +
Original steps:
 +
For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts)
 +
For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other).
 +
For each pattern, get request counts and user counts (based on sessions).
 +
Recognize and attach actions for each pattern
 +
 +
New approach:
 +
Identify interested databases based on sponsor preference (and perhaps technical feasibility)
 +
<table>
 +
<tr><b></b>
 +
</tr>
 +
</table>
 +
 +
Database
 +
Domains (appearing in sample data)
 +
lawnet
 +
*.lawnet.sg
 +
westlaw
 +
*.westlaw.co.uk
 +
*.westlaw.com
 +
*.westlaw.co
 +
Ebsco ebooks
 +
*.ebscohost.com
 +
MyiLibrary
 +
*.myilibrary.com
 +
ebrary
 +
ebrary.com
 +
 +
To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter.
 +
We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi.
 +
Attach its count and an example to each pattern
 +
Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify.
 +
From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot)
 +
try revisiting some pages to identify function
 +
change
 +
http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
 +
to
 +
http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
 +
it takes me to
 +
 +
“docID=10572555” identifies this item viewed
 +
“p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query
 +
This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics
 +
Database
 +
Domains (appearing in sample data)
 +
Keywords
 +
lawnet
 +
*.lawnet.sg
 +
“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”)
 +
“contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml)
 +
“queryStr=” for query input
 +
 +
westlaw
 +
*.westlaw.co.uk
 +
*.westlaw.com
 +
*.westlaw.co
 +
*.westlaw.co.uk conatins “docguid” but the content is coded with internal ID
 +
“query”
 +
 +
 +
Ebsco ebooks
 +
*.a.ebscohost.com
 +
*.b.ebscohost.com
 +
“bquery” as parameter name indicates search input
 +
MyiLibrary
 +
*.myilibrary.com
 +
“tid” for title, but it is located with internal ID
 +
ebrary
 +
ebrary.com
 +
“p00” for search queries
 +
“docSearch.action?docID=10596700&p00=” is document data for rendering
 +
 +
5. Decipher query string and content IDs
 +
As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.
 +
 +
Integration:
 +
Use training dataset to extract patterns
 +
Construct rules to get access type and query/id
  
 
=== Revised Work Scope ===
 
=== Revised Work Scope ===

Revision as of 18:41, 15 February 2017

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

INTERIM PROGRESS

Overview

Data Integration and Filtering

Extracted Table

Challenges

Choice of Key Measurements

Data Cleaning and Exploration

Issues

Exploration

Findings

Page Level

Post Level

Revised Methodology

Proxy Approach

Sort out search result page and download only. Use keyword "query" and "\Wq=" to identify search result. Use file size to identify download.

Pattern Approach

Systematically find the useful url pattern and determine the meaning.
Motivation: identify the user action for each request

Original steps: For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts) For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other). For each pattern, get request counts and user counts (based on sessions). Recognize and attach actions for each pattern

New approach: Identify interested databases based on sponsor preference (and perhaps technical feasibility)

Database Domains (appearing in sample data) lawnet

  • .lawnet.sg

westlaw

  • .westlaw.co.uk
  • .westlaw.com
  • .westlaw.co

Ebsco ebooks

  • .ebscohost.com

MyiLibrary

  • .myilibrary.com

ebrary ebrary.com

To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter. We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi. Attach its count and an example to each pattern Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify. From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot) try revisiting some pages to identify function change http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 to http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492 it takes me to

“docID=10572555” identifies this item viewed “p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics Database Domains (appearing in sample data) Keywords lawnet

  • .lawnet.sg

“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”) “contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml) “queryStr=” for query input

westlaw

  • .westlaw.co.uk
  • .westlaw.com
  • .westlaw.co
  • .westlaw.co.uk conatins “docguid” but the content is coded with internal ID

“query”


Ebsco ebooks

  • .a.ebscohost.com
  • .b.ebscohost.com

“bquery” as parameter name indicates search input MyiLibrary

  • .myilibrary.com

“tid” for title, but it is located with internal ID ebrary ebrary.com “p00” for search queries “docSearch.action?docID=10596700&p00=” is document data for rendering

5. Decipher query string and content IDs As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.

Integration: Use training dataset to extract patterns Construct rules to get access type and query/id

Revised Work Scope

Revised Work Plan

Appendix