Difference between revisions of "ANLY482 AY2016-17 T2 Group16: HOME/Interim"

Revision as of 16:40, 18 February 2017

HOME

ABOUT US

PROJECT OVERVIEW

PROJECT FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Interim

Final

INTERIM PROGRESS

@@ Line 39: / Line 39: @@
 <div style="height: 1em"></div>
 <div><font face="Roboto">
-=== Overview ===
+=== Executive Summary ===
 The objective of project is to provide insights on users and ebook databases for Li Ka Shing Library Analytics Team.
-=== Data Integration and Filtering ===
+=== Exploratory Analysis ===
-==== Extracted Table ====
+==== Horizon Chart Analysis ====
-==== Challenges ====
+==== Population Pyramid Analysis ====
-==== Choice of Key Measurements ====
+==== Customer Surveys Results Analysis ====
-=== Data Cleaning and Exploration ===
-==== Issues ====
-==== Exploration ====
-=== Findings ===
-==== Page Level ====
-==== Post Level ====
-=== Revised Methodology ===
-==== Proxy Approach ====
-Sort out search result page and download only. Use keyword "query" and "\Wq=" to identify search result. Use file size to identify download.
-==== Pattern Approach ====
-Systematically find the useful url pattern and determine the meaning.<br>
-<b>Motivation:</b> identify the user action for each request
-Original steps:
-For each domain, summarise all requests to url patterns, starting from the most popular domains (the ones with most user counts)
-For each pattern, find one example. Describe the user action for that example (whether it's a show search result, view item or download item or other).
-For each pattern, get request counts and user counts (based on sessions).
-Recognize and attach actions for each pattern
-New approach:
-Identify interested databases based on sponsor preference (and perhaps technical feasibility)
-<table>
-<tr><b></b>
-</tr>
-</table>
-Database
-Domains (appearing in sample data)
-lawnet
-*.lawnet.sg
-westlaw
-*.westlaw.co.uk
-*.westlaw.com
-*.westlaw.co
-Ebsco ebooks
-*.ebscohost.com
-MyiLibrary
-*.myilibrary.com
-ebrary
-ebrary.com
-To extract patterns for each domain, we used python urlparse library to extract path and parameters. The path (sub-domain hierarchy) and parameter names together define a pattern. The ordering or parameters does not matter.
-We realised paths may contain hashed string variables. Thus, we replaced them with a placeholder consisting of type (ie. $NUM for numerical, $ALN for alphanumerical, or $STR for strings with non-alphanumerical characters) followed by its length. For example “YnRoX183Nzg0MzUxNF9fQU41” and “YnRoX183Nzg0MzUxNF9fQU42” in the path will be both replaced by placeholder “$ALN24”. The purposes of doing this are to accurately reflect the URL patterns, to reduce number of patterns for readability, and to identify book identifiers such as ISBN and doi.
-Attach its count and an example to each pattern
-Identify patterns for search, view or download. This require some manual work. User inputs are typically easy to identify.
-From pattern extracting, we also increased keyword pool for “junk requests” (e.g. 341 of 8098 requests to ebrary.com are font with extensions ttf, woff, eot)
-try revisiting some pages to identify function
-		change
-http://www.ebrary.com/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
-to
-http://site.ebrary.com.libproxy.smu.edu.sg/lib/smu/docDetail.action?docID=10572555&p00=thailand+%22chinese+diaspora%22&token=967a22b8-85dc-48b5-bfa3-09178cf75492
-it takes me to
-“docID=10572555” identifies this item viewed
-“p00=thailand+%22chinese+diaspora%22” or thailand "chinese diaspora" is user input search query
-This query corroborates with the item title “Library of China Studies : Chinese Diaspora in South-East Asia : The Overseas Chinese in IndoChina (1)”. From the query, at least we can get keywords for topics
-Database
-Domains (appearing in sample data)
-Keywords
-lawnet
-*.lawnet.sg
-“pdfFileName” for file names (e.g. “[1996] 3 SLR(R) 0371.pdf”)
-“contentDocID” for internal content location (e.g. [1985-1986] SLR(R) 0241.xml)
-“queryStr=” for query input
-westlaw
-*.westlaw.co.uk
-*.westlaw.com
-*.westlaw.co
-*.westlaw.co.uk conatins “docguid” but the content is coded with internal ID
-“query”
-Ebsco ebooks
-*.a.ebscohost.com
-*.b.ebscohost.com
-“bquery” as parameter name indicates search input
-MyiLibrary
-*.myilibrary.com
-“tid” for title, but it is located with internal ID
-ebrary
-ebrary.com
-“p00” for search queries
-“docSearch.action?docID=10596700&p00=” is document data for rendering
-. Decipher query string and content IDs
-As we cannot revisit the pages as we desire, we cannot tag all urls with user actions based on the pattern matched. However the pattern approach helped us extract the query strings.
-Integration:
-Use training dataset to extract patterns
-Construct rules to get access type and query/id
-=== Revised Work Scope ===
-=== Revised Work Plan ===
-=== Appendix ===
 <!--/Content-->

Difference between revisions of "ANLY482 AY2016-17 T2 Group16: HOME/Interim"

Revision as of 16:40, 18 February 2017

Contents

Executive Summary

Exploratory Analysis

Horizon Chart Analysis

Population Pyramid Analysis

Customer Surveys Results Analysis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools