Difference between revisions of "ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS/Finals"
Line 102: | Line 102: | ||
: 1. '''Pattern Approach''' : | : 1. '''Pattern Approach''' : | ||
+ | A pattern captures URL subdomains and request parameter names, and defines a group of URLs with the same subdomain structure and parameter names. A pattern disregards data value and reduce the amount of manual work in discovering URL KUA rules. Diagram 5 shows the general pattern extraction process. | ||
+ | [[File:Diag6.PNG|500px|center]] | ||
+ | Once the URLs have been reduced into patterns, we would manually identify patterns related to one of the three KUAs according to traces from the pattern literals. From the identified patterns, we can then build URL pattern rules based on the observed pattern.<br/> | ||
+ | We further reduced the patterns by replacing fixed length subdomain strings to certain placeholders that capture the type of string and length. This is necessary because some databases (i.e. ebscohost) encode content ID or DOIs as a subdomain parameter (instead of a conventional data parameter). The table below illustrates the meaning of some placeholders used: | ||
+ | [[File:Table1diag7.PNG|500px|center]] |
Revision as of 01:45, 23 April 2017
Contents
Analysis Workflow
The workflow diagram below shows “data cleaning for general analysis process”, “horizon chart process” and “word cloud process”. A cleaned log dataset with additional attributes is produced after data preparation process. These attributes come from URL and additional data set collected.
Duplicate requests
Duplicate request is defined as a user makes multiple identical requests with the same URL within 30-second time span This is supposedly caused by auto page refresh.
The removal of duplicate records was done in two rounds for efficiency reason. The first round takes each record as a single string and look backs 20 lines in the rolling cache for exact match. The second round only looks for matches by hashed email and URL within 30-second range.
Requests to web resources
With removed duplicate records, the data would still be noisy due to many requests to web assets. Web assets are used for page rendering and display. The typical ones are JavaScript (.js), Cascading Style Sheets (.css) and images (.img, jpg). Such web assets do not help us understand user behaviour as the requests are not generated by user and are database dependent. The requests to web assets are identified by web assets file extensions.
The requests to web assets should be removed from analysis on two grounds:
- Inflation on intra-domain analysis, as the web assets indefinitely increase the number of requests within a domain
- Distortion on inter-domain analysis, as the number of web assets used to render pages vary from site to site.
We iteratively built up a list of web assets extension stubs as exhaustive as possible. Subsequent pattern extraction also uncovered more types of web assets to be added to the list. We then search in each request line for the extension stubs from the list. Sites render pages differently, thus the pattern of requests to web assets vary.
Derive domain and date time
The domain names in URLs indicate the resource provider. A simple URL utility toolset can extract the domains from request URLs. However, since April 2016, when the library website redirects users to external e-resource page, the
destination URL is encoded in the URL data as a value. For example, we need to extract the URL to Economist from the “qurl” section as that is the actual user intention. This requires recognising and extracting the actual URL from the libproxy redirect requests and decode them into plain text.
Date time section also requires transformation. The date time section (e.g. “[01/Oct/2016:00:00:01 +0800]”) in the records does not take a format acceptable to the analysis tool that we used, and it is rigid for time lapse analysis. Thus, we extended it into timestamp, date, day of week, week and hour.
Deriving database name from URL
To understand the features of databases, we need to group all query URLs by their databases. (i.e. which database is requests with each URL). This is not a straightforward task due to a many-to-many relationship between database name and domain name, and the scale of the manual resolution. In fact, from our preliminary analysis, each database is mapped to 7.38 domain names on average. An extreme case is Sage, which uses 495 different domain names in September data.
The major challenges
- 1. Manual resolution :
Domain names comes in various forms. Some are obviously identifiable, some aren’t. For example, it is not apparent that tandfonline.com is a domain name for the journal database Taylor and Francis. Therefore, there is no systematic way to map the mouthful domain names to the database names which are familiar to the project sponsor.
- 2. Variation in domain levels :
A database can use as much as 500 different domains which comes in various length. There is no single pattern to automatically translate the domain names back to the corresponding database. Most databases use the same top and second level domains. (For example, a common database Financial Times use as many as 39 different domain names, but they are all something like *.ft.com.) However, many databases use multiple top level domains, such as Wall Street Journal and Westlaw.
- 3. Encoded domain name :
Starting from April, the real domain name for some database are encoded into parameters for requests with domain name libproxy.smu.edu.sg. This is flagged out as the that the number of requests to libproxy.smu.edu.sg surged in April. In subsequent months, such domains amount to 21% of overall requests.
- 4. Domains may not contain database name :
For a couple of e-resources providers, the domain name is encoded in URL request data instead of explicitly shown in the domain section. These include Ebsco and ProQuest. For these sites, we will need to read the entire URL to find the DB stub and decipher it to real database name.
The first step we took is extracting top-two level domains (e.g. wsj.com) for all requests in domain statistic file, since intuitively the top-two level domain would suffice to identify most of the databases. The statistic files summarized the statistics below for each top-two level domain.
- number of requests,
- number [proportion] of post requests,
- number [proportion] of requests to web assets,
- number of unique users,
- and number of user sessions
With that, we manually looked up the database names (or “semantic” domain names) from library database catalogue (http://researchguides.smu.edu.sg/az.php) which lists out LKS Library’s 191 databases (till Feb 2017) and their descriptions. The statistics proved over 90% of overall traffic can be attributed to known databases.
We then extracted all full domain names from request URLs, find the identifiable token (wsj in www.wsj.com or businesstimes in www.businesstimes.com.sg), automatically populate database names from the existing mappings, and manually add or modify the mismatched ones. This data is stored in Domain_DB_Mapping table in Diagram 2 ER Diagram. Depending if it is identifiable with top-two level domains, we categorise databases as:
- Domain matching database: databases that can be identified by top-two level domain (e.g. aisnet.org for AIS Electronic Library)
- Regex matching database: databases that cannot be identified by top-two level domain but have to be identified with regex. (e.g. www.businesstimes.com.sg for Business Times)
- Mixed database: databases that uses both type of domains (e.g. cambridge.org and “*.cam.ac.uk” for Cambridge University Press)
As regular expression (regex) is computational intensive and most databases can be identified by top-two level domain, we match database with top-two level domain first.
Diagram 5 illustrate the process to resolve “sub” database names under Ebscohost. Once sub DB stub is extracted from URL. It then looks up in Ebscohost_Sub_DB from Diagram 2
Extract and decipher user input data
Top requested contents partially define students’ usage patterns, with many other attributes. To find out the interested contents, identifying key user actions (KUA) is a necessary step. KUA is defined as:
- Search - identifiable search result page with search query as a URL parameter
- Landing Page - a request to the summary page of a resources, which typically shows title, author, abstract and other mega data of the resource
- View - web embedded reader for online viewing of content
- Download - direct request to content files to offline viewing or, less commonly, embedded PDF viewer
The format of date extracted for KUA actions can be one of:
- Text – instantly readable strings, from which we can directly obtain the information of user input. Decoding may be necessary to remove word connectors.
- IDs – Database specific identifier. We cannot perceive the meaning from the ID but it can be used to compare.
- DOI - Digital Object Identifier is a persistent identifier or handle used to uniquely identify objects. Public APIs can resolve it to a resource title. It looks like “10.1002/(ISSN)1932-443X”
As we cannot revisit the pages as we desire, not to mention the sheer magnitude of work, tagging each and every URL with its meaning is impossible. We used two approaches to extract URL pattern rules for each KUA. The URL pattern rules will then be used to tag KUA records and extract requested resource from URL.
- 1. Pattern Approach :
A pattern captures URL subdomains and request parameter names, and defines a group of URLs with the same subdomain structure and parameter names. A pattern disregards data value and reduce the amount of manual work in discovering URL KUA rules. Diagram 5 shows the general pattern extraction process.
Once the URLs have been reduced into patterns, we would manually identify patterns related to one of the three KUAs according to traces from the pattern literals. From the identified patterns, we can then build URL pattern rules based on the observed pattern.
We further reduced the patterns by replacing fixed length subdomain strings to certain placeholders that capture the type of string and length. This is necessary because some databases (i.e. ebscohost) encode content ID or DOIs as a subdomain parameter (instead of a conventional data parameter). The table below illustrates the meaning of some placeholders used: