ANLY482 AY2016-17 T2 Group16: PROJECT FINDINGS/Finals
Contents
Analysis Workflow
The workflow diagram below shows “data cleaning for general analysis process”, “horizon chart process” and “word cloud process”. A cleaned log dataset with additional attributes is produced after data preparation process. These attributes come from URL and additional data set collected.
Duplicate requests
Duplicate request is defined as a user makes multiple identical requests with the same URL within 30-second time span This is supposedly caused by auto page refresh.
The removal of duplicate records was done in two rounds for efficiency reason. The first round takes each record as a single string and look backs 20 lines in the rolling cache for exact match. The second round only looks for matches by hashed email and URL within 30-second range.
Requests to web resources
With removed duplicate records, the data would still be noisy due to many requests to web assets. Web assets are used for page rendering and display. The typical ones are JavaScript (.js), Cascading Style Sheets (.css) and images (.img, jpg). Such web assets do not help us understand user behaviour as the requests are not generated by user and are database dependent. The requests to web assets are identified by web assets file extensions.
The requests to web assets should be removed from analysis on two grounds:
- Inflation on intra-domain analysis, as the web assets indefinitely increase the number of requests within a domain
- Distortion on inter-domain analysis, as the number of web assets used to render pages vary from site to site.
We iteratively built up a list of web assets extension stubs as exhaustive as possible. Subsequent pattern extraction also uncovered more types of web assets to be added to the list. We then search in each request line for the extension stubs from the list. Sites render pages differently, thus the pattern of requests to web assets vary.
Derive domain and date time
The domain names in URLs indicate the resource provider. A simple URL utility toolset can extract the domains from request URLs. However, since April 2016, when the library website redirects users to external e-resource page, the
destination URL is encoded in the URL data as a value. For example, we need to extract the URL to Economist from the “qurl” section as that is the actual user intention. This requires recognising and extracting the actual URL from the libproxy redirect requests and decode them into plain text.
Date time section also requires transformation. The date time section (e.g. “[01/Oct/2016:00:00:01 +0800]”) in the records does not take a format acceptable to the analysis tool that we used, and it is rigid for time lapse analysis. Thus, we extended it into timestamp, date, day of week, week and hour.
Deriving database name from URL
To understand the features of databases, we need to group all query URLs by their databases. (i.e. which database is requests with each URL). This is not a straightforward task due to a many-to-many relationship between database name and domain name, and the scale of the manual resolution. In fact, from our preliminary analysis, each database is mapped to 7.38 domain names on average. An extreme case is Sage, which uses 495 different domain names in September data.
The major challenges
- Manual resolution
Domain names comes in various forms. Some are obviously identifiable, some aren’t. For example, it is not apparent that tandfonline.com is a domain name for the journal database Taylor and Francis. Therefore, there is no systematic way to map the mouthful domain names to the database names which are familiar to the project sponsor.
2. Variation in domain levels
A database can use as much as 500 different domains which comes in various length. There is no single pattern to automatically translate the domain names back to the corresponding database. Most databases use the same top and second level domains. (For example, a common database Financial Times use as many as 39 different domain names, but they are all something like *.ft.com.) However, many databases use multiple top level domains, such as Wall Street Journal and Westlaw.
3. Encoded domain name
Starting from April, the real domain name for some database are encoded into parameters for requests with domain name libproxy.smu.edu.sg. This is flagged out as the that the number of requests to libproxy.smu.edu.sg surged in April. In subsequent months, such domains amount to 21% of overall requests.
4. Domains may not contain database name
For a couple of e-resources providers, the domain name is encoded in URL request data instead of explicitly shown in the domain section. These include Ebsco and ProQuest. For these sites, we will need to read the entire URL to find the DB stub and decipher it to real database name.
The first step we took is extracting top-two level domains (e.g. wsj.com) for all requests in domain statistic file, since intuitively the top-two level domain would suffice to identify most of the databases. The statistic files summarized the statistics below for each top-two level domain.
- number of requests,
- number [proportion] of post requests,
- number [proportion] of requests to web assets,
- number of unique users,
- and number of user sessions
With that, we manually looked up the database names (or “semantic” domain names) from library database catalogue (http://researchguides.smu.edu.sg/az.php) which lists out LKS Library’s 191 databases (till Feb 2017) and their descriptions. The statistics proved over 90% of overall traffic can be attributed to known databases.
We then extracted all full domain names from request URLs, find the identifiable token (wsj in www.wsj.com or businesstimes in www.businesstimes.com.sg), automatically populate database names from the existing mappings, and manually add or modify the mismatched ones. This data is stored in Domain_DB_Mapping table in Diagram 2 ER Diagram. Depending if it is identifiable with top-two level domains, we categorise databases as:
- Domain matching database: databases that can be identified by top-two level domain (e.g. aisnet.org for AIS Electronic Library)
- Regex matching database: databases that cannot be identified by top-two level domain but have to be identified with regex. (e.g. www.businesstimes.com.sg for Business Times)
- Mixed database: databases that uses both type of domains (e.g. cambridge.org and “*.cam.ac.uk” for Cambridge University Press)
As regular expression (regex) is computational intensive and most databases can be identified by top-two level domain, we match database with top-two level domain first.
- 1. Understanding the data :