ANLY482 AY2016-17 T2 Group7: Project Overview
The role of the analytics team (part of Learning & Information Services) in Li Ka Shing Library is to discover meaningful insights about user behaviour so as to provide necessary assistance in forms of library e-resources training, helpdesk and support. However, the current problem is that they do not know what to do with the logging data collected from the library’s main web page, http://library.smu.edu.sg/. Thus, the logging data files are neglected and therefore the library analytics team wishes to collaborate with us in realizing the full potential of this data.
We felt that this is a great opportunity lost as the log data files could be extracted and analyzed in order to provide valuable insights for the Library team so that e-resources could be better optimized. The exact problem which this would solve is unknown to us even at this stage, however, we felt that this should not hamper our motivation to embark on this project. An EDA on the dataset could reveal much about the user journey as he/she completes his/her searches on the library e-resources database. The results of this may impact and change the user experience for the better. We may not be able to witness the change but it would be really pleasing to see the start of a change.
This project aims to do analysis on log files to:
- Understand user behavior by using a data-driven approach to better discern the reach for each e-resource and the querying capabilities of each user category.
- Understand the relationship between different search queries for different users
- Examine the event sequence for unique users (E.g. What articles did User A searched together or 1 after another in sequence) to provide recommendations for improvement in User Experience. For example, these event sequence insights could complement the optimization of the Search function, to suggest other searches that are based on each unique users’ event sequence.
As with our discussions with our project Supervisor, we had a mutual understanding that there would be a huge amount of time spent on the pre-processing of the dataset due to the nature of the extraction process. Thus, the expectations for our project would be different from that of other projects, where we would not have a working sandbox model by this interim report/presentation. Instead, the expectations we have to deliver would be:
- Data curation done with data of log files tabulated into proper data tables
- Exploratory data analysis done and tested against a few library databases to see if it works & thought process to be written down
- Gap analysis of excessive system logging of search queries (logged every 2 character presses) & recognizing that triggering of multiple logging before searches are completed is a wastage of resources and can potentially slow down the server, hindering the overall experience for end-users
Proxy log data & student information data (Names of Students are Hashed)
Proxy log data:
59.189.71.33 tDU1zb0CaV2B8qZ 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba [01/Jan/2016:00:01:39 +0800] "GET http://heinonline.org:80/HOL/VMTP?base=js&handle=hein.journals/fchlj23&div=7&collection=journals&input=(The%20Great%20Peace)&set_as_cursor=19&disp_num=20&viewurl=SearchVolumeSOLR%3Finput%3D%2528The%2520Great%2520Peace%2529%26div%3D7%26f_size%3D600%26num_results%3D10%26handle%3Dhein.journals%252Ffchlj23%26collection%3Djournals%26set_as_cursor%3D19%26men_tab%3Dsrchresults%26terms%3D%2528The%2520Great%2520Peace%2529 HTTP/1.1" 200 2121 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
Proxy Log Data:
Parameters | Description | Example |
---|---|---|
Http address | This is the IP address of the webpage | 59.189.71.33 |
Session ID | Each session is identified by an unique ID, which corresponds to 1 session by a single user | tDU1zb0CaV2B8qZ |
Unique Student ID (Hashed) | The student ID is hashed by the SMU Library so as to protect the identity of users | 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba |
Timestamp | This is the timing which the log is recorded, and the log is recorded whenever the user performs a task. The time is in 24 hours format and in local Singapore time GST+0800. | [01/Jan/2016:00:01:39 +0800] |
HTML method | The search query by the user typically comes after this HTML method. | GET |
Student Information Data:
“feb0e4d05b236c0bcc0c7331dc754921cf9189c4c1317b0b112696fcf68cd2f8, MASTER School of Accountancy, MSc in CFO Leadership, AY_2014, GY_2015”
Parameters | Description | Example |
---|---|---|
Unique Student ID (Hashed) | This is provided so that we can match the unique student ID to the corresponding ones in the proxy data logs. | feb0e4d05b236c0bcc0c7331dc754921cf9189c4c1317b0b112696fcf68cd2f8 |
Level of Education | This indicates which level of education the user is in, typically Masters or Bachelors programme. | MASTER |
Unique Student ID (Hashed) | The student ID is hashed by the SMU Library so as to protect the identity of users | 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba |
School | This indicates the school that the user is from. | School of Accountancy |
Type of Programme | This indicates the specific programme the user is undertaking. | MSc in CFO Leadership |
Admission Year | This indicates the year which the user is admitted into SMU. | AY_2014 |
Graduating Year | This indicates the year which the user is graduated from SMU. | GY_2015 |