Difference between revisions of "ANLY482 AY2016-17 T2 Group7: Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
(Added: Interim: Adjusted Project Requirements)
m
 
(2 intermediate revisions by the same user not shown)
Line 11: Line 11:
  
 
| style="border-bottom:7px solid #005192;" width="20%" |
 
| style="border-bottom:7px solid #005192;" width="20%" |
[[ANLY482_AY2016-17_T2_Group7: Project Findings | <font color="#bbdefb">Project Findings</font>]]
+
[[ANLY482_AY2016-17_T2_Group7: Methodology | <font color="#bbdefb">Project Findings</font>]]
  
 
| style="border-bottom:7px solid #005192;" width="20%" |
 
| style="border-bottom:7px solid #005192;" width="20%" |
Line 24: Line 24:
  
 
<!-- Start Information -->
 
<!-- Start Information -->
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Business Problem & Motivation</strong></font></div></div>
+
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Introduction</strong></font></div></div>
  
 
<div style="color:#212121;">
 
<div style="color:#212121;">
The role of the analytics team (part of Learning & Information Services) in Li Ka Shing Library is to discover meaningful insights about user behaviour so as to provide necessary assistance in forms of library e-resources training, helpdesk and support. However, the current problem is that they do not know what to do with the logging data collected from the library’s main web page, http://library.smu.edu.sg/. Thus, the logging data files are neglected and therefore the library analytics team wishes to collaborate with us in realizing the full potential of this data.
+
The project sponsor, Singapore Management University’s Library which consists of the Li Ka Shing Library and the Kwa Geok Choo Law Library, has an electronic search platform which offers a wide array of research resources through the EZproxy server. However, the organization requires more valuable insights about the students’ access to the Library’s online database through the EZproxy server. While the team of librarians had an exhaustive repository of EZproxy log data files, they lacked the time and resources to process the data for analysis to better optimize the User Experience. The main focus of this paper consists of our own solution developed in Python 3.0.1 using Jupyter Notebook which is able to process and clean the EZproxy data, and the processed data being tested against 2 test cases, namely the Data Analysis of the search count and the Text Analytics for 3 databases namely Euromonitor, Lawnet and Marketline Advantage. Following these test cases, the paper ends with the conclusion on what can be the future continuation of our project.
 
 
We felt that this is a great opportunity lost as the log data files could be extracted and analyzed in order to provide valuable insights for the Library team so that e-resources could be better optimized. The exact problem which this would solve is unknown to us even at this stage, however, we felt that this should not hamper our motivation to embark on this project. An EDA on the dataset could reveal much about the user journey as he/she completes his/her searches on the library e-resources database. The results of this may impact and change the user experience for the better. We may not be able to witness the change but it would be really pleasing to see the start of a change.
 
 
</div>
 
</div>
  
  
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Project Objective</strong></font></div></div>
+
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Motivation</strong></font></div></div>
  
 
<div style="color:#212121;">
 
<div style="color:#212121;">
This project aims to do analysis on log files to:
+
Currently, there is no single platform where EZproxy log data can be processed into proper data frames or allow topics to be extracted. We felt that this could be a great opportunity as the log data files could be extracted and analyzed to provide valuable insights for the SMU library team so that the electronic resources database can be better optimized for its users. This motivation originates and resonates deeply with us as students who are active users of the SMU library electronic resources database. We personally use the electronic resources databases frequently to research for academic projects and often faced problems in finding the best and most optimized results on the most appropriate platform. Thus, for this project, we believe that preparing the raw log data onto a single platform, coupled with formulating possible directions for Data Analysis and Textual Analytics, could allow the SMU library team to work more efficiently on the data collected. This in turn could possibly add insights for future projects in optimizing the electronic resources database for current and future students of SMU.  
# Understand user behavior by using a data-driven approach to better discern the reach for each e-resource and the querying capabilities of each user category.
 
# Understand the relationship between different search queries for different users
 
# Examine the event sequence for unique users (E.g. What articles did User A searched together or 1 after another in sequence) to provide recommendations for improvement in User Experience. For example, these event sequence insights could complement the optimization of the Search function, to suggest other searches that are based on each unique users’ event sequence.
 
  
 
</div>
 
</div>
  
  
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Interim: Adjusted Project Requirements</strong></font></div></div>
+
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Objectives</strong></font></div></div>
  
 
<div style="color:#212121;">
 
<div style="color:#212121;">
As with our discussions with our project Supervisor, we had a mutual understanding that there would be a huge amount of time spent on the pre-processing of the dataset due to the nature of the extraction process. Thus, the expectations for our project would be different from that of other projects, where we would not have a working sandbox model by this interim report/presentation. Instead, the expectations we have to deliver would be:
+
This project aims to create a single platform which enables the preparation of EZproxy raw log data and extraction of search queries. This is done on Jupyter Notebook using Python 3.0.1 to offer a ‘plug & play’ solution for preparation of future data collected on EZproxy by the SMU library team. After which, the processed data would be tested against 3 test cases which covers the insights on search count and textual analytics on 2 electronic databases: Euromonitor, Lawnet and Marketline.
# Data curation done with data of log files tabulated into proper data tables
 
# Exploratory data analysis done and tested against a few library databases to see if it works & thought process to be written down
 
# Gap analysis of excessive system logging of search queries (logged every 2 character presses) & recognizing that triggering of multiple logging before searches are completed is a wastage of resources and can potentially slow down the server, hindering the overall experience for end-users
 
 
 
 
</div>
 
</div>
  
Line 66: Line 57:
 
<div style="color:#212121;">
 
<div style="color:#212121;">
 
Proxy log data:  
 
Proxy log data:  
<p>59.189.71.33 tDU1zb0CaV2B8qZ 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba [01/Jan/2016:00:01:39 +0800] "GET http://heinonline.org:80/HOL/VMTP?base=js&handle=hein.journals/fchlj23&div=7&collection=journals&input=(The%20Great%20Peace)&set_as_cursor=19&disp_num=20&viewurl=SearchVolumeSOLR%3Finput%3D%2528The%2520Great%2520Peace%2529%26div%3D7%26f_size%3D600%26num_results%3D10%26handle%3Dhein.journals%252Ffchlj23%26collection%3Djournals%26set_as_cursor%3D19%26men_tab%3Dsrchresults%26terms%3D%2528The%2520Great%2520Peace%2529 HTTP/1.1" 200 2121 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"</p>
+
<p>59.189.71.33 tDU1zb0CaV2B8qZ 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba [01/Jan/2016:00:01:36 +0800] "GET http://heinonline.org:80/HOL/VMTP?base=js&handle=hein.journals/bclr54&div=62&collection=journals&input=(The%20Great%20Peace)&set_as_cursor=0&disp_num=1&viewurl=SearchVolumeSOLR%3Finput%3D%2528The%2520Great%2520Peace%2529%26div%3D62%26f_size%3D600%26num_results%3D10%26handle%3Dhein.journals%252Fbclr54%26collection%3Djournals%26set_as_cursor%3D0%26men_tab%3Dsrchresults%26terms%3D%2528The%2520Great%2520Peace%2529 HTTP/1.1" 200 2291 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"</p>
 
Proxy Log Data:
 
Proxy Log Data:
 
{| class="wikitable"
 
{| class="wikitable"
Line 74: Line 65:
 
! Example
 
! Example
 
|-
 
|-
| Http address
+
| libuser_ID
| This is the IP address of the webpage
+
| Student ID hashed by the SMU Library so as to protect the identity of users 
| 59.189.71.33
+
| 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
 
|-
 
|-
| Session ID
+
| libsession_ID
| Each session is identified by an unique ID, which corresponds to 1 session by a single user
+
| Each session is identified by a unique ID, which corresponds to 1 session by a single user  
| tDU1zb0CaV2B8qZ
+
| tDU1zb0CaV2B8qZ  
 
|-
 
|-
| Unique Student ID (Hashed)
+
| search_database
| The student ID is hashed by the SMU Library so as to protect the identity of users
+
| The e-resources database which the search query is searched on
| 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
+
| heinonline
 
|-
 
|-
| Timestamp
+
| timestamp
| This is the timing which the log is recorded, and the log is recorded whenever the user performs a task. The time is in 24 hours format and in local Singapore time GST+0800.
+
| Date and time when the search query is executed by the user in the format: DD/MMM/YYYY HH:MM:SS
| [01/Jan/2016:00:01:39 +0800]
+
| 01/Jan/2016:00:01:36
 
|-
 
|-
| HTML method
+
| search_query
| The search query by the user typically comes after this HTML method.
+
| Search query that was being searched by the user
| GET
+
| (The%20Great%20Peace)
 
|}
 
|}
 
Student Information Data:
 
Student Information Data:
<p>“feb0e4d05b236c0bcc0c7331dc754921cf9189c4c1317b0b112696fcf68cd2f8, MASTER School of Accountancy, MSc in CFO Leadership, AY_2014, GY_2015”
 
</p>
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 103: Line 92:
 
! Example
 
! Example
 
|-
 
|-
| Unique Student ID (Hashed)
+
| libuser_ID
| This is provided so that we can match the unique student ID to the corresponding ones in the proxy data logs.
+
| Student ID hashed by the SMU Library so as to protect the identity of users
| feb0e4d05b236c0bcc0c7331dc754921cf9189c4c1317b0b112696fcf68cd2f8
+
| 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
|-
 
| Level of Education
 
| This indicates which level of education the user is in, typically Masters or Bachelors programme.
 
| MASTER
 
 
|-
 
|-
| Unique Student ID (Hashed)
+
| school
| The student ID is hashed by the SMU Library so as to protect the identity of users
+
| This indicates the school that the user is from
| 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
+
| School of Law
 
|-
 
|-
| School
+
| programme_type
| This indicates the school that the user is from.
+
| This indicates the specific programme the user is undertaking
| School of Accountancy
+
| Bachelor of Laws
 
|-
 
|-
| Type of Programme
+
| admission_year
| This indicates the specific programme the user is undertaking.
+
| This indicates the year which the user is admitted into SMU
| MSc in CFO Leadership
+
| AY_2013
 
|-
 
|-
| Admission Year
+
| graduating_year
| This indicates the year which the user is admitted into SMU.
+
| This indicates the year which the user is graduated from SMU  
| AY_2014
+
| GY_2017
 
|-
 
|-
| Graduating Year
+
| education_level
| This indicates the year which the user is graduated from SMU. 
+
| This indicates which level of education the user is in, typically Masters or Bachelors programme
| GY_2015
+
| UNDERGRADUATE STUDENTS
 
|-
 
|-
 
|}
 
|}

Latest revision as of 18:22, 20 April 2017

Home

Team

Project Overview

Project Findings

Project Management

Documentation


Introduction

The project sponsor, Singapore Management University’s Library which consists of the Li Ka Shing Library and the Kwa Geok Choo Law Library, has an electronic search platform which offers a wide array of research resources through the EZproxy server. However, the organization requires more valuable insights about the students’ access to the Library’s online database through the EZproxy server. While the team of librarians had an exhaustive repository of EZproxy log data files, they lacked the time and resources to process the data for analysis to better optimize the User Experience. The main focus of this paper consists of our own solution developed in Python 3.0.1 using Jupyter Notebook which is able to process and clean the EZproxy data, and the processed data being tested against 2 test cases, namely the Data Analysis of the search count and the Text Analytics for 3 databases namely Euromonitor, Lawnet and Marketline Advantage. Following these test cases, the paper ends with the conclusion on what can be the future continuation of our project.


Motivation

Currently, there is no single platform where EZproxy log data can be processed into proper data frames or allow topics to be extracted. We felt that this could be a great opportunity as the log data files could be extracted and analyzed to provide valuable insights for the SMU library team so that the electronic resources database can be better optimized for its users. This motivation originates and resonates deeply with us as students who are active users of the SMU library electronic resources database. We personally use the electronic resources databases frequently to research for academic projects and often faced problems in finding the best and most optimized results on the most appropriate platform. Thus, for this project, we believe that preparing the raw log data onto a single platform, coupled with formulating possible directions for Data Analysis and Textual Analytics, could allow the SMU library team to work more efficiently on the data collected. This in turn could possibly add insights for future projects in optimizing the electronic resources database for current and future students of SMU.


Objectives

This project aims to create a single platform which enables the preparation of EZproxy raw log data and extraction of search queries. This is done on Jupyter Notebook using Python 3.0.1 to offer a ‘plug & play’ solution for preparation of future data collected on EZproxy by the SMU library team. After which, the processed data would be tested against 3 test cases which covers the insights on search count and textual analytics on 2 electronic databases: Euromonitor, Lawnet and Marketline.


Datasets

Proxy log data & student information data (Names of Students are Hashed)


Data Dictionary

Proxy log data:

59.189.71.33 tDU1zb0CaV2B8qZ 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba [01/Jan/2016:00:01:36 +0800] "GET http://heinonline.org:80/HOL/VMTP?base=js&handle=hein.journals/bclr54&div=62&collection=journals&input=(The%20Great%20Peace)&set_as_cursor=0&disp_num=1&viewurl=SearchVolumeSOLR%3Finput%3D%2528The%2520Great%2520Peace%2529%26div%3D62%26f_size%3D600%26num_results%3D10%26handle%3Dhein.journals%252Fbclr54%26collection%3Djournals%26set_as_cursor%3D0%26men_tab%3Dsrchresults%26terms%3D%2528The%2520Great%2520Peace%2529 HTTP/1.1" 200 2291 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"

Proxy Log Data:

Parameters Description Example
libuser_ID Student ID hashed by the SMU Library so as to protect the identity of users 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
libsession_ID Each session is identified by a unique ID, which corresponds to 1 session by a single user tDU1zb0CaV2B8qZ
search_database The e-resources database which the search query is searched on heinonline
timestamp Date and time when the search query is executed by the user in the format: DD/MMM/YYYY HH:MM:SS 01/Jan/2016:00:01:36
search_query Search query that was being searched by the user (The%20Great%20Peace)

Student Information Data:

Parameters Description Example
libuser_ID Student ID hashed by the SMU Library so as to protect the identity of users 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
school This indicates the school that the user is from School of Law
programme_type This indicates the specific programme the user is undertaking Bachelor of Laws
admission_year This indicates the year which the user is admitted into SMU AY_2013
graduating_year This indicates the year which the user is graduated from SMU GY_2017
education_level This indicates which level of education the user is in, typically Masters or Bachelors programme UNDERGRADUATE STUDENTS

[Back To Project Page]