Difference between revisions of "ANLY482 AY2016-17 T2 Group7: Exploratory Data Analysis"

From Analytics Practicum
Jump to navigation Jump to search
m
Line 39: Line 39:
  
 
<!-- Start Information -->
 
<!-- Start Information -->
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Exploratory Data Analysis</strong></font></div></div>
+
<div style="background:#307FBB; line-height:0.3em; font-family:sans-serif; font-size:120%; border-left:#bbdefb solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Test Case I: Data Analysis</strong></font></div></div>
  
<div style="color:#212121;">
+
After processing the data, we tested it against 2 test cases, namely the data analysis of the search counts and text analytics of 2 databases, Euromonitor and Lawnet. The first test case is as follows:  
[[File:BJJ1.png|700px]]<br/>
+
Tools: Tableau 10.1, SAS Enterprise Guide 7.1 (64-bit)
''Chart 1: Overall Search Counts by Month for All Users''<br/>
 
 
 
[[File:Overall search by existing students.png|700px]]<br/>
 
''Chart 1.1: Overall Search by Month for Existing Students''<br/>
 
 
 
[[File:Search counts by existing students during academic weeks1.png|1040px]]<br/>
 
''Chart 1.2: Search Count by Existing Students during Academic Weeks''<br/>
 
 
 
[[File:Overall search by alumni.png|700px]]<br/>
 
''Chart 1.3: Overall Search by Month for Alumni''<br/>
 
 
 
[[File:Search Counts by Alumni during Academic Weeks1.png|1040px]]<br/>
 
''Chart 1.4: Search Count by Alumni during Academic Weeks''<br/>
 
 
 
[[File:user_group_search_counts.jpg|300px]]<br/>
 
''Chart 2: User Group Search Counts''<br/>
 
  
[[File:others_search_counts.jpg|300px]]<br/>
+
For the 12 months’ worth of data (2016_processed_log.csv)<br/>
''Chart 3: Search Count of 'Others'''<br/>
 
  
 
{| class="wikitable"
 
{| class="wikitable"
 
| Subject Matter:
 
| Awareness of the number of searches throughout the year
 
 
|-
 
|-
| Thought Process:
+
! Parameters !! Description !! Example
| We want to understand the number of searches throughout the year and see if there are any observable trends.
+
|-
Thus, we initiated the break down of the number of searches by months, to have a better look at where the peak periods are.
+
| libuser_ID || Student ID hashed by the SMU Library so as to protect the identity of users  || 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
 
+
|-
 +
| libsession_ID || Each session is identified by a unique ID, which corresponds to 1 session by a single user || tDU1zb0CaV2B8qZ
 +
|-
 +
| search_database || The e-resources database which the search query is searched on || heinonline
 +
|-
 +
| timestamp || Date and time when the search query is executed by the user in the format: DD/MMM/YYYY HH:MM:SS || 01/Jan/2016:00:01:36
 
|-
 
|-
| Analysis:
+
| search_query || Search query that was being searched by the user || (The%20Great%20Peace)
| There is great variation in the number of searches across the span of a year, and these searches on the Ezproxy are contributed by students - Undergraduate, Masters, PhD and others (international exchange, local exchange, visiting students). As the users of the Ezproxy site are students of Singapore Management University, the spike in the number of searches can be seen during the months of the regular semesters (Term 1 and 2) - January to March and Mid-August to November.
+
|}
  
'''Identifying the start and end of regular terms just by looking at the number of searches'''
+
Student Information Data (Student User List)
 
 
In Chart 1, we could potentially identify the start and end of the 2 regular terms just by observing where the number of searches experience a gradual dip.
 
 
 
The overall trend of the number of searches forms the shape of a jagged mountain for both terms, thus the start and ends of the mountains fall around the start and ends of the terms.
 
 
 
'''Existing Students and Alumni in Charts 1.1 & 1.3 respectively'''
 
 
 
We discovered that Chart 1 may not reveal much about who are the users who are actually performing the searches. Thus, we decided to filter by Graduation Year to showcase the Overall Search by Month for Existing Students and Alumni in Charts 1.1 & 1.3 respectively.
 
 
 
The filtering and classification is as follows:
 
  
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Graduating Year !! Type of User (Existing Student or Alumni) !! Thought Process
+
! Original Parameters || New Parameters !! Description !! Example
 
|-
 
|-
| Null || Existing Student || This consists of users who are still far away from their graduating year
+
| email || libuser_ID || Student ID hashed by the SMU Library so as to protect the identity of users || 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
 
|-
 
|-
| GY_2012 || Alumni || This indicates the students who have graduated in 2012 and are considered ‘alumni’ in 2016 where this dataset is based.
+
| Statistical Category 1 || school || This indicates the school that the user is from || School of Law
 
|-
 
|-
| GY_2013 || Alumni || This indicates the students who have graduated in 2013 and are considered ‘alumni’ in 2016 where this dataset is based.
+
| Statistical Category 2 || programme_type || This indicates the specific programme the user is undertaking  || Bachelor of Laws
 
|-
 
|-
| GY_2014 || Alumni || This indicates the students who have graduated in 2014 and are considered ‘alumni’ in 2016 where this dataset is based.
+
| Statistical Category 3 || admission_year || This indicates the year which the user is admitted into SMU || AY_2013
 
|-
 
|-
| GY_2015 || Alumni || This indicates the students who have graduated in 2015 and are considered ‘alumni’ in 2016 where this dataset is based.
+
| Statistical Category 4 || graduating_year || This indicates the year which the user is graduated from SMU || GY_2017
 
|-
 
|-
| GY_2016 || Existing Student || This indicates students who are graduating in 2016 but are still considered students in the year 2016.
+
| User Group || education_level || This indicates which level of education the user is in, typically Masters or Bachelors programme || UNDERGRADUATE STUDENTS
|-
 
| GY_2016 || Existing Student || This indicates students who are graduating in 2017 but are still considered students in the year 2016.
 
 
|}
 
|}
  
We observed that the line chart in Chart 1.1 follows about the same shape as that of Chart 1. This could be due to existing students contributing to majority of the overall searches.  
+
With the assumption of each unique session ID and user ID along with each database being one search query, we group the data set based on these 3 variables.
 +
 
 +
The search count is extracted from the log data and proves to be valuable in understanding the search querying behaviors of SMU students throughout the year of 2016. Trends and peaks are observed when the number of searches are broken down by months.
 +
<br/>
 +
 
 +
<div style="color:#212121;">
 +
[[File:BJJ1.png|700px]]<br/>
 +
''Chart 1: Overall Search Counts by Month for All Users''<br/>
 +
 
 +
[[File:Overall search by existing students.png|700px]]<br/>
 +
''Chart 1.1: Overall Search by Month for Existing Students''<br/>
 +
 
 +
[[File:Search counts by existing students during academic weeks1.png|1040px]]<br/>
 +
''Chart 1.2: Search Count by Existing Students during Academic Weeks''<br/>
 +
 
 +
[[File:user_group_search_counts.jpg|300px]]<br/>
 +
''Chart 2: User Group Search Counts''<br/>
  
In Chart 1.3 however, the shape is vastly different from that of Chart 1. The dip after April 2016 could be due to the fact that alumnus typically receive their job offers around that period and thus are not academically involved in searching for e-resources as much as when they were still students.
+
[[File:others_search_counts.jpg|300px]]<br/>
 +
''Chart 3: Search Count of 'Others'''<br/>
  
'''Chart 1.2: Students in Academic Weeks'''
+
<u>Analysis: Awareness of the number of searches throughout the year </u> <br/>
 +
There is great variation in the number of searches across the span of a year, and these searches on the EZproxy are contributed by students - Undergraduate, Masters, PhD and others (international exchange, local exchange, visiting students). As the users of the EZproxy site are students of Singapore Management University, the spike in the number of searches can be seen during the months of the regular Terms (Term 1 and 2) - January to March and Mid-August to November respectively.  
  
From Chart 1.1, we decided to generate another chart showing how students search throughout the weeks in academic terms. We observed that the peaks in the regular terms, Terms 2 and 1, occur during Week 8, which is the recess week. This could be because that majority of the students start their research during recess week.  
+
In Chart 1, we could potentially identify the start and end of the 2 regular Terms just by observing where the number of searches experience a gradual dip. The overall trend of the number of searches forms the shape of a jagged mountain for both Terms, thus the start and ends of the mountains fall around the start and ends of the Terms. From Chart 1.1, we decided to generate another chart showing how students search throughout the weeks in academic terms. We observed that the peaks in the regular terms, Terms 2 and 1, occur during Week 8, which is the recess week. This could be because that majority of the students start their research during recess week.  
  
Next, we observed that there is a decrease in the number of searches in the weeks following the recess week (Week 8) and then we noticed there is an unusual increase in the number of searches again in Week 14, which is the study week. This same trend can be seen on both Term 2 & 1. We believe that this increase in the number of searches could be due to the students performing searches as they revise for their final examinations.  
+
Next, we observed that there is a decrease in the number of searches in the weeks following the recess week (Week 8) and then we noticed there is an unusual increase in the number of searches again in Week 14, which is the study week. This same trend can be seen on both Term 2 & 1. We believe that this increase in the number of searches could be due to the students performing searches as they revise for their final examinations. <br/>
  
'''Chart 1.4: Alumni in Academic Weeks'''
+
<u> Discussion: </u> <br/>
 +
We want to understand the number of searches throughout the year and see if there are any observable trends. Thus, we initiated the breakdown of the number of searches by months, to have a better look at where the peak periods are.
  
From Chart 1.4, we observed that the number of searches for Term 2 is close to none. The data for Term 1 shows no recognizable pattern.
+
The sponsors will be able to use these results to know the amount of load their server must be ready to handle at different periods of the year, especially during the undergraduate semesters. Furthermore, such results would be very useful for the sponsors in deciding at which period of the year should they organize library training to train the users in effective academic search querying which is vastly different to the generic search querying methods they typically perform on search engines such as Google. <br/>
|}
 
  
 
[[File:Chart4.png|1040px]]<br/>
 
[[File:Chart4.png|1040px]]<br/>
Line 130: Line 121:
 
[[File:Bjj6.png|1040px]]<br/>
 
[[File:Bjj6.png|1040px]]<br/>
 
''Chart 6: Search Count by Days for Term 2: Jan-March 2016''<br/>
 
''Chart 6: Search Count by Days for Term 2: Jan-March 2016''<br/>
 
[[File:Bjj6.png|1040px]]<br/>
 
''Chart 7: Search Count by Days for Term 1: Aug-Nov 2016''<br/>
 
  
 
[[File:Chart5.png|1040px]]<br/>
 
[[File:Chart5.png|1040px]]<br/>
Line 140: Line 128:
 
''Chart 8: Chinese New Year in 2016''<br/>
 
''Chart 8: Chinese New Year in 2016''<br/>
  
{| class="wikitable"
 
 
| Subject Matter:
 
| Understanding the students’ behaviours in searches during the semesters.
 
|-
 
| Thought Process:
 
| In SMU, 1 of the common perception is that SMU students study all day everyday, even the weekends. Thus, we want to see if this perception of SMU students is indeed true.
 
 
Next, we want to see if there is a surge in searches when the semester reaches the week where group projects are released. This is because projects in SMU largely requires the students to perform desk research and 1 of the many places to do so is through the SMU library’s EzProxy e-resources database.
 
 
|-
 
| Analysis:
 
| '''Dip in Weekend Searches'''
 
 
From the Chart 3 & 4,, we noticed that there is a dip in the number of searches performed every weekend (Saturdays & Sundays). For example, there is a plunge in the number of searches on 16th of January (Saturday). Thus, this may show that the perception of SMU students studying all day everyday and even the weekends may be untrue. Or it could be that SMU students generally do not perform as many searches for their research on weekends.
 
 
'''Research on Recess Week?'''
 
 
However, upon further contrasting of the trends in Charts 5 and 6 side-by-side, we discovered that there is always a spike in the number of searches on the first day of Recess week in both Terms. For Term 2, Recess week starts on 22 Feb where there is a visible spike from 21 Feb to 22 Feb. And for Term 1, Recess week starts on 3 Oct where there is also a visible spike from 2 Oct to 3 Oct (this spike happens to be the highest in the entire Term 1). And in both cases, the number of searches decreases gradually until the end of the Recess Week (28 Feb for Term 2 and 9 Oct for Term 1 respectively). This is a very interesting discovery as it potentially shows that students typically start their research on the first day of Recess week, thereby contributing to the spike in number of searches, and then as the Recess week comes to a close, the amount of research students performed becomes lesser too.
 
 
'''Highest Spike in Term 2: 11 Feb, end of CNY?'''
 
 
From Chart 7, we observed the highest spike in Term 2, which takes place on 11 Feb 2016. We could not find a possible explanation for this other than it being the end of the Chinese New Year holidays (9 & 10 Feb 2016) and students may be picking up on their research, thus explaining the spike in number of searches performed on 11 Feb 2016.
 
 
|}
 
  
 
[[File:Percent_of_Search_Counts_by_Degrees_in_Weekends.png|800px]]<br/>
 
[[File:Percent_of_Search_Counts_by_Degrees_in_Weekends.png|800px]]<br/>

Revision as of 13:42, 20 April 2017

Home

Team

Project Overview

Project Findings

Project Management

Documentation


Test Case I: Data Analysis

After processing the data, we tested it against 2 test cases, namely the data analysis of the search counts and text analytics of 2 databases, Euromonitor and Lawnet. The first test case is as follows: Tools: Tableau 10.1, SAS Enterprise Guide 7.1 (64-bit)

For the 12 months’ worth of data (2016_processed_log.csv)

Parameters Description Example
libuser_ID Student ID hashed by the SMU Library so as to protect the identity of users 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
libsession_ID Each session is identified by a unique ID, which corresponds to 1 session by a single user tDU1zb0CaV2B8qZ
search_database The e-resources database which the search query is searched on heinonline
timestamp Date and time when the search query is executed by the user in the format: DD/MMM/YYYY HH:MM:SS 01/Jan/2016:00:01:36
search_query Search query that was being searched by the user (The%20Great%20Peace)

Student Information Data (Student User List)

Original Parameters New Parameters Description Example
email libuser_ID Student ID hashed by the SMU Library so as to protect the identity of users 65ff93f70ca7ceaabcca62de3882ed1633bcd14ecdebbe95f9bd826bd68609ba
Statistical Category 1 school This indicates the school that the user is from School of Law
Statistical Category 2 programme_type This indicates the specific programme the user is undertaking Bachelor of Laws
Statistical Category 3 admission_year This indicates the year which the user is admitted into SMU AY_2013
Statistical Category 4 graduating_year This indicates the year which the user is graduated from SMU GY_2017
User Group education_level This indicates which level of education the user is in, typically Masters or Bachelors programme UNDERGRADUATE STUDENTS

With the assumption of each unique session ID and user ID along with each database being one search query, we group the data set based on these 3 variables.

The search count is extracted from the log data and proves to be valuable in understanding the search querying behaviors of SMU students throughout the year of 2016. Trends and peaks are observed when the number of searches are broken down by months.

BJJ1.png
Chart 1: Overall Search Counts by Month for All Users

Overall search by existing students.png
Chart 1.1: Overall Search by Month for Existing Students

Search counts by existing students during academic weeks1.png
Chart 1.2: Search Count by Existing Students during Academic Weeks

User group search counts.jpg
Chart 2: User Group Search Counts

Others search counts.jpg
Chart 3: Search Count of 'Others'

Analysis: Awareness of the number of searches throughout the year
There is great variation in the number of searches across the span of a year, and these searches on the EZproxy are contributed by students - Undergraduate, Masters, PhD and others (international exchange, local exchange, visiting students). As the users of the EZproxy site are students of Singapore Management University, the spike in the number of searches can be seen during the months of the regular Terms (Term 1 and 2) - January to March and Mid-August to November respectively.

In Chart 1, we could potentially identify the start and end of the 2 regular Terms just by observing where the number of searches experience a gradual dip. The overall trend of the number of searches forms the shape of a jagged mountain for both Terms, thus the start and ends of the mountains fall around the start and ends of the Terms. From Chart 1.1, we decided to generate another chart showing how students search throughout the weeks in academic terms. We observed that the peaks in the regular terms, Terms 2 and 1, occur during Week 8, which is the recess week. This could be because that majority of the students start their research during recess week.

Next, we observed that there is a decrease in the number of searches in the weeks following the recess week (Week 8) and then we noticed there is an unusual increase in the number of searches again in Week 14, which is the study week. This same trend can be seen on both Term 2 & 1. We believe that this increase in the number of searches could be due to the students performing searches as they revise for their final examinations.

Discussion:
We want to understand the number of searches throughout the year and see if there are any observable trends. Thus, we initiated the breakdown of the number of searches by months, to have a better look at where the peak periods are.

The sponsors will be able to use these results to know the amount of load their server must be ready to handle at different periods of the year, especially during the undergraduate semesters. Furthermore, such results would be very useful for the sponsors in deciding at which period of the year should they organize library training to train the users in effective academic search querying which is vastly different to the generic search querying methods they typically perform on search engines such as Google.

Chart4.png
Chart 4: Dip in Weekends for Term 2: Jan-March 2016

BJJ5.png
Chart 5: Dip in Weekends for Term 1: Aug-Nov 2016

Bjj6.png
Chart 6: Search Count by Days for Term 2: Jan-March 2016

Chart5.png
Chart 7: Search Count by Days for Term 1: Aug-Nov 2016

Bjj8.png
Chart 8: Chinese New Year in 2016


Percent of Search Counts by Degrees in Weekends.png
Chart 9: Percentage of Search Counts by Degrees in Weekends for Term 2: Jan-March 2016

Percent of Search Counts by Degrees in Weekends from Sep to Dec.png
Chart 10: Percentage of Search Counts by Degrees in Weekends for Term 1: Sep to Nov 2016

Subject Matter: Understanding the percentage of searches contributed by students across their Degrees during weekends
Thought Process: We want to dive deeper into the analysis of weekend searches and find out who are the ones still contributing to it, despite the dip in number of weekend searches.
Analysis: In Chart 9, we noticed that 56.75% of searches were done by students enrolled in the Bachelor of Laws programme, which occupies a majority of the total number of searches performed on weekends. Additionally, 16.91% of searches were done by students from Bachelor of Business Management and 7.36% from the Juris Doctor programme.

One of the possible conclusions from this observation is that students enrolled in the Law field (Bachelor of Laws & Juris Doctor programme) do not typically stop performing searches and/or stop researching simply because it is the weekends. In addition to that, students in the Bachelor of Business Management programme contributes significantly to the number of searches on weekends too, perhaps due to the nature of the programme which is research-intensive. This is in contrast to students from other non-research intensive programmes such as Bachelor of Science (Information Systems) at 1.64% of total number of searches.

In Chart 10, we can observe that the abovementioned trend is consistent for students in the Bachelor of Laws, Bachelor of Business Management and Juris Doctor. Thus, our trend analysis holds consistent for both Terms 1 and 2.


Usage of databases by schools.jpg
Chart 11: Usage of Database by Schools


[Back To Project Page]