ANLY482 AY2017-18T2 Group19 Project Findings

From Analytics Practicum
Revision as of 21:27, 15 April 2018 by Joanne.ong.2014 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
G19 Logo.png


G19 Home.png   HOME

 

G19 Overview Icon.png   PROJECT OVERVIEW

 

G19 Findings Icon.png   PROJECT FINDINGS

 

G19 Management Icon.png   PROJECT MANAGEMENT

 

G19 Documentation Icon.png   DOCUMENTATION

 

G19 To Main Page icon.png   BACK TO MAIN PAGE


 


 

ADDITIONAL COLUMNS

The same fields were added into the 2-hour, 3-hour and 3-day datasets with the sole intention of facilitating our data exploratory. All calculations were performed in JMP. The following fields were included:

No.Name of FieldDescription
01 Loan_Timestamp This variable is a result of concatenating the ‘loan_date’ and ‘loan_time’ variables into a single field. This enhances the ease-of-use and aesthetic appeal.
02 Return_Timestamp This variable is a result of concatenating the ‘return_date’ and ‘return_time’ variables into a single field. This enhances the ease-of-use and aesthetic appeal.
03 Term This allows us to segment the ‘loan_timestamp’ into academic terms to allow us to breakdown the analysis later on. We got the various academic terms start and end dates from the SMU’s official academic calendar document and applied an IF() logical statement to classify the various loan timestamps into the different academic terms.
04 Hours_borrowed This variable refers to the number of hours borrowed per transaction. It is derived by calculating the date difference between the ‘return_timestamp’ and ‘loan_timestamp’ in terms of hours. This could potentially help us in understanding the usage patterns per borrow.
05 Assigned_loan_period This variable refers to the amount of hours a user is entitled to when borrowing a book. The library’s policy is as follows:
G19 Assigned Loan Period Policy.png

Depending on the hour in which the transaction occurs, the hours of usage allowed to the user differs. As such, we created a calculated field in which follows the above-mentioned rules by utilizing an IF() logical statement.
06 Overdue_period This variable was created with intentions to further our analysis on the extent of user usage patterns. Given the varying assigned loan periods, we believed that it was necessary to take this variable into account when analyzing if a loan period is currently sufficient for users. As such, ‘overdue_period’ is calculated by deducting ‘assigned_loan_period’ from ‘hours_borrowed’. A positive value would indicate that the current loan period assigned is adequate for the users while a negative sufficiency measure would indicate otherwise.
07 Overdue? This is a binary variable that classifies if a transaction is overdue or not. A value of 1 would indicate that the transaction was overdue and a value of 0 would indicate otherwise. This variable will be used primarily for investigation pertaining the users who overdue.
08 Time_Elapsed This variable calculates the time elapsed (in hours) between the borrowing of the same book from the previous time by the same user. It is derived through subtracting the ‘loan_timestamp’ of the current transaction with the previous ‘return_timestamp’ if the transaction is observed to be involving the same ‘email’ and ‘title’. This variable will be used primarily for investigation pertaining the users who borrow in succession.
09 Transaction_Group This variable was created with the purpose to serve as an identifier for transactions belonging to the same group. To be considered the same group, the fields ‘email’ and ‘title’ must remain the same while observing a ‘time_elapsed’ value of not more than 4. This means that transactions with the same user borrowing the same title within 4 hours of his preceding ‘return_timestamp’ would be considered as a single transaction. This variable will be used primarily for investigation pertaining the users who borrow in succession.
10 Hours_Borrowed_With_Successions This variable sums up the ‘hours_borrowed’ that belongs to the same ‘transaction group’. Duplication will be removed during the analysis. This is a field that is updated from ‘hours_borrowed’ to account for the succession borrowing behaviour that library users exhibit. This variable will be used primarily for investigation pertaining the users who borrow in succession.

 

MISSING VALUES

We performed a ‘Summary Statistics’ for all the columns in our 3-hour and 3-day datasets in order to identify the number of observations with missing values. Missing data have the potential to influence our findings and conclusions drawn from the data, and as such, it was essential that we performed this analysis to sieve out how much missing data we have in the datasets and subsequently, decide on how we would like to proceed on from there.

DatasetBeforeAfter
3-Hour Transaction Dataset
G19 Missing Values 3H Before.png
Total = 13281
G19 Missing Values 3H After.png
Total = 12958
3-Day Transaction Dataset
G19 Missing Values 3D Before.png
Total = 1401

Given our client’s stake in the project, we consulted them on the rows that had ‘return_timestamp’ missing. Seeing that it is not possible to find the ‘return_timestamp’ that was attached to these transactions, we then proceeded on to exclude those rows with our client’s permission. In addition, without the ‘return_timestamp’, it would not be possible to analyze the ‘hours_borrowed’ and ‘sufficiency_measure’ of these observations. This exclusion was performed on both transaction datasets.

The other fields with missing values, including ‘active cc’ and ‘isbn’, were not used in any of our analysis thus far, and hence remains untouched.

 

REMOVING OUTLIERS
G19 Lost Damaged Items Policy.png

For both loan policies, we removed outliers where the hours borrowed is less than 0. A negative value under the ‘hours_borrowed’ variable would indicate that the books were returned even before the time they were borrowed. Seeing that this circumstance is not possible, we voided such observations.

3-Hour loan

Due to the library loan policy, books under 3-hour loan are able to be returned the next day depending on the time of borrow. This means that there is a maximum loan time of 19.5 hours which is from Sat, 6pm to Sun, 1.30pm. This results in a maximum of 19.5 hours of sufficiency measure where book is borrowed and returned immediately. This sets the sufficiency measure upper bound for our analysis.

At the same time, the library does not accept overdue books that are returned more than 2 weeks. Therefore, we have restricted our maximum overdue timing to be 336 hours. This sets the sufficiency measure lower bound for our analysis at -336.

3-Day loan

Books with 3-day loans do not have the overnight policy like the 3-hour loan policy does. Hence the upper bound for the sufficiency measure is merely set to 72 hours.

The same policy applies for the books overdue for more than 2 weeks. Therefore, the maximum overdue timing will be 336 hours. This sets the sufficiency measure lower bound for our analysis at -336.

In conclusion, the following table summarizes the boundaries set for our datasets.

DatasetVariableLower BoundUpper BoundNumber of Outliers Removed
2/3-Hour Transaction Dataset Hours Borrowed 0 355.5 6
Overdue Period -336 19.5 6
3-Day Transaction Dataset Hours Borrowed 0 408 0
Overdue Period -336 72 0
REDEFINITION OF SCOPE

As SMU Libraries’ main group of users is the undergraduates, this paper would focus on the Year 1 to Year 4 undergraduates. As such, the following patron groups and years of study are filtered out:

  1. Patron Group: Adjunct, Admin Staff, Alumni, Faculty, Master, PhD, Others, Research Staff
  2. Year of Study: Year 0, Year 5 and above

A total of 2164 rows are being removed from the 2-hour, 3-hour and 3-days datasets.