IS428 AY2019-20T2 Assign ONG LI TING: Data
About the Data
The survey sent out to the SMU community consists of 3 sections – General Information, Survey Questions and Your Comments. The respondents does not require to answer all questions in the survey form.
For General Information, the question response is presented in multiple choice format. This section allow us to gain an overview of the profile of the respondents in terms of their faculty, student status and profession in SMU. Addionally, we will also know which library the respondents patronise more.
For Survey Questions, the question response is presented in rating scale format in terms of importance and actual perceived performance. The questions are relaed to the service and product quality provided by SMU Libraries for the area – Communication, Service Delivery, Facilities and Equipment and, Information Resources.
For Your Comments, the overall satisfaction will be measured in a rating scale format question. Respondents can also provide any suggestion, comments or overall improvement in an open-ended question format. Additionally, respondents will also be asked to answer how frequent they visit the campus, library and have access the library resources in multiple choice format.
The survey responses are consolidated in 2 files:
1. Raw data 2018-03-07 SMU LCS data file – KLG.xls
2. 2018-02-16 SMU Library Survey Comments MAC.xls
Raw data 2018-03-07 SMU LCS data file – KLG.xls
This file contains 2 tabs – SMU and Legend.
The SMU tab contains all the answers gathered from the respondents in numeric coded values.
Header | Description |
---|---|
Numerical values that gives every respondent a unique number. | |
Answers coded in numerical value for every type of answer. The legend tab will be needed to map the value under these headers to a description. |
The primary use of the legend tab is to give SMU Libraries the information needed to map the numerical values in the SMU tab to a description.
Header | Description |
---|---|
Contains 88 unique answer keys for the questions answered.
E.g. HowOftenL represents the question on “How frequently do you visit the library?” and HowOftenC represents the question on “How frequently do you visit the Campus?”. | |
Contains 88 questions presented to the respondents. | |
Any open-ended questions response. | |
The numerical representation for the multiple-choice questions answered.
For example: 1 represents YES, 2 represents NO. |
2018-02-16 SMU Library Survey Comments MAC.xls
This file contains the comments posted by the respondents and their overall profile.
Header | Description |
---|---|
Multiple choice (1 of the following):
| |
Multiple choice (1 of the following):
| |
Multiple choice (1 of the following):
| |
Multiple choice (1 of the following):
| |
Open ended answer |
Data Cleaning & Preparation
File: Raw data 2018-03-07 SMU LCS data file – KLG.xls
Assumption
- Exchange student (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
- Graduate student (Answered in Q3) can be an international non-exchange student (Answered in Q4)
- Faculty profession respondents cannot (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
- Staff profession respondents cannot (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
Cleaning #1 | Steps |
---|---|
Remove rows that do not fulfil the assumptions stated above. It is essential to keep the dataset error free. |
|
Cleaning #2 | Steps |
---|---|
Respondent that did not answer questions with regards to their profile information or less than 10% of the entire questionnaires should be removed from the dataset. This is to improve the accuracy of the insights derived. |
|
Cleaning #3 | Steps |
---|---|
Headers give little meaning (E.g. I01, I02 and etc). Change the headers to the questions in the survey. This will allow ease of visualisation in the later stage. |
|
Cleaning #4 | Steps |
---|---|
The responses are encoded in numerical value. To ensure the ease of visualisation in Tableau, we need to change the numerical values to match the description in the Legend tab. |
|
Cleaning #5 | Steps |
---|---|
Continued: Cleaning with Python import pandas as pd import nltk from nltk.corpus import stopwords import numpy as np from nltk.stem import WordNetLemmatizer def identify_tokens(row): review = str(row['comments']) tokens = nltk.word_tokenize(review) token_words = [w for w in tokens if w.isalpha()] return token_words def lemmatize_list(row): my_list = row['words'] lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list] return (lemmatized_list) def remove_stops(row): my_list = row['lemmatized_words'] meaningful_words = [w for w in my_list if not w in stops] return (meaningful_words) def rejoin_words(row): my_list = row['lemmatized_meaningful'] joined_words = (" ".join(my_list)) return joined_words # Load lemmatizer = WordNetLemmatizer() stops = set(stopwords.words("english")) df = pd.read_csv('Comments_v2.csv', engine='python') # Sentiment based on overall satisfaction conditions = [ (df['overall'] > 4), (df['overall'] < 4)] choices = ['positive', 'negative'] df['sentiment'] = np.select(conditions, choices, default='neutral') # LowerCase df['comments'] = df['comments'].str.lower() # Tokenization df['words'] = df.apply(identify_tokens, axis=1) # Lemmatize and remove punctuation df['lemmatized_words'] = df.apply(lemmatize_list, axis=1) # Remove Stop Words df['lemmatized_meaningful'] = df.apply(remove_stops, axis=1) # Rejoin df['processed'] = df.apply(rejoin_words, axis=1) # Remove empty rows df = df[~df['processed'].isin(['nil', '', 'na', 'nan', 'none'])] # Drop uneccessary columns cols_to_drop = ['comments', 'overall', 'words', 'lemmatized_words', 'lemmatized_meaningful'] df.drop(cols_to_drop, axis=1, inplace=True) print(df) df.to_csv('CleanedComments.csv', index=False)
|
Output:
- Meaningful descriptive headers for all the columns
- Meaningful descriptive inputs for all the responses
- Removed rows that are inaccurate/contains error
- List of cleaned words, along with the sentiment category (positive/negative/neutral)
- New Excel sheet for word cloud – CleanedComments.csv
Import File: Raw data 2018-03-07 SMU LCS data file – KLG.xls to Tableau
Cleaning #1 | Steps |
---|---|
|
Import File: CleanedComments.csv data file – KLG.xls to Tableau
Cleaning #1 | Steps |
---|---|
|
Cleaning #2 | Steps |
---|---|
|