IS428 AY2019-20T2 Assign ONG LI TING: Data

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
Library.jpg


OVERVIEW

DATA

VISUALIZATION

TASK ANSWERS

MISCELLANEOUS


About the Data

The survey sent out to the SMU community consists of 3 sections – General Information, Survey Questions and Your Comments. The respondents does not require to answer all questions in the survey form.

For General Information, the question response is presented in multiple choice format. This section allow us to gain an overview of the profile of the respondents in terms of their faculty, student status and profession in SMU. Addionally, we will also know which library the respondents patronise more.

For Survey Questions, the question response is presented in rating scale format in terms of importance and actual perceived performance. The questions are relaed to the service and product quality provided by SMU Libraries for the area – Communication, Service Delivery, Facilities and Equipment and, Information Resources.

For Your Comments, the overall satisfaction will be measured in a rating scale format question. Respondents can also provide any suggestion, comments or overall improvement in an open-ended question format. Additionally, respondents will also be asked to answer how frequent they visit the campus, library and have access the library resources in multiple choice format.

The survey responses are consolidated in 2 files:
1. Raw data 2018-03-07 SMU LCS data file – KLG.xls
2. 2018-02-16 SMU Library Survey Comments MAC.xls


Raw data 2018-03-07 SMU LCS data file – KLG.xls

This file contains 2 tabs – SMU and Legend.
The SMU tab contains all the answers gathered from the respondents in numeric coded values.

Header Description
Code
Numerical values that gives every respondent a unique number.
[Remaining Headers]
Answers coded in numerical value for every type of answer. The legend tab will be needed to map the value under these headers to a description.


The primary use of the legend tab is to give SMU Libraries the information needed to map the numerical values in the SMU tab to a description.

Header Description
Code
Contains 88 unique answer keys for the questions answered.

E.g. HowOftenL represents the question on “How frequently do you visit the library?” and HowOftenC represents the question on “How frequently do you visit the Campus?”.

Item
Contains 88 questions presented to the respondents.
Free Text
Any open-ended questions response.
0-10
The numerical representation for the multiple-choice questions answered.

For example: 1 represents YES, 2 represents NO.

2018-02-16 SMU Library Survey Comments MAC.xls

This file contains the comments posted by the respondents and their overall profile.

Header Description
Which Library do you use more?
Multiple choice (1 of the following):
  • Li Ka Shing Library
  • Gwa Geok Choo Law Library
Position?
Multiple choice (1 of the following):
  • Undergraduate year 1
  • Undergraduate year 2
  • Undergraduate year 3
  • Undergraduate year 4 & above
  • Exchange student
  • Graduate: Masters
  • Graduate: Doctoral
  • Faculty: Professor
  • Faculty: Assistant Professor
  • Faculty: Lecturer/Senior Lecturer
  • Staff: Researcher
  • Staff: Other administration position
  • Others
What is your major area of study, research or teaching?
Multiple choice (1 of the following):
  • Accountancy
  • Business
  • Economics
  • Information Systems
  • Law
  • Social Sciences
  • Others
Are you an international (non-exchange) student?
Multiple choice (1 of the following):
  • Yes
  • No
Please give us your suggestions for improvement or any other comments about the Library.
Open ended answer

Data Cleaning & Preparation

File: Raw data 2018-03-07 SMU LCS data file – KLG.xls

Assumption

  • Exchange student (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
  • Graduate student (Answered in Q3) can be an international non-exchange student (Answered in Q4)
  • Faculty profession respondents cannot (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
  • Staff profession respondents cannot (Answered in Q3) cannot be a non-exchange student (Answered in Q4)
Cleaning #1 Steps
Tab: SMU
Remove rows that do not fulfil the assumptions stated above. It is essential to keep the dataset error free.
Olt1a.jpg
  • For column “ID”, sort it in ascending order. This will ensure a faster run time for Excel.


Olt1b.jpg
  • For “1” under column “ID”, remove the entire role that contains number 5 (Exchange student), 9 – 11 (Faculty profession) and 12 – 13 (Staff profession) from the “Position” column.
  • Select the group of data that potentially contains the number 5, 9 – 11 and 12 – 13.
  • Ctrl + F and find the values.
  • Highlight the results in the Find and Replace window.
  • Close the window and Ctrl + - to open the Delete window.
  • Select “Entire row”.
Cleaning #2 Steps
Tab: SMU
Respondent that did not answer questions with regards to their profile information or less than 10% of the entire questionnaires should be removed from the dataset. This is to improve the accuracy of the insights derived.
V2A.jpg
  • Did eye balling across the entire dataset.
  • Remove responseID 833 from the dataset.
Cleaning #3 Steps
Tab: SMU

Headers give little meaning (E.g. I01, I02 and etc).

Change the headers to the questions in the survey. This will allow ease of visualisation in the later stage.
V3A.jpg
  • Create a new row above the header.
  • Enter the formula: =VLOOKUP(F2;Legend!$A$2:$B$89;2;FALSE) in the new cell above the first header.
  • Drag the formula to the rest of the header.


V3B.jpg
V3C.jpg
(Legend tab)
  • Rename NA1, NA2, NA3, NA4, NA5, NA6, NA7, NA8, NA9 to NA01, NA02, NA03, NA04, NA05, NA06, NA07, NA08, NA09 to match the legend fields.
Cleaning #4 Steps
Tab: SMU
The responses are encoded in numerical value. To ensure the ease of visualisation in Tableau, we need to change the numerical values to match the description in the Legend tab.
V4A.jpg
  • Create a dummy column on the right of the column “campus”.


V4B.jpg
  • Create a new sheet – HLOOKUP. This sheet will contain the tables needed to convert the numerical value to description as referenced from the Legend tab.


V4C.jpg
  • Enter the formula: =HLOOKUP(B3;HLOOKUP!$B$3:$C$4;2;FALSE) in the first cell of the newly created column.
  • Drag the formula to the rest of the column.
  • Highlight the new column and Ctrl + C.


V4D.jpg
  • Highlight the column “Campus” and select “Paste Special” > “Values”.
  • Delete the dummy column.
  • Repeat these steps for the other columns – “Position”, “ID?” and for column NA01 to NA26, change all the value “1” to “N/A”.
Cleaning #5 Steps
Unable to do an accurate word cloud of the mentioned words with current dataset.
V5A.jpg
  • Created a new Excel sheet called “Comments” for the word cloud analysis
  • Copy column “Position”, “StudyArea”, “P27” and “Comment1” into the new sheet
  • Deleted observations with empty value under the column “Comment1”

Continued: Cleaning with Python

import pandas as pd
import nltk
from nltk.corpus import stopwords
import numpy as np
from nltk.stem import WordNetLemmatizer

def identify_tokens(row):
    review = str(row['comments'])
    tokens = nltk.word_tokenize(review)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

def lemmatize_list(row):
    my_list = row['words']
    lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list]
    return (lemmatized_list)

def remove_stops(row):
    my_list = row['lemmatized_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

def rejoin_words(row):
    my_list = row['lemmatized_meaningful']
    joined_words = (" ".join(my_list))
    return joined_words

# Load
lemmatizer = WordNetLemmatizer()
stops = set(stopwords.words("english"))
df = pd.read_csv('Comments_v2.csv', engine='python')

# Sentiment based on overall satisfaction
conditions = [
    (df['overall'] > 4),
    (df['overall'] < 4)]
choices = ['positive', 'negative']
df['sentiment'] = np.select(conditions, choices, default='neutral')

# LowerCase
df['comments'] = df['comments'].str.lower()

# Tokenization
df['words'] = df.apply(identify_tokens, axis=1)

# Lemmatize and remove punctuation
df['lemmatized_words'] = df.apply(lemmatize_list, axis=1)

# Remove Stop Words
df['lemmatized_meaningful'] = df.apply(remove_stops, axis=1)

# Rejoin
df['processed'] = df.apply(rejoin_words, axis=1)

# Remove empty rows
df = df[~df['processed'].isin(['nil', '', 'na', 'nan', 'none'])]

# Drop uneccessary columns
cols_to_drop = ['comments', 'overall', 'words',
                'lemmatized_words', 'lemmatized_meaningful']
df.drop(cols_to_drop, axis=1, inplace=True)

print(df)
df.to_csv('CleanedComments.csv', index=False)
  • Removed special characters and punctuations from the comments using python.
  • Removed stop words and performed word stemming and lemmatisation with python and packages from NLTK.
  • Created a new column called “Sentiment” to identify comments that are rated as positive (score >4), negative (score <4) or neutral (score = 4) based on the overall satisfaction score.

Output:

  • Meaningful descriptive headers for all the columns
  • Meaningful descriptive inputs for all the responses
  • Removed rows that are inaccurate/contains error
  • List of cleaned words, along with the sentiment category (positive/negative/neutral)
  • New Excel sheet for word cloud – CleanedComments.csv


Import File: Raw data 2018-03-07 SMU LCS data file – KLG.xls to Tableau

Cleaning #1 Steps
Unable to do Likert scale with the current dataset structure.
OltTa.jpg
  • Pivot all columns except for ID, Study Area, Response ID and Position.

Import File: CleanedComments.csv data file – KLG.xls to Tableau

Cleaning #1 Steps
Unable to do word cloud with the current dataset structure.
OltCa.jpg
  • Unable to do word cloud with the current dataset structure.
Cleaning #2 Steps
Unable to provide meaningful word cloud.
OltCb.jpg
  • Manually filter the words out in Tableau.