Difference between revisions of "IS428 AY2019-20T2 Assign PEH ANQI"
Line 56: | Line 56: | ||
<br/> | <br/> | ||
There is quite a number of rows with empty comments, therefore I started by removing empty rows in the dataset. | There is quite a number of rows with empty comments, therefore I started by removing empty rows in the dataset. | ||
− | |||
− | |||
[[File:PAQ pre-processing4.png|400px|center]] | [[File:PAQ pre-processing4.png|400px|center]] | ||
'''Step 2. Convert text to lowercase and remove punctuation''' | '''Step 2. Convert text to lowercase and remove punctuation''' | ||
− | |||
− | |||
[[File:Text process 2.png|400px|center]] | [[File:Text process 2.png|400px|center]] | ||
Line 68: | Line 64: | ||
<br/> | <br/> | ||
As different people have different way of typing and depending on the sentence structure, often words like "happily" and "happiness" both have the same root word which is "happy". Since they both have the same meaning and belong to the same root word, I would like them to be counted together when plotting the word cloud. | As different people have different way of typing and depending on the sentence structure, often words like "happily" and "happiness" both have the same root word which is "happy". Since they both have the same meaning and belong to the same root word, I would like them to be counted together when plotting the word cloud. | ||
− | |||
Lemmatizing is one of the methods to convert words back to its root word. Hence I did lemmatization for each word in the comments. | Lemmatizing is one of the methods to convert words back to its root word. Hence I did lemmatization for each word in the comments. | ||
+ | [[File:Text_process3.png|400px|center]] | ||
− | |||
− | |||
'''Step 4. Stop words removal ''' | '''Step 4. Stop words removal ''' | ||
Line 79: | Line 73: | ||
Code used: | Code used: | ||
[[File:Text process 4.png|400px|center]] | [[File:Text process 4.png|400px|center]] | ||
+ | |||
+ | Below is a sample of how the comment change during the different steps of text processing. | ||
+ | [[File:After text processing.png|400px|center]] | ||
==Visualization== | ==Visualization== |
Revision as of 16:30, 15 March 2020
Contents
Overview
Singapore Management University have two libraries, the Li Ka Shing Library and the Kwa Geok Choo Law Library. Every two years, the libraries would conduct a comprehensive survey in which faculty, students and staff have the opportunity to rate various aspects of SMU library's services. The survey provides SMU libraries with inputs to help enhance existing services and to anticipate emerging needs of SMU faculty, students and staff.
The past survey reports were mainly made-up of pages of tables, which are very difficult to comprehend. Hence, the task is to create interactive data visualisation to transform these tables into visual representation that allow SMU libraries to gain useful insights, and reveal the level of services provided by SMU libraries as perceived by the following stakeholders:
- Faculty
- Undergraduate students
- Postgraduate students
- Staff
Data Preparation
About the Data
The 2018 library survey data is used for this assignment. Two excel data was provided, Raw data 2018-03-07 SMU LCS data file - KLG.xlsx and the 2018-02-16 SMU Library Survey Comments MAC.xls. The 2018-02-16 SMU Library Survey Comments MAC.xls file consist of comments provided by each respondents which is also available in the Raw data 2018-03-07 SMU LCS data file - KLG.xlsx file. Hence, I will be using only the Raw data 2018-03-07 SMU LCS data file - KLG.xlsx file.
A total of 2639 responds was collected from the 2018 Survey. However 1 respondent, ID 833, only answered 4 questions out of a total of 87 and did not provide any information on the respondent position, campus and study area. Hence respondent 833's input will be omitted from the visualization.
The dataset contained of 88 columns. 7 of them provide basic information of the respondents, their position, Studyarea, frequency of them visiting the library, etc. 2 of them are on respondents satisfaction level and likelihood of recommending the library to others. 78 of them are likert scale questions where respondents have to rank different services provided by the library by importance of the services and how well the library perform in providing these services. Lastly, 1 of them is a free text containing additional comments from the respondents.
Data cleaning in excel
1. Remove respondent 833 |
Respondent with responseID 833 is removed as identification attributes are blank (Campus, Position, StudyArea), the respondent also only answered 4 out of the 88 questions. Hence I decided to omitted this respondent from the visualization. | |
2. Replace column Campus, Position, StudyArea and ID's numerical value |
The values in the excel file are represented by numerical values, making it difficult to interpret the data without referring to the legend. Hence, I used the “LOOKUP” function in excel to get the values for Campus, Position, StudyArea and ID field by referring to the legend worksheet. | |
3. Replace column name for likert scale questions |
The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet. | |
4. Pre-processing Comments |
The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet. |
Data cleaning using Python for comments
Comments provided by the respondents are free text which is difficult to work with in tableau. In order to provide a overview on main concerns that that the respondents have, I have decided to pre-process the comments using Python NLTK package to remove commonly used words and punctuation in the comments and build a word cloud in tableau to show the dominant words mentioned.
I removed the comment column in the excel file and placed it in another excel file containing the comments, respondent ID and position. Using Python, I conducted stop words removal and lemmatization.
Step 1. Remove empty rows
There is quite a number of rows with empty comments, therefore I started by removing empty rows in the dataset.
Step 2. Convert text to lowercase and remove punctuation
Step 3. lemmatizing
As different people have different way of typing and depending on the sentence structure, often words like "happily" and "happiness" both have the same root word which is "happy". Since they both have the same meaning and belong to the same root word, I would like them to be counted together when plotting the word cloud. Lemmatizing is one of the methods to convert words back to its root word. Hence I did lemmatization for each word in the comments.
Step 4. Stop words removal
Stop words are words commonly used in a sentence such as "I", "You", "Are". These words are commonly mentioned but have no meaning to them. Hence, I will be removing these words.
Code used:
Below is a sample of how the comment change during the different steps of text processing.