Difference between revisions of "IS428 AY2019-20T2 Assign PEH ANQI"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
Line 51: Line 51:
 
Comments provided by the respondents are free text which is difficult to work with in tableau. In order to provide a overview on main concerns that that the respondents have, I have decided to pre-process the comments using Python NLTK package to remove commonly used words and punctuation in the comments and build a word cloud in tableau to show the dominant words mentioned.  
 
Comments provided by the respondents are free text which is difficult to work with in tableau. In order to provide a overview on main concerns that that the respondents have, I have decided to pre-process the comments using Python NLTK package to remove commonly used words and punctuation in the comments and build a word cloud in tableau to show the dominant words mentioned.  
  
I removed the comment column in the excel file and used and placed it in another excel file containing the comments, respondent ID and position. I open the new excel file using Python and conducted stop words removal and lemmatization.
+
Packages used:
{| class="wikitable"
+
[[File:PAQ Packages Used.png|400px|center]]
|-
 
|<center>'''Data Cleaning'''</center>
 
|<center>'''Steps Taken'''</center>
 
|-
 
|1. Remove respondent 833 <br/> [[File:PAQ Pre-process 1.png|center|400px]]
 
|Respondent with responseID 833 is removed as identification attributes are blank (Campus, Position, StudyArea), the respondent also only answered 4 out of the 88 questions. Hence I decided to omitted this respondent from the visualization.
 
|valign="top"|
 
  
|-
+
I removed the comment column in the excel file and placed it in another excel file containing the comments, respondent ID and position. Using Python, I conducted stop words removal and lemmatization.
|2. Replace column Campus, Position, StudyArea and ID's numerical value <br/> [[File:PAQ Processing 2.png|400px|center]]
 
|The values in the excel file are represented by numerical values, making it difficult to interpret the data without referring to the legend. Hence, I used the “LOOKUP” function in excel to get the values for Campus, Position, StudyArea and ID field by referring to the legend worksheet.  
 
|valign="top"|
 
  
|-
+
'''Step 1. Remove empty rows'''
|3. Replace column name for likert scale questions <br/> [[File:PAQ Preprocessing 3.png|400px|center]]
+
<br/>
|The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet.
+
There is quite a number of rows with empty comments, therefore I started by removing empty rows in the dataset.
|valign="top"|
 
  
|-
+
Code used:
|4. Pre-processing Comments <br/> [[File:PAQ Preprocessing 3.png|400px|center]]
+
[[File:PAQ pre-processing4.png|400px|center]]
|The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet.
+
 
|valign="top"|
+
'''Step 2. Convert text to lowercase and remove punctuation'''
|-
+
<br/>
|}
+
Code used:
 +
[[File:Text process 2.png|400px|center]]
 +
 
 +
'''Step 3. lemmatizing '''
 +
<br/>
 +
As different people have different way of typing and depending on the sentence structure, often words like "happily" and "happiness" both have the same root word which is "happy". Since they both have the same meaning and belong to the same root word, I would like them to be counted together when plotting the word cloud.  
 +
 
 +
Lemmatizing is one of the methods to convert words back to its root word. Hence I did lemmatization for each word in the comments.  
 +
 
 +
Code used:
 +
[[File:PAQ Text process3.png|400px|center]]
 +
 
 +
'''Step 4. Stop words removal '''
 +
Stop words are words commonly used in a sentence such as "I", "You", "Are". These words are commonly mentioned but have no meaning to them. Hence, I will be removing these words.
 +
 
 +
Code used:
 +
[[File:Text process 4.png|400px|center]]
  
 
==Visualization==
 
==Visualization==

Revision as of 16:16, 15 March 2020

SMU Libraries


Overview

Singapore Management University have two libraries, the Li Ka Shing Library and the Kwa Geok Choo Law Library. Every two years, the libraries would conduct a comprehensive survey in which faculty, students and staff have the opportunity to rate various aspects of SMU library's services. The survey provides SMU libraries with inputs to help enhance existing services and to anticipate emerging needs of SMU faculty, students and staff.

The past survey reports were mainly made-up of pages of tables, which are very difficult to comprehend. Hence, the task is to create interactive data visualisation to transform these tables into visual representation that allow SMU libraries to gain useful insights, and reveal the level of services provided by SMU libraries as perceived by the following stakeholders:

  • Faculty
  • Undergraduate students
  • Postgraduate students
  • Staff


Data

Data Preparation

About the Data

The 2018 library survey data is used for this assignment. Two excel data was provided, Raw data 2018-03-07 SMU LCS data file - KLG.xlsx and the 2018-02-16 SMU Library Survey Comments MAC.xls. The 2018-02-16 SMU Library Survey Comments MAC.xls file consist of comments provided by each respondents which is also available in the Raw data 2018-03-07 SMU LCS data file - KLG.xlsx file. Hence, I will be using only the Raw data 2018-03-07 SMU LCS data file - KLG.xlsx file.

A total of 2639 responds was collected from the 2018 Survey. However 1 respondent, ID 833, only answered 4 questions out of a total of 87 and did not provide any information on the respondent position, campus and study area. Hence respondent 833's input will be omitted from the visualization.

The dataset contained of 88 columns. 7 of them provide basic information of the respondents, their position, Studyarea, frequency of them visiting the library, etc. 2 of them are on respondents satisfaction level and likelihood of recommending the library to others. 78 of them are likert scale questions where respondents have to rank different services provided by the library by importance of the services and how well the library perform in providing these services. Lastly, 1 of them is a free text containing additional comments from the respondents.

Data cleaning in excel

Data Cleaning
Steps Taken
1. Remove respondent 833
PAQ Pre-process 1.png
Respondent with responseID 833 is removed as identification attributes are blank (Campus, Position, StudyArea), the respondent also only answered 4 out of the 88 questions. Hence I decided to omitted this respondent from the visualization.
2. Replace column Campus, Position, StudyArea and ID's numerical value
PAQ Processing 2.png
The values in the excel file are represented by numerical values, making it difficult to interpret the data without referring to the legend. Hence, I used the “LOOKUP” function in excel to get the values for Campus, Position, StudyArea and ID field by referring to the legend worksheet.
3. Replace column name for likert scale questions
PAQ Preprocessing 3.png
The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet.
4. Pre-processing Comments
PAQ Preprocessing 3.png
The column name for the likert scale questions on importance and performance is represented by a code. Eg: I01, P01, NA01, etc. This makes it difficult to know what each column represents. Hence, I used the "HLOOKUP" function in excel to get the column code with column name by referring to the legend worksheet.

Data cleaning using Python for comments

Comments provided by the respondents are free text which is difficult to work with in tableau. In order to provide a overview on main concerns that that the respondents have, I have decided to pre-process the comments using Python NLTK package to remove commonly used words and punctuation in the comments and build a word cloud in tableau to show the dominant words mentioned.

Packages used:

PAQ Packages Used.png

I removed the comment column in the excel file and placed it in another excel file containing the comments, respondent ID and position. Using Python, I conducted stop words removal and lemmatization.

Step 1. Remove empty rows
There is quite a number of rows with empty comments, therefore I started by removing empty rows in the dataset.

Code used:

PAQ pre-processing4.png

Step 2. Convert text to lowercase and remove punctuation
Code used:

Text process 2.png

Step 3. lemmatizing


As different people have different way of typing and depending on the sentence structure, often words like "happily" and "happiness" both have the same root word which is "happy". Since they both have the same meaning and belong to the same root word, I would like them to be counted together when plotting the word cloud.

Lemmatizing is one of the methods to convert words back to its root word. Hence I did lemmatization for each word in the comments.

Code used:

Step 4. Stop words removal Stop words are words commonly used in a sentence such as "I", "You", "Are". These words are commonly mentioned but have no meaning to them. Hence, I will be removing these words.

Code used:

Text process 4.png

Visualization

Undergraduate students

Postgraduate students

Faculty

Staff