ISSS608 2016-17 T1 Assign2 Ong Han Ying - Data Preparation

From Visual Analytics and Applications
Jump to navigation Jump to search

OHY Header.jpg ISSS608 2016-17 Assignment 2 - Ong Han Ying

About

Executive Summary

Introduction

Data Preparation

Approach

Results

Conclusion

 

Initial Preparation Work

After Design Iteration

 
CONTENT

Section 5: Data Preparation
_____5.1: Source of Dataset
_____5.2: Best Practice to Evaluating Results of Online Survey
__________5.2.1: Common Mistake 1 - Using Arithmetic Mean
__________5.2.2: Common Mistake 2 - Inconsistent Order to the Questions
__________5.2.3: Common Mistake 3 - Asking Compound Questions
__________5.2.4: Common Mistake 4 - Allow Neutral or NA Responses
_____5.3: Evaluation of Tools for Data Preparation
_____5.4: Data Preparation Implementation
__________5.4.1: Dataset 1 - Survey Results
_______________5.4.1.1: Completeness of Data
_______________5.4.1.2: Consistency of Data
_______________5.4.1.3:Accuracy of Data
_______________5.4.1.4: Final Output of "Survey Result" Table
__________5.4.2: Survey Question Master

Section 5: Data Preparation

"Flawed data can guide even the greatest leaders to the wrong conclusion". Evaluating the results of an online survey is tricky because the quality of the results is highly dependent on the quality of the survey's questions, and also; the responses from the respondents. Therefore, is it essential to prepare the data well such that the quality of the analysis, later on, is precise. The strategy to prepare the data as below;

Data Preparation Strategy

5.1: Source of Dataset

The sources of the dataset used, as below;

  1. Wiki's Online Survey Results
  2. Survey Questionnaire Master - Manual database created based on information from;

5.2: Best Practice to Evaluating Results of Online Survey

It is important to evaluate the methodologies to analyze the survey data, especially so when the quality of the analysis is highly dependent on the quality of the data and data interpretation. As such, with reference to the various online searches, and guidance from the course instructor; the following are attempts to identify common mistakes, and then; to overcome these mistakes in the subsequent sections.

5.2.1: Common Mistake 1 - Using Arithmetic Mean

It is a common (and yet a critical mistake) for people to analyze results on a Likert Scale, and treating it like a continuous data. This is wrong because it implies that there is a measure of distance between values. For instance, it is not logical to say that "3 - neutral" is 3 times the feeling of a "1- strong agreement". For instance, 100 respondents giving a score of 3 tell a very different story as compared to 50 respondents giving a score of 5, and another 50 respondents giving a score of 1, yet; both situations will result in a mean of 3. [[1]]

Thus, converting ordinal labels to the number can cause misleading results. Likert-type data has to be treated as ordinal data. Therefore, it is essential to re-code the numerical number results from the survey back to its labels since,

  1. Using numbers can lead to misleading results - by doing a sum/average of the scores
  2. When interpreting the results, it is often common for the reader to forgot if 1 is assigned to "Strong Agree" or "Strong Disagree", and therefore, then to refer back to the scale again.

As such, renaming the results which the label will help to overcome the above problems. This is will be completed in section 5.4.1.2: Consistency of Data.

5.2.2: Common Mistake 2 - Inconsistent Order to the Questions

Next, it is important to ensure that the order of the questions is consistent throughout the survey. If not, there will be a risk of confusing the respondents and collecting data that doesn't represent their true feelings. As such, it is essential to check both (1) the order of the questions (1 to represent XX, and 5 to represent XX on the scale), and the polarity of the question.

This is will be completed in section 5.4.1.3: Accuracy of Data.

A check over the questions in the survey has found that none of the questions in the database has made such a mistake. Therefore, all the questions can be included in the analysis of the results later on.

5.2.3: Common Mistake 3 - Asking Compound Questions

Asking "Compound Question" is another common mistake that is made in the designing of market survey. For example, "The use of Wikipedia improves student's learning, and is useful for teaching" - is a compounded question because it forced the respondent to give a single answer to reflect his/her feeling on both "student's learning" and "useful for teaching". This is a serious mistake because it leads to the false interpretation of the results.

A check over the questions in the survey has found that none of the questions in the database has made such a mistake. Therefore, all the questions can be included in the analysis of the results later on.

5.2.4: Common Mistake 4 - Allow Neutral or NA Responses

Last but not least, while it common to feel the need to force a respondent to give an answer, however; doing so may "pollute" the data with an unreal answer. Thus, while it seems undesirable to let respondents give an "NA" or "Neutral" response, but doing so; it can improve the quality of data instead.

A check over the order of the questions in the survey has shown that by applying the Likert Scale, an "NA" response is allowed at "3"; and this is consistent throughout the survey questions. Therefore, all the question can be included in the analysis of the results later on.

5.3: Evaluation of Tools for Data Preparation

With many tools around, we have compared the common tools, and select the final tools for data preparation, as below;

Note: "X" refers to the software is able to fulfill the task at ease (within 5 - 10 clicks)

Function Required Tableau JMP MS Excel
Splitting of Data into column
X
X
X
Summary of Data Statistics (to identify missing value(s))
X
Replacing of data across columns
X
X
Recoding of Categorical Data
X
X
Join Tables
X
X
Integration with other visualization tools
X

As a result, we will primarily be using JMP to prepare the datasets. Upon completion, the dataset will be exported out from JMP as a .csv file and saved as a separate dataset to be used in various visualization tools in the later stage.

5.4: Data Preparation Implementation

5.4.1: Dataset 1 - Survey Results

Firstly, the data file, "wiki4H.csv, is imported into JMP, by conducting the following steps;

Splitting the Data into Column

5.4.1.1: Completeness of Data

In order to check for completeness of data, we have to replace missing value of "?", with " " (Blank) in JMP, as below;

Replacing missing value of "?" with " "

This will then generate a statistics to identify the missing value, as below;

Statistics of data

With the above finding, we will

  1. Keep the field "UOC Position", because this is an important information to identify the background of the respondents. Missing value will be recoded with "Unknown" instead.
  2. Exclude "Other_Position", and "OtherStatus" since there is additional information for "UOC_Position", and "UOC_Position", also; there have more than 25% of missing data.

Output:The results as below;

Output of Initial Data-Cleaning

5.4.1.2: Consistency of Data

Next, we will check the consistency of the data. Other than the survey results for each of the survey question, the dataset contains other relevant information about the respondents. Using name labels work better visually as compared to using numerical labels, as per dataset. Therefore, the categorical data that are related to the information of the respondents are recoded. The summary of the actionable as below;

No. Field Name Type of Information Actionable
1 AGE Respondent's Profile
2 GENDER Respondent's Profile Change to Nominal, and recode it with "Male"/"Female"
3 DOMAIN Teaching Background of Respondent Recode it with respective definition based on Research Paper
4 PhD Teaching Background of Respondent Recode it with "Yes"/"No"
5 YEARSEXP Teaching Background of Respondent
6 UNIVERSITY Teaching Background of Respondent Recode with respective definition based on "Attribute Information" on the webpage
7 UOC_POSITION Teaching Background of Respondent Recode with respective definition based on "Attribute Information" on the webpage
8 USERWIKI Teaching Background of Respondent Recode with respective definition based on "Attribute Information" on the webpage

The output as below;

Output of Re-Coding

Next, referring to 5.2.1: Common Mistake 1 - Using Arithmetic Mean, we will replace the numerical value of the survey results, with the description instead of 1- Strongly Disagree 2- Disagree 3- Neutral 4- Agree 5- Strongly Agree

Since there is a total of 41 questions / fields affected. We will then save the JMP file as, "wiki4H_Step1_ClearedfromJMP.xlsx", and proceed with the cleaning of the data in the excel file, as below;

Cleaned Data

The file is saved as "wiki4H_Step2_Final.xlsx".

5.4.1.3:Accuracy of Data

Different Order of Question:

With reference to 5.2.2: Common Mistake 2 - Inconsistent Order to the Questions, it is essential to check for the order of the survey orders, and the polarity of the questions. Reading through the survey question, QU4 was spotted to show negative polarity to the statement. Since

  • we cannot assume that "disagree" with the statement of "Wikipedia has a lower quality", is the same as "agree" to the statement that "Wikipedia has a high quality"
  • There are sufficient questions (a total of 4, excluding QU4) under this section of "Quality".

We shall exclude the results from QU4, in our analysis, and the column shall be removed.

Same Option Selected by Respondents Throughout the Survey:

Next, we will also examine the accuracy of the submission in excel. An assumption can be made that if the respondents have submitted the same answer throughout the 43 questions, then; it is likely that it is not a real feeling because it is unlikely for a person to be 100% feeling the same throughout the question, especially so when there is a question of QU4, that show negative polarity of the question.

Inaccurate Entries by Respondents

Thus, the identified 5 rows of data will be removed.

Similar Type of Questions:

"Use Behavior" and "Experience" have the similar question posted, and it does not make sense to have a respondent to "Agree" with the question in "Use Behavior", and on the other hand, "Disagree" in question in "Experience". For instance, if a respondent agrees with "I use Wikipedia to develop my teaching material" (Use Behavior), but he/she should not disagree with "I consult Wikipedia for the issue related to my field of expertise".

Since "Use Behavior" is used in the Research Paper, we shall exclude the use of results from "Experience" in the analysis.

5.4.1.4: Final Output of "Survey Result" Table

�A final check on the results of the data preparation of this table has shown that the total number of rows has reduced to 904, and 5 categories of survey result.

Statistics of Final Output

The file is then exported, and saved as "wiki4H_SurveyResultTable_Final.xlsx". Ready for use later.

5.4.2: Survey Question Master

Next, a manual table was created, so ask to help to

  1. Design the dashboard such that the question will appear, upon selection; so that the audience can relate the results to the question.
  2. Categorize the survey question, based on the category name assigned to the survey question
  3. Identify if the question is being used in the analysis of the survey results, with reference to the research paper.
Field Name Type Header text Description Source
Qn_Code (Raw) Input Survey's code and question, per copied from website NA Original Data copied from the website
Category Input Category of the Question NA Original Data copied from the website
Qn_Code Calculated Survey's Question Code only =TRIM(LEFT([Qn_Code (Raw)],FIND(":",[Qn_Code (Raw)])-1))
Qn_Details Calculated Survey question only =TRIM(RIGHT([Qn_Code (Raw)],LEN([Qn_Code (Raw)])-FIND(":",[Qn_Code (Raw)])))
Research_Paper Input If the survey question is used in the research paper NA With reference to page 4 & 8 of Research Paper

Note: Survey Qn, "VIS3 - I Cite Wikipedia in my academic papers" is re-classified under "Quality" instead of "Visibility" because it measures more of the quality of the Wikipedia article instead of visibility. For instance, an article is cited most likely because it is relevant, rather than it is being searchable. Also, citing an article acts as a source of endorsement on the article also.

The file is then saved as "Survey_QN_Master.xlsx". Ready for use later.


Next Sub-section: After Design Iteration