ISSS608 2016-17 Assignment 2 - Ong Han Ying

Executive Summary

Introduction

Data Preparation

Approach

Results

Conclusion

Initial Preparation Work

After Design Iteration

CONTENT
Section 5: Data Preparation _____5.1: Source of Dataset _____5.2: Best Practice to Evaluating Results of Online Survey __________5.2.1: Common Mistake 1 - Using Arithmetic Mean __________5.2.2: Common Mistake 2 - Inconsistent Order to the Questions __________5.2.3: Common Mistake 3 - Asking Compound Questions __________5.2.4: Common Mistake 4 - Allow Neutral or NA Responses _____5.3: Evaluation of Tools for Data Preparation _____5.4: Data Preparation Implementation __________5.4.1: Dataset 1 - Survey Results _______________5.4.1.1: Completeness of Data _______________5.4.1.2: Consistency of Data _______________5.4.1.3:Accuracy of Data _______________5.4.1.4: Final Output of "Survey Result" Table __________5.4.2: Survey Question Master

Section 5: Data Preparation

"Flawed data can guide even the greatest leaders to the wrong conclusion". Evaluating the results of an online survey is tricky because the quality of the results is highly dependent on the quality of the survey's questions, and also; the responses from the respondents. Therefore, is it essential to prepare the data well such that the quality of the analysis, later on, is precise. The strategy to prepare the data as below;

5.1: Source of Dataset

The sources of the dataset used, as below;

Wiki's Online Survey Results
Survey Questionnaire Master - Manual database created based on information from;

5.2: Best Practice to Evaluating Results of Online Survey

It is important to evaluate the methodologies to analyze the survey data, especially so when the quality of the analysis is highly dependent on the quality of the data and data interpretation. As such, with reference to the various online searches, and guidance from the course instructor; the following are attempts to identify common mistakes, and then; to overcome these mistakes in the subsequent sections.

5.2.1: Common Mistake 1 - Using Arithmetic Mean

It is a common (and yet a critical mistake) for people to analyze results on a Likert Scale, and treating it like a continuous data. This is wrong because it implies that there is a measure of distance between values. For instance, it is not logical to say that "3 - neutral" is 3 times the feeling of a "1- strong agreement". For instance, 100 respondents giving a score of 3 tell a very different story as compared to 50 respondents giving a score of 5, and another 50 respondents giving a score of 1, yet; both situations will result in a mean of 3. [[1]]

Thus, converting ordinal labels to the number can cause misleading results. Likert-type data has to be treated as ordinal data. Therefore, it is essential to re-code the numerical number results from the survey back to its labels since,

Using numbers can lead to misleading results - by doing a sum/average of the scores
When interpreting the results, it is often common for the reader to forgot if 1 is assigned to "Strong Agree" or "Strong Disagree", and therefore, then to refer back to the scale again.

As such, renaming the results which the label will help to overcome the above problems. This is will be completed in section 5.4.1.2: Consistency of Data.

5.2.2: Common Mistake 2 - Inconsistent Order to the Questions

Next, it is important to ensure that the order of the questions is consistent throughout the survey. If not, there will be a risk of confusing the respondents and collecting data that doesn't represent their true feelings. As such, it is essential to check both (1) the order of the questions (1 to represent XX, and 5 to represent XX on the scale), and the polarity of the question.

This is will be completed in section 5.4.1.3: Accuracy of Data.

A check over the questions in the survey has found that none of the questions in the database has made such a mistake. Therefore, all the questions can be included in the analysis of the results later on.

5.2.3: Common Mistake 3 - Asking Compound Questions

Asking "Compound Question" is another common mistake that is made in the designing of market survey. For example, "The use of Wikipedia improves student's learning, and is useful for teaching" - is a compounded question because it forced the respondent to give a single answer to reflect his/her feeling on both "student's learning" and "useful for teaching". This is a serious mistake because it leads to the false interpretation of the results.

A check over the questions in the survey has found that none of the questions in the database has made such a mistake. Therefore, all the questions can be included in the analysis of the results later on.

5.2.4: Common Mistake 4 - Allow Neutral or NA Responses

Last but not least, while it common to feel the need to force a respondent to give an answer, however; doing so may "pollute" the data with an unreal answer. Thus, while it seems undesirable to let respondents give an "NA" or "Neutral" response, but doing so; it can improve the quality of data instead.

A check over the order of the questions in the survey has shown that by applying the Likert Scale, an "NA" response is allowed at "3"; and this is consistent throughout the survey questions. Therefore, all the question can be included in the analysis of the results later on.

5.3: Evaluation of Tools for Data Preparation

With many tools around, we have compared the common tools, and select the final tools for data preparation, as below;

Note: "X" refers to the software is able to fulfill the task at ease (within 5 - 10 clicks)

Function Required	Tableau	JMP	MS Excel
Splitting of Data into column	X	X	X
Summary of Data Statistics (to identify missing value(s))		X
Replacing of data across columns		X	X
Recoding of Categorical Data	X	X
Join Tables	X	X
Integration with other visualization tools			X

As a result, we will primarily be using JMP to prepare the datasets. Upon completion, the dataset will be exported out from JMP as a .csv file and saved as a separate dataset to be used in various visualization tools in the later stage.

5.4: Data Preparation Implementation

5.4.1: Dataset 1 - Survey Results

Firstly, the data file, "wiki4H.csv, is imported into JMP, by conducting the following steps;

5.4.1.1: Completeness of Data

In order to check for completeness of data, we have to replace missing value of "?", with " " (Blank) in JMP, as below;

This will then generate a statistics to identify the missing value, as below;

With the above finding, we will

Keep the field "UOC Position", because this is an important information to identify the background of the respondents. Missing value will be recoded with "Unknown" instead.
Exclude "Other_Position", and "OtherStatus" since there is additional information for "UOC_Position", and "UOC_Position", also; there have more than 25% of missing data.

Output:The results as below;

5.4.1.2: Consistency of Data

Next, we will check the consistency of the data. Other than the survey results for each of the survey question, the dataset contains other relevant information about the respondents. Using name labels work better visually as compared to using numerical labels, as per dataset. Therefore, the categorical data that are related to the information of the respondents are recoded. The summary of the actionable as below;

No.	Field Name	Type of Information	Actionable
1	AGE	Respondent's Profile
2	GENDER	Respondent's Profile	Change to Nominal, and recode it with "Male"/"Female"
3	DOMAIN	Teaching Background of Respondent	Recode it with respective definition based on Research Paper
4	PhD	Teaching Background of Respondent	Recode it with "Yes"/"No"
5	YEARSEXP	Teaching Background of Respondent
6	UNIVERSITY	Teaching Background of Respondent	Recode with respective definition based on "Attribute Information" on the webpage
7	UOC_POSITION	Teaching Background of Respondent	Recode with respective definition based on "Attribute Information" on the webpage
8	USERWIKI	Teaching Background of Respondent	Recode with respective definition based on "Attribute Information" on the webpage

The output as below;

Next, referring to 5.2.1: Common Mistake 1 - Using Arithmetic Mean, we will replace the numerical value of the survey results, with the description instead of 1- Strongly Disagree 2- Disagree 3- Neutral 4- Agree 5- Strongly Agree

Since there is a total of 41 questions / fields affected. We will then save the JMP file as, "wiki4H_Step1_ClearedfromJMP.xlsx", and proceed with the cleaning of the data in the excel file, as below;

The file is saved as "wiki4H_Step2_Final.xlsx".

5.4.1.3:Accuracy of Data

Different Order of Question:

With reference to 5.2.2: Common Mistake 2 - Inconsistent Order to the Questions, it is essential to check for the order of the survey orders, and the polarity of the questions. Reading through the survey question, QU4 was spotted to show negative polarity to the statement. Since

we cannot assume that "disagree" with the statement of "Wikipedia has a lower quality", is the same as "agree" to the statement that "Wikipedia has a high quality"
There are sufficient questions (a total of 4, excluding QU4) under this section of "Quality".

We shall exclude the results from QU4, in our analysis, and the column shall be removed.

Same Option Selected by Respondents Throughout the Survey:

Next, we will also examine the accuracy of the submission in excel. An assumption can be made that if the respondents have submitted the same answer throughout the 43 questions, then; it is likely that it is not a real feeling because it is unlikely for a person to be 100% feeling the same throughout the question, especially so when there is a question of QU4, that show negative polarity of the question.

Thus, the identified 5 rows of data will be removed.

Similar Type of Questions:

"Use Behavior" and "Experience" have the similar question posted, and it does not make sense to have a respondent to "Agree" with the question in "Use Behavior", and on the other hand, "Disagree" in question in "Experience". For instance, if a respondent agrees with "I use Wikipedia to develop my teaching material" (Use Behavior), but he/she should not disagree with "I consult Wikipedia for the issue related to my field of expertise".

Since "Use Behavior" is used in the Research Paper, we shall exclude the use of results from "Experience" in the analysis.

5.4.1.4: Final Output of "Survey Result" Table

�A final check on the results of the data preparation of this table has shown that the total number of rows has reduced to 904, and 5 categories of survey result.

The file is then exported, and saved as "wiki4H_SurveyResultTable_Final.xlsx". Ready for use later.

5.4.2: Survey Question Master

Next, a manual table was created, so ask to help to

Design the dashboard such that the question will appear, upon selection; so that the audience can relate the results to the question.
Categorize the survey question, based on the category name assigned to the survey question
Identify if the question is being used in the analysis of the survey results, with reference to the research paper.

Field Name	Type	Header text	Description	Source
Qn_Code (Raw)	Input	Survey's code and question, per copied from website	NA	Original Data copied from the website
Category	Input	Category of the Question	NA	Original Data copied from the website
Qn_Code	Calculated	Survey's Question Code only	=TRIM(LEFT([Qn_Code (Raw)],FIND(":",[Qn_Code (Raw)])-1))
Qn_Details	Calculated	Survey question only	=TRIM(RIGHT([Qn_Code (Raw)],LEN([Qn_Code (Raw)])-FIND(":",[Qn_Code (Raw)])))
Research_Paper	Input	If the survey question is used in the research paper	NA	With reference to page 4 & 8 of Research Paper

Note: Survey Qn, "VIS3 - I Cite Wikipedia in my academic papers" is re-classified under "Quality" instead of "Visibility" because it measures more of the quality of the Wikipedia article instead of visibility. For instance, an article is cited most likely because it is relevant, rather than it is being searchable. Also, citing an article acts as a source of endorsement on the article also.

The file is then saved as "Survey_QN_Master.xlsx". Ready for use later.

Next Sub-section: After Design Iteration

Previous Section: Introduction

Next Section: Approach

ISSS608 2016-17 T1 Assign2 Ong Han Ying - Data Preparation

Section 5: Data Preparation

5.1: Source of Dataset

5.2: Best Practice to Evaluating Results of Online Survey

5.2.1: Common Mistake 1 - Using Arithmetic Mean

5.2.2: Common Mistake 2 - Inconsistent Order to the Questions

5.2.3: Common Mistake 3 - Asking Compound Questions

5.2.4: Common Mistake 4 - Allow Neutral or NA Responses

5.3: Evaluation of Tools for Data Preparation

5.4: Data Preparation Implementation

5.4.1: Dataset 1 - Survey Results

5.4.1.1: Completeness of Data

5.4.1.2: Consistency of Data

5.4.1.3:Accuracy of Data

5.4.1.4: Final Output of "Survey Result" Table

5.4.2: Survey Question Master

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools