ISSS608 2016-17 T1 Assign2 Wan Xulang

From Visual Analytics and Applications
Revision as of 16:05, 26 September 2016 by Xulang.wan.2015 (talk | contribs)
Jump to navigation Jump to search

Abstract

Wikipedia, probably the most powerful knowledge base in the world. It’s open, web-based and free. Nearly everything you want to know can be searched in this online library. One reason for its successful is that everyone in the world can contribute what they know which makes sharing knowledge become easy and free. In this assignment, with a dataset about people’s perceiving on Wikipedia, we’ll explore some deep insights on how people using and thinking about this great project.

Problem

Before starting our journey, we have set some problems before. They are:

  1. What are the participants’ distribution?
  2. Who and which kind of people are more likely to give higher overall evaluation score to Wiki?
  3. Who is more likely to become a registered user?
  4. According to the survey, how do the faculty members perceive Wikipedia in different perspectives?
  5. According to the feedback, what impacts behavioural intention the most?

Data Introduction & Preparation

Data Introduction

The dataset is from UCI. It’s collected from a survey of faculty members of two universities which are UOC and UPF. In this dataset, we’ve 53 attributes and 913 records. These 53 attributes can be divided into two parts, the first one is about the faculty member’s personal information while the second part is the feedback of a series of Likert Scale Questions regarding to a wide range of user experience of Wikipedia. The raw data is looking like this:

Raw-data.PNG

Data Preparation

While this dataset is quite massive, we would like to do some clean works first. Several steps are taken, they are:

  1. Import the raw set into Excel
  2. Build two category variables named “AGE-GROUP” and “YEARS-GROUP”. They are grouped from “AGE” and “YEARSEXP” which are two interval columns.
  3. Replace the index number of “GENDER”, “Domain”, “PhD”, “UNIVERSITY”, “UOC_POSITION”, “OTHER_POSITION”, “OTHERSTATUS” and “USERWIKI” with readable English words according to data dictionary.
  4. Replace the missing values of all interval variables with the mean of that specific variable.
  5. Calculate the total score of the survey for each person. It’s calculated by the sum of all answers divided by 215(which is the full mark of this survey). So we can know that this variable is actually a percentage number which can generally represents to a person’s perceiving on Wikipedia.
  6. As the survey is actually divided into 13 kinds of questions, we would like to calculate the average score a person give to a specific kind of questions. Thus, we’ve built 13 new variables which are the average of answers belong to different part.
  7. Delete the rows who have miss values on category variables.
  8. Delete unnecessary variables such as the answer for each question since we’ve generated new variables which can represent them.

Finally, we’ve a cleaned data set like this:

Cleaned.PNG

Approaches

What are the participants’ distribution?

Who and which kind of people are more likely to give higher overall evaluation score to Wiki?

Who is more likely to become a registered user?

According to the survey, how do the faculty members perceive Wikipedia in different perspectives?

According to the feedback, what impacts behavioural intention the most?