Difference between revisions of "ISSS608 2016-17 T1 Assign2 Wan Xulang"
Line 31: | Line 31: | ||
From this graph, we can roughly understand the details of our participants in this survey. From the perspective of age, participants from group 30s to 50s have consist for 91% while other two groups only take about 9%. In the perspective of domain, except “Else”, participants from arts & humanities have taken about 20% while others are taken less than 15% respectively. Another significant label is university as almost 88% of participants are came from UOC. For participants’ position, most of them are adjuncts which have taken 73%. | From this graph, we can roughly understand the details of our participants in this survey. From the perspective of age, participants from group 30s to 50s have consist for 91% while other two groups only take about 9%. In the perspective of domain, except “Else”, participants from arts & humanities have taken about 20% while others are taken less than 15% respectively. Another significant label is university as almost 88% of participants are came from UOC. For participants’ position, most of them are adjuncts which have taken 73%. | ||
=== Who and which kind of people are more likely to give higher overall evaluation score to Wiki? === | === Who and which kind of people are more likely to give higher overall evaluation score to Wiki? === | ||
+ | With so many attributes, it's not easy to answer this problem with normal visualization graphs. So we choose to build a tree map to solve this question. The tree map is like this: <br /> | ||
+ | [[File: Tree-map.png|800px|thumbnail|center]] | ||
+ | The colour for each square represents to the average of total evaluation percentage for that specific sub-set. The deeper the colour is, the higher the average evaluation is. Squares from left part belong to male while the rest belong to female. We could find that the overall depth of colour from the left part is deeper than the same from the right part. It means that males are more likely to give high evaluation results to Wikipedia. Another significant finding is that faculty members who are less than 30 years old have given the highest evaluation result to Wikipedia. | ||
=== Who is more likely to become a registered user? === | === Who is more likely to become a registered user? === | ||
=== According to the survey, how do the faculty members perceive Wikipedia in different perspectives? === | === According to the survey, how do the faculty members perceive Wikipedia in different perspectives? === | ||
=== According to the feedback, what impacts behavioural intention the most? === | === According to the feedback, what impacts behavioural intention the most? === |
Revision as of 16:44, 26 September 2016
Contents
- 1 Abstract
- 2 Problem
- 3 Data Introduction & Preparation
- 4 Approaches
- 4.1 What are the participants’ distribution?
- 4.2 Who and which kind of people are more likely to give higher overall evaluation score to Wiki?
- 4.3 Who is more likely to become a registered user?
- 4.4 According to the survey, how do the faculty members perceive Wikipedia in different perspectives?
- 4.5 According to the feedback, what impacts behavioural intention the most?
Abstract
Wikipedia, probably the most powerful knowledge base in the world. It’s open, web-based and free. Nearly everything you want to know can be searched in this online library. One reason for its successful is that everyone in the world can contribute what they know which makes sharing knowledge become easy and free. In this assignment, with a dataset about people’s perceiving on Wikipedia, we’ll explore some deep insights on how people using and thinking about this great project.
Problem
Before starting our journey, we have set some problems before. They are:
- What are the participants’ distribution?
- Who and which kind of people are more likely to give higher overall evaluation score to Wiki?
- Who is more likely to become a registered user?
- According to the survey, how do the faculty members perceive Wikipedia in different perspectives?
- According to the feedback, what impacts behavioural intention the most?
Data Introduction & Preparation
Data Introduction
The dataset is from UCI. It’s collected from a survey of faculty members of two universities which are UOC and UPF. In this dataset, we’ve 53 attributes and 913 records. These 53 attributes can be divided into two parts, the first one is about the faculty member’s personal information while the second part is the feedback of a series of Likert Scale Questions regarding to a wide range of user experience of Wikipedia. The raw data is looking like this:
Data Preparation
While this dataset is quite massive, we would like to do some clean works first. Several steps are taken, they are:
- Import the raw set into Excel
- Build two category variables named “AGE-GROUP” and “YEARS-GROUP”. They are grouped from “AGE” and “YEARSEXP” which are two interval columns.
- Replace the index number of “GENDER”, “Domain”, “PhD”, “UNIVERSITY”, “UOC_POSITION”, “OTHER_POSITION”, “OTHERSTATUS” and “USERWIKI” with readable English words according to data dictionary.
- Generate a new category variable named “POSITION” from “UOC_POSITION” and “OTHERSTATUS”. The logic is: If UOC_POSITION != ?, then POSITION = UOC_POSITION, else POSITION = OTHERSTATUS.
- Replace the missing values of all interval variables with the mean of that specific variable.
- Calculate the total score of the survey for each person. It’s calculated by the sum of all answers divided by 215(which is the full mark of this survey). So we can know that this variable is actually a percentage number which can generally represents to a person’s perceiving on Wikipedia.
- As the survey is actually divided into 13 kinds of questions, we would like to calculate the average score a person give to a specific kind of questions. Thus, we’ve built 13 new variables which are the average of answers belong to different part.
- Delete the rows who have miss values on category variables.
- Delete unnecessary variables such as the answer for each question since we’ve generated new variables which can represent them.
Finally, we’ve a cleaned data set like this:
Approaches
What are the participants’ distribution?
To explore the distribution of the participants, we choose to use a parallel set to help us. The result is like this:
From this graph, we can roughly understand the details of our participants in this survey. From the perspective of age, participants from group 30s to 50s have consist for 91% while other two groups only take about 9%. In the perspective of domain, except “Else”, participants from arts & humanities have taken about 20% while others are taken less than 15% respectively. Another significant label is university as almost 88% of participants are came from UOC. For participants’ position, most of them are adjuncts which have taken 73%.
Who and which kind of people are more likely to give higher overall evaluation score to Wiki?
With so many attributes, it's not easy to answer this problem with normal visualization graphs. So we choose to build a tree map to solve this question. The tree map is like this:
The colour for each square represents to the average of total evaluation percentage for that specific sub-set. The deeper the colour is, the higher the average evaluation is. Squares from left part belong to male while the rest belong to female. We could find that the overall depth of colour from the left part is deeper than the same from the right part. It means that males are more likely to give high evaluation results to Wikipedia. Another significant finding is that faculty members who are less than 30 years old have given the highest evaluation result to Wikipedia.