ISSS608 2016-17 T1 Assign2 Lee Mei Hui Cheryl
Contents
Abstract
The data provided was a survey response from faculty members at 2 universities in Spain regarding their perception and attitudes toward Wikipedia, which is the theme of interest for this investigation.
At a first glance, looking at attributes that were given, further questions were formed:
- What are the general trends in the survey?
- Is there a relationship between demographic and response?
Data Preparation
Before analysing the dataset to glean further information, it first has to be formatted appropriately. JMP was utilized for this purpose.
Recode missing values
First, the dataset was opened in JMP, and attributes compared against the metadata from the website, to check if the all variables were categorized into the right data type. It was noted that most of the variables were categorized as a character although they contained numerical values. Next, a univariate analysis was conducted, which revealed that columns with “missing values” were labelled as “?”. All columns were then re-coded from “?” to blank and separate new columns created.
Recode demographics to their corresponding meaning
Next, the demographic columns were recoded to their respective values. For instance, in the gender column, 0 recoded to M and 1 to F. This was repeated for all columns, from Gender to UserWiki. From the univariate analysis, the “DOMAIN” column contained values from 1 to 6 although in the metadata, it was only from 1 to 5. The value of 6 was recoded to "Others" as it could be a possibility that individuals did not come from any of the above specified faculties. For instance, they could have been a graduate from Business. Hence, the value of 6 was assumed to represent "Others" since a significant proportion of individuals chose this option.
For the “Other_Status”, there were values from 1 to 7 but the metadata only specified from 1 to 6. Unlike the "DOMAIN" column however, the meaning of the additional value 7 could not be guessed. Hence, this column was dropped from the analysis.
Recode Question QU4
Looking at the questions in the metadata, question QU 4 was phrased in a negative way, asking if Wikipedia had a lower quality than other educational resources. Other questions in that the category however, were phrased in a positive way about quality. Hence, the data for QU 4 was recode such that 5 will now by 1, 4 be 2 and vice versa. This was done to ensure that all questions within the category had the same meaning and impacted the overall category in the same way.
Creating ID column
The dataset provided did not label each row as an individual response. Hence, column for ID was created through the use of the formula Row().
Stack column
The stack function was then used on all questions to create a modified dataset showing each individual survey data with question stacked into 1 column instead of across multiple columns. This was done to create a single column for all responses, making it easier for subsequent analysis.
Results and evolution of visualization
The formatted dataset from JMP was then saved in .csv format and opened in Tableau for visualization of data to answer the questions that were defined in the beginning. In Tableau, columns that were not utilized were hidden (eg. Pre-corrected recoded columns of data). Questions were given aliases according to their respective question, before I started experimenting with data visualization in Tableau.
What are the general trends in the survey?
While trying to look for relationships in questions and distribution of answers, I realized that there were too many questions to visualize in a single view. Questions were thus grouped into their respective categories. A hierarchy was also created for the question group and the individual questions. A graph was created with question against number of responses for category together with the average response for each category.