ISSS608 2016-17 T1 Assign2 Thian Fong Mei
Contents
Abstract
This assignment delves into data discovery of high dimensional Data, using Visual Analytics techniques and methods. The data set used here is the wiki4HE Data Set (https://archive.ics.uci.edu/ml/datasets/wiki4HE)
Problem and Motivation
Wikipedia is open collaboration model. There are certain reservations towards the use of Wikipedia in academia, as it was commonly perceived as a “flawed knowledge community” and "a collaboratively generated encyclopedia (which) cannot meet the high standards of quality." The said data set relates to the research/survey on university faculty perceptions and practices of using Wikipedia as a teaching resource, and taps on the Technology Acceptance Model (TAM) to examines the interrelationships. It was said that the "both the perception of colleagues' opinions about Wikipedia and the perceived quality of the information in Wikipedia play a central role in the obtained model". tfm_TAM.png
Theme of Interest
This assignment does not seek to validate the TAM, but seeks to uncover highlights of the results, and the relationship between the profile of the faculty survey participants and the results.
Approach
Data Source
The data source used wiki4HE Data Set (https://archive.ics.uci.edu/ml/datasets/wiki4HE).There are 913 records/survey participants, with 53 variables. 10 of the variables are user profile information. The 43 survey questions variables are classified into 13 main categories.
Data Preparation & Approach
Data preparation is mainly done with JMP, with some analysis conducted in JMP.
1st iteration: The data is imported into JMP for a first round of exploration. The GENDER,DOMAIN, PhD, YEARSEXP,UNIVERSITY, UOC_POSITION, OTHER_POSITION, OTHERSTATUS and USERWIKI variables are recoded from nominal score to categorical names so to make these user profile variables more meaningful, and easier to interpret. Missing values are recoded as Unknown accordingly. It is observed that the OTHER_POSITION, OTHERSTATUS variable names are swapped when the attribute information given is compared. These 2 variable names changed to better reflect the variables. Univariate/Bivariate/Multivariate distribution, Scatterplot and Graph Builder are run to provide an overview of the data's distribution.
2nd iteration: Missing values in the survey questions are recoded as 0 in the interim to denote no response. The individual survey questions scoring are totaled up under 13 categories (where 13 additional columns are created).Ternary plots are run to for observation of any association/pattern between any 3 variables. Parallel plots are also run to check for relations between each survey question. Secondary tools like Mondrian, Treemap, High-D are also used.
The work from 1st and 2nd iteration do not prove to be very useful.
3rd iteration: The survey questions are transposed/stacked in JMP, with survey faculty participant tagged with a new creation of a column ID. Additional 2 columns are created for survey category and recoding of numeric survey responses to nominal/text. Subsequently, a metadata is also created for the survey question wording for greater clarity. In this 3rd iteration, Tableau is used to create divergent bar chart, heatmap and treemap.