ISSS608 2016-17 T1 Assign2 Mukund Krishna Ravi
Contents
- 1 Overview
- 2 Theme of Interest and Motivation
- 3 Data Set
- 4 Data Preparation
- 5 Analysis
- 6 Tools Utilized
- 7 Conclusion
- 8 References
Overview
In this digital economy age, massive and complex data have been captured and stored in organization databases and/or data warehouses. By and large, these data contain a large amount of variables of a particular product, customer or activity. Due to limitations in perceptual and screen space, graphical techniques available in traditional business intelligence systems tend to confine to uni variate and bi variate data such as bar chart, pie chart and scatter plot. As a result, many important relationships that live in these data remain undiscovered.For instance, in the wiki4HE dataset there are many relationships in between the survey data and the different academic segments. These observations are hidden and require more complex visualization techniques to uncover all the observations.
Theme of Interest and Motivation
Ongoing research on university faculty perceptions and practices of using Wikipedia as a teaching resource. Based on a Technology Acceptance Model, the relationships within the internal and external constructs of the model are analyzed. Both the perception of colleagues€™ opinion about Wikipedia and the perceived quality of the information in Wikipedia play a central role in the obtained model.In this particular problem I have chosen to focus only a few areas of interest and discovering intricate relations in the data set which would not be visible if we used basic visualization techniques . The following are a few key aspects of the problem-
How do various user segments and domains rate Wikipedia and the perceived quality of information in Wikipedia
To understand this behavior we analyse the following criteria
1. How different Domains rate to perceived usefulness
2. How different users rate perceived usefulness
3. How different users rate the experience of wikipedia
4. Different users rate the quality of wikipedia
Do registered users of a particular age have any reference towards Wikipedia (rating of 4 and 5)
To understand this behavior we try to understand how different age segments perceive the usefulness of wikipedia and the perceived quality of wikipedia by analyzing all the parameters associated like usefulness and quality ratings.
Relationship between registered users and Age
To understand this behavior between registered users and age we need to uncover correlations between the two parameters to understand this behavior.
Data Set
For this assignment, I have selected to use the wiki4HE dataset, which is an ongoing research on university faculty perceptions and practices of using Wikipedia as a teaching resource. Based on a Technology Acceptance Model, the relationships within the internal and external constructs of the model are analysed. Both the perception of university faculty teaching staff’s opinion about Wikipedia and the perceived quality of the information in Wikipedia play a central role in the obtained model. The original data set can be found on the UC Irvine Machine Learning Repository’s website [1]. The original data set is formatted as a CSV file.
Data Preparation
Before we can analyse and explore the data, it has to be prepped. JMP Pro and excel was used to clean and prepare the data set. The data set contains a total of 913 entries.
1.Step 1
In the first step we hide the columns which are irrelevant to us. The data set was downloaded in the .csv format and opened with Excel 2013. All the csv values were delimited using comma.The data set was separated into survey responses and non survey responses.Except for age, all the values are converted to ordinal values. The reason for this is because all the values have a natural order to them (Rating values from 1-6), also the values are categorical.
2.Step 2
In the second step we analyse all the non-survey values. In this step we are mainly looking to eliminate all values which have '?'.The DOMAIN category has 0.2 % of total values as '?',the UOC has 12.37% of the total number of values as '?' and the userwiki has 0.4% of the values as '?'. As these values cannot not be imputed in any way without using statistical models , they would have to be excluded from our analysis. To exclude these values from our analysis, we select the filter option from excel and exclude all values which have a '?'. Now the total number of rows has been reduced 796 values.
3.Step 3
In the third step we analyse the distribution of each of the non- survey data. We notice that there are '?' values in the data set which need to be handled. These values have been imputed to 0. These 0 entries in the data set will not have an impact on the distribution of each of the features. This is carried out using filters in excel 2013. The reason 0 will have no impact on the distribution is because the value 0 has no meaning in the analysis.
Analysis
In our analysis we first try to understand how the overall distribution and the relation between various parameters like the Domain, UOC and age are.
The following observations could be made from the heat map
- There is a large portion of the data set which have their domains as unknown and the majority of the faculty in this domain seem to be of Adjunct faculty type. The remaining faculty are assistant, lecturer, associate, and professor. The age range 27- 59.
- In the arts and humanities domain , the age group of the faculty lies between 32- 62. The major chunk of sample is of adjunct faculty type. The oldest in the arts and humanities section is of adjunct faculty type. Interestingly all the adjunct faculty seem to the oldest among all the faculty types. The reason for this could be all the faculty which are of adjunct type could be emeritus professors.
- In the Law and politics domain the age group ranges from 30-59. The faculty comprise of Adjunct,Associate,Lecturer,Assistant and Professor. In this domain the faculty seem to be of slightly lower age group. Evidently this group seems to have younger people
- In the Engineering domain, the age group of the faculty lies between 28-69. The faculty comprise of Adjunct,Associate,Assistant and Lecturer. In this domain the faculty are mainly of adjunct. Interestingly, there are a large number younger faculty who are adjunct.
- In the Health sciences department, the age group of the faculty lies between 28-62. But, majority of the faculty in this domain are quite old except a few of them. Majority of the faculty in this domain are of Adjunct type. The remaining faculty are of adjunct,associate,assistant and lecturer.
- In the Sciences department, the age group of the faculty lies between 29-64. Similar to all the other departments the most of the faculty are adjunct faculty.
Interactive Divergent stacked Bar chart

- The divergent stacked bar chart allow quick visualization of the different responses which the respondents have towards each question. The filter on the right includes domain, question type, age range, the years of experience, and teaching position. The dashboard helps in providing understanding of the respondents' sentiments based on the topic of interest, for example I may be concern with the experience of the respondents with Wikipedia and their use behaviour of Wikipedia, I would then proceed to filter out these questions using the Question Type filter bar to review the responses.
Please refer to interactive dashboard at the following url: https://public.tableau.com/profile/publish/DivergentStackedBarChart/Sheet1#!/publish-confirm