ISSS608 2016-17 T1 Assign2 Shishir Nehete

From Visual Analytics and Applications
Jump to navigation Jump to search

Abstract

As the use of technology increases in data collection and storage in organizations, the demand for finding the insights from this data is a growing concern. Currently, most of the traditional business intelligence systems tend to confine to univariate and bivariate data analysis. The Project focuses on applying interactive data exploration and analysis techniques to discovery patterns in multivariate data to explore different relationships in the data. The topic used for exploring these techniques is “University faculty perceptions and practices of using Wikipedia as a teaching resource”. This is an ongoing research in which perception of colleagues and opinion about Wikipedia and the perceived quality of information in Wikipedia play a central role.

Theme of Interest and Motivation

The dataset used for this project is wiki4HE Data Set(https://archive.ics.uci.edu/ml/datasets/wiki4HE).

Identifying a theme of interest

The dataset provides information of the survey providers on multiple variables such as:
Age, Gender, Domain, PhD, Experience, University (Universitat Oberta de Catalunya, Universitat Pompeu Fabra), UOC_Position, Other, Other_Position, UserWiki The survey consists of questions in following categories to analyse the use of Wikipedia for education purposes.

  1. Perceived Usefulness
  2. Perceived Ease of Use
  3. Perceived Enjoyment
  4. Quality
  5. Visibility
  6. Social Image
  7. Sharing attitude
  8. Use behaviour
  9. Profile 2.0
  10. Job relevance
  11. Behavioural intention
  12. Incentives
  13. Experience

To define the scope of the assignment, I am considering 5 of the above list of variables. Limiting the scope will provide me a confined field of analysis which can be furthered to other variables too. These variables are Perceived Usefulness, Quality, Visibility, Experience and Sharing Attitude.

Data Preparation

1. Import Data in JMP Pro for data preparation.

  • The data consists of 913 rows for the responses by the users.

2. Check for Missing Data pattern.

  • After initial analysis, the data consists of inconsistencies in terms of the attribute values. There are a number of missing values in multiple attributes. Following steps describe the fix for these missing values by studying the data dictionary provided with the data set.

3. Check for attribute appropriateness with the data set description.

  • Following are the attributes provided in the data dictionary.
     AGE: numeric 
     GENDER: 0=Male; 1=Female 
     DOMAIN: 1=Arts & Humanities; 2=Sciences; 3=Health Sciences; 4=Engineering & Architecture; 5=Law & Politics 
     PhD: 0=No; 1=Yes 
     YEARSEXP (years of university teaching experience): numeric 
     UNIVERSITY: 1=UOC; 2=UPF 
     UOC_POSITION (academic position of UOC members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct 
     OTHER (main job in another university for part-time members): 1=Yes; 2=No 
     OTHER_POSITION (work as part-time in another university and UPF members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct 
     USERWIKI (Wikipedia registered user): 0=No; 1=Yes

While comparing the attributes, following observations are made:

  • Age, Gender, Yearsexp, University do not have any discrepancy.
  • DOMAIN: This domain has an extra value (6) and missing values which needs to be taken care of. Hence, recoding the Attribute values as below:
     1=Arts & Humanities
     2=Sciences
     3=Health Sciences
     4=Engineering & Architecture
     5=Law & Politics
     6=Others
     ?=Unknown (7)
  • Yearsexp: There are 23 records that are missing values for this attribute.
     As this number is not very significant (2.5%) recoding these as ‘0’.
  • UOC_POSITION (academic position of UOC members): This is a field which is specific for University type 1 (UOC), so recoding the missing values as NA for another type of university.
     1=Professor
     2=Associate
     3=Assistant
     4=Lecturer
     5=Instructor
     6=Adjunct 
     ?=NA (7)
  • OTHER (main job in another university for part-time members): This attribute is also specific to UOC as all the records for UPF. Recoding the missing values as NA
     1=Yes
     2=No
     ?=NA (3)
  • OTHER_POSITION (work as part-time in another university and UPF members): This attribute has 1 extra classification which is recoded as Other and missing values are recoded as NA.
     1=Professor
     2=Associate
     3=Assistant
     4=Lecturer
     5=Instructor
     6=Adjunct 
     7=Other
     ?=Unknown (8)
  • USERWIKI (Wikipedia registered user): This attribute defines whether the users are registered users if Wikipedia or not. There are 4 records where the data is missing. Hence, recoding this data as Unknown.
     0=No
     1=Yes
     ?=Unknown (2)

4. Change data types of the attributes.

  • Gender: Numeric, Nominal
  • PhD: Numeric, Nominal
  • Yearsexp: Numeric, Continuous
  • University: Numeric, Nominal
  • All Question attributes: Numeric, Continuous

5. Create new columns to understand the attributes better.

  • Gender
  • Domain
  • PhD
  • University
  • UOC_Position
  • Other
  • Other_Position
  • UserWiki

6. Exclude and hide attributes that are out of the scope of the assignment.

7. Export data in csv format which can be used for further visualization in another tools. (<v2>)



Tools Utilised

  1. JMP – To explore and transform the data into usable data set. Also used to check distribution of the ratings for selected questions in scope of the assignment.
  2. Tableau – To create interactive data visualizations for finding insights and relationships between multiple variables.
  3. High-D – To create interactive visualization for analysing the quality criteria of the Wikipedia survey.


Interactive Result



Results



Citations



Comments