IS428 2016-17 Term1 Assign2 Tan Kee Hock

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search

Interactive Data Exploration and Analysis (IDEAL)

Overview

In this digital economy age, massive and complex data have been captured and stored in organization databases and/or data warehouses. By and large, these data contain a large amount of variables of a particular product or activity. Due to limitations in perceptual and screen space, graphical techniques available in traditional business intelligence systems tend to confine to univariate and bivariate data such as bar chart, pie chart and scatterplot. As a result, many important relationships that live in these data remain undiscovered. In this assignment, you are required to apply interactive data exploration and analysis techniques to discovery patterns from multivariate data. The goal of this assignment is not to develop a new visualization tool, but to apply the interactive data exploration and analysis techniques you have learned by using commercial-of-the-shelf and opensource software. It also aims to allow you to gain hands-on experiences on using the visualization tool and at the same time, to evaluate the pros and cons of the tool in real world applications.

Recommended Best Practises

After you have the initial question and the appropriate datasets, construct a visualization that provides an answer to your question. As you construct the visualization, you will find that your question evolves – very often, it will become more specific. Keep track of this evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. The presentation must be in the form of interactive visualization.

  1. And, as you go, maintain a wiki notebook of what you have to do to construct the visualizations and how the questions evolved. Include in the notebook
  2. where you get the data from,
  3. documentation about the format of the dataset.
  4. Describe any transformations or rearrangements of the dataset that you need to perform; in particular, describe how you get the data into the format needed by the visualization system.
  5. Keep copies of any intermediate visualizations that have helped you refine your question.
  6. After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed.
  7. Think of the figure, the caption and the text as materials you might include in a research paper.

Theme

I want to investigate work-related accidents across industries and how they vary. Thus, I have decided to use the Workplace Injuries Data 2014 with Reference to the WSHI National Statistics Report 2014. To also start things off, I had taken a quick glance in the WSHI National Statistics Report 2014 and the WSHI website so that I have a clearer idea and purpose in my analytic process.

My overarching theme for this assignment would be: Investigate work-related accidents across industries and how they vary.

Summary of Questions

The analytical process is all about answering all the questions that comes into the mind of the analyst when presented the data. Thus, a summarised overview of my questions are shown below:

  1. How severe are the accidents when compared with the other industries?
    • Which particular industry which faces the most accidents?
    • Are there any patterns of accident during different time of the day?
      • If so, are they working overtime?
      • Is there a difference between those who were working overtime and those working in normal shift?
    • How is the victims’ job experience distributed?
    • How much long did it take for the accident to occur from the time the victim started work?
      • How is it distributed?
    • How is the accident severity (No. MC days) correlated to the employee’s experience?
  2. From these accident, how does the injuries differ?
    • How the injured body parts vary across industries?
    • What is the main economic activity of the occupier with relation to the accidents?
      • Does the pattern differ between gender?
    • What are the more common causes of injuries within the industries?
    • How the injuries differ between the type of work? Who/what induced the injuries?
  3. What is the demographics of the victims like?
    • Is there a relationship between victim’s age and employment length?
    • If so what is the difference between male and female workers?
    • What is the distribution of the employment length like?
    • What is the distribution of the “Lag in Reporting (Days)”? How long companies take to report an injury in their workplace?

Data Cleansing

IsMajor – Severity Score (New Calculated Field)
The severity of the accident (Minor/Major) is currently a Dimension, in order to use it as a Measure type in Tableau, so that we can actually use it to rank industry, it must be transformed.
[INSERT IMAGE]

SSIC Code
After initial study on SSIC, I decided to include more details on SSIC codes into the data as it provides useful information on the Occupier’s economic activity. This information will give us more insight in relation to the accident.Data on SSIC Code 2010 can be found here. [[1]]
[INSERT IMAGE]
Further transformation performed in excel was needed to make the data more “friendly”. The data is then joined to the WPIData dataset.
[INSERT IMAGE]

Hierarchical Relationship between Major Industry and Sub-Industry
The Sub-Industry is a subset of the Major-Industry. Thus, by transforming it to a hierarchical relationship, we can use it more effectively in Tableau to explore the data.
[INSERT IMAGE]

Age Group
The victim’s age is a wide range of numbers. The numbers can be further categorized to enable ease of use to the user when filtering data.
[INSERT IMAGE]

Visualisation Plan

  1. Dashboard: How the accident differs from industry to industry?
  2. Breakdown of the Injuries from these Accidents
    [INSERT IMAGE]
    Sub-Questions:
    • How the injured body parts vary across industries?
      Data Attributes:
      1. Major Industry (SSIC 2010) (String – Dimension)
      2. Sub Industry (SSIC 2010) (String – Dimension)
      3. Accident Type level 2 Category (String – Dimension)
      4. Number of Rows (Number – Measure)

      [INSERT IMAGE]
      Chart: Heatmap
      Interpretation: The frequency of the Accident type is denoted by the colour.
      Objective: To allow user at one glance understand which Accident type happen more frequently in which industry.
      Conclusion: The construction industry experiences more accident relating to fall from height. This is very likely attributed to the nature of the occupier’s economic activity. (Next chart will show what economic activity of the occupier)

    • What is the main economic activity of the occupier with relation to the accidents?
      Data Attributes:
      1. Number of Records (Number – Measure) – Left Axis
      2. Number of Records (Number – Measure) – Right Axis
      3. SSIC 2010 Description (String – Dimension)
      4. Number of Rows (Number – Measure)

      [INSERT IMAGE]
      Chart: Pareto
      Interpretation: The yellow line above shows the cumulative percentage of the type of economic activities which makes up the entire activities of what the occupier does, in relation to the accidents.
      ObjectivenTo allow the user to understand what is breakdown of the business/operation (Economic activity) of the occupiers in the filtered field in the dashboard.
      Conclusion: The highest number of accidents by economic activity of the occupier is the one which does “General contractors (building construction including major upgrading works)” activities. This economic activity contributed to 12.34% of the total accidents.

    • What are the more common causes of injuries within the industries?
      Data Attributes:
      1. Number of Records (Number – Measure) – Left Axis
      2. Number of Records (Number – Measure) – Right Axis
      3. Accident Agency Level 2 Desc (String - Dimension)
      4. Number of Rows (Number – Measure)

      [INSERT IMAGE]
      Chart: Pareto
      Interpretation: The yellow line above shows the cumulative percentage of the type of causes of injury, in relation to the accidents.
      ObjectivenTo allow the user to understand what is breakdown of causes of injuries in workplace.
      Conclusion: The single highest contributor to accidents is caused by “Other Physical Workplace - Floor/Level Surfaces”.

  3. Dashboard: What is the demographics of the victims like?

Critique on Tableau Software

Additional Findings from other Non-Tableau Software

Software Used

Final Deliverables

[Insert URL to Tableau]

Data Sets

We have provided the following data sets and would like to encourage you to use one of them in order to get started quickly and therefore have more time to explore the data and develop your analysis questions.


Data Visualisation System Design Process

Step 1: Identify a theme of interest

Each of the dataset provides a wide range of parameters that can be used for many different purposes. Hence, it is very important for you to identify a theme clearly before you start your investigation. For example, you might want to focus on issues related to business competitiveness.


Step 2: Define questions for investigation

After you have identified a theme, you should now formulate questions for investigation. For example: Is there a relationship between sales revenue and marketing expenditure? Are the growth of GPD per capita and the growth of productivity correlated? Are there different patterns of energy consumption in different regions of the world?


Step 3: Find appropriate data attributes

Extract and download the datasets in convenient formats such as Excel or a CSV file. The online database contains a lot of tabulated data. In some cases, you will need to convert the data to a format you can use. Format conversion is a big part of visualization research so it is worth learning techniques for doing such conversions.

You will need to iterate through these steps a few times. It may be challenging to find interesting questions and a dataset that has the information that you need to answer those questions.


Recommended Best Practice

After you have the initial question and the appropriate datasets, construct a visualization that provides an answer to your question. As you construct the visualization, you will find that your question evolves – very often, it will become more specific. Keep track of this evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. The presentation must be in the form of interactive visualization.

Before starting, write down the initial question clearly. And, as you go, maintain a wiki notebook of what you have to do to construct the visualizations and how the questions evolved. Include in the notebook where you get the data from, and documentation about the format of the dataset. Describe any transformations or rearrangements of the dataset that you need to perform; in particular, describe how you get the data into the format needed by the visualization system. Keep copies of any intermediate visualizations that have helped you refine your question. After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed. Think of the figure, the caption and the text as materials you might include in a research paper.

You should maintain a session on the assignment wiki documents all the questions you asked and the steps you performed from start to the end .


Visualization Software

To perform the analysis, you are encouraged to explore the following Data Visualisation toolkits:

  • Tableau
  • Qlik Sense
  • Power BI

and specialised visual analytics techniques

  • Ternary diagram
  • Parallel coordinates
  • Trellis
  • Mosaic plot
  • Divergent bar chart
  • Parallel Sets
  • TableLens
  • Treemap

One of the goals of this assignment is for you to learn to use and evaluate the effectiveness of these visualization tools. Please do not hesitate to consult me if you encounter problems in using the tool.

Useful resources


Submission details

This is an individual assignment. You are required to work on the assignment and prepare submission individually. Your completed assignment is due on 26th September 2016, by 11.30am.

You need to edit your assignment in the appropriate wiki page of the Assignment Dropbox. The title of the wiki page should be in the form of: IS428_2016-17_Term1_Assign2_FullName.

The assignment 2 wiki page should include the URL link to the web-based interactive data visualization system prepared.