Difference between revisions of "IS428 2016-17 Term1 Assign2 Lim Kim Yong"

From Visual Analytics for Business Intelligence
Jump to navigation Jump to search
(Created page with "== Theme of Interest: Industry Accident Tracking == Each industry plays a critical role in contributing to Singapore's economy. It is essential to enforce and encourage a safe...")
 
 
(17 intermediate revisions by the same user not shown)
Line 3: Line 3:
  
 
== Analytical/Investigation Questions ==
 
== Analytical/Investigation Questions ==
* Which industry and sub-industry has the highest number of cases by accident types?  
+
* Which industry/sub-industry has the highest number of cases by accident types?  
* What are the most common types of accidents occurring by sub industry?  
+
* What are the most common types of accidents occurring in industry/sub industry?  
* Which month has the highest number of cases reported by sub industry?   
+
* Which month has the highest number of cases reported based on the industry/sub industry?   
  
 
== Selected Data Attributes for Analysis ==
 
== Selected Data Attributes for Analysis ==
 
The following data attributes are selected for analysis:
 
The following data attributes are selected for analysis:
 
* Reported Date
 
* Reported Date
 +
* Accident Day
 
* Major Industry
 
* Major Industry
 
* Sub Industry
 
* Sub Industry
Line 16: Line 17:
  
 
== Data Transformation Process ==
 
== Data Transformation Process ==
Before the analysis began, the dataset is analysed to identify its format and attributes. The dataset comes from a survey conducted across 20,663 researchers, librarians and other groups. In the survey, participants are presented with questions that provide them with the option to choose multiple answers. For example, one of the question in the survey asked participants to check all tools that they use during the process of researching for data. They could select multiple tools, such as a combination of Google Scholar and Web of Science. Each tool is a column in the dataset and if users indicate that they use the tool, the column will be filled. However, if they did not indicate it, the column will be left blank. An example of a row in the dataset is as follows:<br/><br/>
+
In this assignment, I have decided to use Workplace Injuries data in 2014 attained from Assign2 at https://wiki.smu.edu.sg/1617t1IS428g1/Assign2 and with reference to WSHI National Statistics Report 2014. The excel sheets have been cleaned up to suit the needs of the various visualization tools used.  
[[File:01 Explanation of Dataset GwendolineTanWanXin.png|center|800px|Explanation of Dataset]]
+
<br><br><b>Sub Industry + Accident Type Category + Victim’s Age</b>
<br/>After understanding the format of the data, the following is performed to transform the data into a format that could be used for analysis.
+
<br>I had filtered and removed unnecessary data, perform data formatting and, lastly organizing it in a clearer manner.
# Firstly, the columns in the initial dataset is grouped into its own respective questions. Each question will have its own sheet. For all demographics questions (e.g. role, country etc.), it will be grouped together into a single sheet. To identify the user who answered each question, the ID column will be present in all the spreadsheets.<p></p>
+
[[File:SortedData KimYong.png|500px|none|left|Sorted Data]]
# For each column, the response is either a “Yes” (user use the tool) or a “No” (user don’t use the tool). To allow better visualization, empty fields are filled with “No” and fields that are filled are changed to “Yes”. This is done using Excel’s Replace All feature. The resulting dataset will then look like the following:<p></p>[[File:02 Transformed Dataset Part 1 GwendolineTanWanXin.png|600px|center|Data Transformation Part 1]]<br/>
+
<br>
# It is difficult to perform analysis and processing on the data even though they have now been grouped into its own sheet. To make the data readable by Tableau, the data needs to be reshaped. This is done using Tableau’s Excel Add-In. The usage of the tool is as shown in the following steps:<br/>
+
[[File:02 MarineData KimYong.png|500px|none|left|Sub Industry:Others-Marine]]
## Install the Tableau Add-In in Excel and make use of the “Reshape Data” Feature. Enter the cell to begin the data reshaping process.<p></p>[[File:03 Transformed Dataset Part 2 GwendolineTanWanXin.png|center|800px|Tableau-Excel Installation]]<br/>
+
Under the Sub Industry, I have combined the ‘Marine-Other Marine’ provided by the Raw Data into ‘Others –Marine’ in the Clean WPI Data as they refer to the same sub Industry. This is to avoid confusion and the inaccuracy of information retrieve when analyzing data.
## A progress bar will be shown as the data is been reshaped. The end result of the reshaped data will look like this:<p></p>[[File:04 Transformed Dataset Part 3 GwendolineTanWanXin.png|center|800px|Tableau-Excel Data Reshape]]<br/>
+
<br><br><b>Major Industry & Sub Industry + Age Group</b>
# The dataset is now transformed and it will be imported into the tools (e.g. Tableau, PowerBI) for analysis and processing.
+
[[File:03 Ternary Kim Yong.png|600px|none|left|Data for Ternary]]
 +
Based on the total number of reported cases by industry/Sub industry, I have created a new worksheet with age group ranging from Age 15 all the way to Age 80 to identify which age group has the highest number of accidents in the respective industries.  
  
== Analysis & Visualization Construction Process ==
+
As such, we proceed with the analysis and Visualization construction process as follows:
The following section elaborates on the analysis and visualization construction process. It details how various graphs are generated, analyzed and how the dataset are further transformed/rearranged to lead to the final visualization constructed.<p></p>
 
# '''[Analysis]''' To facilitate better understanding in the data collected, a chart was created to show an overview of the survey participants. As there were multiple attributes (career stage, discipline and role), a treemap was plotted to view the distribution of the research participants’ demographics. These 3 attributes could serve as factors that affect the way in which how people use tools throughout the research process.<p></p>[[File:11 Final Chart 1 GwendolineTanWanXin.png|center|600px|Understanding Research Group]]<br/><p></p>Other than using Tableau, I also attempted to import the data into PowerBI to perform analysis. A good function that PowerBI supports is the ability to drill down into each category. As such, it won’t be directly confusing for a user when they look at the treemap directly. As they drill up and down, they can view information clearer as compared to the one in Tableau. However, the way in which data is segmented is not as clear as compared to a similar one generated in Tableau when it has been broken down into its respective levels. The following shows a chart generated in PowerBI. Given the same data, the chart above (generated by Tableau) allows one to view information clearer. As such, I have preferred to display the chart in Tableau for better visualization. <p></p>[[File:07 IntermediateVisualization Part 2 GwendolineTanWanXin.png|center|600px|Understanding Research Group - In PowerBI]]<br/>
 
# '''[Analysis & Intermediate Visualization]''' Another possible factor to look at is the distribution of research participants across the world. This aims to identify whether there are any countries that have a lot of research participants in one discipline but not the other. Based on the chart, majority of the research participants are congregated in the United States across all disciplines. This does not provide much useful information as to whether countries may affect the way in which how people use tools. As such, I have decided to continue with the analysis and look back at the region later on to decide if it could be a factor that influence participant’s choice of tools/sites during their research process.[[File:06 IntermediateVisualization Part 1 GwendolineTanWanXin.png|center|300px|Graphical View of Participant's Geographical Region]]<br/>
 
# '''[Data Rearrangement]''' After knowing possible factors that could affect people’s choice of tools/sites, I aim to identify which phases of the research process requires the most number of tools. In order to retrieve this value, there is a need to identify the number of tools that each participant uses during each stage of the research process. This data is currently not available and therefore, new data has to be added into the transformed dataset. This is done by using Excel’s COUNTIF function to count the number of “Yes” in each row. This is done for all the questions and the data column is imported into the “Demographics” sheet. The following illustrates the addition of new data into the transformed dataset.<p></p>[[File:05 Analysis Part 1 GwendolineTanWanXin.png|center|800px|Data Rearrangement]]<br/>
 
# '''[Analysis & Intermediate Visualization]''' The data source from Step 3 is then, imported into JMP Pro to perform analysis. Initially, I attempted to plot the number of tools that each person used throughout the entire research process. The following diagram illustrates the chart generated in JMP Pro. However, the graph looks messy and because the data values are almost similar, they pass through similar points. This is not useful when displayed in a parallel coordinates plot. I attempted to use Parallel Sets to display the data instead but the tool crashed repeatedly.<p></p>[[File:08 IntermediateVisualization Part 3 GwendolineTanWanXin.png|center|800px|Intermediate Visualization]]<br/>
 
# '''[Analysis]''' To find other alternatives, I attempted to plot an average of the number of search tools/sites used across disciplines. The data source is refreshed in Tableau to reflect the new columns added from Step 3. Tableau’s average function is then used to calculate the average number of tools used for each discipline across the research process. A parallel coordinates plot (as shown below) is then created to show the distribution of number of tools used during each stage of the research process according to their research discipline.<p></p>
 
# '''[Analysis & Question Evolution]''' From the parallel coordinates plot, we can see that the majority of the users are using a lot more tools in the phase of searching for literature and data. However, for phases like archiving codes or posters/presentations, they typically don’t use any or use at most 1 tool to do so. As such, the question narrows down to look at which tools are actually the most popular in helping users to search data.This aims to answer the main question in identifying possible factors that could affect one's choice of tools, by first narrowing the analysis process into one main phase - search literature/data.<p></p>[[File:12 Final Chart 2 GwendolineTanWanXin.png|center|800px|Parallel Coordinates Plot]]<br/>
 
# '''[Analysis & Question Evolution]''' To identify which search tools/sites are the most popular, a bar chart was plotted to show the total number of users using each of the search tool/site. Based on the chart plotted (as shown below), we can see that majority of the users are using Google Scholar. However, there were tools/sites like Web of Science, PubMed or Other Tools/Sites that participants used commonly. From this chart, the question is further evolved into identifying what are the factors that may affect people’s choice of search tools/sites they use throughout the research process.<p></p>[[File:13 Final Chart 3 GwendolineTanWanXin.png|center|600px|Google Scholar as main search tool]]<br/>
 
# Previously, it has been identified that 4 factors (career stage, role, discipline, country) could affect people’s choice of tools/sites throughout the research process. As such, the following analysis targets at each of these 4 factors to identify which will affect people’s choice of search tools/sites.<p></p>
 
# '''[Analysis]''' A chart was plotted to show the total number of people using each search tool/site based on their roles. From the colours of the circle view chart (as shown below), we can see that they go in almost the same order. This shows that despite its role, there is not much difference in the way they use these tools/sites. However, minor differences do exist. From the chart, we can see that librarian use WorldCat more often than a PhD student. Such minor differences can only be spotted if one analyse closes enough or interact with the filters repeatedly. As such, we can conclude that the role of the user does play a part but it's rather insignificant, in determining the choice of search tools/sites that people will use.<p></p>[[File:14 Final Chart 4 GwendolineTanWanXin.png|center|600px|Circle View Chart]]<br/>
 
# '''[Analysis & Intermediate Visualization]''' Looking at the second factor – career stage, a chart was plotted to show the total number of people using each tool/site based on their career stages. From the chart, we can see that people in the intermediate and advanced levels used almost similar tools as compared to people in the beginner level. However, it does not provide much information as to what kinds of people use these tools other than its career stage.<p></p>[[File:09 IntermediateVisualization Part 4 GwendolineTanWanXin.png|center|400px|Intermediate Visualizations]]<br/>
 
# '''[Analysis]''' To provide more in-depth comparison, I attempted to look at the discipline of participants in addition to its career stages. From this chart, we can see that across all disciplines, the choice of search tools/sites used by people are different. For example, participants in the Medicine field use PubMed more than they use Google Scholar. This was not the case for the other disciplines. However, when we look at people in the same discipline with different career stages, the patterns on the type of tools they used to assist them in searching for literature/data were relatively similar. As such, we could potentially identify that discipline is one factor that affect participant's choice of tools/sites used.<p></p>[[File:16 Final Chart 6 (Career & Discipline) GwendolineTanWanXin.png|center|600px|Career Stage VS Discipline]]<br/>
 
# '''[Data Rearrangement]''' Previously, we have identified that countries may not play a huge factor in determining the way how people use search tool/site. We would like to look further into this and identify whether it is indeed true. However, there were too many countries and by plotting them individually will produce an ultimately huge chart. As such, I decided to group countries into their continents. To do so, I retrieved a list of UN countries and continents and convert it into a form of “.csv” file format. This new data is imported into Tableau. In order to group the countries into its continent, it is performed via a left outer join.<p></p>[[File:10 Analysis Part 2 GwendolineTanWanXin.png|center|600px|Mapping Countries to Continents]]<br/>
 
# '''[Analysis]''' With the users grouped into their continents, a chart can be plotted to show whether the region or discipline will affect the choice of search tools/sites that people use. From the chart, we can see that people’s choice of search tools/sites varies across disciplines. However, within the same discipline, different regions use the same tools/sites to search for literature/data. Supporting our analysis in Step 11, discipline is a main factor in determining participant's choice of tools/sites used.<p></p>[[File:15 Final Chart 5 (Career & Region) GwendolineTanWanXin.png|center|600px|Region VS Discipline]]<br/>
 
From the entire analysis, we will then be able to answer the question on the factors that influence people’s choice of tools/sites used during their research process.
 
  
== Final Visualization ==
+
== Visualization 1 ==
Based on the entire analysis process, multiple charts were combined to create a dashboard to form a final story on the analysis. The final visualization can be accessed via this URL: https://public.tableau.com/views/Assignment2_v1_5/Story1?:embed=y&:display_count=yes
+
Firstly, I used Tableau to visualize the Clean WPI Data of which Sub industry has the highest number of reported cases based on their accident types. Before I move on to generate the visualizations, I split up the Sub industry based on its separator (‘-‘) and created a new column – Age Group to facilitate my analysis.
 +
[[File:04 AgeGroup KimYong.png|center|600px|Tableau Data Fixed]]
 +
At a glance, the bar graph is by far one of the most common graphs that everyone will use to show the distributions of number of reported cases by Sub industry in each accident type. This is not a high-dimensional graph and the purpose of this visualization is to see if there is any relationship or pattern between Accident Type, Sub Industry and Number of Reported cases. My reason of this graph is to help me visualize what other possible questions that I could come up with and what visualizations would probably be better to derive my answers for my questions.
 +
[[File:05 BarChart KimYong.png|center|600px]]
 +
I have selected “TreeMap” as it provides a dynamic data visualization paradigm that facilitates the representation and understanding of large and multivariate data. From the TreeMap we can clearly see that the Construction industry has one of the highest reported cases. Besides, when we hovered over each segment within the Sub industry, the tooltips will be able to show us the underlying data such as the type of accidents (e.g. Fall from Height) and what is the rate of the accident based on the no. of reported cases as shown below:
 +
[[File:06 TreeMapKimYong.png |center|600px|TreeMap]]
 +
The rationale of choosing treeMap is because Major Industry and Sub Industry are both categorical attributes and they also hierarchical. Furthermore, some of the industries have only small amount of reported cases and thus cannot be easily spotted in the bar chart. Therefore, we can also use the "Filter" function to find out more on those sub-industries that are visually insignificant for us to see. 
 +
[[File:07 AgeGroupKY.png |center|600px|AgeGroup]]
 +
Subsequently, I went to explore on the age Group by the Major industry and based on my findings I found out that Young adults of between 15 to 34 years of age are more susceptible to accidents than the graying seniors in all the Major industries. From my perspective, one reason that might attribute to such accidents is the lack of proper introduction into the jobs or tasks assigned or given to them. Employers in those industries do not fork out their time to provide information on the procedures and safety rules. As such, the younger age group has seen its accident number increased dramatically. However, as we go deeper into analyzing the Sub industry, we realized that the number of accidents is also based on the nature of the job and not just the specific age group itself. As shown in the graph below, for example, Construction and Metalworking industries have more number of reported cases as compared to other industries.
 +
[[File:08 Nature of Job KY.png |center|600px|Nature of Job Graph]]
 +
Additionally, I also try using JMP’s ternary plot to see the distribution of accidents by age group in the various industry. Ternary plot allows us to visualize the proportion of the victims across various age group – Young, Middle age, and Old from various Major Industry. Each major industry is given a colour which is indicated in the legend. Generally, both Young and Middle age group are susceptible to accidents at work.
 +
[[File:09 Ternary KY.png|center|400px|Ternary Graph]]
  
'''Caption for Final Visualization''': The analysis process aims to identify possible factors that could influence the choice of tools/sites that people use throughout the research process. To do so, the interactive visualization focuses on tools/sites to search literature and data. The rationale for focusing on search tools/sites is due to the fact that most research participants use the most number of tools/sites in this phase throughout the entire research process. As such, more data will be available to perform an analysis. By making use of the interactive visualization, one can identify that a participant’s discipline is a main factor that affects its choice of tools/sites throughout the research process.
+
== Visualization 2 ==
 +
From data visualization 1, it seemed that Construction industry has the highest number of accident cases than other industries. As such, I went on to make use of Power BI tools to visualize the most common types of accidents occurred in Construction and also the other Sub industries. As shown from the bar graph below, the top 3 most common accidents type happened in all Industries are-  Fall from Height, Struck by Moving Objects and Slips & Trips.
 +
[[File:11 Highest Accident Type KY.png|center|600px|AccidentType BarChart]]
 +
So moving forward, I continue to investigate what are the common types of accidents that happened in all industries. I decided to create a parallel coordinate as the no. of reported cases is discrete variable rather than continuous variable. It automatically creates a reference line and provides you with the flexibility to drag the line along the x-axis hovering over data points displaying tooltips with all its underlying data. From there I am able to identify the particular accident type, from the highest to lowest reported cases and by which industry.  As such, the graph below showed that construction industry has the highest amount of reported cases of 258 while the pharmaceutical & biological pdts industry has only 5 reported cases on the accident type of falling from heights.
 +
[[File:10 Parallel Cordinates KY.png|center|600px|Parallel Coordinates]]
  
'''Using The Interactive Visualization To Answer Questions Posed'''<br/>
+
Additionally, we can also use the bubble chart provides a quick and easy way for user to read (Which industry has the highest number of accidents cases? Based on the selected industry, what are some of the top few accident types?) The circles in the bubble chart uses area to represent numbers. For example, the higher the number of reported cases in an industry, the bigger the area of the circle will be. As such, you will be able to see that Construction has the highest number of accidents and some of the most common accidents are fall from height, struck by moving objects and so on and so forth. One good thing of using Power BI tool is that every chart will be given full control without any necessary configurations unlike Tableau (e.g. Tableau actions). It automatically synchronized all the charts and allows users to selects a data point of one visualization and then sees what happens on the other visualizations.
Using the final visualization constructed, one could perform the visualization process in the following steps to answer the questions posed:<br/>
+
[[File:12.Largest Industry KY.png|center|600px|Bubble Chart]]
# The analysis begins with the illustration of the demographics of the research participants who took part in the 101 Innovations Research Survey. These are possible factors that could affect participants' choice of tools/site usage throughout the research process.
 
# After looking at participants’ demographics, we move on to look at the average number of tools used by participants in each stage of the research process based on their discipline. We conclude that people use the most number of tools in searching for literature and data. As such, the next part of the analysis will be focused on search tools/sites to identify what are the factors that could affect the way how people use tools/sites.
 
# The analysis process begins by looking at participants’ roles, career stages, regions and disciplines. By interacting with the 3 charts, we can identify that the main factor which affects the way participants' choice of tools/sites usage during the research process is based on their discipline. Their role may play a slight role in determining their choice of tools/sites used but the level of influence was not significant. In addition, participants’ career stages and region does not impact their choices on the type of tools/sites used during their research process.
 
  
== Tools & References ==
+
== Visualization 3 ==
 +
My final visualization is to find out whether different months plays a part in contributing to the number of reported cases. Ultimately, the stacked graph below showed the distribution of reported cases by their months, and we can see that the month of January has the least cases followed by February and December. The rest of the months are relatively the same in pattern. As such, based on my opinion, January, February and December usually have quite a lot of public holidays (e.g. Christmas Day, New Year, Chinese New Year). Therefore, employees or workers at those periods are either clearing/taking/extending their leaves, hence the number of accident rates tends to be lower as compared to other months. Additionally, there are also two possible reasons. Firstly, the volume of jobs or tasks given were not so intensive during January and February, therefore, employees/workers have the luxury of time to focus on their safety when comes to completing their tasks. However, from the start of Q2 to Q4, more tasks are usually given and expected to quickly and efficiently delivery. This definitely caused employees/workers to feel more pressured and wanting to be more productive and neglected on their safety. Secondly, information on the safety procedures and rules might be briefed at the beginning of the year. As such, those regulations are fresh and instilled in them for the first two months and gradually employees/workers will forget about it as the months goes by.
 +
[[File:13.StackedGraph KY.png|center|600px|Bubble Chart]]
 +
I chose stacked graph because this visualization was sufficient in allowing me to view the distribution of reported cases by all Sub industries across all the months in Year 2014. Any significance differences in the respective Sub industry can be more noticeable by applying the filter function to focus on only one Sub industry at a time.
 +
 
 +
== Tools used and Lesson learnt ==
 
The following tools were used throughout the data transformation, analysis and visualization construction process:
 
The following tools were used throughout the data transformation, analysis and visualization construction process:
 
* Microsoft Excel 2016
 
* Microsoft Excel 2016
* Tableau
+
* Tableau 10.0
 +
Used to construct visualizations such as Bar chart, Parallel Coordinates, Tree Map, Stacked graph, Histogram and an interactive dashboard. After that, the constructed interactive dashboard is published onto Tableau public which provides users a cloud platform to host their Tableau workbooks/dashboards as a webpage for others to interact or see.
 
* Microsoft PowerBI
 
* Microsoft PowerBI
 +
When using Power BI as one of the data visualization toolkit, we need to pay attention to the Excel file extension that we are using. There are basically two types of file extension for Excel, XLS and XLSX. The XLS is a binary format while XLSX is an Open XML format. If you try to import the .XLS data sheet into Power BI you will encounter an error. However, you could simply resolve this issue by just saving or changing the workbook as an .XLSX file and try to import again.
 
* JMP Pro
 
* JMP Pro
 +
 +
==Conclusion==
 +
The interactive dashboard can be accessed via this URL: https://public.tableau.com/views/TreeMap_VA_2/DashBoard-IndustryAccidentTracking?:embed=y&:display_count=yes&:toolbar=no
 +
 +
In conclusion, many insights can be found when visualizing the data with the right usage of charts and graphs. With the kind of information gathered from the aforementioned questions, it provides a better guidance for the Observatory for WSH Landscape (OWL) 1 to convey and communicate the changes quickly and effectively to the respective individuals (e.g. policy makers, work-safety inspectors, industries). For example, policy makers can implement better policies especially on industries with high number of accident cases. Subsequently, work-safety inspectors can also perform their routine checks more effectively and efficiently based on various months to save on resources and efforts. As such, this makes Singapore’s industries a safe working environment for everyone to feel motivated and safe to work in.
 +
 
<br/>
 
<br/>
 
The following websites provided additional guidance on the completion of this assignment:
 
The following websites provided additional guidance on the completion of this assignment:
* Dataset for 101 Innovations - Research Tools Survey (https://www.kaggle.com/bmkramer/101-innovations-research-tools-survey)
+
* Workplace Safety&Health Profile 2014 (http://www.mom.gov.sg/~/media/mom/documents/safety-health/reports-stats/wsh-profile/wsh-profile-2014.pdf)
* Using Tableau To Visualize Survey Data Part 1 (http://www.datarevelations.com/using-tableau-to-visualize-survey-data-part-1.html)
+
* The Power of Tableau Actions (http://www.clearlyandsimply.com/clearly_and_simply/2010/08/the-power-of-tableau-actions.html)
* Using Tableau To Visualize Survey Data Part 2 (http://www.datarevelations.com/using-tableau-to-visualize-survey-data-part-2.html)
 
* Creating Parallel Coordinates In Tableau (https://www.youtube.com/watch?v=kD1iEQYtMYo)
 

Latest revision as of 16:03, 2 October 2016

Theme of Interest: Industry Accident Tracking

Each industry plays a critical role in contributing to Singapore's economy. It is essential to enforce and encourage a safe working environment for all workers in different industries. One of the utmost concerns when a person chooses a job is Safety First. Despite the expansion of Singapore’s population over the years, poor working environment has been known to deter many individuals from joining these industries. Moreover, certain age group has shown to be more susceptible to injured themselves at work. This is a growing concern for Singapore as individuals becoming more skeptical of the industries which greatly affect the labor force and the economic growth of Singapore in the future. I decide to specifically focus on the accident types. Some of the questions I will be examining are:

Analytical/Investigation Questions

  • Which industry/sub-industry has the highest number of cases by accident types?
  • What are the most common types of accidents occurring in industry/sub industry?
  • Which month has the highest number of cases reported based on the industry/sub industry?

Selected Data Attributes for Analysis

The following data attributes are selected for analysis:

  • Reported Date
  • Accident Day
  • Major Industry
  • Sub Industry
  • Accident Type Category
  • Victim's Age

Data Transformation Process

In this assignment, I have decided to use Workplace Injuries data in 2014 attained from Assign2 at https://wiki.smu.edu.sg/1617t1IS428g1/Assign2 and with reference to WSHI National Statistics Report 2014. The excel sheets have been cleaned up to suit the needs of the various visualization tools used.

Sub Industry + Accident Type Category + Victim’s Age
I had filtered and removed unnecessary data, perform data formatting and, lastly organizing it in a clearer manner.

Sorted Data


Sub Industry:Others-Marine

Under the Sub Industry, I have combined the ‘Marine-Other Marine’ provided by the Raw Data into ‘Others –Marine’ in the Clean WPI Data as they refer to the same sub Industry. This is to avoid confusion and the inaccuracy of information retrieve when analyzing data.

Major Industry & Sub Industry + Age Group

Data for Ternary

Based on the total number of reported cases by industry/Sub industry, I have created a new worksheet with age group ranging from Age 15 all the way to Age 80 to identify which age group has the highest number of accidents in the respective industries.

As such, we proceed with the analysis and Visualization construction process as follows:

Visualization 1

Firstly, I used Tableau to visualize the Clean WPI Data of which Sub industry has the highest number of reported cases based on their accident types. Before I move on to generate the visualizations, I split up the Sub industry based on its separator (‘-‘) and created a new column – Age Group to facilitate my analysis.

Tableau Data Fixed

At a glance, the bar graph is by far one of the most common graphs that everyone will use to show the distributions of number of reported cases by Sub industry in each accident type. This is not a high-dimensional graph and the purpose of this visualization is to see if there is any relationship or pattern between Accident Type, Sub Industry and Number of Reported cases. My reason of this graph is to help me visualize what other possible questions that I could come up with and what visualizations would probably be better to derive my answers for my questions.

05 BarChart KimYong.png

I have selected “TreeMap” as it provides a dynamic data visualization paradigm that facilitates the representation and understanding of large and multivariate data. From the TreeMap we can clearly see that the Construction industry has one of the highest reported cases. Besides, when we hovered over each segment within the Sub industry, the tooltips will be able to show us the underlying data such as the type of accidents (e.g. Fall from Height) and what is the rate of the accident based on the no. of reported cases as shown below:

TreeMap

The rationale of choosing treeMap is because Major Industry and Sub Industry are both categorical attributes and they also hierarchical. Furthermore, some of the industries have only small amount of reported cases and thus cannot be easily spotted in the bar chart. Therefore, we can also use the "Filter" function to find out more on those sub-industries that are visually insignificant for us to see.

AgeGroup

Subsequently, I went to explore on the age Group by the Major industry and based on my findings I found out that Young adults of between 15 to 34 years of age are more susceptible to accidents than the graying seniors in all the Major industries. From my perspective, one reason that might attribute to such accidents is the lack of proper introduction into the jobs or tasks assigned or given to them. Employers in those industries do not fork out their time to provide information on the procedures and safety rules. As such, the younger age group has seen its accident number increased dramatically. However, as we go deeper into analyzing the Sub industry, we realized that the number of accidents is also based on the nature of the job and not just the specific age group itself. As shown in the graph below, for example, Construction and Metalworking industries have more number of reported cases as compared to other industries.

Nature of Job Graph

Additionally, I also try using JMP’s ternary plot to see the distribution of accidents by age group in the various industry. Ternary plot allows us to visualize the proportion of the victims across various age group – Young, Middle age, and Old from various Major Industry. Each major industry is given a colour which is indicated in the legend. Generally, both Young and Middle age group are susceptible to accidents at work.

Ternary Graph

Visualization 2

From data visualization 1, it seemed that Construction industry has the highest number of accident cases than other industries. As such, I went on to make use of Power BI tools to visualize the most common types of accidents occurred in Construction and also the other Sub industries. As shown from the bar graph below, the top 3 most common accidents type happened in all Industries are- Fall from Height, Struck by Moving Objects and Slips & Trips.

AccidentType BarChart

So moving forward, I continue to investigate what are the common types of accidents that happened in all industries. I decided to create a parallel coordinate as the no. of reported cases is discrete variable rather than continuous variable. It automatically creates a reference line and provides you with the flexibility to drag the line along the x-axis hovering over data points displaying tooltips with all its underlying data. From there I am able to identify the particular accident type, from the highest to lowest reported cases and by which industry. As such, the graph below showed that construction industry has the highest amount of reported cases of 258 while the pharmaceutical & biological pdts industry has only 5 reported cases on the accident type of falling from heights.

Parallel Coordinates

Additionally, we can also use the bubble chart provides a quick and easy way for user to read (Which industry has the highest number of accidents cases? Based on the selected industry, what are some of the top few accident types?) The circles in the bubble chart uses area to represent numbers. For example, the higher the number of reported cases in an industry, the bigger the area of the circle will be. As such, you will be able to see that Construction has the highest number of accidents and some of the most common accidents are fall from height, struck by moving objects and so on and so forth. One good thing of using Power BI tool is that every chart will be given full control without any necessary configurations unlike Tableau (e.g. Tableau actions). It automatically synchronized all the charts and allows users to selects a data point of one visualization and then sees what happens on the other visualizations.

Bubble Chart

Visualization 3

My final visualization is to find out whether different months plays a part in contributing to the number of reported cases. Ultimately, the stacked graph below showed the distribution of reported cases by their months, and we can see that the month of January has the least cases followed by February and December. The rest of the months are relatively the same in pattern. As such, based on my opinion, January, February and December usually have quite a lot of public holidays (e.g. Christmas Day, New Year, Chinese New Year). Therefore, employees or workers at those periods are either clearing/taking/extending their leaves, hence the number of accident rates tends to be lower as compared to other months. Additionally, there are also two possible reasons. Firstly, the volume of jobs or tasks given were not so intensive during January and February, therefore, employees/workers have the luxury of time to focus on their safety when comes to completing their tasks. However, from the start of Q2 to Q4, more tasks are usually given and expected to quickly and efficiently delivery. This definitely caused employees/workers to feel more pressured and wanting to be more productive and neglected on their safety. Secondly, information on the safety procedures and rules might be briefed at the beginning of the year. As such, those regulations are fresh and instilled in them for the first two months and gradually employees/workers will forget about it as the months goes by.

Bubble Chart

I chose stacked graph because this visualization was sufficient in allowing me to view the distribution of reported cases by all Sub industries across all the months in Year 2014. Any significance differences in the respective Sub industry can be more noticeable by applying the filter function to focus on only one Sub industry at a time.

Tools used and Lesson learnt

The following tools were used throughout the data transformation, analysis and visualization construction process:

  • Microsoft Excel 2016
  • Tableau 10.0

Used to construct visualizations such as Bar chart, Parallel Coordinates, Tree Map, Stacked graph, Histogram and an interactive dashboard. After that, the constructed interactive dashboard is published onto Tableau public which provides users a cloud platform to host their Tableau workbooks/dashboards as a webpage for others to interact or see.

  • Microsoft PowerBI

When using Power BI as one of the data visualization toolkit, we need to pay attention to the Excel file extension that we are using. There are basically two types of file extension for Excel, XLS and XLSX. The XLS is a binary format while XLSX is an Open XML format. If you try to import the .XLS data sheet into Power BI you will encounter an error. However, you could simply resolve this issue by just saving or changing the workbook as an .XLSX file and try to import again.

  • JMP Pro

Conclusion

The interactive dashboard can be accessed via this URL: https://public.tableau.com/views/TreeMap_VA_2/DashBoard-IndustryAccidentTracking?:embed=y&:display_count=yes&:toolbar=no

In conclusion, many insights can be found when visualizing the data with the right usage of charts and graphs. With the kind of information gathered from the aforementioned questions, it provides a better guidance for the Observatory for WSH Landscape (OWL) 1 to convey and communicate the changes quickly and effectively to the respective individuals (e.g. policy makers, work-safety inspectors, industries). For example, policy makers can implement better policies especially on industries with high number of accident cases. Subsequently, work-safety inspectors can also perform their routine checks more effectively and efficiently based on various months to save on resources and efforts. As such, this makes Singapore’s industries a safe working environment for everyone to feel motivated and safe to work in.


The following websites provided additional guidance on the completion of this assignment: