IS428 2016-17 Term1 Assign2 Manas Mohapatra

Background

Bigger the data, the more robust is the conclusion. This is the general rule of thumb used by data analysts. This is where the term Big Data comes from as well. But, it’s not so straightforward, as there are several aspects to it. If the dataset has more Rows, it usually strengthens the hypothesis, makes the conclusion more robust. But if it has more Columns or variables, then the result is often more confusing, as it becomes increasingly more difficult to deal with the dataset. The dataset of Work Place Injuries deals with nearly 50 variables. Upon a first glance, I decided to investigate the major causes of work place injuries based on the variables given such as Type of work, Industry etc. But such datasets needs to be approached with a systematic procedure, as follows.

Research and Data Preparation

The first step in the process is to check out all the variables and get a sense of secondary research that needs to be done. Research could include things like industry standards/benchmarks. For instance, why did the variables decided to segregate “percentage of manual labour” into 2 halves: >50% and <=50% manual work.

Next part comes data cleaning. A closer inspection shows you that there are many redundant variables such as Year (There’s nothing significant gained from this variable as all the data is collected in 2014). Oftentimes the same variable is mentioned as a string and number such as the first day of the week is referred to as “Monday” under Weekday and 1 under Weekday no. This is obviously repetition of variables. Of these I decided to keep the former Variable so that it’s automatically interpreted as a string and hence a categorical variable in JMP and Tableau.

The next step is to clean out anonymized variables. But before completely deleting them, I deemed it necessary to observe their distribution to notice any unusual patterns. I was about to clean the dataset from other unnecessary variables such as names and Ids because they are obviously anonymized. Thus variables such as informant name, organization name, Organization SSIC codes needed to be deleted off the data set. But this is where the secondary research helped me. Upon a closer inspection, SSIC is not anonymized, rather it’s a code set by the government to identify the industry a company is in. There are several kinds of SSIC codes, but the ones used in these datasets are based on year 2010, which consist of a 2 digit code in the first level, 3 digit in the second level, all the way upto a complete 5 digit code. This will be discussed further in the analysis.

The final step to data understanding and preparation was to classify the data. This helps to neatly arrange the data to have a better understanding, it helps in forming hypothesis, it helps in observing potential cause and effect relationships. Below’s a snapshot from the data classification. The variables in yellow, green and red indicate variables to clean out, to inspect before deleting, to analyze respectively.

I had to classify and reclassify the above based on gaining a deeper understanding of the dataset. For instance before researching SSIC, I had organized the SSIC variables under “Reporting”, but after understanding it I decided to reclassify it under type of work. As SSIC Codes works together with industry and sub-industry variables to define the type of work the victim is in. Some other data preparation included assigning categorical type to variables such as weekday no., codes, accident types which were otherwise automatically assigned a numeric status

Data Analysis

I made a rough flow chart to form hypothesis:

RELEVANT CONTEXT --> KNOWN CAUSE (+ Hidden Correlations) --> INJURY --> REPORTING

The relevant context is the Type of Work the victim was working in. The known cause is the obvious causes mentioned that leads to injury. But I step in to provide insights into hidden correlation which could also be leading to injuries. These correlations could form up within the known causes, or could work alongside the “relevant context”. I decided to not focus on reporting, as there wasn’t any interesting insight into reporting matters. Note: I mention correlation, since correlation can’t be equated to causation. All these tools can only point out any correlation. To verify a correlation into causation, experimentation is often a necessity Now there are multiple variables in this dataset, thus the very obvious analysis tool one should be using is the Multivariate analysis. But, there’s one issue with it. Multivariate analysis works much well for continuous data, while in our case: the dataset contains nearly 90% categorical data. Thus most of my analysis is bivariate in nature, with Mosaic Plots as the predominant tool. Every once in a while, I used a Logistic fit to understand any correlation between categorical and continuous variable. Unfortunately, none of my hypothesis required any analysis between continuous variables only. The following are my hypothesis and results:

Injury due to negligence…. Working for long hour… starting time of work variable doesn’t help, since we need reference line for each organization….. better variable to compare to is: overtime (although how much overtime would be accurate)

IS428 2016-17 Term1 Assign2 Manas Mohapatra

Background

Research and Data Preparation

Data Analysis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools