Difference between revisions of "ANLY482 AY2017-18T2 Group18/TeamDAcct Project Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 57: Line 57:
 
<div style="border-style: solid; border-width:0; background: #0000cd; padding: 7px; font-weight: bold; text-align:left; line-height: wrap_content; text-indent: 20px; font-size:20px; font-family:Century Gothic;border-bottom:5px solid white; border-top:5px solid black"><font color= #ffffff>Data Cleaning and Transformation</font></div>
 
<div style="border-style: solid; border-width:0; background: #0000cd; padding: 7px; font-weight: bold; text-align:left; line-height: wrap_content; text-indent: 20px; font-size:20px; font-family:Century Gothic;border-bottom:5px solid white; border-top:5px solid black"><font color= #ffffff>Data Cleaning and Transformation</font></div>
  
After examining, exploring and understanding each data file supplied to us by LS 2, in order to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below (as seen in Figure 4) represent our general flow in the data preparation process. [[Image:DACCTdatacleaning.PNG|800px|center]]
+
After examining, exploring and understanding each data file supplied to us by LS 2, in order to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below (as seen in Figure 4) represent our general flow in the data preparation process.  
 +
 
 +
[[Image:DACCTdatacleaning.PNG|800px|center]]
 +
 
 
'''Challenge 1''': The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed datasets necessitated the creation of a master list of projects with unique serial number assigned.  
 
'''Challenge 1''': The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed datasets necessitated the creation of a master list of projects with unique serial number assigned.  
  

Revision as of 23:48, 25 February 2018

TeamDAcctnew.png

Home About Us Project Overview Project Findings Project Management Documentation ANLY482 Homepage

 

Data

Our client will be supplying us with data for the year 2016 and 2017. However, since there was a change in the way data is stored by the client in June 2016, some data cleaning must be performed on the data for the period before June 2016. The types of files we received are summarized in the table below.

Data377.PNG


Data Cleaning and Transformation

After examining, exploring and understanding each data file supplied to us by LS 2, in order to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below (as seen in Figure 4) represent our general flow in the data preparation process.

DACCTdatacleaning.PNG

Challenge 1: The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed datasets necessitated the creation of a master list of projects with unique serial number assigned.

Challenge 2: The fact that some of the data was maintained in .doc and hardcopy made it difficult for us to create a dataset. As the data set was not in tabular or excel format, we had to manually key in a considerable amount of data. Thus, the challenge of having to transform text information into a quantifiable table format led to a rather time-consuming process.

Challenge 3: During the process of creating a master list of projects, there were projects which we are unable to cross reference (have no matching reference instances) to different data sources. This implies that there are projects with incomplete dataset. This decreased the number of projects from which we could get insights.

Other than these general challenges, more specific elaboration on the data preparation process are elaborated in the remarks of the following pages, such as treatment of missing values, treatment of duplicates and exclusion of projects.