Difference between revisions of "ANLY482 AY2017-18T2 Group18/TeamDAcct Project Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 59: Line 59:
 
<div style="border-style: solid; border-width:0; background: #0000cd; padding: 7px; font-weight: bold; text-align:left; line-height: wrap_content; text-indent: 20px; font-size:20px; font-family:Century Gothic;border-bottom:5px solid white; border-top:5px solid black"><font color= #ffffff>Data Cleaning and Transformation</font></div>
 
<div style="border-style: solid; border-width:0; background: #0000cd; padding: 7px; font-weight: bold; text-align:left; line-height: wrap_content; text-indent: 20px; font-size:20px; font-family:Century Gothic;border-bottom:5px solid white; border-top:5px solid black"><font color= #ffffff>Data Cleaning and Transformation</font></div>
  
After examining, exploring and understanding each data file supplied to us by LS 2, in order to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below (as seen in Figure 4) represent our general flow in the data preparation process.  
+
After examining, exploring and understanding each data file supplied to us by LS 2, to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below represent our general flow in the data preparation process.
  
 
[[Image:DACCTdatacleaning.PNG|800px|center]]
 
[[Image:DACCTdatacleaning.PNG|800px|center]]
  
'''Challenge 1''': The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed datasets necessitated the creation of a master list of projects with unique serial number assigned.
+
The main challenges faced are as follows:
  
'''Challenge 2''': The fact that some of the data was maintained in .doc and hardcopy made it difficult for us to create a dataset. As the data set was not in tabular or excel format, we had to manually key in a considerable amount of data. Thus, the challenge of having to transform text information into a quantifiable table format led to a rather time-consuming process.  
+
'''Challenge 1 - Absence of Integrated Data Management System ''': The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed data-sets necessitated the creation of a master list of projects with unique serial number assigned.
  
'''Challenge 3''': During the process of creating a master list of projects, there were projects which we are unable to cross reference (have no matching reference instances) to different data sources. This implies that there are projects with incomplete dataset. This decreased the number of projects from which we could get insights.
+
'''Challenge 2 - Data Maintained in Non-tabular / Excel Format''': The fact that some of the data was maintained in .doc and hard-copy made it difficult for us to create a data-set. As the data set was not in tabular or excel format, we had to manually key in a considerable amount of data. Thus, the challenge of having to transform text information into a quantifiable table format led to a rather time-consuming process.
  
Other than these general challenges, more specific elaboration on the data preparation process are elaborated in the remarks of the following pages, such as treatment of missing values, treatment of duplicates and exclusion of projects.
+
'''Challenge 3 - Incomplete Data-set''': During the process of creating a master list of projects, there were projects which we are unable to cross reference (have no matching reference instances) to different data sources. This implies that there are projects with incomplete data-set. This decreased the number of projects from which we could get insights.
  
 +
Other than these general challenges, more specific elaboration on the data preparation process are elaborated in the remarks of the following pages, such as treatment of missing values, treatment of duplicates and exclusion of projects.
  
  
 
</div><br>
 
</div><br>

Revision as of 21:14, 12 April 2018

TeamDAcctnew.png

Home About Us Project Overview Project Findings Project Management Documentation ANLY482 Homepage

 

Data

The data given to us is extracted from the company’s Enterprise Resource Planning (ERP) system. Because of a change in the ERP system in June 2016, the data in the previous system is either incomplete or un-retrievable. Hence, we have decided to use data which primarily ranges from June 2016 to December 2017. We do not have the authorization to access the ERP system. LS 2 supplies the data-set only when we make requests. Hence, the document collection took a considerably long period of time and we performed Data Cleaning, Transformation and Integration as whenever we receive more data. The documents we received were in different formats with some of them being hard copy. We have categorized each file according to its nature and the kind of information it contains. A summary table of the data-sets we collected could be seen below.

Data grp18 1.PNG
Data grp18 2.PNG


Data Cleaning and Transformation

After examining, exploring and understanding each data file supplied to us by LS 2, to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below represent our general flow in the data preparation process.

DACCTdatacleaning.PNG

The main challenges faced are as follows:

Challenge 1 - Absence of Integrated Data Management System : The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed data-sets necessitated the creation of a master list of projects with unique serial number assigned.

Challenge 2 - Data Maintained in Non-tabular / Excel Format: The fact that some of the data was maintained in .doc and hard-copy made it difficult for us to create a data-set. As the data set was not in tabular or excel format, we had to manually key in a considerable amount of data. Thus, the challenge of having to transform text information into a quantifiable table format led to a rather time-consuming process.

Challenge 3 - Incomplete Data-set: During the process of creating a master list of projects, there were projects which we are unable to cross reference (have no matching reference instances) to different data sources. This implies that there are projects with incomplete data-set. This decreased the number of projects from which we could get insights.

Other than these general challenges, more specific elaboration on the data preparation process are elaborated in the remarks of the following pages, such as treatment of missing values, treatment of duplicates and exclusion of projects.