ANLY482 AY2017-18T2 Group18/TeamDAcct Project Data

From Analytics Practicum
Revision as of 22:06, 13 April 2018 by Zqlow.2014 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

TeamDAcctnew.png

Home About Us Project Overview Project Findings Project Management Documentation ANLY482 Homepage

 

Data Provided

The data given to us is extracted from the company’s Enterprise Resource Planning (ERP) system. Because of a change in the ERP system in June 2016, the data in the previous system is either incomplete or un-retrievable. Hence, we have decided to use data which primarily ranges from June 2016 to December 2017. We do not have the authorization to access the ERP system. LS 2 supplies the data-set only when we make requests. Hence, the document collection took a considerably long period of time and we performed Data Cleaning, Transformation and Integration as whenever we receive more data. The documents we received were in different formats with some of them being hard copy. We have categorized each file according to its nature and the kind of information it contains. A summary table of the data-sets we collected could be seen below.

Data grp18 1.PNG
Data grp18 2.PNG


Data Cleaning and Transformation

After examining, exploring and understanding each data file supplied to us by LS 2, to ensure the data is suitable for our analysis and model building, we performed the following data cleaning, transformation and integration. The following pointers below represent our general flow in the data preparation process.

DACCTdatacleaning.PNG

The main challenges faced are as follows:

1) Absence of Integrated Data Management System : The main challenge came from the absence of integrated data management system. LS 2 uses an ERP system. However, many of the documents are maintained out of the ERP system. Often these documents have no proper connection established among themselves and with the documents extracted from the ERP system. The presence of disjointed data-sets necessitated the creation of a master list of projects with unique serial number assigned.

2) Data Maintained in Non-tabular / Excel Format: The fact that some of the data was maintained in .doc and hard-copy made it difficult for us to create a data-set. As the data set was not in tabular or excel format, we had to manually key in a considerable amount of data. Thus, the challenge of having to transform text information into a quantifiable table format led to a rather time-consuming process.

3) Incomplete Data-set: During the process of creating a master list of projects, there were projects which we are unable to cross reference (have no matching reference instances) to different data sources. This implies that there are projects with incomplete data-set. This decreased the number of projects from which we could get insights.