ANLY482 AY2017-18 T1 Group1/Final Progress

From Analytics Practicum
Jump to navigation Jump to search
Home Team Project Overview Midterm Progress Final Progress Project Management Documentation


Item No.Final Deliverables
01 Conference Paper (submitted on eLearn)
02 Presentation Poster (submitted on eLearn)
03 Project Artefacts (submitted on eLearn)


Abstract

Supply chain management is a complex process that is filled with information that can be mined using data analytics techniques. From procurement of raw materials to manufacturing the different compositions of a finished good to distribution and sales, there are a lot of information that a company can use to improve the efficiency of their supply chain. This can be done by uncovering discrepancies in the costs in procurement and distribution or creating demand forecasting models to anticipate future demand for a finished good.

This is the first iteration of a data project our sponsor has undertaken and thus the objective will be to source out data issues and create an overview of our client's supply chain. The data and insights obtained can then be used in the future for more complex analysis. For the project, we will be utilizing Python and R and respective data manipulation packages (Pandas, dplyr, etc.) to combine the client's disparate excel files into a single master table. Finally, Tableau will be used to visualize cost comparisons of the constituents of a product throughout its manufacturing and distribution.


Literature Review

The Supply Chain Operations Reference (SCOR) model developed by the Supply Chain Council (APICS, 2017) is a good framework for conducting analytics with J&J’s supply chain.

Figure 1 (JYC).PNG
Figure 1. Analytics techniques used in supply chains

As seen in the table above by Souza, 2014, we will first approach our sponsor's business problem by doing descriptive analytics via mapping and visualization of the constituent costs of their products. Since this is the first iteration of the project, doing descriptive analytics will give a good overview of the sponsor's supply chain and expose any weaknesses in their data. Thus, we aim to build a good foundation on which future projects can use to further build predictive and prescriptive model so that our sponsor can use it to make strategic and tactical decisions on how to source, make and deliver as seen in the SCOR model.


Project Overview

Sponsor's Current Approach
Currently, our sponsor spends a lot of time on manual data cleaning and harmonization across multiple Excel files to reach the insights they want about their products.

Figure 2 (JYC).png
Figure 2. Sample product analysis output generated by the sponsor

For example, our client had to produce multiple static bar chart diagrams in Excel to show two views: one is an overall outlook on the manufacturing costs across different plants, and one is to get a deeper dive into specific plants’ manufacturing cost components. This process is time-consuming as it requires our sponsor to create many different calculations and data tables for different levels of cost comparison (e.g. two data tables are required to display the two bar chart diagrams shown above). The static diagrams also do not allow users to explore different levels of cost comparison on demand.

For our project, twelve data files were required to provide complete information on the different costs involved for each SKU, from the manufacturing plant all the way to the distribution centres. As we aim to automate the process of data cleaning, data harmonization and master table generation, a lot of data manipulation is required for our project. We also initially planned to produce a static report of costs statistical analysis on top of the Tableau dashboard, thus we decided to use R as it has more libraries available for data cleaning, data table manipulation and statistical analysis.

Group's Proposed Approach
At the end of the project, we propose that the process be automated in four phases, with each phase being performed by one R code file. (Unless stated otherwise, all functions mentioned in this paper are from the dplyr library in R.)

A. Merge Data Files
First, we use a self-defined function to standardize the column types and names of the sixty-three BOM files we have received for a product line. Then, using the bind_rows function, we will merge the sixty-three files into a single file, before using write_xlsx function from the writexl library to save the newly combined data table into an .xlsx file. (refer to Figure 3)

Figure 3 (JYC).PNG
Figure 3. Usage of self-defined function f1, bind_rows from dplyr, and write_xlsx from writexl.

B. Clean Data Files
For data cleaning, we will first extract all relevant columns from the raw files using the subset function. Then, depending on the data, we will use either the mutate function to change/replace values, filter function to extract rows that fulfill certain conditions, or the rename function to standardize column names across different files. In some cases where we only want to store the substring of a value, we will also use the str_extract function from the stringr library.

Figure 4 (JYC).PNG
Figure 4. Usage of multiple functions in dplyr library to perform data cleaning.

C. Reshape Data Table for Easier Manipulation
We used the gather function to transpose twenty-one cost type columns into a pair of key-value columns. By reshaping the data table, we can more easily apply calculations to standardize costs across different currencies into US Dollars (USD). The reshaped data table is also more desirable as an input for our Tableau dashboard, because we can easily sum up multiple components that are related to one finished good without having to pivot the data table in Tableau's import settings.

Figure 5 (JYC).PNG
Figure 5. Usage of gather function to reshape the data table.

Figure 6 (JYC).PNG
Figure 6. Data table structure before transposition.

Figure 7 (JYC).PNG
Figure 7. A leaner data table structure after transposition.'

D. Combine All Cleaned-Up Files to Generate Mastertable
Lastly, we used the left_join function to combine the multiple cleaned-up files row-wise using common keys between the files, and generate the final Mastertable that will be used as input for our Tableau dashboard.


Methodology

Data Selection
We seek to perform data harmonization by first identifying an Excel file as the “starting point” to be built upon, in terms of finding relevant information existing in other files, and it will be the left-hand side (LHS) of any join operations we do. As we will use the left_join function from dplyr to i) eliminate irrelevant information that are found in RHS, but find no corresponding records in the LHS and ii) identify records in the LHS that are missing information in the RHS, the starting table needs to fulfill a few requirements:

A. Information in the starting file should be applicable for our client's context.
As our client is concerned with costs involved in the supply chain, we would want to get accurate values on cost items such as duty or freight incurred during the transportation of goods. Though we initially chose one data file to set the foundation for our Mastertable, it was later dropped from the project entirely due to two reasons.

Firstly, all numbers in this file are valid for accounting purposes in the finance team only, and do not represent actual costs incurred in the supply chain of the goods. Secondly, the sponsor and the group struggled to nail down the absolute values of distribution costs expressed in percentages, due to different Incoterms used by different countries which meant the percentages are not always applied on the same set of factors in every transfer of goods between two plants.

B. Information in the starting file should be up-to-date.
We chose another file to set the foundation for the Mastertable, but also faced issues later when we found that many SKUs in this file were obsolete SKUs that were no longer manufactured at the time of the project.

In the end, we agreed to choose a third file as the foundation for Mastertable, as the information included were both applicable for the supply chain analytics context, and up-to-date with only the existing SKUs in the market.


Data Cleaning
During the course of data cleaning, the group saw that many data woes were the result of manual input, maintenance and extraction of data in the organization, which are prone to errors due to carelessness or fatigue of the users. A lack of data governance in the organization also resulted in unstandardized column names across multiple data files, unstandardized data table structures across countries and unstandardized value inputs across different stakeholders.


Limitations

We had difficulties in generating a visualization that gives a broad overview of the products’ total trade volume and trade distribution by different markets and routes throughout the supply chain. Much of the required data needed to construct a more comprehensive and detailed view of the supply chain are not easily accessible or are riddled with errors. This is primarily because of the sponsor's existing data silos structure, where it is difficult to get access to the data we need and is often from different sources, thus the data that we do finally get are often inconsistent.

Because of the massive data amount and complexities given, we spent most of our time cleaning and sorting out the data. There were many data issues such as changes in the data structure which we had to continuously sort out with the sponsor, thus prolonging the process. This resulted in time mostly spent on generating the Mastertable for accurate cost comparisons and visualizations. Due to the lack of real-time data, we were limited to only the use of descriptive analytics, Tools such as predictive modelling were not feasible for our project because our data sets only contained static, descriptive information.


Conclusion

Thus far, we have shown that there are many inconsistencies within our sponsor's data, mainly attributed to a lack of a central data warehouse and various departments in different countries operating in silos. This leads to a glut of inconsistent data and a lot of time wasted in verifying the structures and true figures of the products' costs.

Having proper data governance and infrastructure in place is critical in performing further analysis, especially for a large FMCG MNC like our sponsor. With the current data silo structure, even generating descriptive analytics insights has been a challenge. It would be beneficial for our sponsor to implement a data engineering team to solve its data woes as it would make the process of descriptive analytics much faster, as well as pave the road for utilizing predictive or prescriptive models.