Difference between revisions of "ANLY482 AY2016-17 T2 Group13 - Project Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 51: Line 51:
 
<!------- Details ---->
 
<!------- Details ---->
  
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>Data Preparation</strong></font></div></div>
+
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>Data Pre-Processing</strong></font></div></div>
# Removal of Unused Columns
 
# Appending Sector Details
 
# Appending Country Details
 
# Calculation of Market Share
 
  
XXX
+
Before embarking on an Exploratory Data Analysis (EDA, the data had to be transformed for further analysis. Viewing the data based on 2D HS codes alone was insufficient because aggregated categories included too many possible unrelated items in the same category. These were the data cleaning steps that were attempted:
 +
 
 +
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>1. Combination of Data</strong></font></div></div>
 +
[[File:.png|center|500px]]
 +
All the .csv files were placed in a folder to ensure that concatenation of csv files would only be among datasets that we will be using for the project. Then the “copy *.csv” function was used to combine all the csv folders together though leaving the original folders intact.
 +
 
 +
[[File:.png|center|500px]]
 +
R was also used to combine folders together. This was done by loading the .csv files into variables. After which, the “rbind” function was used to combine the datasets together.
 +
 
 +
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>2. Removal of Unnecessary Columns</strong></font></div></div>
 +
[[File:.png|center|500px]]
 +
Many of the columns in the raw data file did not have values in them. Due to this, we decided to remove the columns in Microsoft Excel because they would not have been useful to our analysis.
 +
 
 +
[[File:.png|center|500px]]
 +
Above is the data set after removing the unnecessary columns. Variables containing weight appeared in 4d and 6d data and were not as apparent for 2d data. Due to this, we decided to leave the variables for the dataset.
 +
 
 +
<div style="background: #dce6f9; line-height: 0.3em; font-family:Century Gothic;  border-left: #003464 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#000000"><strong>3. Sectors and Dimension Tables</strong></font></div></div>
 +
We received the names for 18 sectors that the DIT focuses on, listed below.
 +
 
 +
{| class="wikitable"
 +
! s/n
 +
! High Value Campaign
 +
! High Priority Volume
 +
! Low Priority Volume
 +
! Unclassified
 +
|-
 +
| 1
 +
| Aerospace
 +
| Retail/Consumer
 +
| Advanced
 +
  Manufacturing (Excluding aerospace and automotive)
 +
| Others - Raw
 +
  Materials
 +
|-
 +
| 2
 +
| Food and Beverage
 +
| Education
 +
| Automotive
 +
| Others -
 +
  Manufacturing
 +
|-
 +
| 3
 +
| Infrastructure
 +
  (Water and Environment)
 +
| Energy
 +
| Bio-economy
 +
  (Agri-tech)
 +
| Others
 +
|-
 +
| 4
 +
| Infrastructure
 +
  (Rail)
 +
| Financial and
 +
  Professional Business Services
 +
| Bio-economy
 +
  (Chemicals)
 +
| Financial and
 +
  Professional Business Services  -
 +
  Others
 +
|-
 +
| 5
 +
| Technology
 +
| Healthcare
 +
| Sports
 +
|
 +
|-
 +
| 6
 +
| Food and Beverage
 +
| Life Sciences
 +
|
 +
|
 +
|-
 +
| 7
 +
|
 +
| Infrastructure
 +
  (Airports)
 +
|
 +
|
 +
|}

Revision as of 00:22, 20 February 2017



  AP PROJECTS

  HOME

  ABOUT US

  PROJECT OVERVIEW

  PROJECT FINDINGS

  PROJECT MANAGEMENT

  DOCUMENTATION



Data Pre-Processing

Before embarking on an Exploratory Data Analysis (EDA, the data had to be transformed for further analysis. Viewing the data based on 2D HS codes alone was insufficient because aggregated categories included too many possible unrelated items in the same category. These were the data cleaning steps that were attempted:

1. Combination of Data

All the .csv files were placed in a folder to ensure that concatenation of csv files would only be among datasets that we will be using for the project. Then the “copy *.csv” function was used to combine all the csv folders together though leaving the original folders intact.

R was also used to combine folders together. This was done by loading the .csv files into variables. After which, the “rbind” function was used to combine the datasets together.

2. Removal of Unnecessary Columns

Many of the columns in the raw data file did not have values in them. Due to this, we decided to remove the columns in Microsoft Excel because they would not have been useful to our analysis.

Above is the data set after removing the unnecessary columns. Variables containing weight appeared in 4d and 6d data and were not as apparent for 2d data. Due to this, we decided to leave the variables for the dataset.

3. Sectors and Dimension Tables

We received the names for 18 sectors that the DIT focuses on, listed below.

s/n High Value Campaign High Priority Volume Low Priority Volume Unclassified
1 Aerospace Retail/Consumer Advanced
 Manufacturing (Excluding aerospace and automotive)
Others - Raw
 Materials
2 Food and Beverage Education Automotive Others -
 Manufacturing
3 Infrastructure
 (Water and Environment)
Energy Bio-economy
 (Agri-tech)
Others
4 Infrastructure
 (Rail)
Financial and
 Professional Business Services
Bio-economy
 (Chemicals)
Financial and
 Professional Business Services  -
 Others
5 Technology Healthcare Sports
6 Food and Beverage Life Sciences
7 Infrastructure
 (Airports)