Difference between revisions of "Group06 Report"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 22: Line 22:
  
 
<br>
 
<br>
 +
 +
 +
==Project Motivation==
 +
Google and Temasek Holdings have forecasted that the Singapore’s e-commerce market will grow to S$7.5bn within the next 10 years. With this growth, the amount of data available for analysis in this area will also grow rapidly. This presents unique opportunities to find out how retailers can maximize the value of their data.
 +
 +
==Objective==
 +
Our objective for this project is to push the boundaries on market and customer analytics. We want to challenge ourselves as to how transaction data for retailers can be visualized and various machine learning techniques can be adopted to generate insights in this area.
 +
 +
==Data Source and Preparation==
 +
We will use an e-commerce dataset from the UCI Machine Learning Repository that contains transactional data between 2010 and 2011. The company mainly sells unique all-occasion gifts and most of their customers are wholesalers. Below is a data dictionary of the available fields within the dataset:
 +
{| class="wikitable"
 +
|-
 +
! Field !! Description
 +
|-
 +
| InvoiceNo || Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
 +
|-
 +
| StockCode || Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
 +
|-
 +
| Description || Product (item) name. Nominal.
 +
|-
 +
| Quantity || The quantities of each product (item) per transaction. Numeric.
 +
|-
 +
| InvoiceDate || Invoice Date and time. Numeric, the day and time when each transaction was generated.
 +
|-
 +
| UnitPrice || Unit price. Numeric, Product price per unit in sterling.
 +
|-
 +
| CustomerID || Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
 +
|-
 +
| Country || Country name. Nominal, the name of the country where each customer resides.
 +
|}
 +
 +
 +
==Approach==
 +
The application will be built primarily based on the R Shiny framework. The key focus areas will be:
 +
* Visualization of bipartite networks (i.e. between customers and products)
 +
* Visualization of clustering/segmentation of customers (RFM Model)
 +
* Visualization of popular products through text analytics
 +
 +
Some examples of visualizations are as follows:
 +
===Visualization of bipartite networks===
 +
[[File:ForceDirectedGraph.PNG|571px|center|Force Directed Network Graph]]
 +
 +
===Visualization of clustering/segmentation of customers===
 +
We intend to create the feature to allow users can visualize clustering results interactively. It helps to give more meaningful labels to different clusters. And eventually it brings insightful understanding and segmentation to the customers.
 +
 +
[[File:RFM_Model.gif|300px|center|RFM Clustering]]
 +
 +
 +
===Visualization of products popularity===
 +
It is very important for retailers to understand customer preferences for different products, Natural Language Processing will be conducted to explore the popularity of products.  By understanding what are the most and least popular products and the properties of the popular products, the retail will be able to make smarter decisions towards the customer’s preference, thus improve the revenues. Different visualization techniques such as bar chart, word cloud and etc will be used to ease the understanding of the insights discovered.
 +
[[File:Word cloud1.jpg|450px|center|Text Analytics]]
 +
 +
<br>
 +
 +
==Selection of Tools==
 +
Based on our preliminary assessment, we will utilize the following libraries for the development of the R Shiny dashboard: ''tidyr, dplyr, ggplot2, igraph, htmlwidgets, networkD3, mclust, shiny, shinyTime, shinydashboard, shinythemes, sjmisc, readxl, stringr, data.table, dummies, sjPlot, car, DT, reshape2, sqldf, igraph, etc''.

Revision as of 15:12, 13 August 2018

Group 6

Overview   Proposal   Poster   Application   Report


Report



Project Motivation

Google and Temasek Holdings have forecasted that the Singapore’s e-commerce market will grow to S$7.5bn within the next 10 years. With this growth, the amount of data available for analysis in this area will also grow rapidly. This presents unique opportunities to find out how retailers can maximize the value of their data.

Objective

Our objective for this project is to push the boundaries on market and customer analytics. We want to challenge ourselves as to how transaction data for retailers can be visualized and various machine learning techniques can be adopted to generate insights in this area.

Data Source and Preparation

We will use an e-commerce dataset from the UCI Machine Learning Repository that contains transactional data between 2010 and 2011. The company mainly sells unique all-occasion gifts and most of their customers are wholesalers. Below is a data dictionary of the available fields within the dataset:

Field Description
InvoiceNo Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description Product (item) name. Nominal.
Quantity The quantities of each product (item) per transaction. Numeric.
InvoiceDate Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country Country name. Nominal, the name of the country where each customer resides.

Approach

The application will be built primarily based on the R Shiny framework. The key focus areas will be:

  • Visualization of bipartite networks (i.e. between customers and products)
  • Visualization of clustering/segmentation of customers (RFM Model)
  • Visualization of popular products through text analytics

Some examples of visualizations are as follows:

Visualization of bipartite networks

Force Directed Network Graph

Visualization of clustering/segmentation of customers

We intend to create the feature to allow users can visualize clustering results interactively. It helps to give more meaningful labels to different clusters. And eventually it brings insightful understanding and segmentation to the customers.

RFM Clustering


Visualization of products popularity

It is very important for retailers to understand customer preferences for different products, Natural Language Processing will be conducted to explore the popularity of products. By understanding what are the most and least popular products and the properties of the popular products, the retail will be able to make smarter decisions towards the customer’s preference, thus improve the revenues. Different visualization techniques such as bar chart, word cloud and etc will be used to ease the understanding of the insights discovered.

Text Analytics


Selection of Tools

Based on our preliminary assessment, we will utilize the following libraries for the development of the R Shiny dashboard: tidyr, dplyr, ggplot2, igraph, htmlwidgets, networkD3, mclust, shiny, shinyTime, shinydashboard, shinythemes, sjmisc, readxl, stringr, data.table, dummies, sjPlot, car, DT, reshape2, sqldf, igraph, etc.