Difference between revisions of "Group06 Report"
| Ydzhang.2017 (talk | contribs) | Ydzhang.2017 (talk | contribs)  | ||
| Line 69: | Line 69: | ||
| Due to the number of people working on the code, it is essential to ensure that the code is fully readable. Therefore, we adopted the following folder structure to ensure that each respective sections of code are within a separate R script: | Due to the number of people working on the code, it is essential to ensure that the code is fully readable. Therefore, we adopted the following folder structure to ensure that each respective sections of code are within a separate R script: | ||
| − | [[File:R Shiny Folder Structure.JPG|center| | + | |
| + | [[File:R Shiny Folder Structure.JPG|center|350px]] | ||
| + | |||
| + | |||
| + | ===3.2	Cosine similarity=== | ||
| + | |||
| + | '''coop:''' There are 4,370 customers and 3,866 products. With a 4370 by 3866 matrix, the processing time for coming up with the cosine similarity matrix with this dataset will be lengthy. However, with the “coop” R package, the process was relatively quicker, as it was operating on a matrix with native functions. | ||
| + | |||
| + | '''tidyr:''' The initial data set is in a long format. To make it into a wide format table instead (i.e. by spreading the key-value pairs across multiple columns), we applied the “spread” function from the “tidyr” package. Without this package, it would be difficult to ensure the consistent data formats between tables.  | ||
| + | |||
| + | '''data.table:''' A key requirement to utilize the “coop” package is that the data set needs to be presented in the form of a matrix. To do this, we converted the data set into a data table. Subsequently, we converted this data table into a matrix with the “as.matrix” function.  | ||
| + | |||
| + | One thing to note is that the values within the matrix needs to be of a consistent data type. This would mean that the row names need to be indicated as the actual row name and not a variable by itself. | ||
Revision as of 16:03, 13 August 2018
Group 6
| Overview | Proposal | Poster | Application | Report | 
Report
1 Introduction
An e-commerce company in the United Kingdom has provided their transaction data from 1 December 2010 to 9 December 2011. This company mainly has wholesalers as its customers. Based on the data provided, the company’s client portfolio is mainly based in the United Kingdom. Within the data set, the key variables given are: (i) invoice number; (ii) stock code; (iii) description; (iv) quantity; (v) invoice date; (vi) unit price; (vii) customer identification number; and (viii) country. Being a primarily sales driven company, the key objective should be to maximize the revenue of the company. To achieve this, we will adopt a four-pronged approach: (a) understand the seasonality of the goods flow through the period; (b) identify cross-sell opportunities through customer similarities; (c) clustering of high value customers to target; and (d) reviewing product descriptions to understand the product descriptions that sell well.
2 Motivation and Objective
On the market, there are various customer intelligence platforms available: “DataSift”, “SAS Customer Intelligence”, “Accenture Insights Platform”, etc. However, none of them are offer an integrated bespoke solution for our data on hand.
Our motivation is to build an entirely bespoke application that would allow the company to fully analyse their data right at the onset.
2.1 Cosine Similarity
This part of the application aims to find similarities between customers. Given that there are 3,866 products, we have decided to utilize the customers’ historical patterns in purchasing products to measure similarities between customers. To do this, we have created a matrix comprising the customers (columns) and products (rows). The values within depict the amount of the products these customers bought. Through the similarity of product purchases, clusters of customers can then be formed. Depending on the strength of this similarity between customers, a network diagram can then be drawn to visualize the customers. This will help to provide a meaningful way of grouping customers together so that different marketing campaigns and client relationship officers can be allocated to them.
2.2 Customer Segmentation
One of aims of this application is to allow users to perform customer segmentation visually and interactively. An intuitive methodology RFM is applied in this project to strengthen the result of customer segmentation. RFM facilitates the consumer analytics process in three different dimensions: recency (how recently the customer purchased), frequency (how often customer purchase) and monetary (how much customers spend).
K-means clustering algorithm uses RFM variables to form consumers into clusters. With understanding of the characteristics of different consumers group, business operator can make customized strategy to target different consumer groups.
2.3 Natural Language Processing
It is very important for management to understand customer preferences for different products, Natural Language Processing is conducted to explore the popularity of products. By understanding what are the most and least popular products and the properties of the popular products, the management will be able to make smarter decisions towards the customer’s preference, thus improve the revenues.
We use the column “descriptions” and “quantity” in the data set to conduct the Natural Language Processing. The description gives the product name, for example “WHITE METAL LANTERN”. The quantity gives the number of the products ordered.
3 R Programming and usage of R Packages
We have developed the application in Shiny, an open source R package, which provides a powerful web framework for building web applications. For performing data cleaning and building the visualizations, we have experimented with different R packages and would like to introduce some of the key packages that we found useful in this paper.
3.1 Reducing Development Time on Shiny
As all the team members wanted to contribute and reduce development time on the application, we adopted two initiatives to ensure that everything is smooth sailing.
3.1.1 Git Server
The team started a Git repository on GitHub to manage the development process. This is key to ensure that all the code developments do not conflict with the other sections. At the end of each development cycle, the team was then able to put the code together with less hassle while ensuring that the application had minimal bugs.
3.1.2 R Shiny Folder Structure
Due to the number of people working on the code, it is essential to ensure that the code is fully readable. Therefore, we adopted the following folder structure to ensure that each respective sections of code are within a separate R script:
3.2 Cosine similarity
coop: There are 4,370 customers and 3,866 products. With a 4370 by 3866 matrix, the processing time for coming up with the cosine similarity matrix with this dataset will be lengthy. However, with the “coop” R package, the process was relatively quicker, as it was operating on a matrix with native functions.
tidyr: The initial data set is in a long format. To make it into a wide format table instead (i.e. by spreading the key-value pairs across multiple columns), we applied the “spread” function from the “tidyr” package. Without this package, it would be difficult to ensure the consistent data formats between tables.
data.table: A key requirement to utilize the “coop” package is that the data set needs to be presented in the form of a matrix. To do this, we converted the data set into a data table. Subsequently, we converted this data table into a matrix with the “as.matrix” function.
One thing to note is that the values within the matrix needs to be of a consistent data type. This would mean that the row names need to be indicated as the actual row name and not a variable by itself.
