Group11-Team AYE: Analysis

From Analytics Practicum
Jump to navigation Jump to search
HOME   OVERVIEW   ANALYSIS   PROJECT MANAGEMENT   DOCUMENTATION


Data Compilation

We have identified the following sets of data to be useful in understanding C&K’s local market in China:

  • Demographic
  • Average age of country/region
  • Income
  • % of Female by Age Category
  • Population Size
  • Economic
  • GDP
  • GDP per capita

We will be obtaining these data sets for all the cities where C&K have a foothold in. These data will be extracted from China Statistical Yearbook Database and China Knowledge


Data Preparation

Data preprocessing is required to allow insights and knowledge discovery accurately. Data preparation operations such as reduction in number of attributes, outlier detection and discretization are performed to significantly increases the model's’ accuracy that we will be using for analysis.

  • Dimensionality Reduction

Firstly, we will understand the dataset provided to us. As the dataset provided to us will be of the store transactions in C&K’s China market, it is likely that the fields in the dataset would be in Mandarin, hence we will first have to translate those fields into English before we proceed. Thereafter, we will perform dimensionality reduction by keeping the data fields or attributes that are relevant and insightful towards our analysis and eliminate those that are not useful and unnecessary.

  • Filling and Handling

We will also eliminate the missing data before conducting our analysis. If necessary, these missing values should be filled in using an appropriate approach. Outliers and inaccurate values should be handled and removed from the dataset as well.

  • Transformation

Transformation of the attributes’ values, such as log, may be required if the provided data range are significantly too wide apart or no obvious trends or clusters are being identified.

  • Discretization/ Binning

Continuous attributes should be encoded by discretizing the original values into a small number of value ranges as they provides more meaning to the analysis or bins. The main variables that we are focusing on is the locations, age and RFM categorisation.

After which, we will categorize the dataset based on the store location where the transactions were made. We will classify the stores based on the China City Tier System, based on this classification method i.e. Beijing will be classified as a Tier 1 City and all the stores located in the Tier 1 cities would be grouped together. As a result of this classification, we can obtain the various Tiered City Clusters.

Our group believes that this method of classification is a very robust way of grouping C&K’s stores in China as it takes into account the city’s population size, Gross Domestic Product (GDP), Average economic growth, connectivity as a transpiration hub as well as the city’s historical and cultural significance. Furthermore, based on the theories of Consumer Behaviour, customer buying preferences differs based on factors such as level of disposable income, lifestyles and economic environment which is largely characterised by which tier of city one lives in. For example, the buying attitudes of a consumer in an urban metropolis may be different from that of a consumer in a provincial capital. (Assumption: Customer that purchased from a store in Beijing is likely to live and work in Beijing) Hence, by slicing up the data into specific local categories will help provide our group a better understanding of consumer behaviour and trends.

Following, we will bin the data into different categories. Based on the product category list extracted from C&K’s e-commerce site, we will bin the different SKU’s into their respective product categories. Furthermore, based on the interquartile range of product prices of the different product categories, we will bin the products based on their price range.


Data Analysis

Market Basket Analysis (MBA)
It is a collection of undirected data mining methods for discovering customer purchasing patterns by finding associations between different items in customers’ shopping carts. This project will focus on the Apriori Algorithm as a means to identify the actionable rules present in the in-store transaction data provided by C&K.

MBA would be conducted on different data levels in attempts to discover different customer purchasing patterns. The different levels this project hopes to explore are as follows:

  • Product Category
  • Tiers of City
  • Product Materials
  • Product Price Range