Difference between revisions of "Group05 Proposal"
| Ivygoh.2017 (talk | contribs)  | |||
| Line 74: | Line 74: | ||
| * DTW Barycenter Averaging (DBA) | * DTW Barycenter Averaging (DBA) | ||
| * Partitioning Around Medoids (PAM) | * Partitioning Around Medoids (PAM) | ||
| − | * Shape  | + | * Shape Extraction (Shape) | 
| |} | |} | ||
| + | |||
| ===Key Methods of Hierarchical Clustering=== | ===Key Methods of Hierarchical Clustering=== | ||
| In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.   | In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.   | ||
Revision as of 19:13, 30 November 2018
|  |  |  |  |  | 
Abstract
Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. Time-series datasets contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets. It represents the time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets. Time series clustering has a wide variety of strategies and a series specific to Dynamic Time Warping (DTW) distance. The dtwclust is a package of R statistical software so that have many of the algorithm implemented in this package that are specifically tailored to DTW. A great amount of effort went into implementing them as efficiently as possible, and the functions were designed with flexibility and extensibility in mind. As such, the dtwclust is a package with its functions comparable to, if not more superior than the expensive commercial-of-the-shelves analytical toolkit such as SAS Enterprise Miner. However, till date, the usage of dtwclust package tends to be confined within academic research as it required intermediate R programming skill.
The project aims to provide a user-friendly interface to dtwclust package by using R Shiny framework. The user-friendly interface design allows casual users to import data, manage, explore, calibrate, visualise and evaluate clusters without having to type a single line of code. In addition to that, the application aims to incorporates graph visualization to enhance data exploration, to aid in the interpretability of the outputs of the clusters and to investigate the similarities or dissimilarities within the cluster.
Background
What is Time Series Clustering?
Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. Time-series clustering is a type of clustering algorithm made to handle dynamic data. It is a special type of clustering is time-series clustering, which is essentially dynamic data as its feature values changes as a function of time. They pose some challenging issues due to large size and high dimensionality commonly associated with time-series.
Motivation
Currently, there are a few comprehensive time series clustering packages available in R language which are mathematically robust, however, output plots are based on default R base which can be further improved in terms of visualisation. This project aims to provide an interface for user to apply time series clustering to time related data so that they can perform clustering analysis without the need to code and visualise the result in a more interactive and visual manner. Bike sharing data from Citibike for New York City will be used as a case study for this application.
Objective
To build an application for Time-Series Clustering
Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, economics, healthcare, to government. This dashboard aims to allow user to do time series clustering on time series related data to uncover patterns which have potential use case in the respective domain.
To improve functions of current R packages
The current dtwclust package is intended to provide a consistent and user-friendly way of interacting with classic and new clustering algorithms, taking into consideration the nuances of time-series data. However, the focus of the R package is on the robust mathematical algorithm, where output plots are based on default R base. As such, by building this R shiny application, we would try to improve the application and visualization of time series clustering. We would be improving on the visualization, specifically on:
- Interactive Time-Series Plot
- Dendrogram Visualization
Packages Used
This dashboard mainly uses the following packages from R. 
dtwclust:
The dtwclust package provides the functionality to choose the time-series representation, preprocessing and clustering algorithm, and includes implementations of recently developed time-series clustering algorithms and optimizations. It serves as a bridge between classical clustering algorithms and time-series data, additionally providing visualization and evaluation routines that can handle time-series. 
Key Parameters of Time-Series Clustering
Time-series clustering is affected by several factors, such as the characteristics of time-series themselves, the choice of distance or centroid function, etc. In this application, we will only be focusing on the following key parameter.
| Parameters | Algorithm | 
|---|---|
| Type | 
 | 
| Distances | 
 | 
| Centroid | 
 | 
Key Methods of Hierarchical Clustering
In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.
| Agglomeration Method | Methods in R | 
|---|---|
| Single-Linkage (single) | 
 | 
| Complete-Linkage (complete) | 
 | 
| Average Agglomerative Clustering | 
 | 
| Ward’s Minimum Variance | 
 | 

