Difference between revisions of "Group05 Proposal"

Revision as of 19:13, 30 November 2018

Visual Application for Time Series Clustering

Project Proposal

Dashboard Design

Poster

Final Report

Application

1 Abstract
2 Background
- 2.1 What is Time Series Clustering?
- 2.2 Motivation
3 Objective
- 3.1 To build an application for Time-Series Clustering
- 3.2 To improve functions of current R packages
4 Packages Used
- 4.1 Key Parameters of Time-Series Clustering
- 4.2 Key Methods of Hierarchical Clustering
5 Reference

Abstract

Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. Time-series datasets contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets. It represents the time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets. Time series clustering has a wide variety of strategies and a series specific to Dynamic Time Warping (DTW) distance. The dtwclust is a package of R statistical software so that have many of the algorithm implemented in this package that are specifically tailored to DTW. A great amount of effort went into implementing them as efficiently as possible, and the functions were designed with flexibility and extensibility in mind. As such, the dtwclust is a package with its functions comparable to, if not more superior than the expensive commercial-of-the-shelves analytical toolkit such as SAS Enterprise Miner. However, till date, the usage of dtwclust package tends to be confined within academic research as it required intermediate R programming skill.

The project aims to provide a user-friendly interface to dtwclust package by using R Shiny framework. The user-friendly interface design allows casual users to import data, manage, explore, calibrate, visualise and evaluate clusters without having to type a single line of code. In addition to that, the application aims to incorporates graph visualization to enhance data exploration, to aid in the interpretability of the outputs of the clusters and to investigate the similarities or dissimilarities within the cluster.

Background

What is Time Series Clustering?

Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. Time-series clustering is a type of clustering algorithm made to handle dynamic data. It is a special type of clustering is time-series clustering, which is essentially dynamic data as its feature values changes as a function of time. They pose some challenging issues due to large size and high dimensionality commonly associated with time-series.

Motivation

Currently, there are a few comprehensive time series clustering packages available in R language which are mathematically robust, however, output plots are based on default R base which can be further improved in terms of visualisation. This project aims to provide an interface for user to apply time series clustering to time related data so that they can perform clustering analysis without the need to code and visualise the result in a more interactive and visual manner. Bike sharing data from Citibike for New York City will be used as a case study for this application.

Objective

To build an application for Time-Series Clustering

Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, economics, healthcare, to government. This dashboard aims to allow user to do time series clustering on time series related data to uncover patterns which have potential use case in the respective domain.

To improve functions of current R packages

The current dtwclust package is intended to provide a consistent and user-friendly way of interacting with classic and new clustering algorithms, taking into consideration the nuances of time-series data. However, the focus of the R package is on the robust mathematical algorithm, where output plots are based on default R base. As such, by building this R shiny application, we would try to improve the application and visualization of time series clustering. We would be improving on the visualization, specifically on:

Interactive Time-Series Plot
Dendrogram Visualization

Packages Used

This dashboard mainly uses the following packages from R.
dtwclust:
The dtwclust package provides the functionality to choose the time-series representation, preprocessing and clustering algorithm, and includes implementations of recently developed time-series clustering algorithms and optimizations. It serves as a bridge between classical clustering algorithms and time-series data, additionally providing visualization and evaluation routines that can handle time-series.

Key Parameters of Time-Series Clustering

Time-series clustering is affected by several factors, such as the characteristics of time-series themselves, the choice of distance or centroid function, etc. In this application, we will only be focusing on the following key parameter.

Parameters	Algorithm
Type	Hierarchical Clustering Partitional Clustering
Distances	Dynamic Time Warping (DTW) Global Alignment Kernels (GAK) Shape-Based Distance (SBD)
Centroid	DTW Barycenter Averaging (DBA) Partitioning Around Medoids (PAM) Shape Extraction (Shape)

Key Methods of Hierarchical Clustering

In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.

Agglomeration Method	Methods in R
Single-Linkage (single)	single - Nearest Neighbour clustering
Complete-Linkage (complete)	complete - Furthest Neighbour Sorting
Average Agglomerative Clustering	average - Unweighted Arithmetic Average Clustering (UPGMA) mcquitty - Weighted Pair Group Method with Arithmetric Mean (WPGMA) centroid - Unweighted Centroid Clustering (UPGMC) method - Weighted Centroid Clustering (WPGMC)
Ward’s Minimum Variance	ward.D – Does not implement Ward’s (1963) clustering criterion ward.D2 – Implements that criterion (Murtagh and Legendre 2014)

@@ Line 74: / Line 74: @@
 * DTW Barycenter Averaging (DBA)
 * Partitioning Around Medoids (PAM)
-* Shape Averaging (Shape)
+* Shape Extraction (Shape)
 |}
 ===Key Methods of Hierarchical Clustering===
 In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.

Difference between revisions of "Group05 Proposal"

Revision as of 19:13, 30 November 2018

Contents

Abstract

Background

What is Time Series Clustering?

Motivation

Objective

To build an application for Time-Series Clustering

To improve functions of current R packages

Packages Used

Key Parameters of Time-Series Clustering

Key Methods of Hierarchical Clustering

Reference

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools