Group05 Proposal

From Visual Analytics and Applications
Jump to navigation Jump to search

Time Series Clustering.jpg Visual Application for Time Series Clustering

Project Proposal

Dashboard Design

Poster

Final Report

Application

 


Abstract

Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. Time-series datasets contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets. It represents the time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets. Time series clustering has a wide variety of strategies and a series specific to Dynamic Time Warping (DTW) distance. The dtwclust is a package of R statistical software so that have many of the algorithm implemented in this package that are specifically tailored to DTW. A great amount of effort went into implementing them as efficiently as possible, and the functions were designed with flexibility and extensibility in mind. As such, the dtwclust is a package with its functions comparable to, if not more superior than the expensive commercial-of-the-shelves analytical toolkit such as SAS Enterprise Miner. However, till date, the usage of dtwclust package tends to be confined within academic research as it required intermediate R programming skill.

The project aims to provide a user-friendly interface to dtwclust package by using R Shiny framework. The user-friendly interface design allows casual users to import data, manage, explore, calibrate, visualise and evaluate clusters without having to type a single line of code. In addition to that, the application aims to incorporates graph visualization to enhance data exploration, to aid in the interpretability of the outputs of the clusters and to investigate the similarities or dissimilarities within the cluster.

Background

What is Time Series Clustering?

Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. Time-series clustering is a type of clustering algorithm made to handle dynamic data. It is a special type of clustering is time-series clustering, which is essentially dynamic data as its feature values changes as a function of time. They pose some challenging issues due to large size and high dimensionality commonly associated with time-series.

Motivation

The current dtwclust package provides comprehensive functions which incorporates both classic time series clustering approaches and improvised techniques which were introduced in the past few years. It is intended to provide a consistent and user-friendly way for users to apply time series clustering algorithms with different distance metrics and centroid algorithms, taking into consideration the nuances of time-series data.

However, existing dtwclust package does not offer data preparation related functions for time series clustering and conversion of raw transactional data to time aggregated series data requires some data transformation with functions in other R package. In addition, output plots are based on default R base which can be further improved in terms of visualisation and interactivity. Also, the clustering result do not offer insights on the characteristics of the cluster and it require users to merge the clustering result with other attributes.

Objective

This project aims integrate dtwclust package with packages meant for data manipulation and those which provide interactivity with the objective to build a hassle-free interface for user to perform time series clustering without the need to code and to visualize the clustering result in a more interactive and visual manner in order to uncover pattern which have potential use case in the respective domains.

Visual Data Exploratory

In order to have a better understanding of the data, data exploratory is a crucial process prior to perform any form of analysis. This application aims provide users with multiple time aggregation options and the aggregated value, i.e. frequency, of the selected time aggregation will be plotted accordingly.

Interactive Clustering Result

Even though dtwclust package is mathematically robust and relatively computational efficient, the output plot uses default R base plot function which can be further improve in terms of visualization and interactivity. Visualization of dendrogram from hierarchical clustering will be enhanced and time series charts for each cluster will be made interactive.

Visualization of Cluster Features

As with all types of clustering, it is important for users to understand the characteristics of the clusters through the overlaying of clustering result with other attributes. The application also strives to achieve this objective by providing placeholders for users to select variables of their interest so that they can visually comprehend the clustering result.

Packages Used

This dashboard mainly uses the following packages from R.
dtwclust:
The dtwclust package provides the functionality to choose the time-series representation, preprocessing and clustering algorithm, and includes implementations of recently developed time-series clustering algorithms and optimizations. It serves as a bridge between classical clustering algorithms and time-series data, additionally providing visualization and evaluation routines that can handle time-series.

Key Parameters of Time-Series Clustering

Time-series clustering is affected by several factors, such as the characteristics of time-series themselves, the choice of distance or centroid function, etc. In this application, we will only be focusing on the following key parameter.

Parameters Algorithm
Type
  • Hierarchical Clustering
  • Partitional Clustering
Distances
  • Dynamic Time Warping (DTW)
  • Global Alignment Kernels (GAK)
  • Shape-Based Distance (SBD)
Centroid
  • DTW Barycenter Averaging (DBA)
  • Partitioning Around Medoids (PAM)
  • Shape Extraction (Shape)

Key Methods of Hierarchical Clustering

In Hierarchical clustering itself, there are several agglomeration methods that are available to user to select in R.

Agglomeration Method Methods in R
Single-Linkage (single)
  • single - Nearest Neighbour clustering
Complete-Linkage (complete)
  • complete - Furthest Neighbour Sorting
Average Agglomerative Clustering
  • average - Unweighted Arithmetic Average Clustering (UPGMA)
  • mcquitty - Weighted Pair Group Method with Arithmetric Mean (WPGMA)
  • centroid - Unweighted Centroid Clustering (UPGMC)
  • method - Weighted Centroid Clustering (WPGMC)
Ward’s Minimum Variance
  • ward.D – Does not implement Ward’s (1963) clustering criterion
  • ward.D2 – Implements that criterion (Murtagh and Legendre 2014)

Reference

  1. File:Time Series Clustering A Decade Review.pdf
  2. The Arules