Difference between revisions of "Group05 Proposal"
Ivygoh.2017 (talk | contribs) |
Ivygoh.2017 (talk | contribs) |
||
Line 27: | Line 27: | ||
==Abstract== | ==Abstract== | ||
− | + | ||
+ | Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. Time-series datasets contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets. It represents the time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets. | ||
+ | Time series clustering has a wide variety of strategies and a series specific to Dynamic Time Warping (DTW) distance. The <b>dtwclust</b> is a package of R statistical software so that have many of the algorithm implemented in this package that are specifically tailored to DTW. A great amount of effort went into implementing them as efficiently as possible, and the functions were designed with flexibility and extensibility in mind. As such, the <b>dtwclust</b> is a package with its functions comparable to, if not more superior than the expensive commercial-of-the-shelves analytical toolkit such as SAS Enterprise Miner. However, till date, the usage of <b>dtwclust</b> package tends to be confined within academic research as it required intermediate R programming skill. | ||
+ | In view of this limitation, our project seeks to provide an user-interface to <b>dtwclust</b> package by using R Shiny framework. The user-friendly interface design allows casual users to manage, explore, calibrate and visualise complex items mining and association rules mining models without having to type a single line of code. Besides providing user-friendly interface, our application also incorporates an interactive graph visualisation method to enhance the interpretability of the outputs of frequent itemsets mining and association rules mining algorithms. | ||
+ | |||
+ | This presentation consists of five main sections. Firstly, the motivation and objectives of the project will be discussed. This is followed by a detailed discussion on the principles and concepts of association rule mining and the R packages used to perform association rules mining, the arules family of packages. Thirdly, the application and visualization design with respect to the improvements made to the arules visualization packages will be discussed. Following which, we will demonstrate the flexible use of our application with two different use cases. The presentation will conclude with a sharing of valuable insights gained through working on the project and potential application areas of our application. | ||
==Background on Time Series Clustering== | ==Background on Time Series Clustering== | ||
+ | |||
===What is Time Series Clustering?=== | ===What is Time Series Clustering?=== | ||
Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. | Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. | ||
− | + | Time-series clustering is a type of clustering algorithm made to handle dynamic data. It is a special type of clustering is time-series clustering, which is essentially dynamic data as its feature values changes as a function of time. They pose some challenging issues due to large size and high dimensionality commonly associated with time-series. | |
+ | |||
===Key Parameters of Time-Series Clustering=== | ===Key Parameters of Time-Series Clustering=== | ||
+ | |||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
Line 43: | Line 51: | ||
|- | |- | ||
| Distances|| | | Distances|| | ||
− | * Dynamic Time | + | * Dynamic Time Warping (DTW) |
* Global Alignment Kernels (GAK) | * Global Alignment Kernels (GAK) | ||
* Shape-Based Distance (SBD) | * Shape-Based Distance (SBD) | ||
Line 52: | Line 60: | ||
* Shape Averaging (Shape) | * Shape Averaging (Shape) | ||
|} | |} | ||
+ | |||
+ | ===Key Methods of Hierarchical Clustering=== | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! Agglomeration Method !! Method in R | ||
+ | |- | ||
+ | | Single-Linkage (single)|| | ||
+ | * single - Nearest Neighbour clustering | ||
+ | | Complete-Linkage (complete)|| | ||
+ | * complete - Furthest Neighbour Sorting | ||
+ | |Average Agglomerative Clustering || | ||
+ | * average - Unweighted Arithmetic Average Clustering (UPGMA) | ||
+ | * mcquitty - Weighted Pair Group Method with Arithmetric Mean (WPGMA) | ||
+ | * centroid - Unweighted Centroid Clustering (UPGMC) | ||
+ | * method - Weighted Centroid Clustering (WPGMC) | ||
+ | | Ward’s Minimum Variance || | ||
+ | * ward.D – Does not implement Ward’s (1963) clustering criterion | ||
+ | * ward.D2 – Implements that criterion (Murtagh and Legendre 2014) | ||
+ | |} | ||
+ | |||
==Packages Used== | ==Packages Used== | ||
+ | |||
This dashboard mainly uses dtwclust package from R. | This dashboard mainly uses dtwclust package from R. | ||
<b><big>dtwclust:</big></b><br> | <b><big>dtwclust:</big></b><br> | ||
− | + | ||
+ | The <b>dtwclust</b> package provides the functionality to choose the time-series representation, preprocessing and clustering algorithm, and includes implementations of recently developed time-series clustering algorithms and optimizations. It serves as a bridge between classical clustering algorithms and time-series data, additionally providing visualization and evaluation routines that can handle time-series. | ||
==Objective== | ==Objective== | ||
− | ===To build an application for Time Series Clustering=== | + | |
+ | ===To build an application for Time-Series Clustering=== | ||
Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, economics, healthcare, to government. This dashboard aims to allow user to do time series clustering on time series related data to uncover patterns which have potential use case in the respective domain. | Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, economics, healthcare, to government. This dashboard aims to allow user to do time series clustering on time series related data to uncover patterns which have potential use case in the respective domain. | ||
+ | |||
==Reference== | ==Reference== | ||
− | |||
[[File:Time Series Clustering A Decade Review.pdf|none|A journal by Saeed Aghabozorgi, Ali Seyed Shirkhorshidi and Teh Ying Wah]] | [[File:Time Series Clustering A Decade Review.pdf|none|A journal by Saeed Aghabozorgi, Ali Seyed Shirkhorshidi and Teh Ying Wah]] | ||
+ | [[ISSS608_2016-17_T3_Group8_Arules_Project Proposal|The Arules]] |
Revision as of 18:59, 20 November 2018
|
|
|
|
Contents
Abstract
Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. Time-series datasets contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets. It represents the time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets. Time series clustering has a wide variety of strategies and a series specific to Dynamic Time Warping (DTW) distance. The dtwclust is a package of R statistical software so that have many of the algorithm implemented in this package that are specifically tailored to DTW. A great amount of effort went into implementing them as efficiently as possible, and the functions were designed with flexibility and extensibility in mind. As such, the dtwclust is a package with its functions comparable to, if not more superior than the expensive commercial-of-the-shelves analytical toolkit such as SAS Enterprise Miner. However, till date, the usage of dtwclust package tends to be confined within academic research as it required intermediate R programming skill. In view of this limitation, our project seeks to provide an user-interface to dtwclust package by using R Shiny framework. The user-friendly interface design allows casual users to manage, explore, calibrate and visualise complex items mining and association rules mining models without having to type a single line of code. Besides providing user-friendly interface, our application also incorporates an interactive graph visualisation method to enhance the interpretability of the outputs of frequent itemsets mining and association rules mining algorithms.
This presentation consists of five main sections. Firstly, the motivation and objectives of the project will be discussed. This is followed by a detailed discussion on the principles and concepts of association rule mining and the R packages used to perform association rules mining, the arules family of packages. Thirdly, the application and visualization design with respect to the improvements made to the arules visualization packages will be discussed. Following which, we will demonstrate the flexible use of our application with two different use cases. The presentation will conclude with a sharing of valuable insights gained through working on the project and potential application areas of our application.
Background on Time Series Clustering
What is Time Series Clustering?
Clustering is a data analysis technique for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups or clusters without advanced knowledge of the groups’ definition. Clusters are formed based on combinations of input variable, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknows. Time-series clustering is a type of clustering algorithm made to handle dynamic data. It is a special type of clustering is time-series clustering, which is essentially dynamic data as its feature values changes as a function of time. They pose some challenging issues due to large size and high dimensionality commonly associated with time-series.
Key Parameters of Time-Series Clustering
Parameters | Algorithm |
---|---|
Type |
|
Distances |
|
Centroid |
|
Key Methods of Hierarchical Clustering
Agglomeration Method | Method in R | ||||||
---|---|---|---|---|---|---|---|
Single-Linkage (single) |
|
Complete-Linkage (complete) |
|
Average Agglomerative Clustering |
|
Ward’s Minimum Variance |
|
Packages Used
This dashboard mainly uses dtwclust package from R.
dtwclust:
The dtwclust package provides the functionality to choose the time-series representation, preprocessing and clustering algorithm, and includes implementations of recently developed time-series clustering algorithms and optimizations. It serves as a bridge between classical clustering algorithms and time-series data, additionally providing visualization and evaluation routines that can handle time-series.
Objective
To build an application for Time-Series Clustering
Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, economics, healthcare, to government. This dashboard aims to allow user to do time series clustering on time series related data to uncover patterns which have potential use case in the respective domain.