Difference between revisions of "Group10 Overview"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
Line 27: Line 27:
 
=== Clustering ===
 
=== Clustering ===
 
The clustering is the grouping of the similar variable. The time series cluster is that which groups the variable that behaves similarly over  a period of time. Unlike most of the time series clustering which use the Eclidean model to perform time series clustering, the algorithm behind the time series clustering is the DTW analysis which is based on the distance measure of the variables over a time period.
 
The clustering is the grouping of the similar variable. The time series cluster is that which groups the variable that behaves similarly over  a period of time. Unlike most of the time series clustering which use the Eclidean model to perform time series clustering, the algorithm behind the time series clustering is the DTW analysis which is based on the distance measure of the variables over a time period.
==== Mathematics of DTW Clustering ====
 
DTW means Dynamic Time Wrapping, which is a popular method to do the time series clustering.
 
In time series clustering analysis, the length of the time series which need to be compared might not be the same, besides, they might only have the difference in Displacement on the timeline, which means that if the displacement is reductant, the time series which be used to compare might be the same. In these complex cases, the distance or similarity between the two-time sequences that is not effectively sought by the traditional Euclidean distance. DTW calculates the similarity between two-time series by extending and shortening the time series.
 
There are three components necessary to most clustering approaches, The DTW is a way to measure the distance between the time series. It uses the DTW distance to calculate the similarity and then use the usual clustering method like partitional, to do the clustering. In this application, it doesn’t provide the method to evaluate the clusters, which will be one direction for our future work. 
 
1. A measure of similarity or dissimilarity, i.e. a distance measure
 
2. A clustering algorithm
 
3. A method to evaluate clusters.
 
 
In R package, there is a package named ‘dtwclust’ which is use the dynamic time warping distance to do the time series clustering. Besides, it provides many choices of distance calculation method and clustering method, which can be choose by users. In these analysis, we choose 4 distance method and 3 clustering method to discuss.
 
Similarity calculation description
 
{| class="wikitable"
 
|-
 
! Similarity method !! Description
 
|-
 
| Dtw_basic || Basic dtw distance. A custom version of DTW with less functionality, but faster.
 
|-
 
| Dtw || Optionally with a Sakoe-Chiba/Slanted-band constraint.
 
|-
 
| gak || Fast global alignment kernels: Distance based on (triangular) global alignment kernels
 
|-
 
| ibk || Keogh’s lower bound for DTW with either L1 or L2 norm for the Sakoe-Chiba constraint
 
|}
 
 
Clustering method description
 
{| class="wikitable"
 
|-
 
! Clustering method !! Description
 
|-
 
| Fuzzy || Fuzzy clustering creates a fuzzy or soft partition in which each member belongs to each cluster to a certain degree
 
|-
 
| Partitional || In this case, the data is explicitly assigned to one and only one cluster out of k total clusters. The total number of desired clusters must be specified beforehand, which can be a limiting factor, although this can be ameliorated by using validity indices.
 
|-
 
| Hierarchical  || An algorithm that tries to create a hierarchy of groups in which, as the level in the hierarchy increases, clusters are created by merging the clusters from the next lower level, such that an ordered sequence of groupings is obtained.
 
|}
 
Compare this application with SAS and other tools, which are also inbuilt the function to do the time series clustering but only has use the Euclidean distance calculation to calculate the distance, this application is more flexible, convenient, besides, it can be alternative to expensive software.
 
Let’s use ‘dtw_basic’ distance and partitional clustering method to describe the time series clustering progress. 
 
 
==== Clustering Progress  ====
 
In this case, we use the first step in DTW (X, Y) involves creating a local cost matrix (LCM or lcm), which has n × m Dimensions and stores all pairwise distance from X_t1to〖 Y〗_t1.
 
Then, step through the matrix from 〖LCM〗_11to 〖LCM〗_TT, either diagonally add the costs of each step. At each step, the aim is to find the direction which increases the cost the least. For each direction, a pre-specified step pattern assigns a weight for them. For an optimum path a={(1,1),…,(T, T)}, the DTW distance is computed as:<br>
 
[[File:Clustering function.png|center]]<br>
 
Demo of DTW distance between cities <br>
 
[[File:Demo of DTW distance between cities.png|center]]<br>
 
Then it uses the partitional clustering to do the clusters. Firstly, it needs to set the number of clusters. Then we choose the projects randomly and compute the k centroids for each cluster. A prototyping function is applied to each cluster to update the corresponding centroid. Then, distances and centroids are updated iteratively until a certain number of iterations have elapsed, or no object changes clusters anymore. Then we can get the cluster results.
 
 
==== Clustering with geographic ====
 
Because this data is about the cities’ property price index, so we can further think about that if the location of cities which in the same cluster are near to each other or not. So, we label the city with clusters and put it into the China map which are built by the ‘rechart’ package.
 
 
==== Insight from the advanced clustering ====
 
If we set the number of cluster we want is 4, it can be found that in cluster2, only one city, Wenzhou is keeping going down during the 5years, which means the price of house in Wenzhou keeps decrease, other cities are all have an increase of the price index in March 2014, and then go down from April 2014 to February 2015, after February 2015, the price index increase slowly except one in cluster 1. This city is Shenzhen. From the map, it can be found that the cities in cluster4 are gathered in the coastline. Other clusters don’t have much geographic similarity. 
 
[[File:Cluster.png|400px|center]]
 
[[File:Clusters in China map.png|400px|center]]
 
  
 
=== Forecasting ===
 
=== Forecasting ===

Latest revision as of 15:54, 10 December 2017

width="100%"

Proposal

Poster

Application

Report


Background

There are over 10,000 packages in R that supports many economic and financial analysis. Many analyis methods and alogorithms out there fail to be utilised or optimised by the users. They are either poorly derived with great visualization or accurately derived with poor visualization. One such analysis is Time Series analysis, thus we have taken up the housing price Index of China Housing Market over 5 years. The analysis is time series analysis of the housing prices data over 5 ears using the state of the art time series clustering. Thus allowing better grouping . This analysis have also been presented in most efficient ways.

Time Series Analysis

Time series analysis is about analyzing time series data to understand the characteristics and derive conclusions based on statistical results from the data.The methodology we have used is clustering and forecasting.

Clustering

The clustering is the grouping of the similar variable. The time series cluster is that which groups the variable that behaves similarly over a period of time. Unlike most of the time series clustering which use the Eclidean model to perform time series clustering, the algorithm behind the time series clustering is the DTW analysis which is based on the distance measure of the variables over a time period.

Forecasting

Time series analysis is about analyzing time series data to understand the characteristics and derive conclusions based on statistical results from the data.The methodology we have used is clustering and forecasting. The forecasting is done based on ARIMA model to predict on the next two years which is also compared to the actual results to validate the model. Firstly we built ARIMA forecasting model then convert it to “tidy” data frames by sweep package, last we use grid built by ourselves to visualize the trend and forecast for each city.

Case Application

The Housing Price Index is a major macro economic factor. It not just reflects the housing market but also the economy as a whole. The Housing Prices of each city are analysed and comparative analysis is provided to derive further analysis on them.

Data Preparation

The data used is from the CEIC Data of the Housing Price Index of Cities in China. Total 48 cities are selected, those cities contains first-tier, second-tier and third-tier cities.The price index is collected from so there are all 3 datasets.

Name Description
property price index (2010=100)_New constructed The property price index for 48 cities, the data is a monthly data which start from 2010 until to 2015. The price index is based on 2010. Regard price in 2010 as 100.
Geolocation The longitude and latitude of the 48 cities.
Geofacet The location information for 48 cites which will used in the ‘Geofacet’ package.

Application and Analysis

The application allows the used to conduct the different time series clustering and forecasting between the cities that they wish to see.