Group 8 Report

From Visual Analytics and Applications
Revision as of 21:42, 28 November 2017 by Guoteng.fam.2016 (talk | contribs)
Jump to navigation Jump to search
width="100%"

Proposal

Poster

Application

Report


Abstract

Time-series analysis is a time and effort consuming endeavour. As budding data analysts, we spent considerable resources in experimenting with many variations of parameter configurations to analyse time-series data. This difficulty stems from the lack of automatic tools that can help calculate the optimized time-series parameters during model training. To tackle this challenge, we created an easy-to-use time-series exploration system that is accessible even to the uninitiated analyst. The system is able to decompose the time series data to its constituent parts, namely Seasonality, Trend and Random (Noise). It can generate several forecasting models, using Exponential Smoothing and ARIMA analysis techniques, to predict future time periods using optimization techniques. The system also allows other forms of time series data to be displayed and their forecasts compared using the given forecasting methods, within certain formats. To test the system capabilities, we adopted the Singapore Consumer Price Index (CPI) as our use case. The CPI, with its short-term forecasts, is often used for tuning Governmental policies to steer inflation rates in countries like Singapore and for foreign investors to consider allocating potential investment funds into the country. This paper seeks to document the eventual system functionalities and its underlying design principles.

1. Motivation of the application

During our personal data analysis research and experiences, we discovered a lack of freely available analysis tools that can help us optimize the parameter settings of time-series models. The result is a large amount of time and effort utilized to try out different combinations of parameters and waiting for the models to be trained in order to find one with the highest accuracy rates. An ideal time series system is one that would help estimate the model's accuracy rates automatically while data analysis is performed, so it is easy to choose the best models for further scrutiny. This system in particular would encompass features that are easy to understand for data exploration and accept generic forms of time-series data.

Besides fulfilling the need to utilize a better application for time series modelling, we are also motivated to explore the complex time-related trends of Singapore's CPI throughout the years (1990-2017) for different categories. We would like to use time-series visualization techniques such as tables and line charts representing Trend, Seasonality and Random to investigate any potential insights and to display potential forecasts to the audience. The team was curious on how different categories of goods and services constitute a country's CPI index, and how these different categories of data would look like during our visualization. Hence, we find that the Singapore CPI data would make a suitable use case in our application.

Our desire to work on this project is encouraged by the Singapore Government's initiative of 'SmartNation.sg', where data is made readily available to the general public for in-depth analysis. These datasets often provide interesting observations and opportunities for anyone who care to investigate.

2. Review and critic on past works

Professional visualization tools, such as Tableau, offer specialized features to perform visualization. But their primary goal is not in analyzing time-series data, only to visualize time-series data. Therefore, it is expected that there would not be any model training functionality provided by the application.

Specialized data analysis tools such as SAS JMP and SAS Enterprise Miner offer comprehensive analytical features that can allow complex time-series models to be built. But it is exactly due to their list of functions that can intimidate even the seasoned data analysts. The user interface can be cluttered and the visualizations that result from the analyses are lackluster compared to the graphs generated from Tableau.

Both types of tools are well-appreciated by the data science community for different reasons, but it is with hope that our proposed application can fit nicely into the niche void not covered by these two extremes.

3. Design framework

3.1 Interface

Our system's look-and-feel is designed in accordance to Shneiderman's mantra, his famous quote being: "Overview, zoom and filter, details on demand". This is apparent in the way we organize the steps and our UI elements.

Overview

From the first tab: "Upload Data File", we allow the user to upload a time-series csv file and they will be able to preview the kind of data that is going to be analyzed. There is also a table located at the side panel describing the metadata of the uploaded data. These features allow a general feeling of the data before any actual data analysis.

Zoom and filter

In the second tab, "Exploration", the left panel showcases a comprehensive set of filters that allow the user to narrow down the records in the dataset. Selection of different values would automatically result in the update of individual time-series charts on the main screen area, showing the zoomed in trend and seasonality data that was derived from the observation data.

Details on Demand

In the third tab, "Forecasting", the data has been focused and the user is now able to implement forecasting. Models with optimized parameters will be generated right from the start to provide convenience to the user and displayed in a sortable datatable. Forecast charts after model selection also extend and contract based on the number of models selected to provide some ease of visual comparisons. The forecasted time periods use a stark red line to denote its importance, along with two different shades of blue to denote the confidence intervals. Additional holdout data is also in a different shade of cyan, with this information present in the legend. Due to the possible small size of the charts when multiple have been selected, title bars and background gridlines have been provided to guide the user.

Our general system color scheme uses a calming shade of blue, with great areas of white and light grey. This color scheme is meant to reduce anxiety for data analysts trying to perform time-series data analysis. We also used consistent UI elements (from the Shinythemes package) throughout the pages to give users a sense of familiarity. UI controls are almost always placed on the left and visualizations on the right.

3.2 Functionality Design

To ensure that the system has the necessary functions to perform accurate time series analysis without being too cluttered and overwhelming, the system functions are grouped into three simple steps.

Data Manipulation

Users are able to upload their own data via the first interface tab, the data files need to be pre-processed into certain formats before the upload can commence. The interface provided would allow some forms of data transformations such as the transposition of columns, the generation of an index column as a substitute for a missing datetime column, and the indication of missing time series periods. These functionalities are required because time series analysis needs data to be indexed by a form of datetime field. Metadata information of the uploaded dataset would be displayed for easy viewing and the uploaded dataset can also be previewed from the main panel.

Data Exploration

Users would then be able to explore the uploaded data, by interacting with the provided filter functions, such as denoting whether the data is additive or multiplicative trend, toggling the start and end dates of the data, and setting the frequency of the time-series periods. The system provides a feature to decompose time-series information into its constituent parts: Observation, Seasonal, Trend, Random (Noise). From the separate parts, users can understand the different time-series patterns and derive insights.

Forecasting

Finally, users would also be able to forecast time-series data that have been filtered out from the Data Exploration step. The forecasting techniques will utilize Exponential Smoothing and ARIMA techniques to perform predictions. An optimization algorithm will be used along with existing packages to find the best set of parameters and the top three models of each technique will be selected based on their AIC, BIC values. Once selected, the models can then be graphed on the page as a comparison.

3.3 Usage of R Packages

4. Demonstration

To use a dataset for our application, we adopted the use case of using Singapore's Consumer Price Index (CPI) data. CPI is an economic measure of a country that is often used by foreign investors to consider investments into a particular country or for potential migrants to assess the standards of living for the country's citizens. The data describes how affordable or unaffordable goods or services are in the form of average weighted scores.

The Singapore CPI data that we are exploring is extracted from data.gov.sg, a Singapore government website that houses public data for the use of the nation's smart data initiative, 'SmartNation.sg'. The data is in a monthly format that reveals the figures from January 1961 to August 2017, while the index reference period is 2014. The data uses an overall index to represent any changes in the price level of the whole basket of items, and can also be drilled down to its sub-indices for different categories and sub-categories of goods and services. For our system analysis, we plan to use filtered data from 1990 onwards.

- Sample test cases -

5. Discussion

What has the audience learned from your work? What new insights or practices has your system enabled? A full blown user study is not expected, but informal observations of use that help evaluate your system are encouraged.

6. Future Work

Given time constraints, the current application is only able to process limited types of time series data formats such as monthly-based data or datasets with system-generated indices. In the future, it would be good to cater to data of various formats, such as weekly and quarterly period data. If possible, data captured in miniscule time-scales such as seconds or minutes, should also be possible time data formats for the system to accept for analysis.

The system should also have a comprehensive error message system that can provide customized instructions or error handling features when one tries to perform operations that exceed the system's abilities.

More forecasting methods can be incorporated into the system. Techniques such as Extrapolation, Linear Prediction, Trend Estimation, Growth Curve, and Neural Network can be further explored to provide potential analytical features to users.

7. Installation guide

including hardware configuration and software integrationn.

8. User Guide

Step-by-step guide on how to use the data visualisation functions designed.

References

1. http://www.codingthearchitecture.com/2015/01/08/shneidermans_mantra.html

2. https://data.gov.sg/dataset/consumer-price-index-monthly?view_id=0063aa5a-c5de-4c74-94be-b9ec443878be&resource_id=67d08d6b-2efa-4825-8bdb-667d23b7285e

3. https://secure.mas.gov.sg/msb/ExchangeRates.aspx

4. https://insights-ceicdata-com.libproxy.smu.edu.sg/Untitled-insight/views

5. http://tralvex.com/pub/cars/coe.htm