Difference between revisions of "Group08 Report"
Line 103: | Line 103: | ||
The Yu’e Bao shiny dashboard app is build using R version 3.5.1 (Feather Spray) and RStudio version 1.1.463 for Windows Vista/7/8/10, which are the latest versions available at the time of the project. Hence, it is recommended to use the same versions to run the app. The app also requires the latest version of R libraries, mentioned earlier, that are compatible up to the mentioned version of R. Before running the app, it is advisable to update all R libraries first. Upon starting the app, all libraries that are not already installed, will automatically install without manual intervention. Please be patient and wait for the apps to load as there may be quite a few libraries to install and may require some time. Once all required libraries are installed, the app dashboard should appear as shown below. | The Yu’e Bao shiny dashboard app is build using R version 3.5.1 (Feather Spray) and RStudio version 1.1.463 for Windows Vista/7/8/10, which are the latest versions available at the time of the project. Hence, it is recommended to use the same versions to run the app. The app also requires the latest version of R libraries, mentioned earlier, that are compatible up to the mentioned version of R. Before running the app, it is advisable to update all R libraries first. Upon starting the app, all libraries that are not already installed, will automatically install without manual intervention. Please be patient and wait for the apps to load as there may be quite a few libraries to install and may require some time. Once all required libraries are installed, the app dashboard should appear as shown below. | ||
− | [[Image:g8_report_pic_1.png| | + | [[Image:g8_report_pic_1.png|800px]] |
It is also recommended to run the app in internet browser, eg. Google Chrome, for best app visualization experience. Additionally, to enhance the user experience, most data has been pre-processed to minimize data processing at the front end. Therefore, different data or data object has been pre-generated for each functionality. On the initial load, the pre-processed data has also been configured to load automatically but may take slightly longer depending on the data size. Please be patient and wait for the data to load completely. Try not to close the app before the data is loaded to avoid data corruption. Once the data is loaded, it will automatically be saved into the global environment and subsequent usage of the functionality will be smooth sailing. | It is also recommended to run the app in internet browser, eg. Google Chrome, for best app visualization experience. Additionally, to enhance the user experience, most data has been pre-processed to minimize data processing at the front end. Therefore, different data or data object has been pre-generated for each functionality. On the initial load, the pre-processed data has also been configured to load automatically but may take slightly longer depending on the data size. Please be patient and wait for the data to load completely. Try not to close the app before the data is loaded to avoid data corruption. Once the data is loaded, it will automatically be saved into the global environment and subsequent usage of the functionality will be smooth sailing. | ||
+ | |||
+ | ==General Usage== | ||
+ | Shiny and shinydashboard R libraries has been used as the main framework to create a dashboard like user interface application. A custom theme, Blue Gradient, is used by using dashboardtheme R library to give the app a unified theme experience. Shinywidget R library is used to create more professional looking dashboard controls like material switch buttons and group radio buttons. | ||
+ | |||
+ | [[Image:g8_report_pic_2.png|800px]] | ||
+ | |||
+ | As shown in image above, the overall layout design of the app is based on 3 main functionalities which are listed on the left sidebar menu (red box). Each main functionality opens to a new page to its right. In cases where there are sub-functionalities, they will be listed at the top as tabs (pink box). In the main page area, the control inputs (green box) will be placed to the left and right, or at the top of the page, while the visualizations and plots (blue box) will occupy the rest and most of the remaining area. If the visualization feels squeezed, users can click on the button beside the dashboard title “Yu’e Bao Dashboard” (purple arrow) to hide the sidebar menu as shown in the image below. | ||
+ | |||
+ | [[Image:g8_report_pic_3.png|800px]] | ||
+ | |||
+ | Additionally, in some case there will be a Refresh button or an Update View button in the controls. In the case where there is a Refresh button, the visualization will automatically regenerate with every change in inputs. However, in some cases, the visualization may not be plot properly or does not fit the plot frame (eg. when browser window size is changed while visualization is being generated), the visualization can be refreshed with this button. For controls with Update View button, change in input does not regenerate the visualizations automatically and requires the Update View button to do so. | ||
+ | |||
+ | ==Input Controls== | ||
+ | Various input control types are used in the app: | ||
+ | * Radio Buttons (red box) – a set of radio buttons where only 1 option can be selected. Usually used when list of character selection is mid-size, too long for Radio Group Button but does not require a dropdown list. Selected choice will have a dot beside it. | ||
+ | * Radio Group Buttons (green box) - a set of radio buttons in the form of selection buttons joined as a group. Functions like the same manner as radio buttons and typically used for shorter list of selections where description of selection is also short. Selected button will be darkened. | ||
+ | * Slider Text Input / Slider Input, with 2 selections (pink box) – a sliding selection that allows users to select a maximum and minimum limit from a range of values. The selected range, if shorter than the allowed range, can also be dragged along the slider to change the max and min, yet maintaining the interval length between max and min. Typically used to filter a specific time period. | ||
+ | |||
+ | [[Image:g8_report_pic_4.png|800px]] | ||
+ | |||
+ | * Select Input – a dropdown list of selections. Only one choice can be selected. Typically used for a longer list of character selections. If selection list is longer than dropdown window, scroll is enabled to browse up and down the list. | ||
+ | |||
+ | [[Image:g8_report_pic_5.png|800px]] | ||
+ | |||
+ | * Checkbox Group Input – a group of checkbox selections. Allows more than one selection or none selected at all. | ||
+ | |||
+ | [[Image:g8_report_pic_6.png|800px]] | ||
+ | |||
+ | * Material Switch (red box) – a toggle switch for binary selection to choose between 2 choices. | ||
+ | * Slider Input, single selection (green box) – a bar slider allowing user to select one value along the slider bar. Typically used for a list of continuous values. | ||
+ | |||
+ | [[Image:g8_report_pic_7.png|800px]] | ||
+ | |||
+ | ==Interactive Visualizations== | ||
+ | [[Image:g8_report_pic_8.png|800px]] | ||
+ | |||
+ | Some visualizations in the app are static plots, while others are interactive visualizations generated by R library plotly. For interactive visualization, on mouse over, the corresponding tooltip will appear. A single click on any of the categories in the legend will hide that category from the plot. On the contrary, a double click will filter out all other categories except the selected one. Plotly interactive visualizations also allow users to zoom in and out of the plot to focus on a selected range of values by drag and select. | ||
+ | |||
+ | Additionally, in animated bubble plot, there is an additional feature of animation that shows the changes of values over time (green). The animation controls allow users to simply auto play the changes over time by clicking Play button, or manually move time over a slider bar by clicking and dragging the scrollbar. | ||
+ | |||
+ | [[Image:g8_report_pic_9.png|800px]] | ||
+ | |||
+ | =Visualization Methodologies and Analysis= | ||
Revision as of 20:20, 8 December 2018
|
|
|
|
Yu’e Bao (余额宝) is an investment product offered by Alipay (支付宝), a mobile and online payment platform established by China’s multinational conglomerate Alibaba Group. In June 2013, Alibaba Group launched Yu’e Bao, in collaboration with Tianhong Asset Management Co., Ltd., to form the first internet fund in China. Since then, Yu’e Bao has become the nation’s largest money market fund and, by Feb 2018, has US$251 billion under its management. In Chinese, Yu’e Bao represents “Leftover Treasure”. Alipay users can deposit their extra cash, for example, leftover from online shopping, into this investment product. The money will be invested via a money market fund with no minimum amount or exit charges, with interest paid on a daily basis. While major banks offer 0.35% annual interest on deposits, Yu’e Bao may offer user 6% interest with the convenience and freedom to deposit and withdraw anytime via Alipay mobile app. Thus, Yu’e Bao became extremely popular in China.
Using various data visualization methodologies and techniques, coupled with user transaction level survival analysis and time-series clustering, this project aims to build an interactive tool on R Shiny framework, so as to unearth the underlying treasures of associations between Yu’e Bao’s user profiles, cash flow behaviour, time and other financial factors. This will let us understand more on how people in China invest their money through Yu'e Bao and gain insights that will be valuable to internet money market fund industry.
This report is separated into 9 sections. After this introduction, we will discuss our motivation and objective of this project in section 2. We have also reviewed some related literatures and explained our corresponding critics in section 3 of this report. Description of the dataset and our data preparation process will be covered in section 4 and 5 respectively, followed by the application introduction, installation and user guide in section 6. Next, in section 7, we will provide detailed explanation to the data exploration, analysis and insights we have gained through the Yu'e Bao dashboard. Finally, we will conclude the report by highlighting some of the key challenges faced throughout the project and possible future works to this project in section 8 and 9 respectively.
Contents
Motivation and Objectives
The dataset used in this project is released by a competition (The Purchase and Redemption Forecasts-Challenge the Baseline Competition) organized by Alibaba Cloud, TIANCHI Aliyun . The competition challenges its participants to train models to predict future cash flow of Yu’e Bao users, based on historical financial data from the government, Yu’e Bao and its user, and their user profiles. The results can aid Ant Financial Services Group, Alibaba Group’s affiliate company operating Alipay, in its business of processing cash inflow and outflow. Hence, most of the works done on this dataset are focused only on achieving the best score for predictive modelling. There is no works published at the time of this project with other data analysis or insights.
In view of this, we have chosen to provide an alternate analytical approach to the dataset by building a Shiny App with interactive features, and employing the data visualization methodologies, to visualize the data and its insights interactively. We also want to perform additional analysis of survival analysis and time-series clustering, and to generate dynamically visualizations of the analytical results. This visualization platform is built with RStudio, R programming language with rich libraries. Our final objectives aim to:
- Provide interactive visualization and enable users to explore the dataset in various dimensions by different chart type and to gain corresponding insights
- Dynamically generate different customer segmentation to analyze customer deposit and withdraw behaviour, enable users to explore and visualize the different of different Yu'e Bao user behaviour in different customer segments
- Provide interactive visualization for time clustering and survival analysis, and enable users to perform the analysis with different input parameters
Literature Review
Despite the popularity of internet crowdfunding in China, there is little scholarly research in this area. Shen Lin Bing’ research study (2018) reviews the history of marketplace lending globally, with China as the emphasis, and further explores industries development and driving forces of China. A small part of the article draws relation to Yu'e Bao as an innovating financial investment service with reference to the visualization figure on the left. The figure tries to plot the Yield Rate of Yu’E Bao, Interest Rate for Current Deposit of Banks, and Fund Share of Yu’E Bao from May 2013 to May 2017. Yu'e Bao Big Data Report posted by big data forum (2014) provides the financial summary of Yu'e Bao and it's customer demographics in general. The webpage content provides a comical and simple summary of the objective. In the report posted by SINA Financial (2014) , by plotting Yu'e Bao total count of deposit and withdraw transactions from 2013-2014, it is concluded that average daily withdraw transactions count is three times higher than deposit.
All the analysis and visualization mentioned above is not interactive, though they provided a general summary of Yu'e Bao customer behaviour, there was little or no detailed analysis on the relationship between customer profile and their behaviour. With growing proportion of funds flowing to such investment tool, a centralize and interactive data visualization platform to analyze Yu’e Bao customer segmentation and behaviour will be very helpful for the healthy growth of its ecosystem.
Dataset & R Libraries Used
The source of data is Alibaba Cloud, TIANCHI, Competition: The Purchase and Redemption Forecasts - Challenge the Baseline. The dataset from this competition comprises of Yu'e Bao user’s profiles, transaction behaviour, and financial interest rates over time, in 4 CSV tables as follows:
Table Name | Description |
---|---|
user_balance_table.csv | 2,840,421 observations of the time series cash flow data from 28,041 Yu’e Bao users for 14 months, from 1st Jul 2013 to 31st Aug 2014. Cash flow data includes 18 variables of account balances, different types of deposits, withdrawals, interest earned and, if funds are used to make online purchases, categories of purchase. |
user_profile_table.csv | 28,041 rows of user profile data that describes the user’s gender, zodiac sign, and registered city, based on each user ID, in 4 columns. |
mfd_day_share_interest.csv | 427 observed dates from 1st Jul 2013 to 31st Aug 2014 and the corresponding Yu’e Bao’s daily and 7-daily interest rates. |
mfd_bank_shibor.csv | 294 observed dates from 1st Jul 2013 to 31st Aug 2014 and the corresponding 8 types of Shibor interest rates, from overnight interest rates to yearly interest rates. Although the time frame is the same as above, the number of observations is less here because there are no Shibor interest rates data for weekends and public holidays. |
The dataset spots some challenges which will require some cleaning and recoding. As the dataset is relatively large, and for the purpose of or analysis, we will also need to pre-process the data differently for each of our visualization, eg. reshaping, filtering and aggregating. R libraries like tidyverse, dplyr, plyr, cluster, chron, grid, lubridate, xts, data.table, be considered for this purpose.
R Library | Description |
---|---|
Plotly | Plotly's R graphing library makes interactive, publication-quality graphs online. |
Treemap | A treemap is a space-filling visualization of hierarchical structures. This function offers great flexibility to draw treemaps. |
Corrplot | A graphical display of a correlation matrix or general matrix. It also contains some algorithms to do matrix reordering. |
Lattice | A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. |
Survival | A package which allow users to do survival analysis, is indeed formidable. |
Ranger | Ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. |
Ggfortify | Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'. |
Survminer | Provides functions for facilitating survival analysis and visualization. |
Shiny | Shiny is an R package that makes it easy to build interactive web apps straight from R. |
Shinydashboard | This package allows users create dashboards with 'Shiny' and provides a theme on top of 'Shiny', making it easy to create attractive dashboards. |
ShinyWidgets | Some custom inputs widgets to use in Shiny applications, like a toggle switch to replace checkboxes. And other components to pimp your apps. |
Dashboardthemes | The dashboardthemes package provides two main important features:
Using new pre-defined themes and logos for dashboards. Creating custom themes and logos for dashboards. |
Data Cleaning and Preparation
The data is generally clean but requires some amount of preparation to be meaningful. In the user profile table, column name ‘constellation’ is replaced as ‘Zodiac’ for clarify that the column contains user’s Zodiac signs. The data of Zodiac signs are also translated from Chinese characters (eg. 狮子座) to English Zodiac names (eg. Leo). Next, column ‘sex’ with values of ‘1’ and ‘0’ is translated into column ‘Gender’ with values of ‘Male’ and ‘Female’ respectively. As there was no way to decipher the actual name of the city from the 7-digit code representation of ‘city’, they are replaced with character representation of ‘City A’, ‘City B’, … ‘City G’.
Next, user_id, report_date and mfd_date are converted to factors in R. With these, user balance table is joined with user profile table by user_id, which is then further joined with bank shibor interest table and day share interest table via report_date (in user table) and mfd_date (in interest rates tables).
At this point, it has become obvious that there are missing data for bank shibor interest rates columns during the weekends and public holidays. This is because there are no official interest rates released during these off days. This is not seen in Yu’e Bao interest rates as it is a 24/7 platform. To ensure that it is possible to compare between Yu’e Bao interest rates and bank shibor interest for all days, all missing shibor interest rates will adopt the first available interest rate from previous days. For example, 6 July 2013, (Saturday) and 7 July 2013 (Sunday) will adopt the interest rates of 5 July 2013 (Friday) which is available. If 5 July 2013 is a public holiday, all 3 days will adopt the next available interest rate going backwards in time.
Finally, to allow us to analyze time as categorical data, we extract the year, month, day and day of week from the report_date to form 4 new variable columns. For each of the subsequent analysis, different data manipulation techniques will be used to prepare the data for each analysis and visualization. This will be discussed separately in subsequent sections.
Application User Guide
Local Installation and Execution
The Yu’e Bao shiny dashboard app is build using R version 3.5.1 (Feather Spray) and RStudio version 1.1.463 for Windows Vista/7/8/10, which are the latest versions available at the time of the project. Hence, it is recommended to use the same versions to run the app. The app also requires the latest version of R libraries, mentioned earlier, that are compatible up to the mentioned version of R. Before running the app, it is advisable to update all R libraries first. Upon starting the app, all libraries that are not already installed, will automatically install without manual intervention. Please be patient and wait for the apps to load as there may be quite a few libraries to install and may require some time. Once all required libraries are installed, the app dashboard should appear as shown below.
It is also recommended to run the app in internet browser, eg. Google Chrome, for best app visualization experience. Additionally, to enhance the user experience, most data has been pre-processed to minimize data processing at the front end. Therefore, different data or data object has been pre-generated for each functionality. On the initial load, the pre-processed data has also been configured to load automatically but may take slightly longer depending on the data size. Please be patient and wait for the data to load completely. Try not to close the app before the data is loaded to avoid data corruption. Once the data is loaded, it will automatically be saved into the global environment and subsequent usage of the functionality will be smooth sailing.
General Usage
Shiny and shinydashboard R libraries has been used as the main framework to create a dashboard like user interface application. A custom theme, Blue Gradient, is used by using dashboardtheme R library to give the app a unified theme experience. Shinywidget R library is used to create more professional looking dashboard controls like material switch buttons and group radio buttons.
As shown in image above, the overall layout design of the app is based on 3 main functionalities which are listed on the left sidebar menu (red box). Each main functionality opens to a new page to its right. In cases where there are sub-functionalities, they will be listed at the top as tabs (pink box). In the main page area, the control inputs (green box) will be placed to the left and right, or at the top of the page, while the visualizations and plots (blue box) will occupy the rest and most of the remaining area. If the visualization feels squeezed, users can click on the button beside the dashboard title “Yu’e Bao Dashboard” (purple arrow) to hide the sidebar menu as shown in the image below.
Additionally, in some case there will be a Refresh button or an Update View button in the controls. In the case where there is a Refresh button, the visualization will automatically regenerate with every change in inputs. However, in some cases, the visualization may not be plot properly or does not fit the plot frame (eg. when browser window size is changed while visualization is being generated), the visualization can be refreshed with this button. For controls with Update View button, change in input does not regenerate the visualizations automatically and requires the Update View button to do so.
Input Controls
Various input control types are used in the app:
- Radio Buttons (red box) – a set of radio buttons where only 1 option can be selected. Usually used when list of character selection is mid-size, too long for Radio Group Button but does not require a dropdown list. Selected choice will have a dot beside it.
- Radio Group Buttons (green box) - a set of radio buttons in the form of selection buttons joined as a group. Functions like the same manner as radio buttons and typically used for shorter list of selections where description of selection is also short. Selected button will be darkened.
- Slider Text Input / Slider Input, with 2 selections (pink box) – a sliding selection that allows users to select a maximum and minimum limit from a range of values. The selected range, if shorter than the allowed range, can also be dragged along the slider to change the max and min, yet maintaining the interval length between max and min. Typically used to filter a specific time period.
- Select Input – a dropdown list of selections. Only one choice can be selected. Typically used for a longer list of character selections. If selection list is longer than dropdown window, scroll is enabled to browse up and down the list.
- Checkbox Group Input – a group of checkbox selections. Allows more than one selection or none selected at all.
- Material Switch (red box) – a toggle switch for binary selection to choose between 2 choices.
- Slider Input, single selection (green box) – a bar slider allowing user to select one value along the slider bar. Typically used for a list of continuous values.
Interactive Visualizations
Some visualizations in the app are static plots, while others are interactive visualizations generated by R library plotly. For interactive visualization, on mouse over, the corresponding tooltip will appear. A single click on any of the categories in the legend will hide that category from the plot. On the contrary, a double click will filter out all other categories except the selected one. Plotly interactive visualizations also allow users to zoom in and out of the plot to focus on a selected range of values by drag and select.
Additionally, in animated bubble plot, there is an additional feature of animation that shows the changes of values over time (green). The animation controls allow users to simply auto play the changes over time by clicking Play button, or manually move time over a slider bar by clicking and dragging the scrollbar.
Visualization Methodologies and Analysis
References
Banner image credit to: China Money Network