Difference between revisions of "Group08 Report"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 35: Line 35:
  
 
=Literature Review=
 
=Literature Review=
Despite the popularity of internet crowdfunding in China, there is little scholarly research in this area. Shen Lin Bing’ research study [http://jai.iijournals.com/content/20/3/95 (2018)] reviews the history of marketplace lending globally, with China as the emphasis, and further explores industries development and driving forces of China. A small part of the article draws relation to Yu'e Bao as an innovating financial investment service with reference to the visualization figure on the left. The figure tries to plot the Yield Rate of Yu’E Bao, Interest Rate for Current Deposit of Banks, and Fund Share of Yu’E Bao from May 2013 to May 2017.  Yu'e Bao Big Data Report posted by big data forum [http://jai.iijournals.com/content/20/3/95 (2014)] provides the financial summary of Yu'e Bao and it's customer demographics in general. The webpage content provides a comical and simple summary of the objective. In the report posted by SINA Financial [http://finance.sina.com.cn/money/fund/20140714/152319697476.shtml (2014)] , by plotting Yu'e Bao total count of deposit and withdraw transactions from 2013-2014, it is concluded that average daily withdraw transactions count is three times higher than deposit.  
+
Despite the popularity of internet crowdfunding in China, there is little scholarly research in this area. Shen Lin Bing’ research study [http://jai.iijournals.com/content/20/3/95 (2018)] reviews the history of marketplace lending globally, with China as the emphasis, and further explores industries development and driving forces of China. A small part of the article draws relation to Yu'e Bao as an innovating financial investment service with reference to the visualization figure on the left. The figure tries to plot the Yield Rate of Yu’E Bao, Interest Rate for Current Deposit of Banks, and Fund Share of Yu’E Bao from May 2013 to May 2017.  Yu'e Bao Big Data Report posted by big data forum [https://cloud.tencent.com/developer/article/1131664 (2014)] provides the financial summary of Yu'e Bao and it's customer demographics in general. The webpage content provides a comical and simple summary of the objective. In the report posted by SINA Financial [http://finance.sina.com.cn/money/fund/20140714/152319697476.shtml (2014)] , by plotting Yu'e Bao total count of deposit and withdraw transactions from 2013-2014, it is concluded that average daily withdraw transactions count is three times higher than deposit.  
  
 
All the analysis and visualization mentioned above is not interactive, though they provided a general summary of Yu'e Bao customer behaviour, there was little or no detailed analysis on the relationship between customer profile and their behaviour. With growing proportion of funds flowing to such investment tool, a centralize and interactive data visualization platform to analyze Yu’e Bao customer segmentation and behaviour will be very helpful for the healthy growth of its ecosystem.
 
All the analysis and visualization mentioned above is not interactive, though they provided a general summary of Yu'e Bao customer behaviour, there was little or no detailed analysis on the relationship between customer profile and their behaviour. With growing proportion of funds flowing to such investment tool, a centralize and interactive data visualization platform to analyze Yu’e Bao customer segmentation and behaviour will be very helpful for the healthy growth of its ecosystem.
  
 +
=Dataset & R Libraries Used=
 +
The source of data is Alibaba Cloud, TIANCHI, Competition: [https://tianchi.aliyun.com/getStart/introduction.htm?spm=5176.11409106.5678.1.12b13a01KYqb3C&raceId=231573&_lang=en_US The Purchase and Redemption Forecasts - Challenge the Baseline]. The dataset from this competition comprises of Yu'e Bao user’s profiles, transaction behaviour, and financial interest rates over time, in 4 CSV tables as follows:
  
 +
{| class="wikitable"
 +
|-
 +
! Table Name !! Description
 +
|-
 +
| <b><i>user_balance_table.csv</i></b>|| 2,840,421 observations of the time series cash flow data from 28,041 Yu’e Bao users for 14 months, from 1<sup>st</sup> Jul 2013 to 31<sup>st</sup> Aug 2014. Cash flow data includes 18 variables of account balances, different types of deposits, withdrawals, interest earned and, if funds are used to make online purchases, categories of purchase.
 +
|-
 +
| <b><i>user_profile_table.csv</i></b>|| 28,041 rows of user profile data that describes the user’s gender, zodiac sign, and registered city, based on each user ID, in 4 columns.
 +
|-
 +
| <b><i>mfd_day_share_interest.csv</i></b>|| 427 observed dates from 1<sup>st</sup> Jul 2013 to 31<sup>st</sup> Aug 2014 and the corresponding Yu’e Bao’s daily and 7-daily interest rates.
 +
|-
 +
| <b><i>mfd_bank_shibor.csv</i></b>|| 294 observed dates from 1<sup>st</sup> Jul 2013 to 31<sup>st</sup> Aug 2014 and the corresponding 8 types of [http://www.shibor.org/shibor/web/html/index_e.html Shibor] interest rates, from overnight interest rates to yearly interest rates. Although the time frame is the same as above, the number of observations is less here because there are no Shibor interest rates data for weekends and public holidays.
 +
|}
 +
<br>
 +
The dataset spots some challenges which will require some cleaning and recoding. As the dataset is relatively large, and for the purpose of or analysis, we will also need to pre-process the data differently for each of our visualization, eg. reshaping, filtering and aggregating. R libraries like tidyverse, dplyr, plyr, cluster, chron, grid, lubridate, xts, data.table, be considered for this purpose.
 +
<br>
 +
[[Image:g8_Metadata.png|thumb|alt=Alt text|Full data variables description can be found here|none]]
 +
 +
{| class="wikitable"
 +
|-
 +
! R Library !! Description
 +
|-
 +
| [https://plot.ly/r/ Plotly] || Plotly's R graphing library makes interactive, publication-quality graphs online.
 +
|-
 +
| [https://www.rdocumentation.org/packages/treemap/versions/2.4-2/topics/treemap Treemap] || A treemap is a space-filling visualization of hierarchical structures. This function offers great flexibility to draw treemaps.
 +
|-
 +
| [https://www.rdocumentation.org/packages/corrplot/versions/0.84 Corrplot] || A graphical display of a correlation matrix or general matrix. It also contains some algorithms to do matrix reordering.
 +
|-
 +
| [https://www.rdocumentation.org/packages/lattice/versions/0.20-38 Lattice] || A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements.
 +
|-
 +
| [https://www.rdocumentation.org/packages/survival/versions/2.43-3 Survival] || A package which allow users to do survival analysis, is indeed formidable.
 +
|-
 +
| [https://www.rdocumentation.org/packages/ranger/versions/0.10.1/topics/ranger Ranger] || Ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported.
 +
|-
 +
| [https://www.rdocumentation.org/packages/ggfortify/versions/0.4.5 Ggfortify] || Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'.
 +
|-
 +
| [https://www.rdocumentation.org/packages/survminer/versions/0.2.1 Survminer] || Provides functions for facilitating survival analysis and visualization.
 +
|-
 +
| [https://shiny.rstudio.com/ Shiny] || Shiny is an R package that makes it easy to build interactive web apps straight from R.
 +
|-
 +
| [https://www.rdocumentation.org/packages/shinydashboard/versions/0.7.1 Shinydashboard] || This package allows users create dashboards with 'Shiny' and provides a theme on top of 'Shiny', making it easy to create attractive dashboards.
 +
|-
 +
| [https://cran.r-project.org/web/packages/shinyWidgets/index.html ShinyWidgets] || Some custom inputs widgets to use in Shiny applications, like a toggle switch to replace checkboxes. And other components to pimp your apps.
 +
|-
 +
| [https://rdrr.io/github/nik01010/dashboardthemes/man/dashboardthemes.html Dashboardthemes] || The dashboardthemes package provides two main important features:
 +
Using new pre-defined themes and logos for dashboards.
 +
Creating custom themes and logos for dashboards.
 +
|}
  
  

Revision as of 20:09, 8 December 2018

G8 Logo.jpg  Visualizing Future of Crowd Funding with Yu’e Bao

PROPOSAL

POSTER

APPLICATION

REPORT

Yu’e Bao (余额宝) is an investment product offered by Alipay (支付宝), a mobile and online payment platform established by China’s multinational conglomerate Alibaba Group. In June 2013, Alibaba Group launched Yu’e Bao, in collaboration with Tianhong Asset Management Co., Ltd., to form the first internet fund in China. Since then, Yu’e Bao has become the nation’s largest money market fund and, by Feb 2018, has US$251 billion under its management. In Chinese, Yu’e Bao represents “Leftover Treasure”. Alipay users can deposit their extra cash, for example, leftover from online shopping, into this investment product. The money will be invested via a money market fund with no minimum amount or exit charges, with interest paid on a daily basis. While major banks offer 0.35% annual interest on deposits, Yu’e Bao may offer user 6% interest with the convenience and freedom to deposit and withdraw anytime via Alipay mobile app. Thus, Yu’e Bao became extremely popular in China.

Using various data visualization methodologies and techniques, coupled with user transaction level survival analysis and time-series clustering, this project aims to build an interactive tool on R Shiny framework, so as to unearth the underlying treasures of associations between Yu’e Bao’s user profiles, cash flow behaviour, time and other financial factors. This will let us understand more on how people in China invest their money through Yu'e Bao and gain insights that will be valuable to internet money market fund industry.

This report is separated into 9 sections. After this introduction, we will discuss our motivation and objective of this project in section 2. We have also reviewed some related literatures and explained our corresponding critics in section 3 of this report. Description of the dataset and our data preparation process will be covered in section 4 and 5 respectively, followed by the application introduction, installation and user guide in section 6. Next, in section 7, we will provide detailed explanation to the data exploration, analysis and insights we have gained through the Yu'e Bao dashboard. Finally, we will conclude the report by highlighting some of the key challenges faced throughout the project and possible future works to this project in section 8 and 9 respectively.

Motivation and Objectives

The dataset used in this project is released by a competition (The Purchase and Redemption Forecasts-Challenge the Baseline Competition) organized by Alibaba Cloud, TIANCHI Aliyun . The competition challenges its participants to train models to predict future cash flow of Yu’e Bao users, based on historical financial data from the government, Yu’e Bao and its user, and their user profiles. The results can aid Ant Financial Services Group, Alibaba Group’s affiliate company operating Alipay, in its business of processing cash inflow and outflow. Hence, most of the works done on this dataset are focused only on achieving the best score for predictive modelling. There is no works published at the time of this project with other data analysis or insights.

In view of this, we have chosen to provide an alternate analytical approach to the dataset by building a Shiny App with interactive features, and employing the data visualization methodologies, to visualize the data and its insights interactively. We also want to perform additional analysis of survival analysis and time-series clustering, and to generate dynamically visualizations of the analytical results. This visualization platform is built with RStudio, R programming language with rich libraries. Our final objectives aim to:

  1. Provide interactive visualization and enable users to explore the dataset in various dimensions by different chart type and to gain corresponding insights
  2. Dynamically generate different customer segmentation to analyze customer deposit and withdraw behaviour, enable users to explore and visualize the different of different Yu'e Bao user behaviour in different customer segments
  3. Provide interactive visualization for time clustering and survival analysis, and enable users to perform the analysis with different input parameters

Literature Review

Despite the popularity of internet crowdfunding in China, there is little scholarly research in this area. Shen Lin Bing’ research study (2018) reviews the history of marketplace lending globally, with China as the emphasis, and further explores industries development and driving forces of China. A small part of the article draws relation to Yu'e Bao as an innovating financial investment service with reference to the visualization figure on the left. The figure tries to plot the Yield Rate of Yu’E Bao, Interest Rate for Current Deposit of Banks, and Fund Share of Yu’E Bao from May 2013 to May 2017. Yu'e Bao Big Data Report posted by big data forum (2014) provides the financial summary of Yu'e Bao and it's customer demographics in general. The webpage content provides a comical and simple summary of the objective. In the report posted by SINA Financial (2014) , by plotting Yu'e Bao total count of deposit and withdraw transactions from 2013-2014, it is concluded that average daily withdraw transactions count is three times higher than deposit.

All the analysis and visualization mentioned above is not interactive, though they provided a general summary of Yu'e Bao customer behaviour, there was little or no detailed analysis on the relationship between customer profile and their behaviour. With growing proportion of funds flowing to such investment tool, a centralize and interactive data visualization platform to analyze Yu’e Bao customer segmentation and behaviour will be very helpful for the healthy growth of its ecosystem.

Dataset & R Libraries Used

The source of data is Alibaba Cloud, TIANCHI, Competition: The Purchase and Redemption Forecasts - Challenge the Baseline. The dataset from this competition comprises of Yu'e Bao user’s profiles, transaction behaviour, and financial interest rates over time, in 4 CSV tables as follows:

Table Name Description
user_balance_table.csv 2,840,421 observations of the time series cash flow data from 28,041 Yu’e Bao users for 14 months, from 1st Jul 2013 to 31st Aug 2014. Cash flow data includes 18 variables of account balances, different types of deposits, withdrawals, interest earned and, if funds are used to make online purchases, categories of purchase.
user_profile_table.csv 28,041 rows of user profile data that describes the user’s gender, zodiac sign, and registered city, based on each user ID, in 4 columns.
mfd_day_share_interest.csv 427 observed dates from 1st Jul 2013 to 31st Aug 2014 and the corresponding Yu’e Bao’s daily and 7-daily interest rates.
mfd_bank_shibor.csv 294 observed dates from 1st Jul 2013 to 31st Aug 2014 and the corresponding 8 types of Shibor interest rates, from overnight interest rates to yearly interest rates. Although the time frame is the same as above, the number of observations is less here because there are no Shibor interest rates data for weekends and public holidays.


The dataset spots some challenges which will require some cleaning and recoding. As the dataset is relatively large, and for the purpose of or analysis, we will also need to pre-process the data differently for each of our visualization, eg. reshaping, filtering and aggregating. R libraries like tidyverse, dplyr, plyr, cluster, chron, grid, lubridate, xts, data.table, be considered for this purpose.

Alt text
Full data variables description can be found here
R Library Description
Plotly Plotly's R graphing library makes interactive, publication-quality graphs online.
Treemap A treemap is a space-filling visualization of hierarchical structures. This function offers great flexibility to draw treemaps.
Corrplot A graphical display of a correlation matrix or general matrix. It also contains some algorithms to do matrix reordering.
Lattice A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements.
Survival A package which allow users to do survival analysis, is indeed formidable.
Ranger Ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported.
Ggfortify Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'.
Survminer Provides functions for facilitating survival analysis and visualization.
Shiny Shiny is an R package that makes it easy to build interactive web apps straight from R.
Shinydashboard This package allows users create dashboards with 'Shiny' and provides a theme on top of 'Shiny', making it easy to create attractive dashboards.
ShinyWidgets Some custom inputs widgets to use in Shiny applications, like a toggle switch to replace checkboxes. And other components to pimp your apps.
Dashboardthemes The dashboardthemes package provides two main important features:

Using new pre-defined themes and logos for dashboards. Creating custom themes and logos for dashboards.



References

Banner image credit to: China Money Network