Analysis of User and Merchant Dropoff for Sugar App Time Series

From Analytics Practicum
Revision as of 19:40, 17 April 2016 by Long.kang.2012 (talk | contribs)
Jump to navigation Jump to search

Home

 

Project Overview

 

Findings

 

Project Documentation

 

Project Management

Mid-Term Finals
Funnel Plot Analysis Time Series Analysis Geospatial Analysis
Abstract

A two-sided market is defined as a platform enabling two groups of end-users to interact with each other while the platforms charges for transactions made between the groups. In such a market, each additional User should attract more Merchants and vice versa, a.k.a. the network effect. While many papers on two-sided markets exist, little of them test for network effects for ecommerce. This paper tests the presence of network effects and solves the chicken-or-egg problem by examining the relationship of Users, Merchants and Revenue. This is done by a unique dataset provided by a location-based deals mobile app.

Aggregating transactional data into 97 rows of weekly number of Users, Merchants and Revenue, we constructed three multiple linear regression models on time series data to test the hypotheses. Results show that network effects are not present. Additional Merchants are associated with increased Users. However, additional Users are not associated with additional Merchants. Solving the chicken-or-egg problem, results show that additional Users, not Merchants, are associated with more Revenue. From this, we derived two recommendations for the app. Firstly, they can allocate more resources on user acquisition than merchant acquisition. Secondly, the app should attempt to increase its network effect via reducing information asymmetry. On top of that, we created a prediction model that is able to predict the end-of-month revenue with a R-squared value of 0.87 given user data.


Business Motivations and Objectives

Literature Review

Two-Sided market

The definition of a two-sided market is broadly about getting two sides “on board”, but such a definition may not be restrictive enough (Rochet & Tirole, 2004). Current definition of multi-sided platforms may be too inclusive or broad (Hagiu & Wright, 2015). However, 2 key features are present in multi-sided platforms and not in other businesses: (1) enabling direct intercommunication between two or more sides, (2) each side has a relationship with the platform (Hagiu & Wright, 2015). Rochet & Tirole(2004) defined it as a two-sided market if the platform can change the volume of transactions by charging one side of the market and subsidizing the price paid by the other side by an equal amount.

For networks effects, it can be positive or negative, cross-sided or same sided or cross side. Same-sided network effects refer to effects that affect the group it originates from, for example, an additional fax machine makes the whole network more valuable for everyone who owns a fax machine. Cross-sided network effects are when one side of a multi-sided market affects the other side and it can be either positive or negative. An example of negative cross-sided network effects can be found in the media industry, where the advertisers will exert a negative effect on the number of users because they are averse to additional advertisements (Reisinger, 2004). A positive cross-sided network effect is one where one group fuel demands of the other in a positive manner, creating a virtuous cycle (Eisenmann, Parker & Van Alstyne, 2006).

Various papers have demonstrated network effects in two-sided markets. However, current papers on two-sided markets usually explore pricing choices whereas papers on network effects usually look into adoption by users and network size (Rysman, 2009). Furthermore, Rysman (2009) found that papers on two-sided markets focused more on media, payment systems, and matching markets while the papers on network effect look into technology and telecommunications market. Kim, Lee & Park(2012) tested the existence of cross-sided network effects using a novel approach by examining the advantage of the incumbent(Groupon) over the new entrant(Living Social). Rysman (2007) found the existence of a positive feedback loop between consumer and merchant using data on payment card, suggesting cross-sided network effects. However, from our research, few or no papers test network effects for ecommerce mobile apps. Furthermore, papers testing network effects on online platforms usually do not have direct data on User & Merchants. Lastly, few papers have used a time-regression model to examine network effects.

As such, this paper will utilize a unique ecommerce data set to test the existence of cross-sided network effects with empirical User and Merchant data using a time-regression model approach.

Methodology

The ecommerce data set is provided by a location-based ecommerce mobile app. The mobile app offers local deals tailored to the user’s location. It helps local small businesses such as cafes, restaurants, and small retail shops get discovered and market deals to users close in proximity. Whenever the user opens the app, they can see a list of discounted items from merchants nearby. By definition, the app exists as a platform in a two-sided market. As users do not interact with other users in the app and the same applies for merchant, same-sided network effects does not apply. Using its data on user growth and merchant growth, we can test the presence of network effects and as such, it has the valid data to test our hypotheses for cross-sided network effects.

Data

The app company provided the dataset. It consists of transactional data from Feb 2014 to Jan 2016 and has 3 tables:

  • Users table – over 46,000 rows of users id and their join date
  • Merchant Table – over 1440 rows of merchants and their join date
  • Order Table – over 54,000 rows of orders and the margin of each order


Data Preparation

The first stage of data preparation involves preparing the data set only for Singapore. First, the Users table was cleaned to only contain Singapore users. Secondly, Merchants table was cleaned to only contain Singapore Businesses, and to eliminate duplicates. Thirdly, the Users table and Merchants table was inner joined with the Order table to eliminate all non-Singaporean users and businesses. We further cleaned the data using order’s latitude and longitude to eliminate all orders made outside Singapore.

The second stage involves cleaning out invalid orders and outliers. Using the order status, invalid orders were cleaned out. These orders are either cancelled, unpaid, or refunded and hence do not represent part of revenue. There were other orders that was part of the app’s promotional giveaway and were cleaned out as well.

Finally, data was aggregated into weeks and three times series were formed:

  • numUsers: The number of new users for that week
  • numMerchants: The number of new merchants for that week
  • Revenue: The revenue for that week.

After the data preparation step, the aggregated dataset consist of 97 rows of numUsers, numMerchants and Revenue. The data for the first week of Jan 2016 was excluded due to incomplete data.


Tools Used

SAS JMP Pro 12 will be used to perform the time series analysis. SAS JMP Pro 12 is an analytical software that is able to handle large volumes of data efficiently, which is imperative since the app’s data is too large to be handled by other software such as Microsoft Excel.

Constructing the Population Regression Model

Each hypothesis is tested with a multiple linear regression model. Instead of singular linear regression, instead, multiple linear regression is used to avoid the omitted variable bias, which results in the coefficient estimates being wrong on average when a related variable in not included.

With time series data, the definition of the population regression model to estimate is:

Depending on the hypothesis, the dependent variable yt and k independent variables x(k,t) will be different. Using the parameter estimates, this model will give us an indication, which variables can significantly explain the dependent variable and which are more important. However, observing the graph, a trend also seems to be present for Revenue and Users but not Merchants.

To detect Trend, an extra variable (β(k+1) t) is used. This variable increases incrementally (9,10,11,12…k+1). As such it results in the following regression model:

The trend variable is important to include so as to avoid the third variable problem otherwise, it is difficult to sieve out the true effects of the independent variables.

Furthermore, it is also possible that it takes time for the independent variable to have an effect on the dependent variable. Therefore, it is important to detect if lagging a variable would improve the fit of the model. An arbitrarily lag of eight, an equivalent of two months, was chosen as it was deemed sufficient to detect the effects if it was present. To detect the effects of lag, extra independent variables will be constructed, and each of the variables will be lagged from 1 to 8 periods- i.e. (t-1) to (t-8).


Method for Hypothesis 1

After constructing the lag and time variables, to test hypothesis 1, whether Usert is a function of Merchants, we estimate the population regression model of users to be:

Method for Hypothesis 2

To test hypothesis 2, we would also need to test the effects of users on merchants. As such, we estimate the population regression model for Merchantt to be:

Method for Hypothesis 3

To test hypothesis 2 & 3, we estimate the population regression model of Revenuet to be:

Results

Hypothesis 1: Merchant Growth(IV) is associated with User growth(DV)


Hypothesis 2: User Growth(IV) is associated with Merchant growth(DV)


Hypothesis 3: Revenue Growth is a function of User and Merchant Growth

Discussion

Implications

Prediction Model

Univariate Prediction Model

Multivariate Prediction Model

Conclusion

References