Analysis of User and Merchant Dropoff for Sugar App Data Source

From Analytics Practicum
Jump to navigation Jump to search

Home

 

Project Overview

 

Findings

 

Project Documentation

 

Project Management

Background Data Source Methodology

Data Source

We have 4 years of data from Sugar’s database and dashboards, from 2013 to 2016. However, for our analysis we have chosen to focus on only 2 full years of data as to ensure uniformity and consistency in our analysis, especially with regards to temporal. From our dataset, we have identified certain variables that we will use in our analysis.

However, before embarking on our analysis, we have to do extensive data cleaning. For example, we have 45,000 users but a number of them are throwaway test accounts, or some are dead on arrival with no orders to their name.

The users and merchants also belong to different regions, namely Singapore, Jakarta, Hong Kong, and we have to extract and focus on the Singapore region users and merchants.

Data Preparation

Data Cleaning

As the data for Sugar is massive, cleaning the data took several stages.

SugarBranch.png

First, the Users table was cleaned to only contain Singapore users with the following steps:

  1. Clean out all Users that are not from Singapore by deleting Country_id != 1
  2. Clean out user_id:99280 (invalid mobile number(10000) and balance of 1200)

Secondly, Branch table was cleaned to only contain Singapore non-Sugar Businesses, and eliminate duplicates :

  1. Clean out all Merchants that are not from Singapore by deleting [Country_id] != 1
  2. Clean out all Merchants that are Sugar’s administrative accounts
  3. Clean out all Merchants that are accounts created for specific campaigns (e.g. Blackout Friday, SugarGSS etc)
  4. Clean out Merchant duplicates
  5. Corrected [Area] field for Merchants that have invalid/blank fields
  6. Corrected [Lng,Lat] coordinates for following branches that have incorrect coordinates (i.e. [Country_id] == 1 but coordinates correspond to another country)
  7. Cleaned out all Merchants that have invalid [Created_At] dates
  8. Cleaned out all Merchants whose [Created_At] dates != 2014 OR 2015

Next, the Users table and Branch table was inner joined with the Order table to eliminate all non-Singaporean users and businesses. We further cleaned the data using latitude and longitude to eliminate all orders made outside Singapore.

SugarMerchant.png

Lastly, the Branch table was inner joined with the Orders table to form Merchant table (for merchant analysis). It is further cleaned to include only paid orders and decomposed further for both the funnel plot analysis and time-series data mining analysis.

  1. Clean out all unpaid orders (i.e. [Status] != 1,2 and 4)
  2. Clean out all orders where [Price] == 0
  3. Clean out all orders where it is gifted during a Sugar event (i.e. [Is_Reward] == 1)
  4. Clean out all orders that are lucky draws
  5. Decomposed [Redeemed_Date] into Hour, Day of Week and Month
Geospatial Data Preparation

Segmentation of data: Subzone Planning Area

One interesting information that our data contains is the longitude and latitude points of orders and branches; which means that we are able to pinpoint the location from which a user searches and commits to an order, and at which branch the user will be redeeming his or her order from.

Sugar has predefined areas that are tagged to each order and branch, and we initially wanted to use these predefined areas before realising that they were divided rather arbitrarily, with some areas being a single shopping mall, and others being as large as a region (e.g. Ang Mo Kio).

GS1.png

As none of us had any geospatial analytics background, we were reluctant to start on this aspect of our project; but eventually caved in after realising the potential insights that could be gained by moving in the geospatial direction.

For our planning areas, instead of using Sugar’s predefined areas, we decided to use the planning boundaries as defined by Singapore’s Urban Redevelopment Authority. The various planning areas are defined and segmented depending on the land uses, proximity to the Central Area, existing physical separators like expressways, rivers, major open spaces and other demarcators; which we believe is far more objective as compared to the currently unknown metrics of the predefined areas. The shape file for URA’s master plan for 2014 was obtained from our Professor, Professor Kam Tin Seong, and it will be what we use for our subsequent geographical segmentation.

However, the coordinate reference system (CRS) for the Master Plan 2014 shape file was a custom CRS (with the formula below:); whilst our latitude and longitude was in the WGS84 standard. As such, we had to convert and save the shape file in the WGS84 CRS format, before launching any further analysis.

Upon adding our cleaned orders dataset (only those with status 1, 2 and 4), we realised that despite our best attempts at identifying only Singaporean branches and users, some orders were created off the coastline, which would cause a large skew should we do comparisons of distance between orders.

GS2.png

As such, we decided to only include orders that were created within the geographical boundaries of Singapore’s coastline. This is done by the data management features of QGIS, which allows our orders to be joined to the master plan shape file by location, ensuring that only orders that are within the master plan’s shape will be included.

GS3.png

By joining attributes by location, we are able to link the orders to the planning areas (orders within the polygon of a planning area will be tagged with the planning area). We also performed data addition to the branches.

Using the orders (with status 1, 2 and 4 only) and branches dataset that have been combined with the planning area data,, we created a basic data table indexed by the name of subzones. The table looks as follows:

GS4.png

For each subzone, we calculated the total number and percentage of orders as well as branches in each subzone. For example, 20 orders are from Marina South, which forms 0.02% of the orders. Whilst, 2 branches are in Marina South, which forms 0.18% of the orders.

Next we calculated the percentage of orders over the percentage of branches, multiplied by 100. The rationale is as follows: Assuming that x percentage of orders are in a specific subzone, assuming uniform distribution, we should expect x percentage of branches in that subzone as well. This will occur in an allocatively optimal situation; where we should expect a 100% for all subzones.

However, this is not the case. In some areas, such as Marina South, we can see that these subzones are underperforming (under 100%), while other areas like Telok Blangah Drive are doing well (over 100%).

With this data preparation, we are able to segment our data according to the subzone planning area, and thus we are able to begin our preliminary analysis.

Dominant Store Analysis via Shortest Distance

After realising the limitations of our subzone planning area exploratory metric, we decided to implement a shortest distance approach. For each order, we will identify all the branches of the brand, and proceed to calculate the distance of the branch that the order is going to, as well as calculate the distance of the shortest possible branch from the order location. Assuming there are n number orders and m number of branches per order, we are facing a run time of potentially nm.

As such, we decided to shortlist the brands to those that have more than 3 outlets. We were left with 16 brands in total. Out of the 16 brands, we decided to shortlist further by taking the top 5 in terms of number of branches, as well as the top 5 in terms of orders. We were then left with 7 brands.

GS5.png

Using JMP, we then extracted the branches for each brand, as well as the orders for each brand.

Following which, we used QGIS to calculate the shortest possible distance by installing a plugin called MMQGIS, where there is an inbuilt tool called Hub Distance, where we can calculate the closest branch to the respective orders,

Next, we used an inbuilt tool called Hub Lines in MMQGIS to join each branch to their respective orders. The plugin will create a vector from each order node to the branch node, which we can calculate the distance using QGIS’s field calculator.

GS6.png

Using the above two methods, we were able to export the current branch distance and nearest branch distance for each order.

However, we still do not know if the nearest branch node is wrong, or if the distance calculation is faulty. Hence, we decided to implement our own check, where if the nearest branch node assignment is wrong, we will expect to see anomalies where the nearest branch distance is greater than the current branch distance.

As such, we decided to verify it with the new column (Check Nearest) that contains the following logic sequence: if the nearest branch distance is less than or equals to the current branch distance, then assign a value of 1, else assign a value of 0.

GS8.png

With this check, we are able to verify that the distance of all nearest branch nodes are lesser or equals to the current branch, and the assignment of nearest branch nodes is better or equals to the current node in terms of closeness.

We then created another column (named Samebranch?) which contains the following logic: If the current branch node and the nearest branch node is the same, assign a value of 1, else assign a value of 0. This will check if the user has placed an order that will bring him to the closest possible branch.

With this data preparation, we will then be able to commence the effectiveness of a store’s retail location based on dominant store analysis.