Difference between revisions of "AY1516 T2 Sport Betting at Singapore Pools Project Overview Midterm"

From Analytics Practicum
Jump to navigation Jump to search
Line 201: Line 201:
 
iii. Platinum players would see lower overall profits or greater losses for their winnings and losings for returns are less salient.
 
iii. Platinum players would see lower overall profits or greater losses for their winnings and losings for returns are less salient.
 
   
 
   
Variable Stake Amount SD of Stake Amount
+
{| class="wikitable"
 
+
|-
Group Gold Platinum Gold Platinum
+
! Variable !! Stake Amount!!  !! SD of Stake Amount !! 
 
+
|-
Maximum 4,000 5,000 2,828 3,655
+
| Group || Gold || Platinum || Gold || Platinum
 
+
|-
Upper Quartile 270 425 76 107
+
| Maximum || 4,000 || 5,000 || 2,828 || 3,655
 
+
|-
Median 103 140 31 40
+
| Upper Quartile || 270 || 425 || 76 || 107
 
+
|-
Lower Quartile 50 46 13 15
+
| Median || 103 || 140 || 31 || 40
 
+
|-
Minimum 5 5 0 0
+
| Lower Quartile || 50 || 46 || 13 || 15
 +
|-
 +
| Minimum || 5 || 5 || 0 || 0 |
 +
|-
 +
}
  
 
The above data shows summary statistics of the distribution of the median stake amount for each player and the standard deviation (SD) of stake amount for each player. Our data findings from the above statistics agrees with our first hypothesis. Platinum players do bet larger amounts as shown by greater median, upper and lower quartile figures than Gold players. However, Platinum players have higher standard deviations of stake amounts than Gold, suggesting that the high stake amount of Platinum players could be driven by players who place infrequent high bet amounts. Thus, we hope that our clustering analysis would be able to single these erratic high bettors.
 
The above data shows summary statistics of the distribution of the median stake amount for each player and the standard deviation (SD) of stake amount for each player. Our data findings from the above statistics agrees with our first hypothesis. Platinum players do bet larger amounts as shown by greater median, upper and lower quartile figures than Gold players. However, Platinum players have higher standard deviations of stake amounts than Gold, suggesting that the high stake amount of Platinum players could be driven by players who place infrequent high bet amounts. Thus, we hope that our clustering analysis would be able to single these erratic high bettors.
  
Variable Odds SD of Odds
+
{| class="wikitable"
 
+
|-
Group Gold Platinum Gold Platinum
+
! Variable !! Odds !!  !! SD of Odds !! 
 
+
|-
Maximum 85.00 75.00 54.71 47.99
+
| Group || Gold || Platinum || Gold || Platinum
 
+
|-
Upper Quartile 5.06 5.65 7.69 7.55
+
| Maximum || 85.00 || 75.00 || 54.71 || 47.99
 
+
|-
Median 2.91 3.05 3.80 3.25
+
| Upper Quartile || 5.06 || 5.65 || 7.69 || 7.55
 
+
|-
Lower Quartile 1.93 1.98 1.31 1.02
+
| Median || 2.91 || 3.05 || 3.80 || 3.25
 
+
|-
Minimum 1.01 1.02 0 0
+
| Lower Quartile || 1.93 || 1.98 || 1.31 || 1.02
 +
|-
 +
| Minimum || 1.01 || 1.02 || 0 || 0 |
 +
|-
 +
}
  
 
Based on the above table with regards to the “odds” variable, it can be seen that Platinum players do prefer bet selections with higher odds as shown by their larger median, upper and lower quartile. In addition, the standard deviation of odds for Platinum players are less volatile than that of Gold, hence further suggesting that Platinum players largely prefer higher odds as compared to Gold players.
 
Based on the above table with regards to the “odds” variable, it can be seen that Platinum players do prefer bet selections with higher odds as shown by their larger median, upper and lower quartile. In addition, the standard deviation of odds for Platinum players are less volatile than that of Gold, hence further suggesting that Platinum players largely prefer higher odds as compared to Gold players.
  
Bet losses across account types
+
{| class="wikitable"
 
+
|-
Group Gold Platinum
+
! Variable !! Bet losses across account types !! 
 
+
|-
Maximum -1 0
+
| Group || Gold || Platinum  
 
+
|-
Upper Quartile -237 -289
+
| Maximum || -1 || 0
 
+
|-
Median -686 -960
+
| Upper Quartile || -237 || -289
 
+
|-
Lower Quartile -1,864 -3,099
+
| Median || -686 || -960
 
+
|-
Minimum -71,654 -80,285
+
| Lower Quartile || -1,864 || -3,099
 +
|-
 +
| Minimum || -71,654 || -80,285
 +
|-
 +
}
  
 
The statistics above only consists of unprofitable users in each account type group. Based on the table above, Platinum players as a whole suffers larger bet losses as compared to the Gold players, probably driven by their higher stake amounts and preference for higher odds.
 
The statistics above only consists of unprofitable users in each account type group. Based on the table above, Platinum players as a whole suffers larger bet losses as compared to the Gold players, probably driven by their higher stake amounts and preference for higher odds.

Revision as of 22:41, 28 February 2016


THE SPONSOR

 

THE TEAM

 

THE OVERVIEW

 

THE MANAGEMENT

 

THE DOCUMENTS

 


Proposal Midterm Final


Project Background

In today’s globalized world, the Internet has transformed the gambling environment into a multifaceted, non-physical, multi-platform, environment without boundaries. This presents loopholes for illegal gambling operators to enter the market and draw our customers away, into their unregulated arena that is susceptible to the creation of gambling addiction issues.

Singapore Pools offers a safer outlet, one where players can bet responsibly, within their means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling.

Our sponsor has actually been collecting user data for the past several years, but has yet put it to good use. Just a year ago, Singapore Pools had set up a customer insights division to better understand their customers through the analysis of these user data and their first step towards a data-driven approach to promote responsible gambling was to understand the gambling behavioural patterns of their customers.

Project Objectives

The aim of our project is to allow Singapore Pools to better understand the gambling behaviours of their customers through the identification of gambling patterns. Each cluster might have their own specific ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too. And based on the characteristics of each cluster, the client’s end objective is to have a customized business action so as to enhance the gambling experience of the players in that particular cluster. The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools and the given dataset given involves only transactions that lies within the time period from January 2015 to March 2015.

The overall objectives of our project are to:

(1) Profile their existing pool of customers through clustering analysis

(2) Create a data visualization of the consumer betting activity

(3) Build a dashboard to visualize the profiling and data points

Data Cleaning

The original sample data set that we worked on contained over 930,000 unique observations (transactions) in a worksheet also known as TransactionList. There were two types of noise in our data set: (1) irrelevant fills – some of which are variables or observations that do not pertain to the soccer betting products, rejected bets which have no odds indicated, and ‘Championship Winner’ bet types for which is considered atypical to regular bet patterns; (2) outliers – these observations lie beyond the 99th percentile of selected parameters, the thresholds were identify based on scatterplots of the distribution on Tableau.

The following observations were removed from the original data set:

Capture.JPG
Capture2.JPG

The dilemma we faced was that if we removed the outlier transactions, we would be artificially changing the users’ bet preference when we aggregate these transaction data into one user’s overall bet pattern. The other option was to aggregate all transactions (including extreme observations) into a user’s overall bets, and then to filter the users from the population. Given that the outlier transactions would impede of analysis of transaction data, we had no choice but to remove these outliers. And in effect, we had to remove the affected users (users who made those extreme transactions) to maintain the integrity of the user data. Therefore, after data cleaning, we are left with 529,678 unique transactions in TransactionList for further analysis.

Data Transformation

To gather deeper insights in our data exploration phase, our team created new metrics that reflect certain attributes of bet behaviours. This then allowed us to test some of the hypotheses about betting patterns that our sponsor highlighted to us. The new metrics would be used to test difference in betting preference between gender, age, account types, players of different risk profiles. If the metrics do not differ between the segments of players, or does not significantly affect one’s bet placement – as we were to discover during the data exploration phase – they would then be removed.

Data classification

Given the large amount due to amalgamation of many different users, there were a lot of noise in the data and initial data exploration showed plenty of insignificant relationships. Therefore, we created categorical metrics to segment transactions and users into smaller subsets to reduce variation within these groups, thus allowing more significant observations. For data regarding transactions, the segmentation is based on the bet odds of individual transactions, whereas for data regarding users, the segmentation is based on the total number of betting transactions for the individual player. Classes were determined using the lower and upper quartiles of the distribution of those parameters.

Untitled.png

For example, based on the distribution of betting odds across all the transactions, we have identified two proxies to segregate the odds into three different groups – low, medium or high odds (as shown in the figure above). The two proxies are odds of 3.4 and 7.5, which have been identified through a box plot diagram depicting the distribution of odds for all transactions. Transactions with odds less than or equals to 3.4 will be classed as low odds, transactions with odds more than 3.4 and less than or equals to 7.5 will be classed as medium odds and the rest of the transactions will be classed as high odds. Furthermore, we have reaffirmed these classifications with the client in order to verify the fittingness of the proxies.

Creation of metrics (refer to Data Dictionary for full list of data variables)

The next role of the newly created metrics are for descriptive statistics to be displayed on the dashboard; this pertains to the user data, where aggregated statistics (i.e. total bets, probability or preference for certain bet days) would make up the overview of a user’s profile on the dashboard.

The original data was transformed through various methods – splitting of original data into small subsets (e.g. data time into separate data and time for individual analysis), find difference in time points to determine duration gap, deriving the probability of each option within a parameter, summation of parameters, finding the average, median and standard deviation of certain parameters, and many others. The transformed metrics falls under two categories, one for transactions, and the other for the users.

The following are some of the interesting metrics that deserve highlight:

i) Bet Time Before Match – this duration gap between bet placed and match could indicate whether planning is involve when making bets

ii) Bet Time – the time when bets was place could indicate if the player was a football fan who catches the game at wee hours or simply betting in the day for a later match

iii) Median (Stake/Odds/Time/Returns) – players may sometimes make one off bets that are extremely large this would skew the average, hence median who be more representative of that user parameter if the deviation of that parameter was high

iv) Standard Deviation (Stake/Odds/Time/Returns) – adding the SD of each parameter would tell us if user preference was stable in that parameter (i.e. on whether the player is an erratic bettor)

Data Consolidation

After data cleaning and data transformation, we started to consolidate all the transactions in TransactionList to form list of unique users in another worksheet also known as UserList. Using the transaction data, we summed up variable such as profit/loss, returns, transaction count, and number of bets of each league or market type. Besides that, for each user, we calculated their individual mean, median and standard deviation for variables such as odds, stake amounts and returns. At the end of data consolidation, we have altogether 5,562 unique user accounts in UserList for further analysis.

Transaction Findings

The following few points pertain to initial findings regarding transactions that were derived from the TransactionList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.

Time (over the entire period)

Capture3.jpg

The line graph above shows the number of daily transactions across the period from January 2015 to March 2015. From the graph, there is a cyclical pattern of transaction count peaking over the weekends and being at its lowest during the weekdays.

Time (over a week)

The bar graph above shows the total number of transactions per day across a week. It can be clearly seen that weekends dominate at least 50% of the betting transactions, due to the large amount of matches played all over the world during the weekend.

Time (over a day)

The line graph above shows the total number of transactions for each hour in a day. The betting transactions peaks at certain timings with the highest number being at 2200hrs to 2259hrs, which is probably due to the large number of matches following that hour over the weekends. Besides that, the period between 0600hrs and 0759hrs have the lowest transactions as minimal matches are played during those hours. Using these high and low peaks, we have drawn up five different segments across the 24 hours in a day as shown below.

Using these different time segments in day, we can use them to derive the probability of a user’s bet that lies within each segment. This probability metric can be further used in the later stage of the project for our clustering analysis.

Low-medium-high odds (L-M-H)

Based on the segregation of transaction odds as mentioned earlier on, there are about 103,000 (20%) low-odds transactions, about 304,000 (57%) medium-odds transactions and 122,000 (23%) high-odds transactions.

Based on the table above, it can be seen that as the odds move from a low range to a high range, stake amounts of bettors decrease accordingly. This is an observation we would expect as high odds is synonymous with high risk, therefore bettors would most likely to bet less on a high risk selection. The minimum stake amount of $5 is the same for all groups as it is the minimum bet rule as imposed by Singapore Pools.
Stake amount across L-M-H odds groups
Group A (L) B (M) C (H)
Maximum 10,000 10,000 6,000
Upper Quartile 1,677 644.5 400.25
Median 796 307 191
Lower Quartile 372 144.5 89.75
Minimum 5 5
Based on the table above, a transaction with high odds would mean a shorter betting time before the match starts which means that the player places his/her bets closer to the kick-off time of the match. This is probably due to the longer thinking time that a player would need to have before placing his bet on a high-odds selection.
Bet time before match across L-M-H odds groups (in minutes)
Group A (L) B (M) C (H)
Maximum 6,581 6,619 6,253
Upper Quartile 2,074 1,614 1,506
Median 1,261 892 755
Lower Quartile 608 424 359
Minimum 0 0
The statistics above only consists of unprofitable transactions in each group. Based on the table above, transactions with low odds have a much higher median in terms of bet losses as compared to the other two groups. This is also partly driven by the above analysis where it showed that low odds transactions would likely to have high stake amounts, hence potentially larger losses.

User Findings

The following few points pertain to initial findings regarding users that were derived from the UserList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.

Account type

There are two types of membership or accounts, Platinum or Gold, the main difference is that for Platinum members, their betting accounts are linked to their bank accounts through GIRO which means that bets would be deducted automatically from their bank accounts, whereas for Gold members, they would always need to top-up their accounts (which essentially acts like an e-wallet) manually via service machines such as AXS. Gold users accounted for majority (67.55%) of the transactions across the three months period.

With insights from our client, our group hypothesized the followings:

i. Platinum players would bet larger stake amounts for spending is more inconspicuous, or/and they could have larger spending power.

ii. Platinum players would prefer bets with higher odds for they may be more active and risker gamblers.

iii. Platinum players would see lower overall profits or greater losses for their winnings and losings for returns are less salient.

Bet losses across L-M-H odds groups (in minutes)
Group A (L) B (M) C (H)
Maximum -5 -5 -5
Upper Quartile -261 -132 -88
Median -549 -279 -191
Lower Quartile -1,157 -571 -401
Minimum -10,000 -10,000
} The above data shows summary statistics of the distribution of the median stake amount for each player and the standard deviation (SD) of stake amount for each player. Our data findings from the above statistics agrees with our first hypothesis. Platinum players do bet larger amounts as shown by greater median, upper and lower quartile figures than Gold players. However, Platinum players have higher standard deviations of stake amounts than Gold, suggesting that the high stake amount of Platinum players could be driven by players who place infrequent high bet amounts. Thus, we hope that our clustering analysis would be able to single these erratic high bettors.
Variable Stake Amount SD of Stake Amount
Group Gold Platinum Gold Platinum
Maximum 4,000 5,000 2,828 3,655
Upper Quartile 270 425 76 107
Median 103 140 31 40
Lower Quartile 50 46 13 15
Minimum 5 5 0
} Based on the above table with regards to the “odds” variable, it can be seen that Platinum players do prefer bet selections with higher odds as shown by their larger median, upper and lower quartile. In addition, the standard deviation of odds for Platinum players are less volatile than that of Gold, hence further suggesting that Platinum players largely prefer higher odds as compared to Gold players.
Variable Odds SD of Odds
Group Gold Platinum Gold Platinum
Maximum 85.00 75.00 54.71 47.99
Upper Quartile 5.06 5.65 7.69 7.55
Median 2.91 3.05 3.80 3.25
Lower Quartile 1.93 1.98 1.31 1.02
Minimum 1.01 1.02 0
} The statistics above only consists of unprofitable users in each account type group. Based on the table above, Platinum players as a whole suffers larger bet losses as compared to the Gold players, probably driven by their higher stake amounts and preference for higher odds.

Gender

Male users make up the majority of our transactions across the three month period at 95.92%, while female users account for the remaining 4.08% of total transactions. There have been many consumer studies in various fields with regards to difference in male and female psychological mechanisms, as such we hypothesized gender to have an effect on betting preference.

With insights from our client, our group hypothesized the followings:

i. Male players would bet larger stake amounts given the general observation that males would hold more purchasing power.

ii. Male players would prefer bets with large odds given their innate riskier psychological predisposition.

iii. Female players would bet later for they would take a longer time to make their bet decision.

iv. Female players would have overall greater profits given that they are more incline to make calculated risks.

Variable Stake Amount SD of Stake Amount

Group Male Female Male Female

Maximum 5,000 1,500 3,655 1,253

Upper Quartile 312 147 84 43

Median 111 48 34 19

Lower Quartile 49 20 13 8

Minimum 5 5 0 0

Data shown in the table above validates our prediction that male players would bet more per bet as stake amounts for male players are relatively higher than that of females (over 100% more than females on each quartile). Additionally, standard deviation of stake amounts for male players are also wider than that of female players.

Variable Odds SD of Odds

Group Male Female Male Female

Maximum 85.00 52.50 54.71 34.04

Upper Quartile 5.20 5.75 7.61 8.71

Median 2.98 3.12 3.60 4.38

Lower Quartile 1.96 20.00 1.19 1.31

Minimum 1.02 1.01 0 0

Unlike what we hypothesized, females appear to be the gender that prefers higher odds as shown in the table above, whereas male players tend to prefer lower odds. Linking back to the previous table, general betting behaviour of male players perhaps is to focus on low odds and bet large amounts to reap bigger profits and for the female players on the other hand, their general strategy is to bet smaller amounts but leveraging on higher odds to achieve good returns.

Bet losses between gender

Group Male Female

Maximum 0 -5

Upper Quartile -248 -86

Median -780 -315

Lower Quartile -2,143 -1,378

Minimum -80,285 -16,821

The statistics above only consists of unprofitable users in each gender group. Based on the table above, male players record larger bet losses in spite of their preference for lower odds as compared to their female counterparts, probably due to the larger bet amounts placed by the male players.


Looking at the distribution of odds across the two gender in the above diagrams, we can see that female players have a higher spike in the transaction counts for higher odds (between the odd range from 6 to 10) as compared to the male players, whose overall demand for lower odds seemed to be higher than their overall demand for higher odds.

Age group (only for Platinum customers due to limited information)

In the group of Platinum account holders, about 91% of the account holders are aged between 30 and 59 years old with the highest number in the range of 40 to 49 years old as shown on the bar graph above. Due to the significantly low number of account holders in the 20 to 29 years old and 70 to 79 years old group, any of the following statistics that is pertaining to these two groups can be ignored.

With insights from our client, our group hypothesized the followings:

i. Older and middle-aged players would bet larger stake amounts given that they generally have more resources thus more purchasing power.

ii. Older players would prefer bets with large odds given that they are less interested in saving small incremental wins.

Meanwhile, there are lesser previous findings about age related bet behaviour on the other parameters, and we would then gather insights from the data to understand more about the variation across age groups.

Stake amounts between different age groups (in S$)

Age Group 20-29 30-39 40-49 50-59 60-69 >=70

Maximum 1,000 5,000 2,450 2,000 2,000 1,000

Upper Quartile 100 100 194 100 90 475

Median 50 50 63 50 50 65

Lower Quartile 23 25 30 25 23 18

Minimum 5 5 5 5 5 10

Based on the table above, the age group of 40 to 49 years old has higher betting amounts than the rest of the groups, probably due to the high number of working class adults that fall within the 40 to 49 years old category. This observation supports the former part of our first hypothesis, where middle-aged people – the more financially stable stage of career and life – holds largest spending, making up the majority of the larger stakes.

Odds between different age groups

Age Group 20-29 30-39 40-49 50-59 60-69 >=70

Maximum 10.00 42.00 52.50 55.00 18.00 7.00

Upper Quartile 3.86 4.35 5.20 5.68 5.80 4.55

Median 2.20 2.79 2.90 3.19 3.42 3.56

Lower Quartile 1.82 2.05 2.05 2.20 2.28 2.15

Minimum 1.53 1.05 1.15 1.13 1.18 1.68

Based on the table above, data supports our hypothesis that older people tend to prefer bets with higher odds, as it can be clearly seen that the older the age group. Looking at the lower quartile, median and upper quartile, the odds at every level increases as the age group increases.

Bet losses between different age groups (in S$)

Age Group 20-29 30-39 40-49 50-59 60-69 >=70

Maximum -50 -1 0 -2 -3 -29

Upper Quartile -178 -175 -187 -200 -165 -394

Median -508 -623 -660 -660 -525 -1,163

Lower Quartile -1,995 -2,175 -2,221 -2,169 -1,931 -3,899

Minimum -21,995 -39,157 -80,285 -43,049 -14,700 -7,500

The statistics above only consists of unprofitable users in each age group. Based on the table above, the age group of 60 to 69 years old has lower bet losses than the rest of the groups, despite the fact that this group prefers higher odds as compared to the other groups. Feedback from our client suggested that betting knowledge may come with experience, thus making them the “smartest gamblers” in the sample.

Bet time before match between different age groups (in minutes)

Age Group 20-29 30-39 40-49 50-59 60-69 >=70

Maximum 1,437 1,920 1,991 2,651 3,021 445

Upper Quartile 233 217 212 264 355 181

Median 142 103 117 131 214 129

Lower Quartile 66 44 55 55 71 85

Minimum 14 0 0 0 5 48

Based on the table above, the account holders within the 60 to 69 years old age group place their bets much earlier than the rest of the groups. For example, comparing the median across the age groups, the 60 to 69 years old age group bets about two hours earlier than the rest of the groups.

Based on transaction count

Besides segregating the customers based on their own demographic information, we also divided the customers into three groups (low, medium and high frequency) based on their frequency of betting transactions. Before deciding on the boundaries of the division, we looked at the distribution of the users’ transaction count which is seen in the table below.

Users’ transaction count

Maximum 4,408

Upper Quartile 109

Median 34

Lower Quartile 9

Minimum 1

Based on the table above, we divided the users based on the lower and upper quartile values. The first group (called ‘low frequency’, in short, LF) comprises of users with transaction count less than 10, second group (called ‘medium frequency’, in short, MF) comprises of users with transaction count more than 9 but less than 110 and third group (called ‘high frequency’, in short, HF) comprises of users with transaction count more than 109.

With insights from our client, our group hypothesized the followings:

i. Less frequent players or one-off players would tend to prefer placing higher stake bets for a quick and large win.

ii. Frequent players would prefer bets with lower odds for this safeguards themselves against hefty losses which could affect their opportunity of making future bets.

Stake amount across groups divided based on transaction frequency

Group A (LF) B (MF) C (HF)

Maximum 5,000 5,000 4,000

Upper Quartile 451 279 194

Median 133 96 81

Lower Quartile 51 44 39

Minimum 5 5 5

Based on the table above, data supports our hypothesis about the preference for stake amount; it can be clearly seen that the frequency of betting transaction is inversely related to the stake amount. The HF group records lower stake amount, whereas the LF group records higher stake amount, therefore showing the inverse relationship between transaction frequency and stake amount.

Odds across groups divided based on transaction frequency

Group A (LF) B (MF) C (HF)

Maximum 85.00 42.50 30.00

Upper Quartile 5.94 4.81 4.70

Median 3.07 2.90 2.93

Lower Quartile 1.99 1.91 2.01

Minimum 1.01 1.05 1.18

Based on the findings, it can be seen that the LF group prefers higher betting odds as compared to the other groups with users that have higher transaction frequency. Supporting our hypothesis that LF players are in it for a large win, or either that the low frequency purchase could be due to cash flow issues, which we would later look into in terms of their winnings or losses statistics.

Bet losses across groups divided based on transaction frequency

Group A (LF) B (MF) C (HF)

Maximum -2 0 -1

Upper Quartile -67 -204 -724

Median -176 -516 -1,809

Lower Quartile -538 -1,408 -4,305

Minimum -9,710 -48,955 -80,285

The statistics above only consists of unprofitable users in each group, and the HF group shows significantly higher bet losses than the other groups. Meanwhile, the LF group who bet are large bets and prefer higher odds performed least badly. Despite the preference of HF group towards lower betting odds and lower stake amounts, it should record the least losses out of the three groups. However, based on the high frequency of betting transactions coupled with majority bet losses, over time, it can sum up to a largest overall bet loss among the three groups.

Going Forward

Cluster Analysis

Based on our findings from the exploratory data analysis, we are now able to understand the relationship between variables and how these variables differ across different types of grouping (e.g. male & female, gold & platinum, age groups). Hence, for the next step before we start on our clustering analysis, we will have to normalize certain data points in order to nullify the extreme differences between data points.

As for the clustering analysis itself, we will be using SAS Enterprise Guide to process the clustering and conduct a few iterations of the clustering process before finalizing the clusters. After finalizing the clusters, we will proceed to review these clusters with the client to assess on its commercial usefulness. However, our group noted that the lack of demographic information (such as income range, occupation, education level) of each user account might render our analysis limited.

Automated Data Cleaning

Since the given dataset is a fixed dataset relating to the period from January to March 2015, if we only were to use “manual” analytical tools to generate the analytical outcomes for that dataset and statically visualize it by D3.js, it would not be beneficial for the client moving forward as they would not be able to use it for the other time periods. Therefore, we decided to make it flexible such that they can upload their own dataset (assuming that it is the same format) and the final dashboard still be able to reflect the results of the new dataset. Therefore, we need a tool/library/programming language that can allow us to do the “automated data cleaning” so that the final version of our dashboard can still generate desired outcomes without us having to manually perform any analysis.

There are some languages that can do this job such as Python or MATLAB but we decided to use R due to its open-source nature, having strong community support with an enormous number of different libraries/packages and also, since some of our team members already have prior experience with the language.

In our R code, we follow exactly the same steps as described in our Data Cleaning and Data Transformation segments to generate our analytical data cubes automatically.

First, we use the Read CSV package to read the raw dataset into the R environment:

transactions = read.csv("/DIRECTORY_OF_THE_FILE/FY14 SPORTS TRANSACTIONS TSOPENED_01012015_31032015.csv")

Secondly, we generated new variables which represents the “filtered” datasets after removing irrelevant fills the same criteria as stated in the Data Cleaning part.

filteredTransactions <- transactions[!is.na(users$ODDS),]

filteredTransaction

Similarly we also removed the outliers:

nonNullOdds = na.exclude(filteredTransactions$ODDS)

outlierOddsThreshold = quantile(nonNullOdds,c(.99))

nonNullStakeAmount = na.exclude(filteredTransactions$STAKEAMOUNT)

outlierStakeAmountThreshold = quantile(nonNullStakeAmount,c(.99))

filteredTransactions <- subset(filteredTransactions, filteredTransactions$ODDS < outlierOddsThresHold)

filteredTransactions <- subset(filteredTransactions, filteredTransactions$STAKEAMOUNT < outlierStakeAmountThresHold)

For the data transformation part first we also need to determine thresholds for HF-MF-LF Customers as well as Low – Medium – High Odds:

lowOddQuantile = quantile(transactions$STAKEAMOUNT,c(.25))

highOddQuantile = quantile(transactions$STAKEAMOUNT,c(.75))

transactions$RISK_TYPE[transactions$STAKEAMOUNT < lowOddQuantile] <- 'L'

transactions$RISK_TYPE[transactions$STAKEAMOUNT >= lowOddQuantile && transactions$STAKEAMOUNT <= highOddQuantile] <- 'M'

transactions$RISK_TYPE[transactions$STAKEAMOUNT > highOddQuantile] <- 'H'

Finally data aggregation from per-transaction to per-user is done using R function “aggregate” and “merge”. First, for each of a user metric, create a new dataframe that has two columns – the dummy account id and the aggregating value (median, mean, SD or probability). For example:

MEDSTAKEAMOUNT <- aggregate(x = transaction.clean$STAKEAMOUNT, by = list(transaction.clean$ACCOUNT_DUMMY), median)

This will show us the median stake amount of each account.

Finally, to merge all of those data frames into a single data frame, use the merge function to eventually reduce the number of data frames, for example:

aggregatedData <- merge(MEDSTAKEAMOUNT,MEDODD,by="Group.1")

Dashboard

Since we use R for our automated data cleaning process as well as to generate visualization data to be displayed by D3.js, we need to have a mechanism to “integrate” R into the workflow of a web application. “Full stack” solutions such as Shiny framework which allow you to write the whole web application – both frontend and backend – in R does not cut it due to performance issue – it would take very long to process our given data of nearly 1 million records let alone the possibly bigger datasets imported by Singapore Pools in the future. To resolve that issue we decided to keep our frontend purely HTML/CSS/Javascript and build only our backend side by R by using rApache. Our backend side will generate APIs (in which parameters are basically user inputs) which will be consumed by our frontend side. We will cater the format of the response of our APIs to be “friendly” to D3.js so that it can directly visualize the response without doing much data transformation. With rApache, it is also possible to host our web application on a standard Ubuntu server or cloud instance so it also offer flexibility on deployment which will be helpful in case if our client needs more processing power to process bigger datasets in future.

Wireframe

Our dashboard will consist of 3 views:

1) The Primary Transaction Overview

2) The User Segment (Cluster) Selection View

3) The Specific User Profile Overview

The administrative user will first need to log in into the dashboard, upon doing so, he will land on “Transaction Overview” page. This page allows users to preview the entire set of transaction data. There will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. There will also be a report summary of this month’s statistics at the end of the page (parameters will be determined by the client at a later stage).

Some parameters to be display will include the following (to be revised with client):

- Total revenue

- Winning ratio of SG Pools

- Number of transactions for the last month

- New member sign-ups for the last month

- Total profits

- Total pay outs

- Number of unique players

- Top 10 players with the highest stake amount for the month

- Top selling product for the month

- Top selling event type for the month

- Most popular league for the month

The side bar will aid navigation through the dashboard, users can easily switch between “Transaction Overview” and “User Profiles”. Secondary functions like calendar features to jot notes might be added later on during the prototype development.

Upon clicking on “User Profiles” tab on the side bar, users will land on “Segment Selection” page, where they can narrow down on the segment of players (based on their clusters, gender, or account type) that they want to view. They can future narrow in on a particular player by clicking on their ID the list of player names from the segment group. Else, they can use the search bar to find a particular player ID.

As seen in the screenshot below, on “User Profile” overview page, users can view several popular parameters about their purchase, bet pattern type, amount of returns, favourite bet preferences, etc. Like “Transaction Overview” there will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. Something different here is that there will be a secondary trend line that charts the overall population parameters over that of the selected user, allowing comparison of the individual bet patterns with others. Final set of parameters to be displayed will be confirm upon sponsor’s request.

There will also be a summary report of the user at the bottom of the page, which allows the user to easily export the info out by copying and pasting the statistics.

Variable Bet losses across account types
Group Gold Platinum
Maximum -1 0
Upper Quartile -237 -289
Median -686 -960
Lower Quartile -1,864 -3,099
Minimum -71,654 -80,285