Difference between revisions of "AY1516 T2 Sport Betting at Singapore Pools Project Overview Final"

From Analytics Practicum
Jump to navigation Jump to search
Line 36: Line 36:
 
<!--END OF Sub-Navigation-->
 
<!--END OF Sub-Navigation-->
  
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Background</font></div></div>==
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Abstract</font></div></div>==
In today’s globalized world, the Internet has transformed the gambling environment into a multifaceted, non-physical, multi-platform, environment without boundaries. This presents loopholes for illegal gambling operators to enter the market and draw our customers away, into their unregulated arena that is susceptible to the creation of gambling addiction issues. 
 
  
Singapore Pools offers a safer outlet, one where players can bet responsibly, within their means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling.  
+
In today’s interconnected world, the gambling environment has transformed into a multifaceted playing field without boundaries, exposing more people and younger people to the games, and too creates loop holes for illegal gambling operators to enter the market. The result is greater public worry about the social ills of irresponsible gambling.
  
Our sponsor has actually been collecting user data for the past several years, but has yet put it to good use. Just a year ago, Singapore Pools had set up a customer insights division to better understand their customers through the analysis of these user data and their first step towards a data-driven approach to promote responsible gambling was to understand the gambling behavioural patterns of their customers.
+
Our sponsor, the Singapore Pools, takes a strong stand in responsible gaming, wanting to offer a safer outlet for the public to play. This paper will explore gambling transaction data (n=930,000) to identify and better understand betting patterns that would eventually allow us to flag out players who engage in or is susceptible to irresponsible gambling in turn suggest ways to promote responsible gambling. This paper would also consult with past literature to guide our methodological approach and cross compare hypotheses and findings.  
 
   
 
   
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Objectives</font></div></div>==
+
The methodological flow of this project begins with exploratory data analysis where the dataset would cleaned and transformed for further analysis. The large set of transaction will be aggregated into a list of user data. We then proceeded to relationship analysis of the parameters and bet preferences of players of different demographics. Using a clustering analysis, we will then profiled players into four main segments: (1) Masses (2) High-rollers (3) Players-at-risk (4) Habituals.
  
The aim of our project is to allow Singapore Pools to better understand the gambling behaviours of their customers through the identification of gambling patterns. Each cluster might have their own specific ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too.  And based on the characteristics of each cluster, the client’s end objective is to have a customized business action so as to enhance the gambling experience of the players in that particular cluster. The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools and the given dataset given involves only transactions that lies within the time period from January 2015 to March 2015.
+
This unique segmentation would allow our sponsor to identify players would are at risk of irresponsible gambling, and suggest strategies to reach out to these segments and alert them of their betting behaviour and educate them about responsible betting. To ensure project continuity and future analyses, our team has created a dynamic dashboard to visualise monthly transaction trends, highlight popular events, players who are at risk, and allow exploration of each individual player’s profile and betting patterns (i.e. their betting intensity, transaction history).
  
<center>
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Introduction</font></div></div>==
The overall objectives of our project are to:
 
  
<b>(1) Profile their existing pool of customers through clustering analysis
+
Gambling is often seen as a problem in society, no doubt gambling addiction poses a grave societal problem, however banning gambling is not a viable solution, for it would simply drive these activities underground. Our sponsor, Singapore Pools was set up by the Singapore government in 1968 to place gambling on legal grounds and to deal with the social ills tied to gambling. Ever since, Singapore Pools (SG Pools) has been the sole legalized operator to run lotteries and sports betting in Singapore.
  
(2) Create a data visualization of the consumer betting activity
+
Unlike in most countries where gambling houses are privately owned organizations, Singapore Pools is a stated-owned organization, registered under Singapore’s Ministry of Finance. Singapore Pools offers four main products to the public (TOTO, Singapore Sweep, 4D, Sports Betting) all of which –operations and product configurations – are regulated by Singapore’s Ministry of Home Affairs, Ministry of Finance, Ministry of Social & Family Development.
  
(3) Build a dashboard to visualize the profiling and data points
+
Our sponsor takes a strong stand in responsible gaming, wanting to offer a safer outlet, where players can bet responsibly within their financial means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling, and too use this data to help usher in their online betting platform that is scheduled to launch in the upcoming year.
</b>
 
</center>
 
  
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Data Cleaning</font></div></div>==
+
Thankfully our sponsor has been collecting user and transaction data for the past several years, but has yet put it to good use. Singapore Pools had set up a customer insights division about a year ago to better understand their customers through the analysis of these user data. This is their first step towards a data-driven approach to promote responsible gambling and to understand the gambling behavioural patterns of their customers; and thus this is where our team comes in.
  
The original sample data set that we worked on contained over 930,000 unique observations (transactions) in a worksheet also known as TransactionList. There were two types of noise in our data set: (1) irrelevant fills – some of which are variables or observations that do not pertain to the soccer betting products, rejected bets which have no odds indicated, and ‘Championship Winner’ bet types for which is considered atypical to regular bet patterns; (2) outliers – these observations lie beyond the 99th percentile of selected parameters, the thresholds were identify based on scatterplots of the distribution on Tableau.
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Objectives</font></div></div>==
 
 
The following observations were removed from the original data set:
 
  
[[File: Capture.JPG|center]]
+
The aim of this project is to provide Singapore Pools with a better understanding of the gambling behaviours of their customers through the identification of betting preferences and patterns. Clusters of players may be identified base on their betting behaviour – ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too. 
  
[[File: Capture2.JPG|center]]
+
The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools. The data provide are confined to line betting transactions made by their ‘Gold’ and ‘Platinum’ members.  
  
The dilemma we faced was that if we removed the outlier transactions, we would be artificially changing the users’ bet preference when we aggregate these transaction data into one user’s overall bet pattern. The other option was to aggregate all transactions (including extreme observations) into a user’s overall bets, and then to filter the users from the population. Given that the outlier transactions would impede of analysis of transaction data, we had no choice but to remove these outliers. And in effect, we had to remove the affected users (users who made those extreme transactions) to maintain the integrity of the user data. Therefore, after data cleaning, we are left with 529,678 unique transactions in TransactionList for further analysis.
+
<center>
 
+
The overall objectives of our project are as stated:
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Data Transformation</font></div></div>==
+
(1) Provide insights with regards to gambling behavioural patterns
 
+
(2) Profile their existing pool of customers into meaningful segments
To gather deeper insights in our data exploration phase, our team created new metrics that reflect certain attributes of bet behaviours. This then allowed us to test some of the hypotheses about betting patterns that our sponsor highlighted to us. The new metrics would be used to test difference in betting preference between gender, age, account types, players of different risk profiles. If the metrics do not differ between the segments of players, or does not significantly affect one’s bet placement – as we were to discover during the data exploration phase – they would then be removed.
+
(3) Build a dashboard to visualize betting patterns and trends on a macro and individual level   
+
</center>
=== Data classification ===
 
 
 
Given the large amount due to amalgamation of many different users, there were a lot of noise in the data and initial data exploration showed plenty of insignificant relationships. Therefore, we created categorical metrics to segment transactions and users into smaller subsets to reduce variation within these groups, thus allowing more significant observations. For data regarding transactions, the segmentation is based on the bet odds of individual transactions, whereas for data regarding users, the segmentation is based on the total number of betting transactions for the individual player. Classes were determined using the lower and upper quartiles of the distribution of those parameters.
 
 
 
[[File:Untitled.png|center|frameless|upright=2.5]]
 
 
 
For example, based on the distribution of betting odds across all the transactions, we have identified two proxies to segregate the odds into three different groups – low, medium or high odds (as shown in the figure above). The two proxies are odds of 3.4 and 7.5, which have been identified through a box plot diagram depicting the distribution of odds for all transactions. Transactions with odds less than or equals to 3.4 will be classed as low odds, transactions with odds more than 3.4 and less than or equals to 7.5 will be classed as medium odds and the rest of the transactions will be classed as high odds. Furthermore, we have reaffirmed these classifications with the client in order to verify the fittingness of the proxies.
 
 
 
=== Creation of metrics (refer to Data Dictionary for full list of data variables) ===
 
 
 
The next role of the newly created metrics are for descriptive statistics to be displayed on the dashboard; this pertains to the user data, where aggregated statistics (i.e. total bets, probability or preference for certain bet days) would make up the overview of a user’s profile on the dashboard.
 
 
 
The original data was transformed through various methods – splitting of original data into small subsets (e.g. data time into separate data and time for individual analysis), find difference in time points to determine duration gap, deriving the probability of each option within a parameter, summation of parameters, finding the average, median and standard deviation of certain parameters, and many others. The transformed metrics falls under two categories, one for transactions, and the other for the users.
 
 
 
The following are some of the interesting metrics that deserve highlight:
 
 
 
i) Bet Time Before Match – this duration gap between bet placed and match could indicate whether planning is involve when making bets   
 
 
 
ii) Bet Time – the time when bets was place could indicate if the player was a football fan who catches the game at wee hours or simply betting in the day for a later match
 
 
iii) Median (Stake/Odds/Time/Returns) – players may sometimes make one off bets that are extremely large this would skew the average, hence median who be more representative of that user parameter if the deviation of that parameter was high
 
 
 
iv) Standard Deviation (Stake/Odds/Time/Returns) – adding the SD of each parameter would tell us if user preference was stable in that parameter (i.e. on whether the player is an erratic bettor)
 
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Data Consolidation</font></div></div>==
 
 
 
After data cleaning and data transformation, we started to consolidate all the transactions in TransactionList to form list of unique users in another worksheet also known as UserList. Using the transaction data, we summed up variable such as profit/loss, returns, transaction count, and number of bets of each league or market type. Besides that, for each user, we calculated their individual mean, median and standard deviation for variables such as odds, stake amounts and returns. At the end of data consolidation, we have altogether 5,562 unique user accounts in UserList for further analysis.
 
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Transaction Findings</font></div></div>==
 
 
 
The following few points pertain to initial findings regarding transactions that were derived from the TransactionList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.
 
 
 
=== Time (over the entire period) ===
 
<br/>
 
[[File: Capture3.jpg|center|frameless|upright=4.1]]
 
The line graph above shows the number of daily transactions across the period from January 2015 to March 2015. From the graph, there is a cyclical pattern of transaction count peaking over the weekends and being at its lowest during the weekdays.
 
 
 
=== Time (over a week) ===
 
<br/>
 
[[File: Capture4.jpg|center|frameless|upright=3.5]]
 
The bar graph above shows the total number of transactions per day across a week. It can be clearly seen that weekends dominate at least 50% of the betting transactions, due to the large amount of matches played all over the world during the weekend.
 
 
 
=== Time (over a day) ===
 
<br/>
 
[[File: Capture5.jpg|center|frameless|upright=3.1]]
 
The line graph above shows the total number of transactions for each hour in a day. The betting transactions peaks at certain timings with the highest number being at 2200hrs to 2259hrs, which is probably due to the large number of matches following that hour over the weekends. Besides that, the period between 0600hrs and 0759hrs have the lowest transactions as minimal matches are played during those hours. Using these high and low peaks, we have drawn up five different segments across the 24 hours in a day as shown below.
 
<br/>
 
[[File: Capture6.jpg|center|frameless|upright=3.1]]
 
Using these different time segments in day, we can use them to derive the probability of a user’s bet that lies within each segment. This probability metric can be further used in the later stage of the project for our clustering analysis.
 
 
 
=== Low-medium-high odds (L-M-H) ===
 
Based on the segregation of transaction odds as mentioned earlier on, there are about 103,000 (20%) low-odds transactions, about 304,000 (57%) medium-odds transactions and 122,000 (23%) high-odds transactions.
 
<br/>
 
[[File: Capture7.JPG|center|frameless|upright=4.1]]
 
Based on the table above, it can be seen that as the odds move from a low range to a high range, stake amounts of bettors decrease accordingly. This is an observation we would expect as high odds is synonymous with high risk, therefore bettors would most likely to bet less on a high risk selection. The minimum stake amount of $5 is the same for all groups as it is the minimum bet rule as imposed by Singapore Pools.
 
<br/>
 
[[File: Capture8.JPG|center|frameless|upright=4.1]]
 
Based on the table above, a transaction with high odds would mean a shorter betting time before the match starts which means that the player places his/her bets closer to the kick-off time of the match. This is probably due to the longer thinking time that a player would need to have before placing his bet on a high-odds selection.
 
<br/>
 
[[File: Capture9.JPG|center|frameless|upright=4.1]]
 
The statistics above only consists of unprofitable transactions in each group. Based on the table above, transactions with low odds have a much higher median in terms of bet losses as compared to the other two groups. This is also partly driven by the above analysis where it showed that low odds transactions would likely to have high stake amounts, hence potentially larger losses.
 
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">User Findings</font></div></div>==
 
 
 
The following few points pertain to initial findings regarding users that were derived from the UserList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.
 
 
 
=== Account type ===
 
 
There are two types of membership or accounts, Platinum or Gold, the main difference is that for Platinum members, their betting accounts are linked to their bank accounts through GIRO which means that bets would be deducted automatically from their bank accounts, whereas for Gold members, they would always need to top-up their accounts (which essentially acts like an e-wallet) manually via service machines such as AXS. Gold users accounted for majority (67.55%) of the transactions across the three months period.
 
 
 
With insights from our client, our group hypothesized the followings:
 
 
 
i. Platinum players would bet larger stake amounts for spending is more inconspicuous, or/and they could have larger spending power.
 
 
 
ii. Platinum players would prefer bets with higher odds for they may be more active and risker gamblers.
 
 
 
iii. Platinum players would see lower overall profits or greater losses for their winnings and losings for returns are less salient.
 
<br/>
 
[[File: Capture10.JPG|center|frameless|upright=4.1]]
 
The above data shows summary statistics of the distribution of the median stake amount for each player and the standard deviation (SD) of stake amount for each player. Our data findings from the above statistics agrees with our first hypothesis. Platinum players do bet larger amounts as shown by greater median, upper and lower quartile figures than Gold players. However, Platinum players have higher standard deviations of stake amounts than Gold, suggesting that the high stake amount of Platinum players could be driven by players who place infrequent high bet amounts. Thus, we hope that our clustering analysis would be able to single these erratic high bettors.
 
<br/>
 
[[File: Capture11.JPG|center|frameless|upright=4.1]]
 
Based on the above table with regards to the “odds” variable, it can be seen that Platinum players do prefer bet selections with higher odds as shown by their larger median, upper and lower quartile. In addition, the standard deviation of odds for Platinum players are less volatile than that of Gold, hence further suggesting that Platinum players largely prefer higher odds as compared to Gold players.
 
<br/>
 
[[File: Capture12.JPG|center|frameless|upright=4.1]]
 
The statistics above only consists of unprofitable users in each account type group. Based on the table above, Platinum players as a whole suffers larger bet losses as compared to the Gold players, probably driven by their higher stake amounts and preference for higher odds.
 
 
 
=== Gender ===
 
Male users make up the majority of our transactions across the three month period at 95.92%, while female users account for the remaining 4.08% of total transactions. There have been many consumer studies in various fields with regards to difference in male and female psychological mechanisms, as such we hypothesized gender to have an effect on betting preference.
 
 
 
With insights from our client, our group hypothesized the followings:
 
 
 
i. Male players would bet larger stake amounts given the general observation that males would hold more purchasing power.
 
 
 
ii. Male players would prefer bets with large odds given their innate riskier psychological predisposition.
 
 
 
iii. Female players would bet later for they would take a longer time to make their bet decision.
 
 
 
iv. Female players would have overall greater profits given that they are more incline to make calculated risks.
 
<br/>
 
[[File: Capture13.JPG|center|frameless|upright=4.1]]
 
Data shown in the table above validates our prediction that male players would bet more per bet as stake amounts for male players are relatively higher than that of females (over 100% more than females on each quartile). Additionally, standard deviation of stake amounts for male players are also wider than that of female players.
 
<br/>
 
[[File: Capture14.JPG|center|frameless|upright=4.1]]
 
Unlike what we hypothesized, females appear to be the gender that prefers higher odds as shown in the table above, whereas male players tend to prefer lower odds. Linking back to the previous table, general betting behaviour of male players perhaps is to focus on low odds and bet large amounts to reap bigger profits and for the female players on the other hand, their general strategy is to bet smaller amounts but leveraging on higher odds to achieve good returns.
 
<br/>
 
[[File: Capture15.JPG|center|frameless|upright=4.1]]
 
The statistics above only consists of unprofitable users in each gender group. Based on the table above, male players record larger bet losses in spite of their preference for lower odds as compared to their female counterparts, probably due to the larger bet amounts placed by the male players.
 
<br/>
 
[[File: Capture16.JPG|center|frameless|upright=14.1]]
 
Looking at the distribution of odds across the two gender in the above diagrams, we can see that female players have a higher spike in the transaction counts for higher odds (between the odd range from 6 to 10) as compared to the male players, whose overall demand for lower odds seemed to be higher than their overall demand for higher odds.
 
 
 
=== Age group (only for Platinum customers due to limited information) ===
 
<br/>
 
[[File: Capture17.JPG|center|frameless|upright=4.1]]
 
In the group of Platinum account holders, about 91% of the account holders are aged between 30 and 59 years old with the highest number in the range of 40 to 49 years old as shown on the bar graph above. Due to the significantly low number of account holders in the 20 to 29 years old and 70 to 79 years old group, any of the following statistics that is pertaining to these two groups can be ignored.
 
 
 
With insights from our client, our group hypothesized the followings:
 
 
 
i. Older and middle-aged players would bet larger stake amounts given that they generally have more resources thus more purchasing power.
 
 
 
ii. Older players would prefer bets with large odds given that they are less interested in saving small incremental wins.
 
 
 
Meanwhile, there are lesser previous findings about age related bet behaviour on the other parameters, and we would then gather insights from the data to understand more about the variation across age groups.
 
<br/>
 
[[File: Capture18.JPG|center|frameless|upright=4.1]]
 
Based on the table above, the age group of 40 to 49 years old has higher betting amounts than the rest of the groups, probably due to the high number of working class adults that fall within the 40 to 49 years old category. This observation supports the former part of our first hypothesis, where middle-aged people – the more financially stable stage of career and life – holds largest spending, making up the majority of the larger stakes.
 
<br/>
 
[[File: Capture19.JPG|center|frameless|upright=4.1]]
 
Based on the table above, data supports our hypothesis that older people tend to prefer bets with higher odds, as it can be clearly seen that the older the age group. Looking at the lower quartile, median and upper quartile, the odds at every level increases as the age group increases.
 
<br/>
 
[[File: Capture20.JPG|center|frameless|upright=4.1]]
 
The statistics above only consists of unprofitable users in each age group. Based on the table above, the age group of 60 to 69 years old has lower bet losses than the rest of the groups, despite the fact that this group prefers higher odds as compared to the other groups. Feedback from our client suggested that betting knowledge may come with experience, thus making them the “smartest gamblers” in the sample.
 
<br/>
 
[[File: Capture21.JPG|center|frameless|upright=4.1]]
 
Based on the table above, the account holders within the 60 to 69 years old age group place their bets much earlier than the rest of the groups. For example, comparing the median across the age groups, the 60 to 69 years old age group bets about two hours earlier than the rest of the groups.
 
 
 
=== Based on transaction count ===
 
 
 
Besides segregating the customers based on their own demographic information, we also divided the customers into three groups (low, medium and high frequency) based on their frequency of betting transactions. Before deciding on the boundaries of the division, we looked at the distribution of the users’ transaction count which is seen in the table below.
 
<br/>
 
[[File: Capture22.JPG|center|frameless|upright=4.1]]
 
Based on the table above, we divided the users based on the lower and upper quartile values. The first group (called ‘low frequency’, in short, LF) comprises of users with transaction count less than 10, second group (called ‘medium frequency’, in short, MF)  comprises of users with transaction count more than 9 but less than 110 and third group (called ‘high frequency’, in short, HF) comprises of users with transaction count more than 109.
 
 
 
With insights from our client, our group hypothesized the followings:
 
 
 
i.  Less frequent players or one-off players would tend to prefer placing higher stake bets for a quick and large win.
 
 
 
ii. Frequent players would prefer bets with lower odds for this safeguards themselves against hefty losses which could affect their opportunity of making future bets.
 
<br/>
 
[[File: Capture23.JPG|center|frameless|upright=4.1]]
 
Based on the table above, data supports our hypothesis about the preference for stake amount; it can be clearly seen that the frequency of betting transaction is inversely related to the stake amount. The HF group records lower stake amount, whereas the LF group records higher stake amount, therefore showing the inverse relationship between transaction frequency and stake amount.
 
<br/>
 
[[File: Capture24.JPG|center|frameless|upright=4.1]]
 
Based on the findings, it can be seen that the LF group prefers higher betting odds as compared to the other groups with users that have higher transaction frequency. Supporting our hypothesis that LF players are in it for a large win, or either that the low frequency purchase could be due to cash flow issues, which we would later look into in terms of their winnings or losses statistics.
 
<br/>
 
[[File: Capture25.JPG|center|frameless|upright=4.1]]
 
The statistics above only consists of unprofitable users in each group, and the HF group shows significantly higher bet losses than the other groups. Meanwhile, the LF group who bet are large bets and prefer higher odds performed least badly. Despite the preference of HF group towards lower betting odds and lower stake amounts, it should record the least losses out of the three groups. However, based on the high frequency of betting transactions coupled with majority bet losses, over time, it can sum up to a largest overall bet loss among the three groups.
 
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Going Forward</font></div></div>==
 
 
 
=== Cluster Analysis ===
 
 
 
Based on our findings from the exploratory data analysis, we are now able to understand the relationship between variables and how these variables differ across different types of grouping (e.g. male & female, gold & platinum, age groups). Hence, for the next step before we start on our clustering analysis, we will have to normalize certain data points in order to nullify the extreme differences between data points.
 
 
 
As for the clustering analysis itself, we will be using SAS Enterprise Guide to process the clustering and conduct a few iterations of the clustering process before finalizing the clusters. After finalizing the clusters, we will proceed to review these clusters with the client to assess on its commercial usefulness. However, our group noted that the lack of demographic information (such as income range, occupation, education level) of each user account might render our analysis limited.
 
 
 
=== Automated Data Cleaning ===
 
 
 
Since the given dataset is a fixed dataset relating to the period from January to March 2015, if we only were to use “manual” analytical tools to generate the analytical outcomes for that dataset and statically visualize it by D3.js, it would not be beneficial for the client moving forward as they would not be able to use it for the other time periods. Therefore, we decided to make it flexible such that they can upload their own dataset (assuming that it is the same format) and the final dashboard still be able to reflect the results of the new dataset. Therefore, we need a tool/library/programming language that can allow us to do the “automated data cleaning” so that the final version of our dashboard can still generate desired outcomes without us having to manually perform any analysis.
 
 
 
There are some languages that can do this job such as Python or MATLAB but we decided to use R due to its open-source nature, having strong community support with an enormous number of different libraries/packages and also, since some of our team members already have prior experience with the language.
 
 
 
In our R code, we follow exactly the same steps as described in our Data Cleaning and Data Transformation segments to generate our analytical data cubes automatically.
 
 
 
First, we use the Read CSV package to read the raw dataset into the R environment:
 
 
 
transactions = read.csv("/DIRECTORY_OF_THE_FILE/FY14 SPORTS TRANSACTIONS TSOPENED_01012015_31032015.csv")
 
 
 
Secondly, we generated new variables which represents the “filtered” datasets after removing irrelevant fills the same criteria as stated in the Data Cleaning part.
 
 
 
filteredTransactions <- transactions[!is.na(users$ODDS),]
 
filteredTransaction
 
  
Similarly we also removed the outliers:
+
And based on the characteristics of each cluster, the sponsor’s end objective is to (1) flag out players who display alarming patterns that could lead to irresponsible betting, and (2) tailored business actions that targets the derived clusters of players to enhance their gambling experience while ensuring that they make bets in a responsible fashion. 
  
nonNullOdds = na.exclude(filteredTransactions$ODDS)
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Literature Review</font></div></div>==
outlierOddsThreshold = quantile(nonNullOdds,c(.99))
 
nonNullStakeAmount = na.exclude(filteredTransactions$STAKEAMOUNT)
 
outlierStakeAmountThreshold = quantile(nonNullStakeAmount,c(.99))
 
filteredTransactions <- subset(filteredTransactions,  filteredTransactions$ODDS < outlierOddsThresHold)
 
filteredTransactions <- subset(filteredTransactions,  filteredTransactions$STAKEAMOUNT < outlierStakeAmountThresHold)
 
  
For the data transformation part first we also need to determine thresholds for HF-MF-LF Customers as well as Low – Medium – High Odds:
+
Gambling is one topic that is widely research across the world, from survey polls of gambling participation and perception, gambling risk and pathology, to thorough statistical analysis on gambling behaviours.
  
lowOddQuantile = quantile(transactions$STAKEAMOUNT,c(.25))
+
According to a survey done by Singapore’s Ministry of Community Development, Youth and Sports (MCYS), within a year’s period, 58% of Singaporeans over 18 years of age have participated in at least one gambling activity. Further study on pathological gamblers by the MCYS found that players at risk to developing a gambling addiction would gamble at least once a week, and this pool of susceptible players made up 70% of the sample population involved in the study (2005).  
highOddQuantile = quantile(transactions$STAKEAMOUNT,c(.75))
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT < lowOddQuantile] <- 'L'
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT >= lowOddQuantile && transactions$STAKEAMOUNT <= highOddQuantile] <- 'M'
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT > highOddQuantile] <- 'H'
 
  
Finally data aggregation from per-transaction to per-user is done using R function “aggregate” and “merge”. First, for each of a user metric, create a new dataframe that has two columns – the dummy account id and the aggregating value (median, mean, SD or probability). For example:
+
Behavioural or betting patterns is another popular area studied across most papers – for they provide cues to possible pathological gambling behaviours; difference between betting behaviour of regular players and players at risk (problem gamblers) is evident, a common finding in most studies. One study revealed that gamblers at risk are more likely to bet more frequently coupled with increasing bet amounts, regardless of their bet outcome (Mizerski, 2011). And that less frequent players are more likely to put more effort into decision-making when making bets to allow for future betting possibilities, as compared to regular or frequent players. Evidently they also found that certain betting games and game arrangements may actually prompt reckless betting that could like to irresponsible gambling.  
  
MEDSTAKEAMOUNT <- aggregate(x = transaction.clean$STAKEAMOUNT, by = list(transaction.clean$ACCOUNT_DUMMY), median)
+
Several other papers provided insights to a more analytical approach to segment gamblers and identify those at risk. A study by Faregh and Leth-Steensen (2011) discovered clusters of players with variations in terms of their bet activity level (frequency), bet variability (spread of stakes and odds), time spent on making the bets, and the games played. Relationship and predictive analysis between selected parameters may reveal variables that best predicted returns, and reflect bet strategies that are less sophisticated (Gainsbury & Russell, 2013). Suggestions on data collection procedures, selection of metrics and parameters for clustering players in these papers are just some of the secondary insights that have aided our choice of methodology and analysis – determining ways of profiling our result clusters – that will be elaborated on later in this report.
  
This will show us the median stake amount of each account.
+
Besides researching the field of gambling and the analytical methodologies, we took examined past data visualization papers to learn about the pit falls and best practices of data visualization. “Different types of graphs are designed to communicate different types of messages” quote data visualization expert, Stephen Few, as he demonstrated in his papers regarding the effective use of points and lines to shape data trends, to the principles of colour selection for data visualization – use of contrasting or analogous colours for varying purposes (Few, 2004; 2006; 2007). Meanwhile some graphs are best to avoid, such as alluring 3-D graphs or pie-charts which can be rendered better in a two-dimensional plane, for the added depth and angle makes interpretation more difficult (Few, 2005). Returning to the dashboard, two guiding principle in designing the dashboard layout that we took from Few’s recommendations was to (1) find balance between being information rich and not oversimplifying and (2) to remove clutter or any distractions that do not add value (Few, 2005).  
  
Finally, to merge all of those data frames into a single data frame, use the merge function to eventually reduce the number of data frames, for example:
+
Leveraging on these prior knowledge, our team hopes to deliver actionable insights with regards to the betting behaviour of our sponsor’s pool of customers, and present the findings on a dashboard that is all visually appealing, intuitive, practical, and accurately data driven.
  
aggregatedData <- merge(MEDSTAKEAMOUNT,MEDODD,by="Group.1")
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Data Cleaning & Transformation</font></div></div>==
  
=== Dashboard ===
 
  
Since we use R for our automated data cleaning process as well as to generate visualization data to be displayed by D3.js, we need to have a mechanism to “integrate” R into the workflow of a web application. “Full stack” solutions such as Shiny framework which allow you to write the whole web application – both frontend and backend – in R does not cut it due to performance issue – it would take very long to process our given data of nearly 1 million records let alone the possibly bigger datasets imported by Singapore Pools in the future.
 
To resolve that issue we decided to keep our frontend purely HTML/CSS/Javascript and build only our backend side by R by using rApache. Our backend side will generate APIs (in which parameters are basically user inputs) which will be consumed by our frontend side. We will cater the format of the response of our APIs to be “friendly” to D3.js so that it can directly visualize the response without doing much data transformation.
 
With rApache, it is also possible to host our web application on a standard Ubuntu server or cloud instance so it also offer flexibility on deployment which will be helpful in case if our client needs more processing power to process bigger datasets in future.
 
  
=== Wireframe ===
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Exploratory Data Analysis</font></div></div>==
  
Our dashboard will consist of 3 views:
 
  
1) The Primary Transaction Overview
 
  
2) The User Segment (Cluster) Selection View
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Methodology for Clustering</font></div></div>==
  
3) The Specific User Profile Overview
 
  
The administrative user will first need to log in into the dashboard, upon doing so, he will land on “Transaction Overview” page. This page allows users to preview the entire set of transaction data. There will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. There will also be a report summary of this month’s statistics at the end of the page (parameters will be determined by the client at a later stage).
 
  
Some parameters to be display will include the following (to be revised with client):
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Results & Findings</font></div></div>==
  
- Total revenue
 
  
- Winning ratio of SG Pools
 
  
- Number of transactions for the last month
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Recommendations</font></div></div>==
  
- New member sign-ups for the last month
 
  
- Total profits
 
  
- Total pay outs
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Dashboard</font></div></div>==
  
- Number of unique players
 
  
- Top 10 players with the highest stake amount for the month
 
  
- Top selling product for the month
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Limitations & Challenges</font></div></div>==
  
- Top selling event type for the month
 
  
- Most popular league for the month
 
  
The side bar will aid navigation through the dashboard, users can easily switch between “Transaction Overview” and “User Profiles”. Secondary functions like calendar features to jot notes might be added later on during the prototype development. 
+
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Further Development</font></div></div>==
<br/>
 
[[File: Capture26.jpg|center|frameless|upright=4.1]]
 
Upon clicking on “User Profiles” tab on the side bar, users will land on “Segment Selection” page, where they can narrow down on the segment of players (based on their clusters, gender, or account type) that they want to view. They can future narrow in on a particular player by clicking on their ID the list of player names from the segment group. Else, they can use the search bar to find a particular player ID.
 
<br/>
 
[[File: Capture27.jpg|center|frameless|upright=4.1]]
 
As seen in the screenshot below, on “User Profile” overview page, users can view several popular parameters about their purchase, bet pattern type, amount of returns, favourite bet preferences, etc. Like “Transaction Overview” there will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. Something different here is that there will be a secondary trend line that charts the overall population parameters over that of the selected user, allowing comparison of the individual bet patterns with others. Final set of parameters to be displayed will be confirm upon sponsor’s request.
 
<br/>
 
[[File: Capture28.jpg|center|frameless|upright=4.1]]
 
There will also be a summary report of the user at the bottom of the page, which allows the user to easily export the info out by copying and pasting the statistics.
 

Revision as of 15:07, 17 April 2016


THE SPONSOR

 

THE TEAM

 

THE OVERVIEW

 

THE MANAGEMENT

 

THE DOCUMENTS

 


Proposal Midterm Final


Abstract

In today’s interconnected world, the gambling environment has transformed into a multifaceted playing field without boundaries, exposing more people and younger people to the games, and too creates loop holes for illegal gambling operators to enter the market. The result is greater public worry about the social ills of irresponsible gambling.

Our sponsor, the Singapore Pools, takes a strong stand in responsible gaming, wanting to offer a safer outlet for the public to play. This paper will explore gambling transaction data (n=930,000) to identify and better understand betting patterns that would eventually allow us to flag out players who engage in or is susceptible to irresponsible gambling in turn suggest ways to promote responsible gambling. This paper would also consult with past literature to guide our methodological approach and cross compare hypotheses and findings.

The methodological flow of this project begins with exploratory data analysis where the dataset would cleaned and transformed for further analysis. The large set of transaction will be aggregated into a list of user data. We then proceeded to relationship analysis of the parameters and bet preferences of players of different demographics. Using a clustering analysis, we will then profiled players into four main segments: (1) Masses (2) High-rollers (3) Players-at-risk (4) Habituals.

This unique segmentation would allow our sponsor to identify players would are at risk of irresponsible gambling, and suggest strategies to reach out to these segments and alert them of their betting behaviour and educate them about responsible betting. To ensure project continuity and future analyses, our team has created a dynamic dashboard to visualise monthly transaction trends, highlight popular events, players who are at risk, and allow exploration of each individual player’s profile and betting patterns (i.e. their betting intensity, transaction history).

Introduction

Gambling is often seen as a problem in society, no doubt gambling addiction poses a grave societal problem, however banning gambling is not a viable solution, for it would simply drive these activities underground. Our sponsor, Singapore Pools was set up by the Singapore government in 1968 to place gambling on legal grounds and to deal with the social ills tied to gambling. Ever since, Singapore Pools (SG Pools) has been the sole legalized operator to run lotteries and sports betting in Singapore.

Unlike in most countries where gambling houses are privately owned organizations, Singapore Pools is a stated-owned organization, registered under Singapore’s Ministry of Finance. Singapore Pools offers four main products to the public (TOTO, Singapore Sweep, 4D, Sports Betting) all of which –operations and product configurations – are regulated by Singapore’s Ministry of Home Affairs, Ministry of Finance, Ministry of Social & Family Development.

Our sponsor takes a strong stand in responsible gaming, wanting to offer a safer outlet, where players can bet responsibly within their financial means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling, and too use this data to help usher in their online betting platform that is scheduled to launch in the upcoming year.

Thankfully our sponsor has been collecting user and transaction data for the past several years, but has yet put it to good use. Singapore Pools had set up a customer insights division about a year ago to better understand their customers through the analysis of these user data. This is their first step towards a data-driven approach to promote responsible gambling and to understand the gambling behavioural patterns of their customers; and thus this is where our team comes in.

Project Objectives

The aim of this project is to provide Singapore Pools with a better understanding of the gambling behaviours of their customers through the identification of betting preferences and patterns. Clusters of players may be identified base on their betting behaviour – ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too.

The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools. The data provide are confined to line betting transactions made by their ‘Gold’ and ‘Platinum’ members.

The overall objectives of our project are as stated: (1) Provide insights with regards to gambling behavioural patterns (2) Profile their existing pool of customers into meaningful segments (3) Build a dashboard to visualize betting patterns and trends on a macro and individual level

And based on the characteristics of each cluster, the sponsor’s end objective is to (1) flag out players who display alarming patterns that could lead to irresponsible betting, and (2) tailored business actions that targets the derived clusters of players to enhance their gambling experience while ensuring that they make bets in a responsible fashion.

Literature Review

Gambling is one topic that is widely research across the world, from survey polls of gambling participation and perception, gambling risk and pathology, to thorough statistical analysis on gambling behaviours.

According to a survey done by Singapore’s Ministry of Community Development, Youth and Sports (MCYS), within a year’s period, 58% of Singaporeans over 18 years of age have participated in at least one gambling activity. Further study on pathological gamblers by the MCYS found that players at risk to developing a gambling addiction would gamble at least once a week, and this pool of susceptible players made up 70% of the sample population involved in the study (2005).

Behavioural or betting patterns is another popular area studied across most papers – for they provide cues to possible pathological gambling behaviours; difference between betting behaviour of regular players and players at risk (problem gamblers) is evident, a common finding in most studies. One study revealed that gamblers at risk are more likely to bet more frequently coupled with increasing bet amounts, regardless of their bet outcome (Mizerski, 2011). And that less frequent players are more likely to put more effort into decision-making when making bets to allow for future betting possibilities, as compared to regular or frequent players. Evidently they also found that certain betting games and game arrangements may actually prompt reckless betting that could like to irresponsible gambling.

Several other papers provided insights to a more analytical approach to segment gamblers and identify those at risk. A study by Faregh and Leth-Steensen (2011) discovered clusters of players with variations in terms of their bet activity level (frequency), bet variability (spread of stakes and odds), time spent on making the bets, and the games played. Relationship and predictive analysis between selected parameters may reveal variables that best predicted returns, and reflect bet strategies that are less sophisticated (Gainsbury & Russell, 2013). Suggestions on data collection procedures, selection of metrics and parameters for clustering players in these papers are just some of the secondary insights that have aided our choice of methodology and analysis – determining ways of profiling our result clusters – that will be elaborated on later in this report.

Besides researching the field of gambling and the analytical methodologies, we took examined past data visualization papers to learn about the pit falls and best practices of data visualization. “Different types of graphs are designed to communicate different types of messages” quote data visualization expert, Stephen Few, as he demonstrated in his papers regarding the effective use of points and lines to shape data trends, to the principles of colour selection for data visualization – use of contrasting or analogous colours for varying purposes (Few, 2004; 2006; 2007). Meanwhile some graphs are best to avoid, such as alluring 3-D graphs or pie-charts which can be rendered better in a two-dimensional plane, for the added depth and angle makes interpretation more difficult (Few, 2005). Returning to the dashboard, two guiding principle in designing the dashboard layout that we took from Few’s recommendations was to (1) find balance between being information rich and not oversimplifying and (2) to remove clutter or any distractions that do not add value (Few, 2005).

Leveraging on these prior knowledge, our team hopes to deliver actionable insights with regards to the betting behaviour of our sponsor’s pool of customers, and present the findings on a dashboard that is all visually appealing, intuitive, practical, and accurately data driven.

Data Cleaning & Transformation

Exploratory Data Analysis

Methodology for Clustering

Results & Findings

Recommendations

Project Dashboard

Limitations & Challenges

Further Development