Difference between revisions of "AY1516 T2 Sport Betting at Singapore Pools Project Overview Final"

From Analytics Practicum
Jump to navigation Jump to search
(Blanked the page)
 
Line 1: Line 1:
<br>
 
<font face="Century Gothic">
 
{| style="background-color:#FFFFFF; color:#007BBD padding: 5px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; border-left:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools|<font face ="Century Gothic" color="#FFFFFF"><strong>THE SPONSOR</strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" | 
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Team|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE TEAM </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#FFFFFF; text-align:center;" width="20%" | 
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Proposal|<font face ="Century Gothic" color="#007BBD"><strong> THE OVERVIEW</strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Management|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE MANAGEMENT </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Documentation|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE DOCUMENTS </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
|}
 
<br>
 
  
<!--Sub-Navigation-->
 
{| style="background-color:white; color:white ; border: 0px solid #007BBD; margin-left: auto; margin-right: auto;" width="800px" height=50px cellspacing="0" cellpadding="0" valign="top"  |
 
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; border: 1px solid grey;" width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Proposal| <font face = "Arial" color="#101010"><b>Proposal</b></font>]]
 
 
| style="padding:0 .3em;  solid #00000;  padding: 10px; text-align:center; border: 1px solid grey; " width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Midterm| <font face = "Arial" color="#101010"><b>Midterm </b></font>]]
 
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; background-color: grey; border: 1px solid grey;" width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Final| <font face = "Arial" color="white"><b>Final</b></font>]]
 
 
|
 
 
|}
 
 
<!--END OF Sub-Navigation-->
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Abstract</font></div></div>==
 
 
<center><b>An Analysis of Sports Betting Behaviour in Singapore</b></center>
 
 
In today’s interconnected world, the gambling environment has transformed into a multifaceted playing field without boundaries, exposing more people and younger people to the games, and too creates loop holes for illegal gambling operators to enter the market. The result is greater public worry about the social ills of irresponsible gambling.
 
 
Our sponsor, the Singapore Pools, takes a strong stand in responsible gaming, wanting to offer a safer outlet for the public to play. This paper will explore gambling transaction data to identify and better understand betting patterns that would eventually allow us to flag out players who engage in or is susceptible to irresponsible gambling in turn suggest ways to promote responsible gambling. This paper would also consult with past literature to guide our methodological approach and cross compare hypotheses and findings.
 
 
The methodological flow of this project begins with exploratory data analysis where the dataset would cleaned and transformed for further analysis. The large set of transaction will be aggregated into a list of user data. We then proceeded to relationship analysis of the parameters and bet preferences of players of different demographics. Using a clustering analysis, we will then profiled players into four main segments: (1) Masses (2) High-rollers (3) Players-at-risk (4) Habituals.
 
 
This unique segmentation would allow our sponsor to identify players would are at risk of irresponsible gambling, and suggest strategies to reach out to these segments and alert them of their betting behaviour and educate them about responsible betting. To ensure project continuity and future analyses, our team has created a dynamic dashboard to visualise monthly transaction trends, highlight popular events, players who are at risk, and allow exploration of each individual player’s profile and betting patterns (i.e. their betting intensity, transaction history).
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Introduction</font></div></div>==
 
 
Gambling is often seen as a problem in society, no doubt gambling addiction poses a grave societal problem, however banning gambling is not a viable solution, for it would simply drive these activities underground. Our sponsor, Singapore Pools was set up by the Singapore government in 1968 to place gambling on legal grounds and to deal with the social ills tied to gambling. Ever since, Singapore Pools (SG Pools) has been the sole legalized operator to run lotteries and sports betting in Singapore.
 
 
Unlike in most countries where gambling houses are privately owned organizations, Singapore Pools is a stated-owned organization, registered under Singapore’s Ministry of Finance. Singapore Pools offers four main products to the public (TOTO, Singapore Sweep, 4D, Sports Betting) all of which –operations and product configurations – are regulated by Singapore’s Ministry of Home Affairs, Ministry of Finance, Ministry of Social & Family Development.
 
 
Our sponsor takes a strong stand in responsible gaming, wanting to offer a safer outlet, where players can bet responsibly within their financial means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling, and too use this data to help usher in their online betting platform that is scheduled to launch in the upcoming year.
 
 
Thankfully our sponsor has been collecting user and transaction data for the past several years, but has yet put it to good use. Singapore Pools had set up a customer insights division about a year ago to better understand their customers through the analysis of these user data. This is their first step towards a data-driven approach to promote responsible gambling and to understand the gambling behavioural patterns of their customers; and thus this is where our team comes in.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Objectives</font></div></div>==
 
 
The aim of this project is to provide Singapore Pools with a better understanding of the gambling behaviours of their customers through the identification of betting preferences and patterns. Clusters of players may be identified base on their betting behaviour – ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too. 
 
 
The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools. The data provide are confined to line betting transactions made by their ‘Gold’ and ‘Platinum’ members.
 
 
<center>
 
<b>The overall objectives of our project are as stated:
 
 
(1) Provide insights with regards to gambling behavioural patterns
 
 
(2) Profile their existing pool of customers into meaningful segments
 
 
(3) Build a dashboard to visualize betting patterns and trends on a macro and individual level</b>
 
</center>
 
 
And based on the characteristics of each cluster, the sponsor’s end objective is to (1) flag out players who display alarming patterns that could lead to irresponsible betting, and (2) tailored business actions that targets the derived clusters of players to enhance their gambling experience while ensuring that they make bets in a responsible fashion.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Literature Review</font></div></div>==
 
 
Gambling is one topic that is widely researched across the world, from survey polls of gambling participation and perception, gambling risk and pathology, to thorough statistical analysis on gambling behaviours.
 
 
According to a survey done by Singapore’s Ministry of Community Development, Youth and Sports (MCYS), within a year’s period, 58% of Singaporeans over 18 years of age have participated in at least one gambling activity. Further study on pathological gamblers by the MCYS found that players at risk to developing a gambling addiction would gamble at least once a week, and this pool of susceptible players made up 70% of the sample population involved in the study (2005).
 
 
Behavioural or betting patterns is another popular area studied across most papers – for they provide cues to possible pathological gambling behaviours; difference between betting behaviour of regular players and players at risk (problem gamblers) is evident, a common finding in most studies. One study revealed that gamblers at risk are more likely to bet more frequently coupled with increasing bet amounts, regardless of their bet outcome (Mizerski, 2011). And that less frequent players are more likely to put more effort into decision-making when making bets to allow for future betting possibilities, as compared to regular or frequent players. Evidently they also found that certain betting games and game arrangements may actually prompt reckless betting that could like to irresponsible gambling.
 
 
Several other papers provided insights to a more analytical approach to segment gamblers and identify those at risk. A study by Faregh and Leth-Steensen (2011) discovered clusters of players with variations in terms of their bet activity level (frequency), bet variability (spread of stakes and odds), time spent on making the bets, and the games played. Relationship and predictive analysis between selected parameters may reveal variables that best predicted returns, and reflect bet strategies that are less sophisticated (Gainsbury & Russell, 2013). Suggestions on data collection procedures, selection of metrics and parameters for clustering players in these papers are just some of the secondary insights that have aided our choice of methodology and analysis – determining ways of profiling our result clusters – that will be elaborated on later in this report.
 
 
Besides researching the field of gambling and the analytical methodologies, we took examined past data visualization papers to learn about the pit falls and best practices of data visualization. “Different types of graphs are designed to communicate different types of messages” quote data visualization expert, Stephen Few, as he demonstrated in his papers regarding the effective use of points and lines to shape data trends, to the principles of colour selection for data visualization – use of contrasting or analogous colours for varying purposes (Few, 2004; 2006; 2007). Meanwhile some graphs are best to avoid, such as alluring 3-D graphs or pie-charts which can be rendered better in a two-dimensional plane, for the added depth and angle makes interpretation more difficult (Few, 2005). Returning to the dashboard, two guiding principle in designing the dashboard layout that we took from Few’s recommendations was to (1) find balance between being information rich and not oversimplifying and (2) to remove clutter or any distractions that do not add value (Few, 2005).
 
 
Leveraging on these prior knowledge, our team hopes to deliver actionable insights with regards to the betting behaviour of our sponsor’s pool of customers, and present the findings on a dashboard that is all visually appealing, intuitive, practical, and accurately data driven.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Methodology for Clustering</font></div></div>==
 
 
===Choice of Tools===
 
 
The two tools that we have used for the clustering analysis phase were SAS Enterprise Guide and JMP Pro 12. SAS Enterprise Guide’s process map offers a great way to organize and track our procedures from early data exploration tasks to the subsequent multivariate analysis; it too allows quick modification of the tasks of each modules, making duplication of analysis of slight input variations a lot easier. The main drawback of the software would be the lack of (or poor) data visualization of the analysis results, therefore we had complemented this using JMP Pro 12 to provide the necessary data plots and charts.
 
 
===Standardization of Variables===
 
 
Our data preparation for clustering analysis involved standardizing and normalizing the input parameters. Two methods that were used for the standardization process: (1) Z-scores of the variables (2) And transforming the parameters into a probability of it being selected as an event.
 
 
To identify the need to normalize the data variables, we plotted out the distribution on JMP to check for the presence of skewness in the spread. Instead of carrying the commonly used statistical methods – such as using Windsorizing the variables or using taking the log of the variables – to transform input variables to remove the skewness, we decided to take a different approach. Normalizing the data does not always improve the results, it may pose problems for clustering algorithms by transforming spherical clusters into elliptical clusters affecting the density zoning. Returning back to our EDA step, we figured we could remove the long tails of our input variable by filtering out certain “volatile” users who actually made up the outliers.
 
 
===Filtering of Players===
 
 
Our aggregated dataset retains users of few transaction records, they may comprise of new customers or one-time-off bettors. As such, they do not have a stable betting pattern, addition to that, those single (or few) bets alone would result in misrepresentation of certain parameters. For example, given a player with one transaction count, a single bet won would mean that his or her winning ratio is at a 100%.
 
 
In light of this finding, we conducted 3 variations of clustering analysis. For the first, we filtered out single-transaction users; the second included only users with more than 5 transactions (which was determined by using the 10th quartile of the spread of transaction count); and lastly a clustering analysis for users with more than 30 transactions (a statistics rule of thumb to accommodate the t-test). And indeed, users with more than 30 transactions showed a more stable betting pattern.
 
 
The spread of parameters followed more closely to a normal distribution, and the hierarchical clustering analysis returned a favourable result for selecting our k-value for the k-means clustering that is to determine our final cluster results.
 
 
===Hierarchical Clustering===
 
 
[[File:Image228.png|frameless|upright=2.5]]
 
 
The next phase was for us to carry out a hierarchical clustering to determine number of cluster to be used for our K-means clustering analysis. To select the K-value we used the elbow criterion, comparing for the percentage of variance given each addition cluster. We have identified several potential peaks (K = 4, 6) points as shown above where marginal gain drops, whereby adding another cluster would not improve the modelling.
 
 
===K-means Clustering===
 
 
Given the limit of number of clusters given by our sponsor, our team selected a K-value of 4 for the subsequent K-means clustering.
 
 
Input variables for the K-means clustering were the same used for the hierarchical clustering:
 
 
- Bet Time Before Match (Z-score)
 
 
- Odds Type H-M-L (%)
 
 
- Stake Amount (Z-score)
 
 
- Spread of Stake Amount (SD Z-score)
 
 
- Profits or Losses (Z score)
 
 
- Win-Loss Ratio (%)
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Dashboard</font></div></div>==
 
 
<b>Client's Requirements for Dashboard:</b>
 
 
• Ability to view latest performance updates
 
 
• Ability to identify top performing products
 
 
• Ability to highlight new overall monthly transaction trends
 
 
• Ability to flag out players who are at risk of irresponsible gambling
 
 
• Ability to explore betting patterns and preference of specific players
 
 
• Enables easy switching or input of new datasets
 
 
 
<b>Objectives of our Project Dashboard</b>
 
 
[[File:Image202.png|frameless|upright=3.5]]
 
 
<b>Automated Data Cleaning</b>
 
 
Our dashboard is designed to provide continued monitoring capabilities by accommodating future data inputs. To facilitate that feature and for the convenience of our sponsor, we had to automate the data cleaning process such that the format can be read and displayed by the dashboard. Given the large dataset Singapore Pools has each year, using Microsoft excel formula and VBA scripts to clean and transform the data would be inefficient.
 
 
There are many languages for the job (such as Python or MATLAB) but we decided to use R due to its open-source nature, having strong community support with an enormous number of different libraries/packages. The steps implemented in our R codes are similar to that described in our manual data cleaning and transformation procedures, with the final parameters as listed in our analytical data cubes, but in a JSON format – to reduce load times.
 
 
To “integrate” R into the workflow of a web application, “Full stack” solutions such as Shiny framework – for both frontend and backend – would not be feasible due to performance speed given the large dataset. 
 
 
To resolve that issue we decided to keep our frontend purely HTML/CSS/Javascript and build only our backend side by R by using rApache. Our backend side will generate APIs (in which parameters are basically user inputs) which will be consumed by our frontend side. We will cater the format of the response of our APIs to be “friendly” to D3.js so that it can directly visualize the response without doing much data transformation.
 
 
With rApache, it is also possible to host our web application on a standard Ubuntu server or cloud instance so it also offer flexibility on deployment which will be helpful in case if our sponsor needs more processing power to process bigger datasets in future.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Limitations & Challenges</font></div></div>==
 
 
<b>1) Large Dataset</b>
 
 
Load times was one of the key considerations we accommodated when creating the dashboard. Given the large dataset, our team had to veer away from using loops and iterations of arrays to reduce load time. Utilizing most JavaScript’s native functions and libraries such as Underscore.js were some ways to optimize performance.
 
 
<b>2) Rigid Data Format</b>
 
 
As the result of the automated data cleaning process, the final data format to be bootstrapped into the dashboard would in a JSON format. This dataset is thus read as an object, and not as an array which is the commonly accepted input for most of the D3.js charts that are used for the dashboard. As such we had to iterate through the index array [0,...,length-1]. 
 
 
<b>3) Limiting Sample Date Range</b>
 
 
The sample data that was given to us had a historical 3 month date range. Initially our implementation was coded in a way to complement the sample data range. But to accommodate future datasets of varying date range, our team had to fine tune our codes to support dynamic start and end dates.
 
 
<b>4) Visualization of Clusters</b>
 
 
Given that k-means clustering analysis is an unsupervised learning method, we are unable to automate the analysis and visualize the clustering results on our dashboard. Interpretation of the cluster profiles would too require some degree of statistics knowledge. In light of this, our team will present the clustering analysis result as a separate deliverable.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Further Developments</font></div></div>==
 
 
Moving forward to gather deeper consumer understanding, our team proposes a more extensive data collection procedure. Singapore Pools have collected basic demographical data on their Platinum users, however this was not extended to the Gold users. With greater demographic data – such as customer’s salary and occupation, customer’s address, customer’s family background – we may discover new betting patterns, and new relationship between existing (betting behaviour) parameters and player demographic data; and so offering a better understanding of what responsible gambling should be at an individual level rather than a general take across the entire population.
 
 
Besides collecting demographic data, there are other transaction data such as account top-ups which could too provide greater insight on one’s betting behaviour. Historical trends of one’s frequency and amount of top-ups, coupled with his or her past winnings records, could reveal patterns of irresponsible gambling. 
 
 
[[File:Image201.png|center|frameless|upright=2.5]]
 
 
There are many papers and research carried out regarding the use rule-based classifiers and pre-learning data clustering such as association rule clustering and automated genetic clustering, but still this process is yet to be viable in the near future. Such machine learning approaches are still unperfected given the vast number of rules that needs be considered, and the incapability of the machine to learn beyond the training data. Coupled with another obvious limitation regarding the interpretation of clusters’ profiles that will be left to the dashboard user, and as subjective as it is, the user would require some degree of statistics knowledge to make meaningful inferences.
 
 
Our team would therefore suggest that having a workable methodology or guide on how to perform the clustering analysis would be more feasible. The results from the current clustering analysis only provides a one-time-off understanding of the current market context, the insights cannot be replicated for future references. There is a need to revise the clustering analysis from time to time, updated with new datasets representative of the latest trends and context. A more sustainable solution would be to have a trained analyst to conduct the clustering analysis each round.
 
 
Lastly as mentioned above, the dataset that our team have been working on is merely a three month long dataset, as such we will continue to work with our sponsor to carry out load testing with a larger set of data. Client user testing of the dashboard is also currently on going, and we too will continue to provide support to update the dashboard charts and interactive tools upon further feedback.
 

Latest revision as of 23:58, 7 September 2016