AY1516 T2 Sport Betting at Singapore Pools Project Overview Midterm
Proposal | Midterm | Final |
Contents
Project Background
In today’s globalized world, the Internet has transformed the gambling environment into a multifaceted, non-physical, multi-platform, environment without boundaries. This presents loopholes for illegal gambling operators to enter the market and draw our customers away, into their unregulated arena that is susceptible to the creation of gambling addiction issues.
Singapore Pools offers a safer outlet, one where players can bet responsibly, within their means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling.
Our sponsor has actually been collecting user data for the past several years, but has yet put it to good use. Just a year ago, Singapore Pools had set up a customer insights division to better understand their customers through the analysis of these user data and their first step towards a data-driven approach to promote responsible gambling was to understand the gambling behavioural patterns of their customers.
Project Objectives
The aim of our project is to allow Singapore Pools to better understand the gambling behaviours of their customers through the identification of gambling patterns. Each cluster might have their own specific ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too. And based on the characteristics of each cluster, the client’s end objective is to have a customized business action so as to enhance the gambling experience of the players in that particular cluster. The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools and the given dataset given involves only transactions that lies within the time period from January 2015 to March 2015.
The overall objectives of our project are to:
(1) Profile their existing pool of customers through clustering analysis
(2) Create a data visualization of the consumer betting activity
(3) Build a dashboard to visualize the profiling and data points
Data Cleaning
The original sample data set that we worked on contained over 930,000 unique observations (transactions) in a worksheet also known as TransactionList. There were two types of noise in our data set: (1) irrelevant fills – some of which are variables or observations that do not pertain to the soccer betting products, rejected bets which have no odds indicated, and ‘Championship Winner’ bet types for which is considered atypical to regular bet patterns; (2) outliers – these observations lie beyond the 99th percentile of selected parameters, the thresholds were identify based on scatterplots of the distribution on Tableau.
The following observations were removed from the original data set:
The dilemma we faced was that if we removed the outlier transactions, we would be artificially changing the users’ bet preference when we aggregate these transaction data into one user’s overall bet pattern. The other option was to aggregate all transactions (including extreme observations) into a user’s overall bets, and then to filter the users from the population. Given that the outlier transactions would impede of analysis of transaction data, we had no choice but to remove these outliers. And in effect, we had to remove the affected users (users who made those extreme transactions) to maintain the integrity of the user data. Therefore, after data cleaning, we are left with 529,678 unique transactions in TransactionList for further analysis.
Data Transformation
To gather deeper insights in our data exploration phase, our team created new metrics that reflect certain attributes of bet behaviours. This then allowed us to test some of the hypotheses about betting patterns that our sponsor highlighted to us. The new metrics would be used to test difference in betting preference between gender, age, account types, players of different risk profiles. If the metrics do not differ between the segments of players, or does not significantly affect one’s bet placement – as we were to discover during the data exploration phase – they would then be removed.
Data classification
Given the large amount due to amalgamation of many different users, there were a lot of noise in the data and initial data exploration showed plenty of insignificant relationships. Therefore, we created categorical metrics to segment transactions and users into smaller subsets to reduce variation within these groups, thus allowing more significant observations. For data regarding transactions, the segmentation is based on the bet odds of individual transactions, whereas for data regarding users, the segmentation is based on the total number of betting transactions for the individual player. Classes were determined using the lower and upper quartiles of the distribution of those parameters.
For example, based on the distribution of betting odds across all the transactions, we have identified two proxies to segregate the odds into three different groups – low, medium or high odds (as shown in the figure above). The two proxies are odds of 3.4 and 7.5, which have been identified through a box plot diagram depicting the distribution of odds for all transactions. Transactions with odds less than or equals to 3.4 will be classed as low odds, transactions with odds more than 3.4 and less than or equals to 7.5 will be classed as medium odds and the rest of the transactions will be classed as high odds. Furthermore, we have reaffirmed these classifications with the client in order to verify the fittingness of the proxies.
Creation of metrics (refer to Data Dictionary for full list of data variables)
The next role of the newly created metrics are for descriptive statistics to be displayed on the dashboard; this pertains to the user data, where aggregated statistics (i.e. total bets, probability or preference for certain bet days) would make up the overview of a user’s profile on the dashboard.
The original data was transformed through various methods – splitting of original data into small subsets (e.g. data time into separate data and time for individual analysis), find difference in time points to determine duration gap, deriving the probability of each option within a parameter, summation of parameters, finding the average, median and standard deviation of certain parameters, and many others. The transformed metrics falls under two categories, one for transactions, and the other for the users.
The following are some of the interesting metrics that deserve highlight:
i) Bet Time Before Match – this duration gap between bet placed and match could indicate whether planning is involve when making bets
ii) Bet Time – the time when bets was place could indicate if the player was a football fan who catches the game at wee hours or simply betting in the day for a later match
iii) Median (Stake/Odds/Time/Returns) – players may sometimes make one off bets that are extremely large this would skew the average, hence median who be more representative of that user parameter if the deviation of that parameter was high
iv) Standard Deviation (Stake/Odds/Time/Returns) – adding the SD of each parameter would tell us if user preference was stable in that parameter (i.e. on whether the player is an erratic bettor)
Data Consolidation
After data cleaning and data transformation, we started to consolidate all the transactions in TransactionList to form list of unique users in another worksheet also known as UserList. Using the transaction data, we summed up variable such as profit/loss, returns, transaction count, and number of bets of each league or market type. Besides that, for each user, we calculated their individual mean, median and standard deviation for variables such as odds, stake amounts and returns. At the end of data consolidation, we have altogether 5,562 unique user accounts in UserList for further analysis.
Transaction Findings
The following few points pertain to initial findings regarding transactions that were derived from the TransactionList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.
Time (over the entire period)
The line graph above shows the number of daily transactions across the period from January 2015 to March 2015. From the graph, there is a cyclical pattern of transaction count peaking over the weekends and being at its lowest during the weekdays.
Time (over a week)
The bar graph above shows the total number of transactions per day across a week. It can be clearly seen that weekends dominate at least 50% of the betting transactions, due to the large amount of matches played all over the world during the weekend.
Time (over a day)
The line graph above shows the total number of transactions for each hour in a day. The betting transactions peaks at certain timings with the highest number being at 2200hrs to 2259hrs, which is probably due to the large number of matches following that hour over the weekends. Besides that, the period between 0600hrs and 0759hrs have the lowest transactions as minimal matches are played during those hours. Using these high and low peaks, we have drawn up five different segments across the 24 hours in a day as shown below.
Using these different time segments in day, we can use them to derive the probability of a user’s bet that lies within each segment. This probability metric can be further used in the later stage of the project for our clustering analysis.
Low-medium-high odds (L-M-H)
Based on the segregation of transaction odds as mentioned earlier on, there are about 103,000 (20%) low-odds transactions, about 304,000 (57%) medium-odds transactions and 122,000 (23%) high-odds transactions.
Header text | Stake amount across L-M-H odds groups | Header text | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | A (L) | B (M) | C (H) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Maximum | 10,000 | 10,000 | 6,000 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Upper Quartile | 1,677 | 644.5 | 400.25 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Median | 796 | 307 | 191 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Lower Quartile | 372 | 144.5 | 89.75 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Minimum | 5 | 5 |
}
|