Difference between revisions of "AY1516 T2 Sport Betting at Singapore Pools Project Overview Midterm"

From Analytics Practicum
Jump to navigation Jump to search
(Blanked the page)
 
Line 1: Line 1:
<br>
 
<font face="Century Gothic">
 
{| style="background-color:#FFFFFF; color:#007BBD padding: 5px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; border-left:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools|<font face ="Century Gothic" color="#FFFFFF"><strong>THE SPONSOR</strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" | 
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Team|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE TEAM </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#FFFFFF; text-align:center;" width="20%" | 
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Proposal|<font face ="Century Gothic" color="#007BBD"><strong> THE OVERVIEW</strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Management|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE MANAGEMENT </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
[[AY1516_T2_Sport_Betting_at_Singapore_Pools_Documentation|<font  face ="Century Gothic" color="#FFFFFF"><strong> THE DOCUMENTS </strong></font>]]
 
| style="border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD;" width="1%" | &nbsp;
 
| style="padding:0.3em; font-family:Helvetica; font-size:120%; border-bottom:2px solid #007BBD; border-top:2px solid #007BBD; background:#007BBD; text-align:center;" width="20%" |
 
|}
 
<br>
 
  
<!--Sub-Navigation-->
 
{| style="background-color:white; color:white ; border: 0px solid #007BBD; margin-left: auto; margin-right: auto;" width="800px" height=50px cellspacing="0" cellpadding="0" valign="top"  |
 
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; border: 1px solid grey;" width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Proposal| <font face = "Arial" color="#101010"><b>Proposal</b></font>]]
 
 
| style="padding:0 .3em;  solid #00000;  padding: 10px; text-align:center; background-color: grey; border: 1px solid grey; " width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Midterm| <font face = "Arial" color="white"><b>Midterm </b></font>]]
 
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; border: 1px solid grey;" width="33%" | [[AY1516_T2_Sport_Betting_at_Singapore_Pools_Project_Overview_Final| <font face = "Arial" color="#101010"><b>Final</b></font>]]
 
 
|
 
 
|}
 
 
<!--END OF Sub-Navigation-->
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Background</font></div></div>==
 
In today’s globalized world, the Internet has transformed the gambling environment into a multifaceted, non-physical, multi-platform, environment without boundaries. This presents loopholes for illegal gambling operators to enter the market and draw our customers away, into their unregulated arena that is susceptible to the creation of gambling addiction issues. 
 
 
Singapore Pools offers a safer outlet, one where players can bet responsibly, within their means. Attrition rates have be raising over the years, and this could meant that Singapore Pools’ customers are seeking other avenues to participate in gambling activities such as illegal online-gambling sites, which may lead to irresponsible betting. Therefore, within the next few years, our sponsor seeks to undertake a data-driven approach to promote responsible gambling by monitoring the player's’ betting behaviour and performance, in hopes of highlighting alarming patterns that could indicate signs of irresponsible gambling.   
 
 
Our sponsor has actually been collecting user data for the past several years, but has yet put it to good use. Just a year ago, Singapore Pools had set up a customer insights division to better understand their customers through the analysis of these user data and their first step towards a data-driven approach to promote responsible gambling was to understand the gambling behavioural patterns of their customers.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Project Objectives</font></div></div>==
 
 
The aim of our project is to allow Singapore Pools to better understand the gambling behaviours of their customers through the identification of gambling patterns. Each cluster might have their own specific ways of splitting their bets, preference for a league, different decision making process, and ways of selecting their bet selections. Such behavioural patterns could possibly be linked back to certain demographics pertaining to the cluster, allowing us to further infer reasons behind their gambling habits, and hopefully could help us identify those irresponsible gamblers too.  And based on the characteristics of each cluster, the client’s end objective is to have a customized business action so as to enhance the gambling experience of the players in that particular cluster. The scope of our project is limited to the Sports Betting segment of customers who have opened betting accounts with Singapore Pools and the given dataset given involves only transactions that lies within the time period from January 2015 to March 2015.
 
 
<center>
 
The overall objectives of our project are to:
 
 
<b>(1) Profile their existing pool of customers through clustering analysis
 
 
(2) Create a data visualization of the consumer betting activity
 
 
(3) Build a dashboard to visualize the profiling and data points
 
</b>
 
</center>
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Data Transformation</font></div></div>==
 
 
To gather deeper insights in our data exploration phase, our team created new metrics that reflect certain attributes of bet behaviours. This then allowed us to test some of the hypotheses about betting patterns that our sponsor highlighted to us. The new metrics would be used to test difference in betting preference between gender, age, account types, players of different risk profiles. If the metrics do not differ between the segments of players, or does not significantly affect one’s bet placement – as we were to discover during the data exploration phase – they would then be removed.
 
 
=== Data classification ===
 
 
Given the large amount due to amalgamation of many different users, there were a lot of noise in the data and initial data exploration showed plenty of insignificant relationships. Therefore, we created categorical metrics to segment transactions and users into smaller subsets to reduce variation within these groups, thus allowing more significant observations. For data regarding transactions, the segmentation is based on the bet odds of individual transactions, whereas for data regarding users, the segmentation is based on the total number of betting transactions for the individual player. Classes were determined using the lower and upper quartiles of the distribution of those parameters.
 
 
=== Creation of metrics (refer to Data Dictionary for full list of data variables) ===
 
 
The next role of the newly created metrics are for descriptive statistics to be displayed on the dashboard; this pertains to the user data, where aggregated statistics (i.e. total bets, probability or preference for certain bet days) would make up the overview of a user’s profile on the dashboard.
 
 
The original data was transformed through various methods – splitting of original data into small subsets (e.g. data time into separate data and time for individual analysis), find difference in time points to determine duration gap, deriving the probability of each option within a parameter, summation of parameters, finding the average, median and standard deviation of certain parameters, and many others. The transformed metrics falls under two categories, one for transactions, and the other for the users.
 
 
The following are some of the interesting metrics that deserve highlight:
 
 
i) Bet Time Before Match – this duration gap between bet placed and match could indicate whether planning is involve when making bets   
 
 
ii) Bet Time – the time when bets was place could indicate if the player was a football fan who catches the game at wee hours or simply betting in the day for a later match
 
 
iii) Median (Stake/Odds/Time/Returns) – players may sometimes make one off bets that are extremely large this would skew the average, hence median who be more representative of that user parameter if the deviation of that parameter was high
 
 
iv) Standard Deviation (Stake/Odds/Time/Returns) – adding the SD of each parameter would tell us if user preference was stable in that parameter (i.e. on whether the player is an erratic bettor)
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Transaction Findings</font></div></div>==
 
 
The following few points pertain to initial findings regarding transactions that were derived from the TransactionList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">User Findings</font></div></div>==
 
 
The following few points pertain to initial findings regarding users that were derived from the UserList during the exploratory data analysis phase and we will mainly be using median values for basis of comparison to reduce the influence from extreme values.
 
 
=== Age group (only for Platinum customers due to limited information) ===
 
 
=== Based on transaction count ===
 
 
Besides segregating the customers based on their own demographic information, we also divided the customers into three groups (low, medium and high frequency) based on their frequency of betting transactions. Before deciding on the boundaries of the division, we looked at the distribution of the users’ transaction count which is seen in the table below.
 
 
==<div style="background: #007BBD; line-height: 0.3em; border-left: #007BBD solid 13px;"><div style="border-left: #45E98F solid 5px; padding:15px;"><font face ="Century Gothic" color= "white" size="5">Going Forward</font></div></div>==
 
 
=== Cluster Analysis ===
 
 
Based on our findings from the exploratory data analysis, we are now able to understand the relationship between variables and how these variables differ across different types of grouping (e.g. male & female, gold & platinum, age groups). Hence, for the next step before we start on our clustering analysis, we will have to normalize certain data points in order to nullify the extreme differences between data points.
 
 
As for the clustering analysis itself, we will be using SAS Enterprise Guide to process the clustering and conduct a few iterations of the clustering process before finalizing the clusters. After finalizing the clusters, we will proceed to review these clusters with the client to assess on its commercial usefulness. However, our group noted that the lack of demographic information (such as income range, occupation, education level) of each user account might render our analysis limited.
 
 
=== Automated Data Cleaning ===
 
 
Since the given dataset is a fixed dataset relating to the period from January to March 2015, if we only were to use “manual” analytical tools to generate the analytical outcomes for that dataset and statically visualize it by D3.js, it would not be beneficial for the client moving forward as they would not be able to use it for the other time periods. Therefore, we decided to make it flexible such that they can upload their own dataset (assuming that it is the same format) and the final dashboard still be able to reflect the results of the new dataset. Therefore, we need a tool/library/programming language that can allow us to do the “automated data cleaning” so that the final version of our dashboard can still generate desired outcomes without us having to manually perform any analysis.
 
 
There are some languages that can do this job such as Python or MATLAB but we decided to use R due to its open-source nature, having strong community support with an enormous number of different libraries/packages and also, since some of our team members already have prior experience with the language.
 
 
In our R code, we follow exactly the same steps as described in our Data Cleaning and Data Transformation segments to generate our analytical data cubes automatically.
 
 
First, we use the Read CSV package to read the raw dataset into the R environment:
 
 
transactions = read.csv("/DIRECTORY_OF_THE_FILE/FY14 SPORTS TRANSACTIONS TSOPENED_01012015_31032015.csv")
 
 
Secondly, we generated new variables which represents the “filtered” datasets after removing irrelevant fills the same criteria as stated in the Data Cleaning part.
 
 
filteredTransactions <- transactions[!is.na(users$ODDS),]
 
filteredTransaction
 
 
Similarly we also removed the outliers:
 
 
nonNullOdds = na.exclude(filteredTransactions$ODDS)
 
outlierOddsThreshold = quantile(nonNullOdds,c(.99))
 
nonNullStakeAmount = na.exclude(filteredTransactions$STAKEAMOUNT)
 
outlierStakeAmountThreshold = quantile(nonNullStakeAmount,c(.99))
 
filteredTransactions <- subset(filteredTransactions,  filteredTransactions$ODDS < outlierOddsThresHold)
 
filteredTransactions <- subset(filteredTransactions,  filteredTransactions$STAKEAMOUNT < outlierStakeAmountThresHold)
 
 
For the data transformation part first we also need to determine thresholds for HF-MF-LF Customers as well as Low – Medium – High Odds:
 
 
lowOddQuantile = quantile(transactions$STAKEAMOUNT,c(.25))
 
highOddQuantile = quantile(transactions$STAKEAMOUNT,c(.75))
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT < lowOddQuantile] <- 'L'
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT >= lowOddQuantile && transactions$STAKEAMOUNT <= highOddQuantile] <- 'M'
 
transactions$RISK_TYPE[transactions$STAKEAMOUNT > highOddQuantile] <- 'H'
 
 
Finally data aggregation from per-transaction to per-user is done using R function “aggregate” and “merge”. First, for each of a user metric, create a new dataframe that has two columns – the dummy account id and the aggregating value (median, mean, SD or probability). For example:
 
 
MEDSTAKEAMOUNT <- aggregate(x = transaction.clean$STAKEAMOUNT, by = list(transaction.clean$ACCOUNT_DUMMY), median)
 
 
This will show us the median stake amount of each account.
 
 
Finally, to merge all of those data frames into a single data frame, use the merge function to eventually reduce the number of data frames, for example:
 
 
aggregatedData <- merge(MEDSTAKEAMOUNT,MEDODD,by="Group.1")
 
 
=== Dashboard ===
 
 
Since we use R for our automated data cleaning process as well as to generate visualization data to be displayed by D3.js, we need to have a mechanism to “integrate” R into the workflow of a web application. “Full stack” solutions such as Shiny framework which allow you to write the whole web application – both frontend and backend – in R does not cut it due to performance issue – it would take very long to process our given data of nearly 1 million records let alone the possibly bigger datasets imported by Singapore Pools in the future.
 
To resolve that issue we decided to keep our frontend purely HTML/CSS/Javascript and build only our backend side by R by using rApache. Our backend side will generate APIs (in which parameters are basically user inputs) which will be consumed by our frontend side. We will cater the format of the response of our APIs to be “friendly” to D3.js so that it can directly visualize the response without doing much data transformation.
 
With rApache, it is also possible to host our web application on a standard Ubuntu server or cloud instance so it also offer flexibility on deployment which will be helpful in case if our client needs more processing power to process bigger datasets in future.
 
 
=== Wireframe ===
 
 
Our dashboard will consist of 3 views:
 
 
1) The Primary Transaction Overview
 
 
2) The User Segment (Cluster) Selection View
 
 
3) The Specific User Profile Overview
 
 
The administrative user will first need to log in into the dashboard, upon doing so, he will land on “Transaction Overview” page. This page allows users to preview the entire set of transaction data. There will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. There will also be a report summary of this month’s statistics at the end of the page (parameters will be determined by the client at a later stage).
 
 
Some parameters to be display will include the following (to be revised with client):
 
 
- Total revenue
 
 
- Winning ratio of SG Pools
 
 
- Number of transactions for the last month
 
 
- New member sign-ups for the last month
 
 
- Total profits
 
 
- Total pay outs
 
 
- Number of unique players
 
 
- Top 10 players with the highest stake amount for the month
 
 
- Top selling product for the month
 
 
- Top selling event type for the month
 
 
- Most popular league for the month
 
 
The side bar will aid navigation through the dashboard, users can easily switch between “Transaction Overview” and “User Profiles”. Secondary functions like calendar features to jot notes might be added later on during the prototype development. 
 
<br/>
 
[[File: Capture26.jpg|center|frameless|upright=4.1]]
 
Upon clicking on “User Profiles” tab on the side bar, users will land on “Segment Selection” page, where they can narrow down on the segment of players (based on their clusters, gender, or account type) that they want to view. They can future narrow in on a particular player by clicking on their ID the list of player names from the segment group. Else, they can use the search bar to find a particular player ID.
 
<br/>
 
[[File: Capture27.jpg|center|frameless|upright=4.1]]
 
As seen in the screenshot below, on “User Profile” overview page, users can view several popular parameters about their purchase, bet pattern type, amount of returns, favourite bet preferences, etc. Like “Transaction Overview” there will be a trend line graph to display certain parameters over a selected period of months, which users can select from the drop-down list button. Something different here is that there will be a secondary trend line that charts the overall population parameters over that of the selected user, allowing comparison of the individual bet patterns with others. Final set of parameters to be displayed will be confirm upon sponsor’s request.
 
<br/>
 
[[File: Capture28.jpg|center|frameless|upright=4.1]]
 
There will also be a summary report of the user at the bottom of the page, which allows the user to easily export the info out by copying and pasting the statistics.
 

Latest revision as of 00:15, 8 September 2016