Difference between revisions of "Red Dot Payment Data Source"

From Analytics Practicum
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 33: Line 33:
 
{| style="background-color:#ffffff; margin: 3px auto 0 auto" width="55%"
 
{| style="background-color:#ffffff; margin: 3px auto 0 auto" width="55%"
 
|-  
 
|-  
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #2e2e2e" width="150px"| [[Red Dot Payment_-_Project Overview| <span style="color:#3d3d3d">Background</span>]]
+
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #2e2e2e" width="150px"| [[Project Overview| <span style="color:#3d3d3d">Background</span>]]
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
  
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Red Dot Payment_Data Source| <span style="color:#3d3d3d">Data Source</span>]]
+
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Data Source| <span style="color:#3d3d3d">Data Source</span>]]
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
  
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Red Dot Payment_Methodology| <span style="color:#3d3d3d">Methodology</span>]]
+
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[Methodology| <span style="color:#3d3d3d">Methodology</span>]]
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
 
|}
 
|}
Line 45: Line 45:
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Sample Data</strong></font></div></div>==
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Sample Data</strong></font></div></div>==
 
'''''*Due to sensitivity of data, sample data will not be shown here. Please refer to the submitted reports for more details.'''''
 
'''''*Due to sensitivity of data, sample data will not be shown here. Please refer to the submitted reports for more details.'''''
<br>
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Metadata</strong></font></div></div>==
 
Previously, we only have the transaction data of 2017 provided by RDP.
 
Moving on, we managed to obtain transaction data of 2016 as well.
 
 
<br><br>
 
<br><br>
Below are the definitions of common terms used in their business model:
+
In our study, we incorporated 2 years of data, from 2016 to 2017. This data consists of the details of all online transactions processed by our sponsor from January 2016 to December 2017, such as:<br>
<br><br>
+
• Date and time<br>
'''''MERCHANT''''' - A merchant is a business, often a retailer, that operates online. Each merchant has appointed RDP to be its payment processor to handle online transactions between them and their customers.
+
• Monetary value<br>
<br><br>
+
• Whether transaction is approved/rejected <br>
''<b>CUSTOMER</b>'' - A customer is an entity that makes a transaction with the merchant. For a transaction to be made, a customer must make contact with a merchant and provide their payment details (e.g. Credit card number) to the merchant through RDP’s payment processing gateway.
+
• Reason for rejected transaction (i.e. reason code description)<br>
<br><br>
+
Currency type<br>
The following table shows the list of variables in the data and their associated description:
+
 
<br><br>
 
{| class="wikitable"
 
|-
 
! NAME !! TYPE !! DESCRIPTION
 
|-
 
| date_created || Date Time || Date and time at which the transaction was carried out between merchant and customer
 
|-
 
| period || Integer || Month of the year the transaction was carried out between merchant and customer
 
|-
 
| hour || Integer || Hour in a day the transaction was carried out between merchant and customer
 
|-
 
| day || Integer || Day of the week the transaction was carried out between merchant and customer
 
|-
 
| time || Time || Time at which the transaction was carried out between merchant and customer
 
|-
 
| name || Alphanumeric || Name of merchant
 
|-
 
| amount of money || Decimal || Total transaction amount - unstandardised currency
 
|-
 
| currency || CHAR(3) || Currency of the transaction amount - usually in SGD or USD
 
|-
 
| converted value || Decimal || Total transaction amount - standardised in SGD
 
|-
 
| reason_code || Alphanumeric || A unique code tagged to each reason for each approved or rejected transaction made by customer
 
|-
 
| reason_code_description || Alphanumeric || Description of the reason for each approved or rejected transaction made by customer
 
|-
 
| card_data1 || Integer || First 6 digits of customer’s card number - reveals details about the issuing bank and card type
 
|-
 
| card_data2 || Integer || Last 4 digits of customer’s card number
 
|-
 
| contact_ip_address || Alphanumeric||  IP address of the customer
 
|-
 
| contact_ip_country || Alphanumeric || IP country of the customer
 
|-
 
| Bin [based on log10] || Integer || Bin number each merchant belongs to
 
|}
 
<br>
 
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Data Cleaning</strong></font></div></div>==
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Data Cleaning</strong></font></div></div>==
 
For data cleaning, we carried out the following steps:<br><br>
 
For data cleaning, we carried out the following steps:<br><br>
 
'''<big>General Cleaning</big>'''
 
'''<big>General Cleaning</big>'''
<br><br>
+
<br>
 
1. Compiled 2016-2017 data <br>
 
1. Compiled 2016-2017 data <br>
 
2. Standardised all merchants’ names and currency into uppercase: e.g. Realised that JMP Pro recognises characters with uppercase and lowercase separately<br>
 
2. Standardised all merchants’ names and currency into uppercase: e.g. Realised that JMP Pro recognises characters with uppercase and lowercase separately<br>
Line 112: Line 70:
 
   - Removal of 1563 rows with ambiguous or illegible characters;  
 
   - Removal of 1563 rows with ambiguous or illegible characters;  
 
   - Removal of 2 rows with negative transaction value
 
   - Removal of 2 rows with negative transaction value
 +
10. Merchant Categorisation Codes (MCC) was given at later stage to understand the industry sectors of the merchants
 
<br><br>
 
<br><br>
 
'''<big>Removal of Outliers</big>'''
 
'''<big>Removal of Outliers</big>'''
 +
<br>
 +
The raw data provided by our sponsor needed high levels of data cleansing. This includes standardizing all monetary value to Singapore currency, standardizing merchant names, and removing deactivated merchants.  Furthermore, there were several confounding factors that could affect our analysis, such as test transactions carried out by merchants. To reduce the risk of inaccuracy, we excluded the outliers (i.e. monetary value belonging to top and bottom 0.5% and <$1) from our data. As seen in Figure 1 and 2, this resulted in a less skewed distribution.
 
<br><br>
 
<br><br>
There were transactions with transaction value as high as 5.41212e+15. Such transaction, as shown in Figure 1, represents top 0.5% of all transactions, and has a very large difference in value compared to the bottom 99.5%. This would result in a very skewed distribution of transaction value (see Figure 2a), leading to inaccurate results during exploratory data analysis. Thus, it is essential that we remove these outliers. In addition, through our client meetings, we realised that our data includes test cases - i.e. False transactions carried out by merchants to test RDP payment modes; While we are not able to fully determine whether a transaction is a test case or not (as our client cannot give us a benchmark for a transaction value that would define a test case), we can reduce the number of test cases by eliminating the bottom 0.5% of all transactions.
+
<center>[[File:General3.jpg|400px]]<br>
<br><br>
+
<small>Figure 1: Distribution graph of transaction monetary values (transformed using log10), before removing outliers</small></center>
<center>[[File:Figure1RDP.jpg|200px]]<br>
 
<small>Figure 1: Summary Statistics for ‘Converted Value’</small></center>
 
<br/>
 
 
<br>
 
<br>
* Remove transactions with transaction value belonging to top 0.5% and bottom 0.5% of all transactions
+
<center>[[File:General4.jpg|400px]]<br>
* In addition, we also remove transactions with transaction value less than $1, as we realised that there were still many transactions with value <$1. Based on our team’ perspectives, these transactions are likely to be test cases, as it is rather impossible to have transaction value <$1 across two years. Thus, these transactions are necessary to be removed.
+
<small>Figure 2: Distribution graph of transaction monetary values (transformed using log10), after removing outliers</small><br></center><br><br>
<br><br>
 
<center>[[File:Figure2RDP.jpg|400px]]<br>
 
<small>Figure 2a: Johnson SI Transformed Converted value distribution before removing outliers</small></center>
 
<br/>
 
<center>[[File:Figure3RDP.jpg|400px]]<br>
 
<small>Figure 2b:  Johnson SI Transformed Converted value distribution after removing outliers</small><br></center><br><br>
 
- '''Before Removing Outliers'''<br>
 
- Total number of transactions: 2,310,781<br><br>
 
  
- '''After Removing Outliers'''<br>
 
- Total number of transactions: 2,282,080<br>
 
- Excluded number of transactions: 28,701<br>
 
<br>
 
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Data Preparation</strong></font></div></div>==
 
==<div style="background: #DD597D; line-height: 0.3em; font-family:calibri;  border-left: #CFCFCF solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Data Preparation</strong></font></div></div>==
We have identified the most common types of transaction in the following:<br>
+
The cleaned data was loaded into JMP Pro to conduct data preparation and exploration. This analytical software can efficiently handle large volumes of data, which is essential given our sponsor's abundant data (approximately 2.2 million).
1. Approved or Completed Transaction<br>
+
 
2. Bank Rejected Transaction<br>
+
Data preparation was a time-consuming but necessary task in ensuring that our data is transformed into a suitable form for further analysis.
3. Authentication Failed<br>
+
 
4. Duplicate Merchant_TranID Detected<br>
+
Using JMP, we have identified the top 3 types of rejected transactions:
 +
1) Bank Rejected Transaction
 +
2) Authentication Failed
 +
3) Duplicate Merchant Tran_ID Detected
 +
 
 +
These values, termed as ‘reason code description’, are defined by our merchants’ registered bank. Moving forward, we will only be analysing these transactions and the approved transactions, to narrow down the scope of work.
 +
<br><br>
 +
<big>'''Interactive Binning'''</big><br>
 +
From our data, we realised that a huge discrepancy in the total number of transactions carried out by each merchant from 2016-2017; this range varies from 1 to 1212019. As shown in Figure 3, there is a high number of overlapping data in the scatter plot when plotting proportion of approved transactions for all 269 merchants using line of best fit graphs. This makes it less useful in assessing the performance of different merchants.
 +
<br><br>
 +
<center>[[File:General1.jpg|400px]]<br>
 +
<small>Figure 3: High number of overlapping data when plotting proportion of approve transactions for 269 merchants</small></center>
 
<br>
 
<br>
Moving forward, we will only be analysing the approved transactions, and the top 3 types of rejected transaction, in order to narrow down the scope of our work.
+
To compare the performance of merchants on a fairer ground, we have decided to group the merchants into 5 bins (i.e. cut-off points for binning at 20th percentile) based on the total number of transactions per merchant. To achieve this, we installed the Interactive Binning plugin by Jeff Perkinson.  
 
<br><br>
 
<br><br>
<big>'''Interactive Binning'''</big>
+
<center>[[File:General2.jpg|400px]]<br>
 +
<small>Figure 4: Interactive binning results (I) (Cut-off point: 20th percentile)</small></center>
 
<br>
 
<br>
From our data, we realised that there is huge discrepancy in the total number of transactions carried out by each merchant from 2016-2017; The range varies from 1 to 1212019. As stated in our refined objectives, we hope to identify the over-performing (star) and underperforming (laggard) merchants through the use of Line of Fit graphs. We have decided to group the merchants into 5 bins (cut-points for binning at 20 percentiles) based on total number of transactions per merchant, as we realised that compared to the Line of Fit graphs for all merchants, grouping them into 5 bins provide better model fit.  
+
We first used a log10 transformation on the variable ‘Number of transactions per merchant’. Seen in Figure 4, by setting the cut-off point at 20th percentile, we were able to obtain 5 different groups of merchants. Each group has a ‘Bin Number’, as seen in the table below. Each merchant thus has a respective bin number. 
<br>Next, we rename each bin into the following:<br>
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
 
! Bin [Based on Log10]  (Group) !! Number of transactions per merchant (Range) !! Number of merchants
 
! Bin [Based on Log10]  (Group) !! Number of transactions per merchant (Range) !! Number of merchants
 
|-
 
|-
| 1 || <= 7 || 72
+
| 1 || 1 - 10 || 53
 
|-
 
|-
| 2 || 8 - 52 || 72
+
| 2 || 11 - 60 || 54
 
|-
 
|-
| 3 || 53 - 286 || 73
+
| 3 || 61 - 327 || 54
 
|-
 
|-
| 4 || 287 - 1479 || 69
+
| 4 || 328 - 2516 || 53
 
|-
 
|-
| 5 || > = 1480 || 75
+
| 5 || 2517 - 1,212,019 || 55
 
|}
 
|}

Latest revision as of 15:24, 15 April 2018

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

 

ANLY482 HOMEPAGE

Background Data Source Methodology

Sample Data

*Due to sensitivity of data, sample data will not be shown here. Please refer to the submitted reports for more details.

In our study, we incorporated 2 years of data, from 2016 to 2017. This data consists of the details of all online transactions processed by our sponsor from January 2016 to December 2017, such as:
• Date and time
• Monetary value
• Whether transaction is approved/rejected
• Reason for rejected transaction (i.e. reason code description)
• Currency type

Data Cleaning

For data cleaning, we carried out the following steps:

General Cleaning
1. Compiled 2016-2017 data
2. Standardised all merchants’ names and currency into uppercase: e.g. Realised that JMP Pro recognises characters with uppercase and lowercase separately
3. Standardised currency of all transaction values: Converted all non-SGD values into SGD values, based on average monthly historical exchange rates found on OANDA.com
4. Split ‘date_created’ column into ‘date’ and ‘time’ columns.
5. Created a new column ‘period’ to indicate the respective month of each date in the ‘date’ column (e.g. 01/01/16 → Period 1; 01/01/17 → Period 12)
6. Created a new column ‘hour’ to indicate respective hour of each time in the ‘time’ column (e.g. 1:00:00 AM → 1; 1:00:00 PM → 13)
7. Created a new column ‘day’ to indicate respective day in the week for each data in the ‘date’ column (e.g. Monday is assigned a value of 1, Sunday is assigned a value of 7)
8. Removed deactivated or terminated merchants

  - 2016: 41 transactions removed;
  - 2017: 841 transactions removed

9. Removed rows with ambiguous or illegible characters/ negative transaction value on JMP Pro:

  - Removal of 1563 rows with ambiguous or illegible characters; 
  - Removal of 2 rows with negative transaction value

10. Merchant Categorisation Codes (MCC) was given at later stage to understand the industry sectors of the merchants

Removal of Outliers
The raw data provided by our sponsor needed high levels of data cleansing. This includes standardizing all monetary value to Singapore currency, standardizing merchant names, and removing deactivated merchants. Furthermore, there were several confounding factors that could affect our analysis, such as test transactions carried out by merchants. To reduce the risk of inaccuracy, we excluded the outliers (i.e. monetary value belonging to top and bottom 0.5% and <$1) from our data. As seen in Figure 1 and 2, this resulted in a less skewed distribution.

General3.jpg
Figure 1: Distribution graph of transaction monetary values (transformed using log10), before removing outliers


General4.jpg
Figure 2: Distribution graph of transaction monetary values (transformed using log10), after removing outliers



Data Preparation

The cleaned data was loaded into JMP Pro to conduct data preparation and exploration. This analytical software can efficiently handle large volumes of data, which is essential given our sponsor's abundant data (approximately 2.2 million).

Data preparation was a time-consuming but necessary task in ensuring that our data is transformed into a suitable form for further analysis.

Using JMP, we have identified the top 3 types of rejected transactions: 1) Bank Rejected Transaction 2) Authentication Failed 3) Duplicate Merchant Tran_ID Detected

These values, termed as ‘reason code description’, are defined by our merchants’ registered bank. Moving forward, we will only be analysing these transactions and the approved transactions, to narrow down the scope of work.

Interactive Binning
From our data, we realised that a huge discrepancy in the total number of transactions carried out by each merchant from 2016-2017; this range varies from 1 to 1212019. As shown in Figure 3, there is a high number of overlapping data in the scatter plot when plotting proportion of approved transactions for all 269 merchants using line of best fit graphs. This makes it less useful in assessing the performance of different merchants.

General1.jpg
Figure 3: High number of overlapping data when plotting proportion of approve transactions for 269 merchants


To compare the performance of merchants on a fairer ground, we have decided to group the merchants into 5 bins (i.e. cut-off points for binning at 20th percentile) based on the total number of transactions per merchant. To achieve this, we installed the Interactive Binning plugin by Jeff Perkinson.

General2.jpg
Figure 4: Interactive binning results (I) (Cut-off point: 20th percentile)


We first used a log10 transformation on the variable ‘Number of transactions per merchant’. Seen in Figure 4, by setting the cut-off point at 20th percentile, we were able to obtain 5 different groups of merchants. Each group has a ‘Bin Number’, as seen in the table below. Each merchant thus has a respective bin number.

Bin [Based on Log10] (Group) Number of transactions per merchant (Range) Number of merchants
1 1 - 10 53
2 11 - 60 54
3 61 - 327 54
4 328 - 2516 53
5 2517 - 1,212,019 55