Red Dot Payment Data Source

Background		Data Source		Methodology

Sample Data

*Due to sensitivity of data, sample data will not be shown here. Please refer to the submitted reports for more details.

Metadata

Previously, we only have the transaction data of 2017 provided by RDP. Moving on, we managed to obtain transaction data of 2016 as well.

Below are the definitions of common terms used in their business model:

MERCHANT - A merchant is a business, often a retailer, that operates online. Each merchant has appointed RDP to be its payment processor to handle online transactions between them and their customers.

CUSTOMER - A customer is an entity that makes a transaction with the merchant. For a transaction to be made, a customer must make contact with a merchant and provide their payment details (e.g. Credit card number) to the merchant through RDP’s payment processing gateway.

The following table shows the list of variables in the data and their associated description:

NAME	TYPE	DESCRIPTION
date_created	Date Time	Date and time at which the transaction was carried out between merchant and customer
period	Integer	Month of the year the transaction was carried out between merchant and customer
hour	Integer	Hour in a day the transaction was carried out between merchant and customer
day	Integer	Day of the week the transaction was carried out between merchant and customer
time	Time	Time at which the transaction was carried out between merchant and customer
name	Alphanumeric	Name of merchant
amount of money	Decimal	Total transaction amount - unstandardised currency
currency	CHAR(3)	Currency of the transaction amount - usually in SGD or USD
converted value	Decimal	Total transaction amount - standardised in SGD
reason_code	Alphanumeric	A unique code tagged to each reason for each approved or rejected transaction made by customer
reason_code_description	Alphanumeric	Description of the reason for each approved or rejected transaction made by customer
card_data1	Integer	First 6 digits of customer’s card number - reveals details about the issuing bank and card type
card_data2	Integer	Last 4 digits of customer’s card number
contact_ip_address	Alphanumeric	IP address of the customer
contact_ip_country	Alphanumeric	IP country of the customer
Bin [based on log10]	Integer	Bin number each merchant belongs to

Data Cleaning

For data cleaning, we carried out the following steps:

General Cleaning

1. Compiled 2016-2017 data
2. Standardised all merchants’ names and currency into uppercase: e.g. Realised that JMP Pro recognises characters with uppercase and lowercase separately
3. Standardised currency of all transaction values: Converted all non-SGD values into SGD values, based on average monthly historical exchange rates found on OANDA.com
4. Split ‘date_created’ column into ‘date’ and ‘time’ columns.
5. Created a new column ‘period’ to indicate the respective month of each date in the ‘date’ column (e.g. 01/01/16 → Period 1; 01/01/17 → Period 12)
6. Created a new column ‘hour’ to indicate respective hour of each time in the ‘time’ column (e.g. 1:00:00 AM → 1; 1:00:00 PM → 13)
7. Created a new column ‘day’ to indicate respective day in the week for each data in the ‘date’ column (e.g. Monday is assigned a value of 1, Sunday is assigned a value of 7)
8. Removed deactivated or terminated merchants

  - 2016: 41 transactions removed;
  - 2017: 841 transactions removed

9. Removed rows with ambiguous or illegible characters/ negative transaction value on JMP Pro:

  - Removal of 1563 rows with ambiguous or illegible characters; 
  - Removal of 2 rows with negative transaction value

Removal of Outliers

There were transactions with transaction value as high as 5.41212e+15. Such transaction, as shown in Figure 1, represents top 0.5% of all transactions, and has a very large difference in value compared to the bottom 99.5%. This would result in a very skewed distribution of transaction value (see Figure 2a), leading to inaccurate results during exploratory data analysis. Thus, it is essential that we remove these outliers. In addition, through our client meetings, we realised that our data includes test cases - i.e. False transactions carried out by merchants to test RDP payment modes; While we are not able to fully determine whether a transaction is a test case or not (as our client cannot give us a benchmark for a transaction value that would define a test case), we can reduce the number of test cases by eliminating the bottom 0.5% of all transactions.

Figure 1: Summary Statistics for ‘Converted Value’

Remove transactions with transaction value belonging to top 0.5% and bottom 0.5% of all transactions
In addition, we also remove transactions with transaction value less than $1, as we realised that there were still many transactions with value <$1. Based on our team’ perspectives, these transactions are likely to be test cases, as it is rather impossible to have transaction value <$1 across two years. Thus, these transactions are necessary to be removed.

Figure 2a: Johnson SI Transformed Converted value distribution before removing outliers

Figure 2b: Johnson SI Transformed Converted value distribution after removing outliers

- Before Removing Outliers
- Total number of transactions: 2,310,781

- After Removing Outliers
- Total number of transactions: 2,282,080
- Excluded number of transactions: 28,701

Data Preparation

We have identified the most common types of transaction in the following:
1. Approved or Completed Transaction
2. Bank Rejected Transaction
3. Authentication Failed
4. Duplicate Merchant_TranID Detected

Moving forward, we will only be analysing the approved transactions, and the top 3 types of rejected transaction, in order to narrow down the scope of our work.

Interactive Binning
From our data, we realised that there is huge discrepancy in the total number of transactions carried out by each merchant from 2016-2017; The range varies from 1 to 1212019. As stated in our refined objectives, we hope to identify the over-performing (star) and underperforming (laggard) merchants through the use of Line of Fit graphs. We have decided to group the merchants into 5 bins (cut-points for binning at 20 percentiles) based on total number of transactions per merchant, as we realised that compared to the Line of Fit graphs for all merchants, grouping them into 5 bins provide better model fit.
Next, we rename each bin into the following:

Bin [Based on Log10] (Group)	Number of transactions per merchant (Range)	Number of merchants
1	<= 7	72
2	8 - 52	72
3	53 - 286	73
4	287 - 1479	69
5	> = 1480	75

Red Dot Payment Data Source

Contents

Sample Data

Metadata

Data Cleaning

Data Preparation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools