Difference between revisions of "AY1516 T2 Team Hew - Documentation"

From Analytics Practicum
Jump to navigation Jump to search
Line 41: Line 41:
 
<br/>
 
<br/>
 
The final merged dataset contains 92 columns and 349,648 rows, with duplicate rows removed.  
 
The final merged dataset contains 92 columns and 349,648 rows, with duplicate rows removed.  
 +
<br/><br/>
 +
===Further Cleaning & Processing===
 +
* We hypothesized that there would be differences in claim amounts for different vehicle types, so we split the dataset into Motorcycle policies VS Vehicle policies for further analysis.
 +
* Likewise, we segregated the data into Corporate Accounts VS Personal Accounts as we felt that Corporate customers would have different demographics and behaviours.
 +
* We deleted ''Driver Ages'' from 0 -16 (630 entries) as this was below the legal age of 17 to drive a vehicle or ride a motorcycle in Indonesia.  For high ''Driver Age'' limits, we used a boxplot to identify obvious outliers of age 103 and 114, which were excluded.
 +
* We studied data above 2 standard deviations away from the mean for ''Driver Age''.
 +
* We excluded Policy “SBD/AORF/12-A0374288”, as its claim amount is more than 900 Million Rupiah (extreme outlier in dataset), resulting from the vehicle insured being sunk alongside the ship which was carrying it.
 +
* We excluded rows where ''SUM(Claim Paid)''=0 and ''Sum(Claim OS)''=0, as these are rejected claims and are excluded from the main dataset and saved as a separate table for further analysis (~1000 rows).
 +
* We removed several rows where ''SUM(GWP)'' was negative, which probably arose from data entry errors
 
<br/><br/>
 
<br/><br/>
 
</div>
 
</div>

Revision as of 22:07, 28 February 2016


Data processing

Dataset Merging

Initially, we tried merging the 2 provided datasets directly but were unsuccessful as our computers ran out of memory, since the end result was a combined dataset of over 6GB in file size and 218 variables.

Our Approach:
1. We tried a different approach by first filtering both datasets to only select Policies with Orig_InceptionDate between 2012 to 2015.
2. Deleted variables in both datasets which were either duplicate columns or not very useful in our analysis.

Deleted Variables (not exhaustive):
VEH_TYPE, VEH_Reg_NO, VEH_Chassis, VEH_Engine, Category, AGENT_Code, Client_CODE, Person_in_Charge, DEPT, INWARD, IND, ACT_Policy Duration, LongTERM_POL, Inception_Date, Expiry_Date, UW_YR, UW_Quarter, Eff_Yr, EFF_Month, EFF_Quarter, Last_EffDate, VEH_Code, VEH_CLASS, Contract_Type, Anniversary, Account_Type, Intermediaries, Intermediares_Area, Intermediares_Island, Vehicle_SI_Acceptance, Cause_Code, Loss_Code, Repairer_Code, Transaction_Code, CLM_NO, Repairier_Status2, Transaction_Desc, TransCode_Desc, Payer_name

3. Merged both datasets by matching Policy and Risk_NO in both datasets
4. Grouped by the Policy, Risk_NO and Claim_NO to sum up the transactional-level data for each group to derive the total for the quantitative variables of GWP, ClaimPaid, Disc and Comm.

The final merged dataset contains 92 columns and 349,648 rows, with duplicate rows removed.

Further Cleaning & Processing

  • We hypothesized that there would be differences in claim amounts for different vehicle types, so we split the dataset into Motorcycle policies VS Vehicle policies for further analysis.
  • Likewise, we segregated the data into Corporate Accounts VS Personal Accounts as we felt that Corporate customers would have different demographics and behaviours.
  • We deleted Driver Ages from 0 -16 (630 entries) as this was below the legal age of 17 to drive a vehicle or ride a motorcycle in Indonesia. For high Driver Age limits, we used a boxplot to identify obvious outliers of age 103 and 114, which were excluded.
  • We studied data above 2 standard deviations away from the mean for Driver Age.
  • We excluded Policy “SBD/AORF/12-A0374288”, as its claim amount is more than 900 Million Rupiah (extreme outlier in dataset), resulting from the vehicle insured being sunk alongside the ship which was carrying it.
  • We excluded rows where SUM(Claim Paid)=0 and Sum(Claim OS)=0, as these are rejected claims and are excluded from the main dataset and saved as a separate table for further analysis (~1000 rows).
  • We removed several rows where SUM(GWP) was negative, which probably arose from data entry errors



Findings