Difference between revisions of "AY1516 T2 Team Hew - Documentation"

From Analytics Practicum
Jump to navigation Jump to search
Line 27: Line 27:
 
<br/><br/>
 
<br/><br/>
 
'''Our Approach'''
 
'''Our Approach'''
# We tried a different approach by first filtering both datasets to only select Policies with ''Orig_InceptionDate'' between 2012 to 2015.
+
1. We tried a different approach by first filtering both datasets to only select Policies with ''Orig_InceptionDate'' between 2012 to 2015.
# Deleted variables in both datasets which were either duplicate columns or not very useful in our analysis.
+
2. Deleted variables in both datasets which were either duplicate columns or not very useful in our analysis.
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 37: Line 37:
 
|}
 
|}
  
# Merged both datasets by matching Policy and Risk_NO in both datasets
+
3. Merged both datasets by matching Policy and Risk_NO in both datasets
# Grouped by the ''Policy'', ''Risk_NO'' and ''Claim_NO'' to sum up the transactional-level data for each group to derive the total for the quantitative variables of ''GWP'', ''ClaimPaid'', ''Disc'' and ''Comm''.
+
4. Grouped by the ''Policy'', ''Risk_NO'' and ''Claim_NO'' to sum up the transactional-level data for each group to derive the total for the quantitative variables of ''GWP'', ''ClaimPaid'', ''Disc'' and ''Comm''.
 +
 
 +
The final merged dataset contains 92 columns and 349,648 rows, with duplicate rows removed.  
 
<br/><br/>
 
<br/><br/>
 
</div>
 
</div>

Revision as of 22:04, 28 February 2016


Data processing

Dataset Merging

Initially, we tried merging the 2 provided datasets directly but were unsuccessful as our computers ran out of memory, since the end result was a combined dataset of over 6GB in file size and 218 variables.

Our Approach 1. We tried a different approach by first filtering both datasets to only select Policies with Orig_InceptionDate between 2012 to 2015. 2. Deleted variables in both datasets which were either duplicate columns or not very useful in our analysis.

Deleted Variables (not exhaustive):
VEH_TYPE, VEH_Reg_NO, VEH_Chassis, VEH_Engine, Category, AGENT_Code, Client_CODE, Person_in_Charge, DEPT, INWARD, IND, ACT_Policy Duration, LongTERM_POL, Inception_Date, Expiry_Date, UW_YR, UW_Quarter, Eff_Yr, EFF_Month, EFF_Quarter, Last_EffDate, VEH_Code, VEH_CLASS, Contract_Type, Anniversary, Account_Type, Intermediaries, Intermediares_Area, Intermediares_Island, Vehicle_SI_Acceptance, Cause_Code, Loss_Code, Repairer_Code, Transaction_Code, CLM_NO, Repairier_Status2, Transaction_Desc, TransCode_Desc, Payer_name

3. Merged both datasets by matching Policy and Risk_NO in both datasets 4. Grouped by the Policy, Risk_NO and Claim_NO to sum up the transactional-level data for each group to derive the total for the quantitative variables of GWP, ClaimPaid, Disc and Comm.

The final merged dataset contains 92 columns and 349,648 rows, with duplicate rows removed.

Findings