Difference between revisions of "AY1516 T2 Team Hew - Prediction"

From Analytics Practicum
Jump to navigation Jump to search
 
(17 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
{| style="background-color:white; color:white ; border:0px solid #4690cd; margin-left: auto; margin-right: auto;" width="100%" height=50px cellspacing="0" cellpadding="0" valign="top"  |
 
{| style="background-color:white; color:white ; border:0px solid #4690cd; margin-left: auto; margin-right: auto;" width="100%" height=50px cellspacing="0" cellpadding="0" valign="top"  |
  
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:white; border-bottom:3px solid #E0E0E0" width="50%" | [[AY1516 T2 Team Hew - Documentation| <font color="#3C2415" family="Trebuchet MS"><b>Exploratory Analysis</b></font>]]
+
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:white; border-bottom:3px solid #E0E0E0" width="50%" | [[AY1516 T2 Team Hew - Documentation| <font color="#3C2415" family="Trebuchet MS"><b>Exploratory Analysis (Interim)</b></font>]]
  
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; background-color:#F0F0F0; border-left:3px solid #E0E0E0; " width="50%" | [[AY1516 T2 Team Hew - Prediction| <font family="Trebuchet MS" color="#101010"><b>Prediction</b></font>]]
 
| style="padding:0 .3em;  solid #000000; padding: 10px; text-align:center; background-color:#F0F0F0; border-left:3px solid #E0E0E0; " width="50%" | [[AY1516 T2 Team Hew - Prediction| <font family="Trebuchet MS" color="#101010"><b>Prediction</b></font>]]
Line 36: Line 36:
 
== <p style="font-family:Trebuchet MS; border-left: 6px solid #62b762; padding-left:10px; line-height:40px; height:40px"><b>Claim Amount Distribution Fitting</b></p>==
 
== <p style="font-family:Trebuchet MS; border-left: 6px solid #62b762; padding-left:10px; line-height:40px; height:40px"><b>Claim Amount Distribution Fitting</b></p>==
 
=== Analyze Distribution===
 
=== Analyze Distribution===
Vaughn (1996) and a number of actuaries have mentioned that ClaimPaid amounts have to factor in inflation for greater accuracy. However, as the highest inflation rate attained by Indonesia for the time period of study is about 7%, this is rather negligible. Using JMP Pro 12, we first excluded rows where ClaimPaid = 0 and derived the distribution statistics of:
+
Vaughn<ref name="vaughn">Vaughn, Trent R. (1996), Simulation Models for Self-Insurance</ref> and a number of actuaries have mentioned that Claim amounts have to factor in inflation for greater accuracy. However, as the highest inflation rate attained by Indonesia for the time period of study is about 7%, this is rather negligible. Using JMP Pro 12, we first excluded rows where Claim amount = 0 and derived the distribution statistics of:
 
<br />
 
<br />
 
{| class="wikitable" style="text-align: center; width: 70%; margin-left: auto; margin-right: auto; border: none;"
 
{| class="wikitable" style="text-align: center; width: 70%; margin-left: auto; margin-right: auto; border: none;"
Line 47: Line 47:
 
[[File:ClaimPaidOverall.png|center|Overall Claim amount distribution]]
 
[[File:ClaimPaidOverall.png|center|Overall Claim amount distribution]]
 
<br />
 
<br />
From figure above, we can see that there are many instances of small claim amounts, with few instances of extreme values, and the distribution shape is right skewed. Similar distribution shapes for Claim amounts have also been derived from insurance companies based in other regions, as seen in figure below taken from Eling’s research (2011).
+
From figure above, we can see that there are many instances of small claim amounts, with few instances of extreme values, and the distribution shape is right skewed. Similar distribution shapes for Claim amounts have also been derived from insurance companies based in other regions, as seen in figure below taken from Eling's Research<ref name="eling">Eling, M. (2011), Fitting Insurance Claims to Skewed Distributions, Working Papers on Risk Management and Insurance, No. 98, November 2011</ref>.
 
<br /><br />
 
<br /><br />
 
[[File:Eling's_Research.png|center|Eling's Research]]
 
[[File:Eling's_Research.png|center|Eling's Research]]
Line 55: Line 55:
 
We then attempt to fit several distributions on the Claim amounts column, using JMP Pro’s Fit Distribution function. The most suitable distribution selected will be the one which minimizes the -2Log(Likelihood), taken to be a measure of variation or uncertainty in the sample.
 
We then attempt to fit several distributions on the Claim amounts column, using JMP Pro’s Fit Distribution function. The most suitable distribution selected will be the one which minimizes the -2Log(Likelihood), taken to be a measure of variation or uncertainty in the sample.
 
<br /><br />
 
<br /><br />
[[File:Fit_distri.JPG|center|Fit Distribution Results]]
+
[[File:Fit_distri.JPG|center|800px|Fit Distribution Results]]
 
<br /><br />
 
<br /><br />
 
+
We can see that -2Log(Likelihood) is minimized with the LogNormal distribution. According to JMP Pro documentation, maximum likelihood estimation is used in determining the parameters for the LogNormal distribution. Naturally, we can see why the LogNormal distribution derives the lowest -2Log(Likelihood) as the problem of maximizing the likelihood function is reformulated to become a minimization of the negative of the natural logarithm of the likelihood function.
 +
<br /><br />
 +
We did another Goodness-of-Fit test using Kolmogrov’s D test with the null hypothesis as “H0: The data is from the LogNormal Distribution”, thus obtaining a D-statistic of 0.048569 and a p-value of 0.01. As p-values are the probability of getting an even more extreme statistic given the true value being tested is at the hypothesized value (usually at zero), a small p-value means that the statistic is unlikely to be that extreme by coincidence. In this case, the p-value of 0.01 is fairly small, indicating significance that the fitted distribution is LogNormal.
 +
<br /><br />
 +
Using JMP Pro’s Diagnostic Plot function, we can visually affirm the goodness-of-fit, as seen in the figure below. The actual Claim amount is plotted against its corresponding estimated LogNormal probability on the Y-axis. We can see that most of the data points fall on the Line of Fit (red line) for Claim amounts within the mid-range, thus depicting a good fit.
 +
<br /><br />
 +
[[File:Quantile_plot.png|center|Diagnostic Plot]]
 +
<br /><br />
 +
Several other papers have adopted to fit Claim amounts with the LogNormal distribution because of its right-skewed shape<ref name="adeleke">Adeleke, Ismail A., Ibiwoye, A. (2011), Modelling claim sizes in Personal Line non-life insurance, International Business & Economics Research Journal - February 2011, Vol. 10, No. 2, pp. 21-38</ref>, and this is in line with our findings. We therefore fit a LogNormal distribution to the Claim amounts, which allows us to use it for prediction or modelling purposes.
 +
<br /><br />
 +
===Segmentation===
 +
We hypothesized that there would be differences in claim amounts for different vehicle types, due to the varying costs of each vehicle, so we proceeded to segment the dataset by the Vehicle Type (Passenger Car, Bus, Truck, Motorcycle) and plot the Claim amount distributions for each segment:
 +
<br /><br />
 +
[[File:Segment1.png|center|Claim amounts for Bus & Motorcycle]]
 +
<br />
 +
[[File:Segment2.png|center|Claim amounts for Passenger Car & Truck]]
 +
<br />
 +
We can see that the distribution shape for Motorcycles is different as compared to the other Vehicle Types, as about 90% of the claims were theft-related. The other Vehicle Types have a similar right-skewed distribution which we can fit a LogNormal distribution, albeit with different parameter values due to the differences in Claim  amounts. Claim amount is generally higher for larger Vehicle Types.
 +
<br /><br /><br />
 
</div>
 
</div>
  
 
<div style="margin:20px; padding-top:0px; padding-right:20px; padding-left:20px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 100%; -webkit-border-radius: 10px;-webkit-box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08); -moz-box-shadow:    1px 1px 2px rgba(0, 0, 0, 0.08);box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08);">
 
<div style="margin:20px; padding-top:0px; padding-right:20px; padding-left:20px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 100%; -webkit-border-radius: 10px;-webkit-box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08); -moz-box-shadow:    1px 1px 2px rgba(0, 0, 0, 0.08);box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08);">
  
== <p style="font-family:Trebuchet MS; border-left: 6px solid #daad25; padding-left:10px; line-height:40px; height:40px"><b>Linear Regression</b></p>==
+
== <p style="font-family:Trebuchet MS; border-left: 6px solid #daad25; padding-left:10px; line-height:40px; height:40px"><b>Multiple Linear Regression</b></p>==
Coming Soon
+
===Applying Log Transformation===
 +
We first attempted a prediction of Claim amounts on the Passenger Car Segment of the dataset. A natural logarithm transformation was applied to normalize the Claim amounts as shown below:
 +
<br />
 +
{| class="wikitable" style="text-align:center;"
 +
|-
 +
! Initial Distribution !! After Applying Log Transformation
 +
|-
 +
|
 +
[[File:Passenger initial.png|center|500px]]
 +
|| [[File:Passenger_after_log.png|center|500px]]
 +
|}
 +
We can see that the Mean of 15.0673 is now approximately equal to the Median of 15.0318, indicating a Normal distribution. This is further confirmed by the Normal Quantile plot below which shows that the data points are close to the line of best fit:
 +
<br />
 +
[[File:Normal quantile plot.png|center]]
 +
<br /><br />
 +
 
 +
===Multiple Linear Regression===
 +
We then set out to predict the Log(Claim amount) of the Passenger Car segment, using the following independent variables listed below. Our approach was to first include all possible variables that were useful in the predictive model, then analyze the results to remove the insignificant variables and improve model accuracy.
 +
 
 +
{| class="wikitable"
 +
|-
 +
! Variable !! Values !! Description
 +
|-
 +
| VehAge_Current || Continuous Numerical || Current Vehicle Age
 +
|-
 +
| Bus_Ind || New Business / Renewal Business || Business Indicator
 +
|-
 +
| Driver_Age || Continuous Numerical || Only recorded when a claim is made
 +
|-
 +
| Corppers || Corporate / Personal || Account Type
 +
|-
 +
| At_Fault || At Fault / Not At Fault || Whether the Claimant caused the accident that resulted in the claim
 +
|-
 +
| Veh_Make || Audi / BMW / Chevrolet / Chrysler / Daihatsu / Ford / Honda / Hyundai / Isuzu / Jaguar / Land rover / Mazda / Mercedes Benz/ Mitsubishi / Nissan / Opel / Peugeot / Proton / Subaru / Suzuki / Timor / Toyota / Volkswagen || Vehicle Brand
 +
|-
 +
| Log(SumInsured) || Continuous Numerical || Natural Logarithm Transformation of the vehicle's Sum Insured amount
 +
|-
 +
| Coverage_period || Categorical Numerical || Policy coverage period in years
 +
|-
 +
| Jap_Ind || Japanese / Local || Japanese Customer Indicator
 +
|-
 +
| Cover Type || Comprehensive / Total Loss Order || Policy Coverage Type
 +
|-
 +
| Channel || Agent / Bank / Broker / Coinsurance / Dealer / Direct / Leasing  || Sales Channel
 +
|-
 +
| Cause_Type || Collision / Fire / Flood / Grazed (humans) / Grazed (non-vehicle) / Grazed (vehicle) / Impact (humans) / Impact (non-vehicle) / Impact (vehicle) / Others || Cause of the claim
 +
|}
 +
<br /><br />
 +
Splitting the dataset by a mix of 60% (Training Set), 20% (Validation Set), 20% (Test Set), the following results were obtained:
 +
 
 +
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
 +
|-
 +
!  !! Training Set !! Validation Set !! Test Set
 +
|-
 +
| R-Square || 0.288 || 0.274 || 0.223
 +
|}
 +
 
 +
Even with all the possible variables used as predictors, the R-Square stands at a low value of '''0.288'''. The same approach was taken on other predictive methods like Decision Trees and Neural Networks, but yielded a similar low R-Square value of between 0.3-0.4 thus highlighting the dataset's lack of predictive ability.
 +
<br /><br />
 
</div>
 
</div>
  
 
<div style="margin:20px; padding-top:0px; padding-right:20px; padding-left:20px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 100%; -webkit-border-radius: 10px;-webkit-box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08); -moz-box-shadow:    1px 1px 2px rgba(0, 0, 0, 0.08);box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08);">
 
<div style="margin:20px; padding-top:0px; padding-right:20px; padding-left:20px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 100%; -webkit-border-radius: 10px;-webkit-box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08); -moz-box-shadow:    1px 1px 2px rgba(0, 0, 0, 0.08);box-shadow: 1px 1px 2px rgba(0, 0, 0, 0.08);">
  
== <p style="font-family:Trebuchet MS; border-left: 6px solid #cf657b; padding-left:10px; line-height:40px; height:40px"><b>Research & Methodology </b></p>==
+
== <p style="font-family:Trebuchet MS; border-left: 6px solid #cf657b; padding-left:10px; line-height:40px; height:40px"><b>Recommendations</b></p>==
Coming Soon
+
===Suggested improvements to dataset:===
 +
* The ''Bus_Ind'' variable allows us to see which policy is a renewal, however, it does not tell us which particular customer has made a renewal, as individual customer names are not recorded, but rather, the Leasing Agent names are recorded. Having the actual customer names in the dataset can open up many possibilities for Predictive Analytics - e.g. to identify hidden customer segments/clusters and also to predict customer churn rate.
 +
* This can be supplemented with further customer demographic data like the age, gender, income level, address, etc.
 +
* Record claim cause in the form of an additional accident report instead of just classifying it in categories. This can allow for text mining to be applied in conjunction with other predictive methods to detect fraud.
 +
<br /><br />
 
</div>
 
</div>
  

Latest revision as of 14:32, 17 April 2016

Exploratory Analysis (Interim) Prediction


Claim Amount Distribution Fitting

Analyze Distribution

Vaughn[1] and a number of actuaries have mentioned that Claim amounts have to factor in inflation for greater accuracy. However, as the highest inflation rate attained by Indonesia for the time period of study is about 7%, this is rather negligible. Using JMP Pro 12, we first excluded rows where Claim amount = 0 and derived the distribution statistics of:

N Min Max Median Mean Standard Deviation
49,066 2,570 3,295,056,701 4,297,418 21,090,896 90,927,485
Overall Claim amount distribution


From figure above, we can see that there are many instances of small claim amounts, with few instances of extreme values, and the distribution shape is right skewed. Similar distribution shapes for Claim amounts have also been derived from insurance companies based in other regions, as seen in figure below taken from Eling's Research[2].

Eling's Research



Fit Distribution

We then attempt to fit several distributions on the Claim amounts column, using JMP Pro’s Fit Distribution function. The most suitable distribution selected will be the one which minimizes the -2Log(Likelihood), taken to be a measure of variation or uncertainty in the sample.

Fit Distribution Results



We can see that -2Log(Likelihood) is minimized with the LogNormal distribution. According to JMP Pro documentation, maximum likelihood estimation is used in determining the parameters for the LogNormal distribution. Naturally, we can see why the LogNormal distribution derives the lowest -2Log(Likelihood) as the problem of maximizing the likelihood function is reformulated to become a minimization of the negative of the natural logarithm of the likelihood function.

We did another Goodness-of-Fit test using Kolmogrov’s D test with the null hypothesis as “H0: The data is from the LogNormal Distribution”, thus obtaining a D-statistic of 0.048569 and a p-value of 0.01. As p-values are the probability of getting an even more extreme statistic given the true value being tested is at the hypothesized value (usually at zero), a small p-value means that the statistic is unlikely to be that extreme by coincidence. In this case, the p-value of 0.01 is fairly small, indicating significance that the fitted distribution is LogNormal.

Using JMP Pro’s Diagnostic Plot function, we can visually affirm the goodness-of-fit, as seen in the figure below. The actual Claim amount is plotted against its corresponding estimated LogNormal probability on the Y-axis. We can see that most of the data points fall on the Line of Fit (red line) for Claim amounts within the mid-range, thus depicting a good fit.

Diagnostic Plot



Several other papers have adopted to fit Claim amounts with the LogNormal distribution because of its right-skewed shape[3], and this is in line with our findings. We therefore fit a LogNormal distribution to the Claim amounts, which allows us to use it for prediction or modelling purposes.

Segmentation

We hypothesized that there would be differences in claim amounts for different vehicle types, due to the varying costs of each vehicle, so we proceeded to segment the dataset by the Vehicle Type (Passenger Car, Bus, Truck, Motorcycle) and plot the Claim amount distributions for each segment:

Claim amounts for Bus & Motorcycle


Claim amounts for Passenger Car & Truck


We can see that the distribution shape for Motorcycles is different as compared to the other Vehicle Types, as about 90% of the claims were theft-related. The other Vehicle Types have a similar right-skewed distribution which we can fit a LogNormal distribution, albeit with different parameter values due to the differences in Claim amounts. Claim amount is generally higher for larger Vehicle Types.


Multiple Linear Regression

Applying Log Transformation

We first attempted a prediction of Claim amounts on the Passenger Car Segment of the dataset. A natural logarithm transformation was applied to normalize the Claim amounts as shown below:

Initial Distribution After Applying Log Transformation
Passenger initial.png
Passenger after log.png

We can see that the Mean of 15.0673 is now approximately equal to the Median of 15.0318, indicating a Normal distribution. This is further confirmed by the Normal Quantile plot below which shows that the data points are close to the line of best fit:

Normal quantile plot.png



Multiple Linear Regression

We then set out to predict the Log(Claim amount) of the Passenger Car segment, using the following independent variables listed below. Our approach was to first include all possible variables that were useful in the predictive model, then analyze the results to remove the insignificant variables and improve model accuracy.

Variable Values Description
VehAge_Current Continuous Numerical Current Vehicle Age
Bus_Ind New Business / Renewal Business Business Indicator
Driver_Age Continuous Numerical Only recorded when a claim is made
Corppers Corporate / Personal Account Type
At_Fault At Fault / Not At Fault Whether the Claimant caused the accident that resulted in the claim
Veh_Make Audi / BMW / Chevrolet / Chrysler / Daihatsu / Ford / Honda / Hyundai / Isuzu / Jaguar / Land rover / Mazda / Mercedes Benz/ Mitsubishi / Nissan / Opel / Peugeot / Proton / Subaru / Suzuki / Timor / Toyota / Volkswagen Vehicle Brand
Log(SumInsured) Continuous Numerical Natural Logarithm Transformation of the vehicle's Sum Insured amount
Coverage_period Categorical Numerical Policy coverage period in years
Jap_Ind Japanese / Local Japanese Customer Indicator
Cover Type Comprehensive / Total Loss Order Policy Coverage Type
Channel Agent / Bank / Broker / Coinsurance / Dealer / Direct / Leasing Sales Channel
Cause_Type Collision / Fire / Flood / Grazed (humans) / Grazed (non-vehicle) / Grazed (vehicle) / Impact (humans) / Impact (non-vehicle) / Impact (vehicle) / Others Cause of the claim



Splitting the dataset by a mix of 60% (Training Set), 20% (Validation Set), 20% (Test Set), the following results were obtained:

Training Set Validation Set Test Set
R-Square 0.288 0.274 0.223

Even with all the possible variables used as predictors, the R-Square stands at a low value of 0.288. The same approach was taken on other predictive methods like Decision Trees and Neural Networks, but yielded a similar low R-Square value of between 0.3-0.4 thus highlighting the dataset's lack of predictive ability.

Recommendations

Suggested improvements to dataset:

  • The Bus_Ind variable allows us to see which policy is a renewal, however, it does not tell us which particular customer has made a renewal, as individual customer names are not recorded, but rather, the Leasing Agent names are recorded. Having the actual customer names in the dataset can open up many possibilities for Predictive Analytics - e.g. to identify hidden customer segments/clusters and also to predict customer churn rate.
  • This can be supplemented with further customer demographic data like the age, gender, income level, address, etc.
  • Record claim cause in the form of an additional accident report instead of just classifying it in categories. This can allow for text mining to be applied in conjunction with other predictive methods to detect fraud.



References

  1. Vaughn, Trent R. (1996), Simulation Models for Self-Insurance
  2. Eling, M. (2011), Fitting Insurance Claims to Skewed Distributions, Working Papers on Risk Management and Insurance, No. 98, November 2011
  3. Adeleke, Ismail A., Ibiwoye, A. (2011), Modelling claim sizes in Personal Line non-life insurance, International Business & Economics Research Journal - February 2011, Vol. 10, No. 2, pp. 21-38