ANLY482 AY2016-17 T1 Group4: Project Findings
Contents
Mid-Term Findings
Exploratory Data Analysis
For our exploratory data analysis, we went with analyzing the three different aspects of the data, namely, the clients and the policies.
Distribution By Age
From the data, we are able to see that the median age is around 45, with the standard deviation being around 13 years. However, there are missing age values, and those entries consist of 9% of the data, and have been excluded from this analysis.
We can also see that most of the customers are around the ages of 31 to 45, from the graph. It can also be observed that the graph is right skewed, because of this particular distribution
Distribution By Gender
From the data, we are able to observe that around 53% of the valid clients are male, while around 46% are female. There is a very small percentage of individuals, however that did not disclose their gender (0.53%).
Distribution By Age & Gender
Below are two separate age distribution graphs that are grouped by gender. The graph on the top is for females while the graph on the bottom is for males. The bars are the frequency of clients who fall under that particular age.
For Females:
For Males:
What we can observe is slightly more women than men for clients aged 50 years old to their late 60s. However, we are also able to see that there are slightly more men than women clients for clients in their late 30s, by comparing both peaks, which both happen to fall within the similar age range of 30 to 45 years old.
Distribution of Customers By Occupation
From the above analysis, we are able to observe that the highest group of customers are managers, followed by “OTHR”. This could possibly consist of other businessmen such as Entrepreneurs. Following that are a group that did not disclose their occupation. Engineers, Housewives and Executives follow after that.
Understand Policy Based Data
For policy based data, we would be looking at a few fields, namely, CHDRNUM, which indicates the unique policy number, RIDER, which indicates the rider index number, INSTPREM, which is the premium that has been paid up for that particular. Also, we would be looking at the columns LIFE and COVERAGE, which are the columns that define the index of each contract’s coverage. Essentially, this is the visual representation of how each contract is modelled:
Each CHDRNUM has many LIFE, which has many COVERAGE, which has many RIDERs. To illustrate, the table below shows a small sample of the data:
As seen from the small sample above, we can see that for a contract with CHDRNUM of 00240907, it contains two LIFE coverages, 01 and 02. Within each LIFE, COVERAGE then denotes each coverage plan within each Life Coverage. For each coverage plan, there exists various riders. However, one thing to take note of is that the all life coverages that come after 01 will be nested under the original coverage defined in the row where LIFE =01, COVERAGE = 01 and RIDER = 00. Therefore, if the product defined for the row of LIFE =01, COVERAGE = 01, RIDER = 00 is “Legacy Plan”, then the basic plan for the entire contract of 00240907 would be “Legacy Plan”. Also, another thing to take note is that as long as LIFE is not 01, then that particular record is a rider for the basic plan defined by LIFE =01, COVERAGE = 01, RIDER = 00 in that particular contract.
Policy's Uptake Vs Time
We plotted graphs for each of the basic plans and also for the riders, to allow the client company to understand the uptake of each basic plan over the years.
From this, they will be able to understand which plans have an increasing uptake rate, and which plans have a decreasing uptake rate. These graphs have been put into a separate PDF, for their perusal.
Finals Findings
Due to the NDA that we have signed, the findings will not be displayed here, but in our privately submitted reports. Please contact our project supervisor, Professor Kam Tin Seong, if you wish to have access to them. Thank you for your kind understanding.