Octocat This notebook is part of a GitHub repository: https://github.com/pessini/insurance-claim-prediction/
MIT Licensed
Author: Leandro Pessini

Porto Seguro’s Safe Driver Prediction - Kaggle

Predict if a driver will file an insurance claim next year


Contents:

1- Introduction

Porto Seguro

Porto Seguro is one of the largest insurance companies in Brazil specialized in car and home insurance. Located in São Paulo, Porto Seguro has been one of the leading insurers in Brazil since its foundation in 1945.

A key challenge faced by all major insurers is, when it comes to car insurance, how to address fairness towards good drivers and try not to penalize those who have a good driving history on account of a few bad drivers. Inaccuracies in car insurance claim predictions usually raise its cost for good drivers and reduce the price for bad ones.

Porto Seguro has been applying Machine Learning for more than 20 years and intends to make car insurance more accessible to everyone.

Porto Seguro

Kaggle is an online community of data scientists and allows users to find and publish data sets, explore and build ML models, and enter competitions to solve data science challenges.

In this competition, the challenge is to build a model that predicts the probability that a car insurance policyholder will file a claim next year.

Data Description

In the train and test data:

Libraries

Loading Dataset


As per competition description, there are a few calculated features. In one of the discussions on Kaggle, it was highlighted that some kind of transformation was applied in order to generate these features. I will drop them and apply the transformations using my best judgment.

2- Preprocessing & Feature Engineering

Target variable distribution

target variable 1 means that a claim was filed and 0 that it was not claimed.


The target feature has a severe imbalance distribution showing that only 3.6% filled a claim and 96.4% did not.

This will be handle by the algorithm on a hyperparameter is_unbalance = True.

Missing values

Values of -1 indicate that the feature was missing from the observation.


Only ps_car_03_cat and ps_car_05_cat have a large number (~ >= 50%) of missing values.

Metadata

To make data management easier, a meta-info about the variables is added to the DataFrame. It will help handling those variables later on the analysis, data viz and modeling.

We do not have information on which features are ordinal or not so a meta-info numerical will be added in order to apply Normalization later.

Number of variables per role and level

Exploratory Analysis

There are a strong correlations between the variables:

Heatmap showed low number of correlated variables, we'll look at three of highly correlated variables separately.

NOTE: sampling was applied to speed up the process.

ps_car_12 x ps_car_13

ps_reg_01 x ps_reg_03

ps_reg_02 x ps_reg_03

As the number of correlated variables is rather low, dimensionality reduction will not be applied and the model will do the heavy-lifting.

Binary features

Distribution of binary data and the corresponding values of target variable.

Features Importance

As the categorical variables are already numerical, there is no need to apply LabelEncoding.

Reference:

Raschka, S., & Mirjalili, V. (2019). Python Machine Learning. Zaltbommel, Netherlands: Van Haren Publishing.

Loading prefit model

Data transformation and normalization

Combining train and test data

Now that we have the Feature Importance, let's join the train and test data in order to perform transformation on both.

Dropping less important features

Handling missing data

The study of missing data was formalized by Donald Rubin with the concept of missing mechanism in which missing-data indicators are random variables and assigned a distribution. Missing data mechanism describes the underlying mechanism that generates missing data.

It is important to consider missing data mechanism when deciding how to deal with missing data. Because this is unknown, I will consider the missing data as part of the dataset (as a category) and just create a new feature adding the total number of missing data.

Rubin, D. B. (1975). INFERENCE AND MISSING DATA. ETS Research Bulletin Series, 1975(1), i–19. https://doi.org/10.1002/j.2333-8504.1975.tb01053.x

Feature scaling using StandardScaler

One-hot encoding categorical features

Split train and test data

3- Model

Normalized Gini - Kaggle Evaluation

LightGBM

NOTE

Sets the weights of the dominated label to 1, and the weights of the dominant labels to the ratio of count of dominant/dominated.

K-fold cross validation

K-fold cross-validation reports on the performance of a model on several (k) samples from your training set. This provides a less biased evaluation of the model. However, K-fold cross-validation is more computationally expensive than slicing your data into three parts. It re-fits the model and tests it k-times, for each iteration, as opposed to one time.

It can be beneficial for many reasons, for example, when your data set is not large enough to slice into three representative parts, cross-validation can help with that. Preventing overfitting to your test data without further reducing the size of your training data set.

LightGBM total: 6min 43s

Saving the results

XGBoost (eXtreme Gradient Boosting)

XGBoost total: 3h 34min 42s

Random Search for Hyper-Parameter Optimization

Random Search for Hyper-Parameter Optimization

https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html


NOTE: Due to computing resources limitations I did not perform RandomizedSearchCV locally

Parameters Tuning was performed on Kaggle and yield the output below:


RandomizedSearchCV

random_search.best_params_

Best-Params

LightGBM Tuned

LightGBM Tuned total: 1h 6min 3s

Saving the results

4- Evaluation

Normalized Gini Coefficient

The Gini index or Gini coefficient is a statistical measure of distribution that was developed by the Italian statistician Corrado Gini in 1912. It is used as a gauge of economic inequality, measuring income distribution among a population.

The Gini coefficient is equal to the area below the line of perfect equality (0.5 by definition) minus the area below the Lorenz curve, divided by the area below the line of perfect equality. In other words, it is double the area between the Lorenz curve and the line of perfect equality.

Gini is a vital metric for insurers because the main concern focuses on segregating high and low risks rather than predicting losses. The reason is that this information is used to price insurance risks and charging customers on their predicted loss is not accountable for expenses and profit.

Since this task is classification, AUC was used as a metric because it's equivalent to using Gini since:

$$ Gini = 2 * AUC - 1 $$

In order to compute gini coefficient we should apply two integrals with the cumulative proportion of positive class

$$A=\int_{0}^{1}F(x)dx\approx \sum_{i=1}^{n}F_{i}(x)\times \frac{1}{n}$$$$B=\int_{0}^{1}x \; dx\approx \sum_{i=1}^{n}\frac{i}{n} \times \frac{1}{n}= \frac{1}{n^{2}}\times \frac{n\times (n+1)}{2}\approx 0.5$$$$Gini\, coeff= A-B=\frac{1}{n}\left (\sum_{i=1}^{n}F_{i}(x) -\frac{n+1}{2} \right)$$

Reference:

https://theblog.github.io/post/gini-coefficient-intuitive-explanation/

Because both classes are important due to the segregating aspect mentioned on the Gini coefficient, the evaluation metric chosen will be AUC.

LightGBM

https://en.wikipedia.org/wiki/Youden%27s_J_statistic

XGBoost

LightGBM - Tuned Model

ROC Area Under Curve (AUC) Score

All 3 models' performance (AUC) were very similar and using Random Search for Hyper-Parameter Optimization only increased 0.01 in AUC metric. One interesting point was about two models with the same score, LightGBM (first version) and XGBoost, but with almost 3,5 hours difference in training time.

Comparative table

Model Normalized Gini AUC Training time
LightGBM 0.33 0.67 6min 43s
XGBoost 0.33 0.67 3h 34min 42s
LightGBM Tuned 0.35 0.68 1h 6min 3s

5- Kaggle Submission


GitHub Mark GitHub repository
Author: Leandro Pessini