Octocat This notebook is part of a GitHub repository: https://github.com/pessini/insurance-claim-prediction/
MIT Licensed
Author: Leandro Pessini

Porto Seguro’s Safe Driver Prediction - Kaggle

Predict if a driver will file an insurance claim next year


1- Introduction

Porto Seguro

Porto Seguro is one of the largest insurance companies in Brazil specialized in car and home insurance. Located in São Paulo, Porto Seguro has been one of the leading insurers in Brazil since its foundation in 1945.

A key challenge faced by all major insurers is, when it comes to car insurance, how to address fairness towards good drivers and try not to penalize those who have a good driving history on account of a few bad drivers. Inaccuracies in car insurance claim predictions usually raise its cost for good drivers and reduce the price for bad ones.

Porto Seguro has been applying Machine Learning for more than 20 years and intends to make car insurance more accessible to everyone.

Porto Seguro

Kaggle is an online community of data scientists and allows users to find and publish data sets, explore and build ML models, and enter competitions to solve data science challenges.

In this competition, the challenge is to build a model that predicts the probability that a car insurance policyholder will file a claim next year.

Data Description

In the train and test data:


Loading Dataset

As per competition description, there are a few calculated features. In one of the discussions on Kaggle, it was highlighted that some kind of transformation was applied in order to generate these features. I will drop them and apply the transformations using my best judgment.

2- Preprocessing & Feature Engineering

Target variable distribution

target variable 1 means that a claim was filed and 0 that it was not claimed.

The target feature has a severe imbalance distribution showing that only 3.6% filled a claim and 96.4% did not.

This will be handle by the algorithm on a hyperparameter is_unbalance = True.

Missing values

Values of -1 indicate that the feature was missing from the observation.

Only ps_car_03_cat and ps_car_05_cat have a large number (~ >= 50%) of missing values.


To make data management easier, a meta-info about the variables is added to the DataFrame. It will help handling those variables later on the analysis, data viz and modeling.

We do not have information on which features are ordinal or not so a meta-info numerical will be added in order to apply Normalization later.

Number of variables per role and level

Exploratory Analysis

There are a strong correlations between the variables:

Heatmap showed low number of correlated variables, we'll look at three of highly correlated variables separately.

NOTE: sampling was applied to speed up the process.

ps_car_12 x ps_car_13

ps_reg_01 x ps_reg_03

ps_reg_02 x ps_reg_03

As the number of correlated variables is rather low, dimensionality reduction will not be applied and the model will do the heavy-lifting.

Binary features

Distribution of binary data and the corresponding values of target variable.