Octocat This notebook is part of a GitHub repository: https://github.com/pessini/moby-bikes
MIT Licensed
Author: Leandro Pessini

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
import scipy.stats as stats
from patsy import dmatrices

# statsmodels
from statsmodels.stats import diagnostic as diag
import statsmodels.api as sm
from statsmodels.formula.api import glm
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

%matplotlib inline
import warnings
warnings.simplefilter('ignore', FutureWarning)

%reload_ext watermark
%watermark -a "Leandro Pessini" -n -u -v -iv -w

Author: Leandro Pessini

Last updated: Sun May 15 2022

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 8.3.0

pandas     : 1.3.0
sys        : 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) 
[Clang 11.1.0 ]
statsmodels: 0.13.2
numpy      : 1.21.1
seaborn    : 0.11.1
scipy      : 1.8.0
matplotlib : 3.4.2

Watermark: 2.3.0

df_train = pd.read_csv('../data/processed/df_train.csv')
df_test = pd.read_csv('../data/processed/df_test.csv')
all_data = pd.read_csv('../data/processed/rentals_data.csv')

Outlier Analysis¶

Unusual Observations (Outliers) and High Leverage Observations¶

We have a few outliers but not sure if they can be treated as inacurates as they capture the data.
High Leverage Observations are unusual values on the predictors that can dramatically affect the model. In a regression analysis, single observations can have a strong influence on the results of the model.

Influencial Points (not all unusual observations have much of an impact on the model) - https://www.kirenz.com/post/2021-11-14-linear-regression-diagnostics-in-python/linear-regression-diagnostics-in-python/

expr = \
     """ count ~ temp_bin + wind_speed_group + rhum_bin + rainfall_intensity + holiday + timesofday"""
y_train, X_train = dmatrices(expr, df_train, return_type = 'dataframe')
nb_training_results = sm.GLM(y_train, X_train, family = sm.families.NegativeBinomial(alpha = 1)).fit(maxiter=5000, method='nm', cov_type='HC3')

We can use influence plots to identify observations in our independent variables which have “unusual” values in comparison to other values.

Influence plots show the (externally) studentized residuals vs. the leverage of each observation:

infl = nb_training_results.get_influence(observed=False)
fig, ax = plt.subplots(figsize=(10, 8))
infl.plot_index(y_var="cooks", threshold=3 * infl.cooks_distance[0].mean(), 
                ax=ax, title="Influence plots - Cook's Distance")
fig.tight_layout(pad=1.0)
plt.axhline(4 / df_train.shape[0], color='r', linestyle='--')
plt.show()

A data point that has a large value for Cook’s Distance indicates that it strongly influences the fitted values. A general rule of thumb is that any point with a Cook’s Distance over 4/n (where n is the total number of data points) is considered to be an outlier.

# obtain Cook's distance 
model_cooksd = infl.cooks_distance[0]
threshold_c = 4 / df_train.shape[0]
out_d = model_cooksd > threshold_c
subset_data = df_train[~out_d]
print(f'Number of possible outliers: {df_train.shape[0] - subset_data.shape[0]}')

Number of possible outliers: 200

fig, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(data=df_train,
            y="count",
            orient="v",ax=ax)
plt.show()

hourlyDataOutliers = df_train[df_train['count'] > 14]
hourlyDataWithoutOutliers = df_train[df_train['count'] <= 14]
    
print(f"Outliers (#): {round(hourlyDataOutliers.shape[0], 2)}")
print(f"Outliers (%): {round((hourlyDataOutliers.shape[0] / df_train.shape[0])*100, 2)}%")

Outliers (#): 91
Outliers (%): 1.04%

hourlyDataOutliers['holiday'].value_counts()

False    85
True      6
Name: holiday, dtype: int64

hourlyDataOutliers['working_day'].value_counts(normalize=True)

False    0.505495
True     0.494505
Name: working_day, dtype: float64

hourlyDataOutliers['season'].value_counts()

Spring    43
Winter    21
Summer    18
Autumn     9
Name: season, dtype: int64

Unusual observations do not seem to be related with Holiday nor Weekends (Working_Day).

daily_count = df_train.groupby('date')['count'].sum().reset_index()
daily_count_no_outliers = hourlyDataWithoutOutliers.groupby('date')['count'].sum().reset_index()

sns.boxplot(data=daily_count,y="count",orient="v")

<AxesSubplot:ylabel='count'>

daily_count[daily_count['count'] < 20]

The "outlier" showed on the boxplot is on Christmas Day.

daily_count.loc[295:302]

mean_daily = daily_count['count'].mean()
std_dev_daily = daily_count['count'].std()
slice_daily = mean_daily + (2*std_dev_daily)
daily_count[daily_count['count'] > slice_daily]

march_07 = df_train[df_train['date'] == '2021-03-07']
march_07[['hour', 'rain', 'temp', 'rhum', 'wdsp', 'season', 'timesofday','working_day', 'count']]

april_25 = df_train[df_train['date'] == '2021-04-25']
april_25[['hour', 'rain', 'temp', 'rhum', 'wdsp', 'season', 'timesofday', 'count']]

daily_count.describe()

print(f"Mean of daily rentals = {round(daily_count['count'].mean(), 2)}")
print(f"Mean of daily rentals without Outliers = {round(daily_count_no_outliers['count'].mean(), 2)}")

Mean of daily rentals = 90.52
Mean of daily rentals without Outliers = 86.31

hourlyDataOutliers = hourlyDataOutliers.join(daily_count.set_index('date'), on='date', lsuffix='_hour', rsuffix='_day')

outliers_df = hourlyDataOutliers[hourlyDataOutliers['count_day'] > 110].sort_values(by='count_day', ascending=False)
outliers_df.head()

hourlyDataOutliers.to_csv('../data/interim/outliers.csv')

GitHub Mark GitHub repository
Author: Leandro Pessini

	date	count
295	2021-12-21	71
296	2021-12-22	66
297	2021-12-23	94
298	2021-12-24	44
299	2021-12-25	14
300	2021-12-26	33
301	2021-12-27	34
302	2021-12-28	42

	date	count
6	2021-03-07	157
16	2021-03-17	168
18	2021-03-19	151
19	2021-03-20	148
33	2021-04-03	155
47	2021-04-17	152
54	2021-04-24	163
55	2021-04-25	168
138	2021-07-17	162
144	2021-07-23	154

	hour	temp	rhum	wdsp	season	timesofday	working_day	count
144	0	1.8	71	3	Winter	Night	False	2
145	1	1.0	71	4	Winter	Night	False	2
146	2	0.4	76	4	Winter	Night	False	0
147	3	1.3	72	4	Winter	Night	False	1
148	4	0.5	77	4	Winter	Night	False	0
149	5	0.8	77	5	Winter	Night	False	0
150	6	0.8	77	6	Winter	Night	False	0
151	7	0.8	77	6	Winter	Morning	False	0
152	8	1.8	73	6	Winter	Morning	False	1
153	9	3.3	69	6	Winter	Morning	False	12
154	10	5.1	64	6	Winter	Morning	False	13
155	11	6.6	62	7	Winter	Morning	False	20
156	12	7.8	57	4	Winter	Afternoon	False	11
157	13	8.9	52	4	Winter	Afternoon	False	20
158	14	8.7	55	5	Winter	Afternoon	False	12
159	15	8.8	55	6	Winter	Afternoon	False	12
160	16	9.5	52	4	Winter	Afternoon	False	17
161	17	7.7	61	5	Winter	Afternoon	False	11
162	18	6.5	63	5	Winter	Evening	False	6
163	19	4.8	71	4	Winter	Evening	False	5
164	20	3.9	75	3	Winter	Evening	False	3
165	21	5.1	70	3	Winter	Evening	False	2
166	22	4.5	74	4	Winter	Evening	False	6
167	23	4.9	78	5	Winter	Night	False	1

	hour	temp	rhum	wdsp	season	timesofday	count
1320	0	5.3	89	3	Spring	Night	1
1321	1	6.3	85	3	Spring	Night	0
1322	2	3.7	90	2	Spring	Night	1
1323	3	2.4	95	2	Spring	Night	2
1324	4	3.8	98	5	Spring	Night	0
1325	5	4.6	97	8	Spring	Night	0
1326	6	5.7	97	6	Spring	Night	2
1327	7	8.5	86	7	Spring	Morning	0
1328	8	8.8	81	12	Spring	Morning	2
1329	9	9.7	73	12	Spring	Morning	12
1330	10	9.5	71	12	Spring	Morning	7
1331	11	10.4	65	13	Spring	Morning	19
1332	12	10.3	64	14	Spring	Afternoon	19
1333	13	10.8	60	14	Spring	Afternoon	18
1334	14	10.8	62	12	Spring	Afternoon	13
1335	15	10.3	66	12	Spring	Afternoon	8
1336	16	10.1	67	11	Spring	Afternoon	19
1337	17	8.9	69	11	Spring	Afternoon	14
1338	18	9.2	68	10	Spring	Evening	4
1339	19	7.6	76	10	Spring	Evening	8
1340	20	5.4	83	8	Spring	Evening	11
1341	21	3.6	90	6	Spring	Evening	5
1342	22	2.3	93	5	Spring	Evening	2
1343	23	1.1	94	4	Spring	Night	1

	count
count	365.000000
mean	90.520548
std	27.895168
min	14.000000
25%	73.000000
50%	90.000000
75%	111.000000
max	168.000000