Octocat This notebook is part of a GitHub repository: https://github.com/pessini/moby-bikes
MIT Licensed
Author: Leandro Pessini

Time Series

In [25]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
import scipy.stats as stats
from patsy import dmatrices

# statsmodels
from statsmodels.stats import diagnostic as diag
import statsmodels.api as sm
from statsmodels.formula.api import glm
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

%matplotlib inline
import warnings
warnings.simplefilter('ignore', FutureWarning)
In [26]:
%reload_ext watermark
%watermark -a "Leandro Pessini" -n -u -v -iv -w
Author: Leandro Pessini

Last updated: Tue May 17 2022

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 8.3.0

matplotlib : 3.4.2
seaborn    : 0.11.1
statsmodels: 0.13.2
numpy      : 1.21.1
sys        : 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) 
[Clang 11.1.0 ]
scipy      : 1.8.0
pandas     : 1.3.0

Watermark: 2.3.0

In [39]:
df_train = pd.read_csv('../data/processed/df_train.csv', parse_dates=['date'])
df_test = pd.read_csv('../data/processed/df_test.csv', parse_dates=['date'])

Hourly number of rentals

In [28]:
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(26, 12))

sns.lineplot(data=df_train, x='hour', y='count', hue='season', hue_order=['Winter','Spring','Summer','Autumn'], ax=ax[0][0], ci=None)
ax[0][0].set(xlabel='Hour', ylabel='', title='Seasons')

sns.lineplot(data=df_train, x='hour', y='count', hue='dayofweek', hue_order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'], ax=ax[1][0], ci=None)
ax[1][0].set(xlabel='Hour', ylabel='', title='Days of week')

sns.pointplot(data=df_train, x="hour", y="count", hue="working_day", ax=ax[0][1], ci=None)
ax[0][1].set(xlabel='Hour', ylabel='', title='Working Days')

sns.pointplot(data=df_train, x='hour', y='count', hue='timesofday', hue_order=['Morning','Afternoon','Evening','Night'], ax=ax[1][1], ci=None)
ax[1][1].set(xlabel='Hour', ylabel='', title='Times of the Day')

plt.show()

When we plot the number o rentals on each hour of day, we notice that no discrepancy's found on different Seasons, Weekends/Holidays and even week days follow a similar pattern throughout the day.

Monthly Data

In [29]:
monthly_data = df_train.groupby('month')['count'].agg('sum').reset_index(name='total_rentals')
month_map = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
monthly_data['monthly_data_str'] = monthly_data['month'].map(lambda d : month_map[d])
monthly_stats = f"Mean = {round(monthly_data['total_rentals'].mean(), 2)} \nStd. Dev = {round(monthly_data['total_rentals'].std(), 2)}"
monthly_data = monthly_data[['monthly_data_str', 'total_rentals']]
fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [1, 3]}, figsize=(10,6), tight_layout=True)
ax[0].axis('off')
ax[0].axis('tight')
table = ax[0].table(cellText=monthly_data.values, colLabels=['Month','Total Rentals'], cellLoc='center', loc='center', colColours =["#eb826b"] * 2)
table.auto_set_font_size(False)
table.set_fontsize(12)
table.auto_set_column_width(col=list(range(len(monthly_data.columns)))) 
table.scale(3,2)
monthly_plot = sns.pointplot(x=monthly_data["monthly_data_str"], y=monthly_data["total_rentals"], ax=ax[1])
ax[1].set(ylim=(0,max(monthly_data["total_rentals"]+100)), xlabel='Month', ylabel='Total Rentals')
monthly_plot.annotate(monthly_stats, xy=(3, 1500), fontsize=16, fontfamily='monospace')
fig.tight_layout()
plt.show()
In [38]:
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   rain                8760 non-null   float64
 1   temp                8760 non-null   float64
 2   rhum                8760 non-null   int64  
 3   wdsp                8760 non-null   int64  
 4   date                8760 non-null   object 
 5   hour                8760 non-null   int64  
 6   day                 8760 non-null   int64  
 7   month               8760 non-null   int64  
 8   year                8760 non-null   int64  
 9   count               8760 non-null   int64  
 10  holiday             8760 non-null   bool   
 11  dayofweek_n         8760 non-null   int64  
 12  dayofweek           8760 non-null   object 
 13  working_day         8760 non-null   bool   
 14  season              8760 non-null   object 
 15  peak                8760 non-null   bool   
 16  timesofday          8760 non-null   object 
 17  rainfall_intensity  8760 non-null   object 
 18  wind_bft            8760 non-null   int64  
 19  wind_speed_group    8760 non-null   object 
 20  temp_r              8760 non-null   int64  
 21  temp_bin            8760 non-null   float64
 22  rhum_bin            8760 non-null   float64
dtypes: bool(3), float64(4), int64(10), object(6)
memory usage: 1.4+ MB

Daily Data - Time Series

In [56]:
def daily_data(df):
    slice_daily_data = df[['date', 'holiday', 'dayofweek', 'working_day', 'season']]
    df = df.groupby('date')['count'].agg('sum').reset_index(name='daily_rentals')
    time_series_df = pd.merge(df, slice_daily_data, on='date', how='left')
    time_series_df.drop_duplicates(subset='date', keep='first', inplace=True)
    time_series_df.set_index('date', inplace=True)
    return time_series_df
In [57]:
daily_rentals_train = df_train.copy()
daily_rentals_test = df_test.copy()
time_series_df_train = daily_data(daily_rentals_train)
time_series_df_test = daily_data(daily_rentals_test)
final_df = pd.concat([time_series_df_train, time_series_df_test], axis=0)
train_df, test_df = final_df[final_df['type'] == 'train'], final_df[final_df['type'] == 'test']
In [71]:
fig, ax = plt.subplots(figsize=(22, 12))
sns.lineplot(x=train_df.index, 
              y=train_df["daily_rentals"], ax=ax)
sns.lineplot(x=test_df.index, 
              y=test_df["daily_rentals"], ax=ax)
ax.axhline(45, ls='--', color='grey')
ax.set(xlabel='Time', ylabel='Rentals Count')
ax.set(ylim=(10,max(final_df["daily_rentals"])+10))
ax.set_title(fontsize=20, label="Time Series on Daily Rentals")
plt.legend(['train','test'])
plt.show()
In [73]:
train_df[train_df['daily_rentals'] < 45]
Out[73]:
daily_rentals holiday dayofweek working_day season type
date
2021-03-31 42 False Wednesday True Spring train
2021-05-20 44 False Thursday True Spring train
2021-07-31 38 False Saturday False Summer train
2021-08-31 42 False Tuesday True Summer train
2021-10-31 32 False Sunday False Autumn train
2021-11-30 31 False Tuesday True Autumn train
2021-12-06 44 False Monday True Autumn train
2021-12-08 39 False Wednesday True Autumn train
2021-12-10 40 False Friday True Autumn train
2021-12-19 42 False Sunday False Autumn train
2021-12-24 44 False Friday True Winter train
2021-12-25 14 True Saturday False Winter train
2021-12-26 33 True Sunday False Winter train
2021-12-27 34 False Monday True Winter train
2021-12-28 42 False Tuesday True Winter train
2021-12-30 40 False Thursday True Winter train
2021-12-31 31 False Friday True Winter train
2022-01-01 43 False Saturday False Winter train
2022-01-02 38 False Sunday False Winter train
2022-01-08 39 False Saturday False Winter train
2022-01-09 44 False Sunday False Winter train
2022-01-31 38 False Monday True Winter train
2022-02-20 34 False Sunday False Winter train
2022-02-28 34 False Monday True Winter train

Poisson/NB vs Time series of count data

If a time series of count data is generated by a Poisson point process then event occurrences in successive time intervals are independent. Independence is a reasonable assumption when the underlying stochastic process for events, conditional on covariates, has no memory. Then there is no need for special time series models. For example, the number of deaths (or births) in a region may be uncorrelated over time. At the same time the population, which cumulates births and deaths, will be very highly correlated over time.

The first step for time series count data is therefore to test for serial correlation.

In [32]:
expr = """ count ~ temp_r + wdsp + rhum + rain + holiday"""
y_train, X_train = dmatrices(expr, df_train, return_type = 'dataframe')
nb_training_results = sm.GLM(y_train, X_train, family = sm.families.NegativeBinomial()).fit(maxiter=5000, method='nm', cov_type='HC3')
print(nb_training_results.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                 8760
Model:                            GLM   Df Residuals:                     8754
Model Family:        NegativeBinomial   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                            nm   Log-Likelihood:                -20689.
Date:                Tue, 17 May 2022   Deviance:                       7382.9
Time:                        09:48:08   Pearson chi2:                 5.82e+03
No. Iterations:                    88   Pseudo R-squ. (CS):             0.1612
Covariance Type:                  HC3                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           3.3304      0.087     38.351      0.000       3.160       3.501
holiday[T.True]    -0.2301      0.079     -2.909      0.004      -0.385      -0.075
temp_r              0.0399      0.002     19.343      0.000       0.036       0.044
wdsp                0.0004      0.002      0.168      0.866      -0.004       0.005
rhum               -0.0305      0.001    -34.769      0.000      -0.032      -0.029
rain               -0.0834      0.035     -2.371      0.018      -0.152      -0.014
===================================================================================
In [33]:
# print(f'AIC: {round(nb_training_results.aic,2)} \nBIC: {round(nb_training_results.bic,2)}')

Durbin-Watson statistic

https://www.statology.org/durbin-watson-test-python/

The Durbin-Watson statistic will always have a value ranging between 0 and 4. A value of 2.0 indicates there is no autocorrelation detected in the sample. Values from 0 to less than 2 point to positive autocorrelation and values from 2 to 4 means negative autocorrelation.

In [34]:
# The response residuals are defined as endog - fittedvalues

# The hypotheses followed for the Durbin Watson statistic:
# H(0) = First-order autocorrelation does not exist.
# H(1) = First-order autocorrelation exists.

print(f'Durbin-Watson test for serial correlation: {round(durbin_watson(nb_training_results.resid_response),2)}')
Durbin-Watson test for serial correlation: 1.08

Possible positive autocorrelation

A rule of thumb is that DW test statistic values in the range of 1.5 to 2.5 are relatively normal. Values outside this range could, however, be a cause for concern. Field(2009) suggests that values under 1 or more than 3 are a definite cause for concern.

Field, A. (2009). Discovering Statistics Using SPSS, 3rd Edition (Introducing Statistical Methods) (3rd ed.). SAGE Publications Ltd.

In [35]:
houlyTM = df_train[['date','hour','count']].copy()
# houlyTM = hourly_rentals[['date','hour','count']].copy()
houlyTM['datetime'] = pd.to_datetime(houlyTM.date) + pd.to_timedelta(houlyTM.hour, unit='h')
houlyTM.set_index('datetime', inplace=True)
houlyTM.drop(houlyTM.tail(25).index,inplace=True)

Residual Autocorrelation and Partial autocorrelation

Autocorrelation occurs when the residuals are not independent of each other.

In [36]:
fig = plt.figure(figsize=(18,10))
gs = fig.add_gridspec(2, 2)
ax0 = fig.add_subplot(gs[0, :])
ax1 = fig.add_subplot(gs[1, :])
plot_acf(nb_training_results.resid_response, ax=ax0)
plot_pacf(nb_training_results.resid_response, ax=ax1)
plt.show()

How to Handle Autocorrelation

If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options to correct this problem if you deem it to be serious enough:

  1. For positive serial correlation, consider adding lags of the dependent and/or independent variable to the model.

  2. For negative serial correlation, check to make sure that none of your variables are overdifferenced.

  3. For seasonal correlation, consider adding seasonal dummy variables to the model.


Time series modeling allows us to replicate every element of the process by decomposing the mathematical process into a combination of signals (e.g. growth in demand, seasonal variability, etc) and noise (random probabilistic processes), withou necessarily knowing the underlying causes for each.

Considerations

  • Count has a fixed range due to limitation on bikes available. (e.g. we cannot have more than 30 rentals if Moby Bikes only has 30 bikes available and to a new rental begins the bike needs to become available (current rental finished))
  • The duration of rentals is relevant and directly impacts the number of rentals per day.
  • Because we have a limited range of occurence on our target feature, decisions trees can be applied as they can only predict within the training range.
  • Hourly data requires an approach which can count with autocorrelation. Temperature, Humidity and Wind Speed are not randomly throughout the hours. If one hour we have a temperature of 25C, on the next hour is very likely that temperature will be closed to that. (time series?)
  • Capturing the relationship on these variables as continuous can be hard for machine learning algorithms as the range is to high. Temperature of 13.4C and 13.9C or 13C and 15C, are practically the same if you think about deciding whether to go bicicling or not. The same rule applies for humidity and wind speed. Transforming these variables into categoricals can provide more relevance to the algorithms.
# Stacking Random Forest and GBM
rf_pred = predict_on_test_set(rf_model, rf_cols)
gbm_pred = predict_on_test_set(gbm_model, gbm_cols)
# taking weighted average of output from two models
y_pred = np.round(.20*rf_pred + .80*gbm_pred)
  • Consider transforming all features into categories or merge all into one (e.g. good weather = high temperature + weak wind speed + light or no rain). ```python df['ideal'] = df[['temp', 'windspeed']].apply(lambda x: (0, 1)[x['temp'] > 27 and x['windspeed'] < 30], axis = 1) df['sticky'] = df[['humidity', 'workingday']].apply(lambda x: (0, 1)[x['workingday'] == 1 and x['humidity'] >= 60], axis = 1)

References:

https://blog.paperspace.com/introduction-time-series-analysis/

https://www.machinelearningplus.com/time-series/time-series-analysis-python/

https://towardsdatascience.com/how-to-predict-your-step-count-for-next-week-a16b7800b408

GitHub Mark GitHub repository
Author: Leandro Pessini