This notebook is part of a GitHub repository: https://github.com/pessini/moby-bikes
MIT Licensed
Author: Leandro Pessini
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
%matplotlib inline
import warnings
warnings.simplefilter('ignore', FutureWarning)
Dataset provided by Moby Bikes through a public API.
Dataset provided by Met Éireann through a public API.
HarvestTime
- Data retrieval timestampBikeID
- Unique bike ID used for rent bikeBattery
- Battery status (max distance in km)BikeIdentifier
- Bike Identifier (Generally contains only digits, might contains chars)BikeTypeName
- Bike type nameEBikeProfileID
- E-bike Profile ID (Every ebike profile has defined Geofence (allowed riding areas))EBikeStateID
- EBike State (Indicates: {1:'Warning - is in move and not rented',2:'Normal',3:'Switched Off',4:'Firmware Upgrade',5:'Laying on the ground'})IsEBike
- Is electronic bike (Bike sends messages to Backend if bike is equipped with electronic, bluetooth etc.)IsMotor
- Bike has engineIsSmartLock
- Bike has smart lockLastGPSTime
- Last valid GPS messageLastRentalStart
- Last time bike was rentedLatitude
- Bike coordinates if bike is locked out of stationLongitude
- Bike coordinates if bike is locked out of stationSpikeID
- Might be used for rent bike instead of BikeIDdate_columns = ['HarvestTime','LastGPSTime', 'LastRentalStart']
historical_data = pd.read_csv('../data/raw/historical_data.csv', parse_dates=date_columns)
historical_data.info()
historical_data.columns = historical_data.columns.str.lower()
historical_data.head()
print(f'Total number of rows: {historical_data.shape[0]}')
print(f'Total number of columns: {historical_data.shape[1]}')
historical_data.isnull().sum()
historical_data['isebike'].value_counts()
historical_data['ismotor'].value_counts()
historical_data['issmartlock'].value_counts()
isebike
, ismotor
and issmartlock
only have one value. For that reason they will not be used further on the analysis.
EBike State indicates:
On all states a bike can report, the number 1 is the most alarming because can indicate that a rental is over, a failure in the signal or the rental didn't start at all.
historical_data['ebikestateid'].value_counts()
idx_state1 = historical_data[historical_data['ebikestateid'] == 1]['lastrentalstart'].tolist()
idx_state5 = historical_data[historical_data['ebikestateid'] == 5]['lastrentalstart'].tolist()
nrented_df = historical_data[historical_data['lastrentalstart'].isin(idx_state1)]
# 2021-03-31 08:52:13
historical_data[historical_data['lastrentalstart'] == '2021-04-02 12:51:50']
# 2021-04-02 13:00:13
nrented_df[nrented_df['lastrentalstart'] == '2021-04-02 13:00:13']
bikenotparked_df = historical_data[historical_data['lastrentalstart'].isin(idx_state5)]
bikenotparked_df.sort_values(by=['lastrentalstart','lastgpstime'], ascending=False).head()
We have values with ebikestateid
== 1 (Warning - is in move and not rented) with only one row and others 'within' a rental that could be due to malfunctioning.
If the ebikestateid
== 5 is present, only a fine will applied to the customer according to the Terms of Services which will not impact the analysis of the rental.
When ebikestateid
== 1, only rows with one lastrentalstart
will be removed. The logic is that if a bike has state 1 and only one row, the rental didn't start at all.
# filter out the ones with bike state id 1 and only 1 row
not_rented_list = nrented_df.groupby('lastrentalstart').filter(lambda x: len(x) == 1).index.tolist()
len(not_rented_list)
historical_data_clean = historical_data[~historical_data.index.isin(not_rented_list)]
historical_data.shape[0] - historical_data_clean.shape[0]
Sanity check
assert (historical_data.shape[0] - historical_data_clean.shape[0]) == len(not_rented_list), \
f"Bike state id logic is wrong. Expected {len(not_rented_list)} rows to be removed, but got {historical_data.shape[0] - historical_data_clean.shape[0]}"
historical_data['biketypename'].value_counts()
workshop_df = historical_data_clean[historical_data_clean['biketypename'] == 'Workshop']
private_df = historical_data_clean[historical_data_clean['biketypename'] == 'Private']
healthcare_df = historical_data_clean[historical_data_clean['biketypename'] == 'Healthcare']
date_range = str(workshop_df['lastrentalstart'].dt.date.min()) + ' to ' +str(workshop_df['lastrentalstart'].dt.date.max())
date_range
workshop_rentals_dt = workshop_df['lastrentalstart'].drop_duplicates()
workshop_rentals_dt = workshop_rentals_dt.groupby(workshop_rentals_dt.dt.date).count()
fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
sns.lineplot(x=workshop_rentals_dt.index, y=workshop_rentals_dt.values, ax=ax, ci=None)
ax.set(xlabel='Date', ylabel='Total rentals', title='Workshop rentals')
ax.text(datetime(2020,10,20), 7.8, "Tuesday, October 6, 2020", fontsize=12, color='red')
plt.show()
healthcare_df_dt = healthcare_df['lastrentalstart'].drop_duplicates()
healthcare_df_dt = healthcare_df_dt.groupby(healthcare_df_dt.dt.date).count()
fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
sns.pointplot(x=healthcare_df_dt.index, y=healthcare_df_dt.values, ax=ax, ci=None)
ax.set(xlabel='Date', ylabel='Total rentals', title='Healthcare rentals')
plt.xticks(rotation = 45) # Rotates X-Axis Ticks by 45-degrees
plt.show()
workshop_df['bikeid'].value_counts()
private_df['bikeid'].value_counts()
healthcare_df['bikeid'].value_counts()
Moby has one bike with ID 83 which seems to be Private
and also different rental types called Workshop
and Healthcare
. Workshop is related to events which bikes are allocated either to an specific time period or an event provided by Moby Bikes in order to promote its services. Regarding Healthcare type, is a ebike scheme that was offered for free to frontline workers.
general_hist_data = historical_data_clean[historical_data_clean['biketypename'] == 'DUB-General']
print(f'Private BIKE ID 83: {general_hist_data[general_hist_data["bikeid"] == 83].shape[0]} observations used on DUB-General')
print(f'Workshop BIKE ID 58: {general_hist_data[general_hist_data["bikeid"] == 58].shape[0]} observations used on DUB-General')
print(f'Healthcare BIKE ID 48: {general_hist_data[general_hist_data["bikeid"] == 48].shape[0]} observations used on DUB-General')
We can see that the same bike used as Private
, Workshop
and Healthcare
are being used again from regular customers. Also, workshop day do not have a big impact overall as shown the maximum number of rentals was 8 on a Workshop day.
Due to the fact that workshop volume on rentals is not significant high, it is not worthy to label a rental as workshop/healthcare (true/false). For the reasons discussed above, the data from those will be removed.
battery_hist_df = general_hist_data.copy()
battery_hist_df.isnull().sum()
battery_hist_df.loc[battery_hist_df['battery'] > 100]
From the battery records there is a few cases that we can consider. Only one record has battery > 100
and a few negatives ones. To simplify the analysis the records will be normalized with values between 0 > x > 100
.
All missing values will not be transformed as it could be only malfunction issue when transmiting the data and it could mislead the analysis.
battery_hist_df.loc[battery_hist_df['battery'] > 100, 'battery'] = 100
negatives_b = battery_hist_df.loc[battery_hist_df['battery'] < 0]
negatives_b.shape[0]
There are lots of negative/missing battery information as well as Latitude/Longitude. Because we are not using battery information for modeling, this feature won't be handled for now. All Latitude/Longitude with 0 will be replace as NaN.
When grouping the data all the historical info for these two features will be stored as lists.
# Transforming null values and negative battery values to -1
battery_hist_df['battery'].fillna(-1, inplace=True)
battery_hist_df.loc[battery_hist_df['battery'] < 0, 'battery'] = -1
negatives_b_n = battery_hist_df.loc[battery_hist_df['battery'] < 0]
print(f'Previous # of negative battery values: {negatives_b.shape[0]}')
print(f'# of negative values after transformation: {negatives_b_n.shape[0]}')
print(f'Difference: {negatives_b_n.shape[0]-negatives_b.shape[0]}')
rentals = battery_hist_df.copy()
columns_to_drop = ['harvesttime','bikeidentifier','biketypename','ebikeprofileid',
'ebikestateid','isebike','ismotor','issmartlock','spikeid']
rentals.drop(columns_to_drop, axis=1, inplace=True)
rentals.shape
def group_as_list(x):
d = {'coordinates': x[['latitude','longitude']].values.tolist()}
d['battery'] = list(x['battery']) # battery chronology, inverting it because it's in descending order
d['lastgpstime'] = list(x['lastgpstime'])[0] # get the last gpstime (previously sorted)
return pd.Series(d, index=['coordinates', 'battery', 'lastgpstime'])
# also sorting data by lastgpstime
grouped_rentals = rentals.sort_values("lastgpstime", ascending=False).groupby(['lastrentalstart', 'bikeid']).apply(group_as_list).reset_index()
grouped_rentals.tail()
def get_valid_battery(battery_list):
'''
Iterate the battery list and return the first valid battery value. The battery list is in descending order.
'''
return next((item for item in reversed(battery_list) if item > 0), -1)
grouped_rentals['start_battery'] = grouped_rentals['battery'].apply(get_valid_battery) # get the first battery value
grouped_rentals.to_csv('../data/interim/rentals.csv', index=False)
grouped_rentals.shape
Regarding the weather data there are two important decisions to deal with.
# Hourly data from Phoenix Park Station
phoenixpark_weather_hourly = pd.read_csv('../data/raw/hly175.csv', low_memory=False)
phoenixpark_weather_hourly.head()
# Daily data from Phoenix Park Station
phoenixpark_weather_daily = pd.read_csv('../data/raw/dly175.csv')
phoenixpark_weather_daily.head()
# Hourly data from Dublin Airport Station
dublin_airport_weather_hourly = pd.read_csv('../data/raw/hly532.csv', low_memory=False, parse_dates=['date'])
dublin_airport_weather_hourly.head()
print(f'Total number of rows: {dublin_airport_weather_hourly.shape[0]}')
Geographically, the station at Phoenix Park would be the most suitable choice but unfortunately, they do not collect Wind information which in Ireland plays an important role when deciding to go cycling or not. For those who are not familiar with Irish weather, it rains a lot and mostly we do not have much choice about it but the wind is something that can prevent you go outside or choosing a different kind of transportation. Heavy rain is not that common, though.
A daily data to the business could make more sense but because the weather is so unpredictable in Ireland (it can completely change in an hour), the best option would be hourly data if looking at a historical perspective. Important to note that from the Weather API the forecast is provided hourly. For simplicity and better planning, we can always aggregate the predicted results by day.
dublin_airport_weather_hourly.info()
dublin_airport_weather_hourly.tail()
recent_dubairport_data = dublin_airport_weather_hourly.copy()
columns_to_drop = ['ind','ind.1','ind.2','ind.3','vappr','msl','ind.4','wddir','ww','w','sun','vis','clht','clamt','wetb','dewpt']
weather_data = recent_dubairport_data.drop(columns=columns_to_drop)
weather_data.isnull().sum()
weather_data.to_csv('../data/interim/hist_weather_data.csv', index=False)
weather_data.tail()
weather_data['dt'] = pd.to_datetime(weather_data['date'].dt.date)
weather_data['hour'] = weather_data['date'].dt.hour
weather_data['day'] = weather_data['date'].dt.day
weather_data['month'] = weather_data['date'].dt.month
weather_data['year'] = weather_data['date'].dt.year
weather_data.drop(columns='date', axis=1, inplace=True)
weather_data.rename(columns={'dt':'date'},inplace=True)
grouped_rentals.shape, weather_data.shape
rentals = grouped_rentals.copy()
weather = weather_data.copy()
rentals['date'] = pd.to_datetime(rentals['lastrentalstart'].dt.date)
rentals['hour'] = rentals['lastrentalstart'].dt.hour
all_data = pd.merge(rentals, weather, on=['date', 'hour'])
all_data.to_csv('../data/interim/all_data.csv', index=False)
hourly_rentals = rentals.groupby(['date', 'hour']).size().reset_index(name='count')
hourly_rentals.tail(3)
hourly_data = pd.merge(hourly_rentals, weather, on=['date','hour'])
hourly_data.to_csv('../data/interim/hourly_rentals.csv', index=False)
hourly_data.tail(3)
hourly_data_withzeros = pd.merge(weather, hourly_rentals, on=['date','hour'], how='left')
hourly_data_withzeros['count'] = hourly_data_withzeros['count'].fillna(0).astype(int)
hourly_data_withzeros.to_csv('../data/interim/hourly_data.csv', index=False)
hourly_data_withzeros.tail(3)
start_train_dt, end_train_dt = datetime(2021, 3, 1), datetime(2022, 2, 28) # train sample
start_test_dt, end_test_dt = datetime(2022, 3, 1), datetime(2022, 4, 30) # test sample
df_train = hourly_data_withzeros[(hourly_data_withzeros['date'] >= start_train_dt) & (hourly_data_withzeros['date'] <= end_train_dt)]
df_test = hourly_data_withzeros[(hourly_data_withzeros['date'] >= start_test_dt) & (hourly_data_withzeros['date'] <= end_test_dt)]
df_train.to_csv('../data/interim/dfsample_train.csv', index=False)
df_test.to_csv('../data/interim/dfsample_test.csv', index=False)
df_train.shape, df_test.shape
GitHub repository
Author: Leandro Pessini