Analyzing Bike Sharing Demand

Recently, I worked on an assignment to analyze the data from bikesharing system to predict its demand. In this post, we will see how the given data can be analyzed using statistical machine learning methods.

The original dataset can be found here on Kaggle.

https://www.kaggle.com/c/bike-sharing-demand

In this notebook, we would be using a small subset of this dataset and just focus on the methodology. The accuracies can’t be expected to be high for such a small subset of the dataset. Feel free to use the complete dataset for your analysis.

You can download the CSV data used in this noteboot from Google drive.

https://drive.google.com/file/d/1GWbrrsPe6B-0ZKn4Dxd7D8pNvfK6IsQZ/view?usp=sharing

   season month  holiday day_of_week  ...  atemp humidity  windspeed   count
0  Spring   May        0         Tue  ...   26.0  76.5833   0.118167  6073.0
1    Fall   Dec        0         Tue  ...   19.0  73.3750   0.174129  6606.0
2  Spring   Jun        0         Thu  ...   28.0  56.9583   0.253733  7363.0
3    Fall   Dec        0         Sun  ...    4.0  58.6250   0.169779  2431.0
4  Summer   Sep        0         Wed  ...   23.0  91.7083   0.097021  1996.0

[5 rows x 11 columns]

Data Exploration

Bikeshares for weekend vs weekdays

The plot uses the workingday column to get information about weekday or weekend. Clearly lot more bikeshares happen on weekdays as compared to weekends.

Also, on a more detailed level the factor plot is plotted to see the variations in bikeshares based on the day of the week.

The factor plot reaffirms our observation from the pie chart and tells us that during weekends Sunday sees the least number of bikeshares.

png

png

Bikeshares during different weather conditions

The plot below shows that most bikeshares happen when the weather is clear to cloudy. People also seem to be using bikeshares in misty weather conditions. Quite intuitively, the number of bikeshares on a rainy or snowy day is quite less.

png

Bikeshares during different seasons

Firstly a pie chart is plotted for visualizing the number of bikeshares in different seasons. The pie chart doesn’t provide a clear picture so next a bar chart is used to get a clearer picture.

The bar chart shows that most bikeshares happen during Summer while Winter accounts for the least number of bikeshares.

png

png

Exploring linear non-linear dependencies

As it is apparent from the plots below, some of the variables have linear dependencies with others while some don’t have a linear relationship.

VariableLinearInverse-linearNon-Linear
Tempatemp, count-humidity, windspeed
atemptemp, count-humidity, windspeed
humidity-windspeedtemp, atemp, count
windspeed-humiditytemp, atemp, count
counttemp, atemp-humidity, windspeed

png

Moreover, lets find the correlation between the variables to further analyse the dependencies between variables.

In the correlation heatmap, the blue shades represent higher level of correlation between the variables.

Note: temp, atemp variables show good amount of positive correlation with count.

Index(['season', 'month', 'holiday', 'day_of_week', 'workingday', 'weather',
       'temp', 'atemp', 'humidity', 'windspeed', 'count'],
      dtype='object')

png

Should polynomial transformation be performed?

Clearly not all features exhibit a linear dependency with the predictor variable ie. count so it might be worthwhile to try a polynomial transformation and visualize the dependencies again.

Before that lets do some more data exploration to further improve the data for linear regression.

Bikeshares sum and mean plots grouped based on humidity

The idea is to see the variations in bikeshares for different humidity levels and then bin the values into multiple groups.

Text(0.5, 1.0, 'Average counts')

png

png

Bikeshares sum and mean plots grouped based on atemp

Similar to humidity, the idea is to see the variations in bikeshares for different atemp levels and then bin the values into multiple groups.

Moreover, temp and atemp is compared to see how these values vary.

Text(0.5, 1.0, 'Average counts')

png

png

temptemp_no
024.037.0
115.020.0
226.048.0
30.012.0
423.046.0
.........
72626.048.0
72733.025.0
72830.049.0
7298.017.0
73016.030.0

731 rows × 2 columns

<matplotlib.axes._subplots.AxesSubplot at 0x7fe857347240>

png

seasonmonthholidayday_of_weekworkingdayweathertempatempwindspeedcounttemp_noatemp_bin_coldatemp_bin_mildatemp_bin_warmhumidity_bin_lowhumidity_bin_mediumhumidity_bin_high
0SpringMay0Tue1Mist24.026.00.1181676073.037.0010001
1FallDec0Tue1Clear to Cloudy15.019.00.1741296606.020.0010001
2SpringJun0Thu1Clear to Cloudy26.028.00.2537337363.048.0010010
3FallDec0Sun0Clear to Cloudy0.04.00.1697792431.012.0000010
4SummerSep0Wed1Rain/Snow23.023.00.0970211996.046.0010001
......................................................
726SummerSep0Fri1Mist26.027.00.1399294727.048.0010010
727SummerAug0Wed1Clear to Cloudy33.032.00.2002584780.025.0001100
728SummerSep0Sun0Clear to Cloudy30.031.00.2064674940.049.0001001
729WinterJan0Sun0Clear to Cloudy8.013.00.1921672294.017.0100010
730SpringApr0Thu1Clear to Cloudy16.020.00.0659296565.030.0010010

731 rows × 17 columns

Encoding categorical variables

   month  holiday  ...  weather_Mist  weather_Rain/Snow
0      8        0  ...             1                  0
1      2        0  ...             0                  0
2      6        0  ...             0                  0
3      2        0  ...             0                  0
4     11        0  ...             0                  1

[5 rows x 22 columns]

Checking for positive/negative correlation & statisitical significance

Some of the variables have negative correlation like windspeed, atemp_bin_cold, humidity_bin_high, season_Winter, weather_Mist, weather_Rain/Snow. Other variables with -ve sign have a very low -ve value and can be safely ignored.

Similarly, the table shows variables with positive correlation.

------ Positive/negative correlation with predictor variable(count) --------
month                      0.186275
holiday                   -0.004103
day_of_week               -0.045716
workingday                 0.042159
temp                       0.516475
atemp                      0.516624
windspeed                 -0.195128
count                      1.000000
temp_no                    0.429402
atemp_bin_cold            -0.225357
atemp_bin_mild             0.388137
atemp_bin_warm             0.161776
humidity_bin_low          -0.076209
humidity_bin_medium        0.201935
humidity_bin_high         -0.153773
season_Fall                0.071844
season_Spring              0.101960
season_Summer              0.294050
season_Winter             -0.471729
weather_Clear to Cloudy    0.219172
weather_Mist              -0.140658
weather_Rain/Snow         -0.233985
Name: count, dtype: float64
------------------------------

Statistical significance

Statistical significance can be measured using P-values.

  • more the number of *(stars) and higher the value, the statistical significance would be higher.

Note: Statistical significance assumes linear dependencies.

monthholidayday_of_weekworkingdaytempatempwindspeedcounttemp_noatemp_bin_coldatemp_bin_mildatemp_bin_warmhumidity_bin_lowhumidity_bin_mediumhumidity_bin_highseason_Fallseason_Springseason_Summerseason_Winterweather_Clear to Cloudyweather_Mistweather_Rain/Snow
month1.0***0.020.01-0.010.11***0.12***-0.12***0.19***0.15***-0.08**0.29***-0.12***-0.16***0.030.1***0.4***-0.11***-0.05-0.23***-0.040.010.06*
holiday0.021.0***-0.12***-0.25***-0.03-0.030.01-0.0-0.020.02-0.07*0.05-0.030.03-0.010.02-0.02-0.030.030.03-0.02-0.03
day_of_week0.01-0.12***1.0***0.2***0.030.030.0-0.050.010.010.00.030.02-0.030.020.00.00.01-0.01-0.0-0.030.1***
workingday-0.01-0.25***0.2***1.0***0.050.05-0.020.040.05-0.020.060.01-0.030.05-0.02-0.010.010.02-0.03-0.060.050.03
temp0.11***-0.030.030.051.0***0.99***-0.16***0.52***0.55***-0.47***0.47***0.58***-0.15***0.11***0.01-0.22***0.16***0.68***-0.62***0.12***-0.1***-0.06
atemp0.12***-0.030.030.050.99***1.0***-0.18***0.52***0.54***-0.44***0.47***0.58***-0.17***0.12***0.01-0.21***0.16***0.66***-0.62***0.11***-0.09**-0.07*
windspeed-0.12***0.010.0-0.02-0.16***-0.18***1.0***-0.2***-0.13***0.09**-0.1***-0.11***0.28***-0.13***-0.1***-0.14***0.1***-0.14***0.18***-0.0-0.040.12***
count0.19***-0.0-0.050.040.52***0.52***-0.2***1.0***0.43***-0.23***0.39***0.16***-0.08**0.2***-0.15***0.07*0.1***0.29***-0.47***0.22***-0.14***-0.23***
temp_no0.15***-0.020.010.050.55***0.54***-0.13***0.43***1.0***-0.17***0.5***0.07*-0.16***0.030.11***-0.030.19***0.29***-0.45***0.010.0-0.03
atemp_bin_cold-0.08**0.020.01-0.02-0.47***-0.44***0.09**-0.23***-0.17***1.0***-0.55***-0.29***-0.01-0.030.040.34***-0.08**-0.41***0.16***-0.1***0.09**0.03
atemp_bin_mild0.29***-0.07*0.00.060.47***0.47***-0.1***0.39***0.5***-0.55***1.0***-0.32***-0.1***-0.030.12***-0.08**0.31***0.15***-0.38***-0.00.01-0.02
atemp_bin_warm-0.12***0.050.030.010.58***0.58***-0.11***0.16***0.07*-0.29***-0.32***1.0***-0.060.16***-0.13***-0.23***-0.13***0.58***-0.23***0.15***-0.13***-0.07*
humidity_bin_low-0.16***-0.030.02-0.03-0.15***-0.17***0.28***-0.08**-0.16***-0.01-0.1***-0.061.0***-0.51***-0.25***-0.15***0.08**-0.07*0.14***0.3***-0.28***-0.07**
humidity_bin_medium0.030.03-0.030.050.11***0.12***-0.13***0.2***0.03-0.03-0.030.16***-0.51***1.0***-0.69***0.07**-0.12***0.11***-0.06*0.24***-0.17***-0.2***
humidity_bin_high0.1***-0.010.02-0.020.010.01-0.1***-0.15***0.11***0.040.12***-0.13***-0.25***-0.69***1.0***0.050.08**-0.07*-0.06-0.52***0.43***0.27***
season_Fall0.4***0.020.0-0.01-0.22***-0.21***-0.14***0.07*-0.030.34***-0.08**-0.23***-0.15***0.07**0.051.0***-0.33***-0.33***-0.33***-0.06*0.030.09**
season_Spring-0.11***-0.020.00.010.16***0.16***0.1***0.1***0.19***-0.08**0.31***-0.13***0.08**-0.12***0.08**-0.33***1.0***-0.34***-0.33***-0.020.04-0.04
season_Summer-0.05-0.030.010.020.68***0.66***-0.14***0.29***0.29***-0.41***0.15***0.58***-0.07*0.11***-0.07*-0.33***-0.34***1.0***-0.34***0.11***-0.1***-0.03
season_Winter-0.23***0.03-0.01-0.03-0.62***-0.62***0.18***-0.47***-0.45***0.16***-0.38***-0.23***0.14***-0.06*-0.06-0.33***-0.33***-0.34***1.0***-0.020.03-0.02
weather_Clear to Cloudy-0.040.03-0.0-0.060.12***0.11***-0.00.22***0.01-0.1***-0.00.15***0.3***0.24***-0.52***-0.06*-0.020.11***-0.021.0***-0.94***-0.23***
weather_Mist0.01-0.02-0.030.05-0.1***-0.09**-0.04-0.14***0.00.09**0.01-0.13***-0.28***-0.17***0.43***0.030.04-0.1***0.03-0.94***1.0***-0.12***
weather_Rain/Snow0.06*-0.030.1***0.03-0.06-0.07*0.12***-0.23***-0.030.03-0.02-0.07*-0.07**-0.2***0.27***0.09**-0.04-0.03-0.02-0.23***-0.12***1.0***

VariableCorrelation & Significance
0month0.19***
1holiday-0.0
2day_of_week-0.05
3workingday0.04
4temp0.52***
5atemp0.52***
6windspeed-0.2***
7count1.0***
8temp_no0.43***
9atemp_bin_cold-0.23***
10atemp_bin_mild0.39***
11atemp_bin_warm0.16***
12humidity_bin_low-0.08**
13humidity_bin_medium0.2***
14humidity_bin_high-0.15***
15season_Fall0.07*
16season_Spring0.1***
17season_Summer0.29***
18season_Winter-0.47***
19weather_Clear to Cloudy0.22***
20weather_Mist-0.14***
21weather_Rain/Snow-0.23***

Linear regression without polynomial transformation

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fe851c2a828>

png

png

png

Evaluating predictions

We can evaluate the predictions using Root mean Squared Log Error (RMSLE)

R2: 44.029165944830574 %
RMSE: 0.5088707035845853
RMSLE: 0.5071272452091635

Back to polynomial transformation

Now, that we have seen that without polynomial transformation we get a RMSLE score of 0.507 lets try with the polynomial transformation.

     1  month  ...  weather_Mist weather_Rain/Snow  weather_Rain/Snow^2
0  1.0    8.0  ...                             0.0                  0.0
1  1.0    2.0  ...                             0.0                  0.0
2  1.0    6.0  ...                             0.0                  0.0
3  1.0    2.0  ...                             0.0                  0.0
4  1.0   11.0  ...                             0.0                  1.0

[5 rows x 276 columns]
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

png

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

R2: 22.13254677543377 %
RMSE: 0.6002118212313721
RMSLE: 0.5991302674566322

Random forest regressor

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

png

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

R2: 40.83093585563251 %
RMSE: 0.5232074381024822
RMSLE: 0.5214938879406389

Gradient boosting regressor

png

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

R2: 41.92546239525904 %
RMSE: 0.5183456277546589
RMSLE: 0.5167765321375696

XG Boost Regressor

[06:03:21] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

png

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

R2: 35.83080957789135 %
RMSE: 0.5448661642271247
RMSLE: 0.5432712054784864

Conclusion

The results for different models are shown in the table below. Clearly linear regression with polynomial transformation performs the best in terms of RMSE score.

ModelR2 Score(%)RMSERMSLE
Linear regression44.020.5080.507
Linear regression (polynomial)22.130.600.59
Random forest regressor43.280.5120.510
Gradient boosting regressor43.4451.151.0
XG Boost regressor25.830.5440.5432
Vivek Maskara
Vivek Maskara
SDE @ Remitly

SDE @ Remitly | Graduated from MS CS @ ASU | Ex-Morgan, Amazon, Zeta

Related