Skip to content

limeunhee/bike-demand-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bike Sharing Demand Prediction

https://www.kaggle.com/c/bike-sharing-demand

Bike sharing program is a transporation system that provides affordable access to bikes in urban areas. The data generated by bike sharing program can help researchers understand behaviors of bike share users and mobility patterns. The goal of this kaggle project is to combine historic bike share usage and weather data to predict bikeshare demand in Washington DC.

The data

Data source

  • The data was downloaded from '''Bike Sharing Demand''' competition on kaggle.com
  • There are total of 10886 rows of data, collected from 24 months period in Washington DC
  • Each row contains bike usage count for 1 hour period and corresponding weather information
  • Data for 4th week of every month is used as a test set, and prediction on test set is evaluated by RMSLE

EDA

First, number of observations for categorical variables are explored. There are approximately equal number of observations for each season. In terms of weather, "clear" has the most observations, followed by "Misty + Cloudy" , "Wet", and very few "Extreme". And majority of observations were from non-holiday and working day.


Figure 1. Categorical features in dataset

Next, distribution of features were explored using barplots. Mean bike share count per hour was in order of Summer > Fall > Winter > Spring. In terms of months, June and July had the highest bikeshare demand and January, February, and March had the least bikeshare demand. While the average bikeshare demand was similar for holiday vs non-holiday and workday vs non-workday, there are higher number of outliers in non-holidays and workdays. In terms of weather conditions, clear day had the highest average bikeshare demand.


Figure 2. Distribution of features from barplots

By plotting the bike share count by time, it is evident that overall bikeshare demand has increased from 2011 to 2012, and summer months have the highest total number of bike shares. The 0 values in weekly and daily timeseries are the test set that is set aside by Kaggle for evaluation.


Figure 3. Bike share count by time

By plotting the bike share count by hour, impact of features on hourly demand is clear. In terms of weather, Spring time has the least demand all across hours. While during morning commute hours(7-9am), demand in Summer, Fall, Winter were similar, during afternoon commute hours(4-7pm), demand varied for those seasons. On holidays and non-work days, there were less bike demand for commute hours, and more demand during the work hours compared to non-holidays and work days. Lastly, day of week (0-Monday, 5-Sunday) shows that bike demand pattern is drastically different on weekday vs weekends. Additionally, on weekdays, Monday and Friday have slightly different demand pattern compared to other weekdays.


Figure 4. Hourly and monthly averages of bike share demand

From scattering plots of demand count vs weather, we see slight positive correlations with temperatures (both absolute and felt temperatures) and slight negative correlations with windspeed and humidity. Since temp and atemp have correlation value close to 1, one of the two features can be dropped for modeling.


Figure 5. Weather conditions and bike share counts

Lastly, correlation between features were visualized using a heatmap. Features hour,temperature, year, month have highest positive correlation with the count ( target variable), and humidity shows high negative correlation with the count.


Figure 6. Correlation plot of features

Evaluation Metrics and model comparison

Root mean square log error (RMSLE) is used as model evaluation metric. RMSLE gives extra penalty for underestimation compared to overeestimation. (In comparison, RMSE gives equal penalty for over and underestimation). Linear regression, linear regression with lasso, linear regression with ridge, random forest, and gradient boosted models were optimized through grid search and their rmsle values were [0.5878, 0.5896,0.578828, 0.3324, 0.2964] respectively.


Figure 7. Final model RMSLE comparisons

Best Performing Model and Evaluation on Kaggle

The best performing GBR model was used to make prediction on Kaggle test set. The RMSLE value was 0.38922, equivalnet to top 5% score on the leaderboard. Below is a cdaily bikeshare demand plot that combines train set and predicted values using best model on Kaggle test set.


Figure 8. Final test set prediction from Gradient Boosted Model

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published