This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Fri Oct 24 04:35:21 2025.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
## Import required packages
# Install timetk (required for the dataset)
install.packages(c("timetk", "dplyr"))
## Installing packages into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
# Load the package
library(timetk)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the built-in dataset
data("bike_sharing_daily")
# Create a working copy
bike_data <- bike_sharing_daily
# Correlation between temp/atemp and total rentals (cnt)
cor(bike_data$temp, bike_data$cnt)
## [1] 0.627494
cor(bike_data$atemp, bike_data$cnt)
## [1] 0.6310657
# Mean & median temp by season
bike_data %>%
mutate(season_label = case_when(
season == 1 ~ "Winter",
season == 2 ~ "Spring",
season == 3 ~ "Summer",
season == 4 ~ "Fall"
)) %>%
group_by(season_label) %>%
summarise(
mean_temp = mean(temp),
median_temp = median(temp),
mean_rentals = mean(cnt)
)
## # A tibble: 4 × 4
## season_label mean_temp median_temp mean_rentals
## <chr> <dbl> <dbl> <dbl>
## 1 Fall 0.423 0.409 4728.
## 2 Spring 0.544 0.562 4992.
## 3 Summer 0.706 0.715 5644.
## 4 Winter 0.298 0.286 2604.
library(dplyr)
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
bike_data %>%
mutate(year = year(dteday)) %>%
ggplot(aes(x = dteday, y = cnt, color = factor(year))) +
geom_line() +
labs(
title = "Daily Bike Rentals (2011-2012)",
x = "Date",
y = "Total Rentals (cnt)",
color = "Year"
) +
theme_minimal()
# Load required packages
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(TTR)
# Step 1: Create time series object
ts_cnt <- ts(bike_data$cnt, start = c(2011, 1), frequency = 365)
# Step 2: Clean the series
ts_clean <- tsclean(ts_cnt)
# Step 3: Compute SMA(10)
ts_sma_raw <- SMA(as.numeric(ts_clean), n = 10)
# Step 4: Align SMA with original time series (remove leading NAs for plotting)
# Create a new time series with same time index
ts_sma <- ts(ts_sma_raw, start = start(ts_clean), frequency = frequency(ts_clean))
# Step 5: Plot
plot(ts_clean, main = "Bike Rentals: Original vs. SMA(10)", col = "black", lwd = 1)
lines(ts_sma, col = "red", lwd = 2)
legend("topright", legend = c("Original", "SMA(10)"), col = c("black", "red"), lwd = 2)
# Load required packages
library(forecast)
library(tseries) # Needed for adf.test()
# Step 1: Decompose the time series using STL
decomp <- stl(ts_clean, s.window = "periodic")
plot(decomp)
# Step 2: Check stationarity using ADF test
adf_result1 <- adf.test(ts_clean)
print(adf_result1)
##
## Augmented Dickey-Fuller Test
##
## data: ts_clean
## Dickey-Fuller = -1.4435, Lag order = 9, p-value = 0.8138
## alternative hypothesis: stationary
# Step 3: Difference the series once (to remove trend)
ts_diff <- diff(ts_clean, differences = 1)
# Step 4: Re-check stationarity on differenced series
adf_result2 <- adf.test(ts_diff)
## Warning in adf.test(ts_diff): p-value smaller than printed p-value
print(adf_result2)
##
## Augmented Dickey-Fuller Test
##
## data: ts_diff
## Dickey-Fuller = -13.298, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
# Load required packages
library(forecast)
library(tseries)
# Check the structure of ts_diff
print("Length of ts_diff:")
## [1] "Length of ts_diff:"
print(length(ts_diff))
## [1] 730
print("Number of NA values in ts_diff:")
## [1] "Number of NA values in ts_diff:"
print(sum(is.na(ts_diff)))
## [1] 0
print("First 10 values of ts_diff:")
## [1] "First 10 values of ts_diff:"
print(head(ts_diff, 10))
## Time Series:
## Start = c(2011, 2)
## End = c(2011, 11)
## Frequency = 365
## [1] -184 548 213 38 6 -96 -551 -137 499 -58
print("Is ts_diff a time series object?")
## [1] "Is ts_diff a time series object?"
print(is.ts(ts_diff))
## [1] TRUE
# Remove any remaining NA values (if any)
ts_diff_clean <- na.omit(ts_diff)
print("Length after removing NAs:")
## [1] "Length after removing NAs:"
print(length(ts_diff_clean))
## [1] 730
# Try fitting ARIMA on the cleaned version
if (length(ts_diff_clean) > 10) {
fit_auto <- auto.arima(ts_diff_clean, seasonal = FALSE, stepwise = TRUE, approximation = TRUE)
print("Model fitted successfully!")
print(summary(fit_auto))
# Forecast
fc <- forecast(fit_auto, h = 25)
plot(fc, main = "Forecast: Next 25 Days")
} else {
stop("Not enough data to fit ARIMA model.")
}
## [1] "Model fitted successfully!"
## Series: ts_diff_clean
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.3692 -0.8751
## s.e. 0.0437 0.0213
##
## sigma^2 = 663543: log likelihood = -5928.18
## AIC=11862.37 AICc=11862.4 BIC=11876.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 11.14636 813.4647 588.9245 157.0058 233.5991 0.6555386 0.0110962
Throughout this project, I analyzed daily bike rental demand in Washington, DC (2011-2012) using time series modeling techniques. Here are my key findings:
cor(temp, cnt) = 0.627).ARIMA(1,0,1) model was fitted to the differenced
series.This project reinforced how real-world data (like bike rentals) can be modeled using time series methods. While the forecast is for changes, not absolute values, the process — from exploration to modeling — mirrors real business analytics workflows. For future work, I would explore adding external variables (e.g., holidays, events) to improve accuracy.