About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Fri Oct 24 04:35:21 2025.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

## Import required packages
# Install timetk (required for the dataset)
install.packages(c("timetk", "dplyr"))

## Installing packages into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)

# Load the package
library(timetk)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the built-in dataset
data("bike_sharing_daily")

# Create a working copy
bike_data <- bike_sharing_daily

Describe and explore the data

# Correlation between temp/atemp and total rentals (cnt)
cor(bike_data$temp, bike_data$cnt)

## [1] 0.627494

cor(bike_data$atemp, bike_data$cnt)

## [1] 0.6310657

# Mean & median temp by season
bike_data %>%
  mutate(season_label = case_when(
    season == 1 ~ "Winter",
    season == 2 ~ "Spring",
    season == 3 ~ "Summer",
    season == 4 ~ "Fall"
  )) %>%
  group_by(season_label) %>%
  summarise(
    mean_temp = mean(temp),
    median_temp = median(temp),
    mean_rentals = mean(cnt)
  )

## # A tibble: 4 × 4
##   season_label mean_temp median_temp mean_rentals
##   <chr>            <dbl>       <dbl>        <dbl>
## 1 Fall             0.423       0.409        4728.
## 2 Spring           0.544       0.562        4992.
## 3 Summer           0.706       0.715        5644.
## 4 Winter           0.298       0.286        2604.

Task Two: Create interactive time series plots

library(dplyr)
library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

bike_data %>%
  mutate(year = year(dteday)) %>%
  ggplot(aes(x = dteday, y = cnt, color = factor(year))) +
  geom_line() +
  labs(
    title = "Daily Bike Rentals (2011-2012)",
    x = "Date",
    y = "Total Rentals (cnt)",
    color = "Year"
  ) +
  theme_minimal()

Task Three: Smooth time series data

# Load required packages
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(TTR)

# Step 1: Create time series object
ts_cnt <- ts(bike_data$cnt, start = c(2011, 1), frequency = 365)

# Step 2: Clean the series
ts_clean <- tsclean(ts_cnt)

# Step 3: Compute SMA(10)
ts_sma_raw <- SMA(as.numeric(ts_clean), n = 10)

# Step 4: Align SMA with original time series (remove leading NAs for plotting)
# Create a new time series with same time index
ts_sma <- ts(ts_sma_raw, start = start(ts_clean), frequency = frequency(ts_clean))

# Step 5: Plot
plot(ts_clean, main = "Bike Rentals: Original vs. SMA(10)", col = "black", lwd = 1)
lines(ts_sma, col = "red", lwd = 2)
legend("topright", legend = c("Original", "SMA(10)"), col = c("black", "red"), lwd = 2)

Task Four: Decompose and assess the stationarity of time series data

# Load required packages
library(forecast)
library(tseries)  # Needed for adf.test()

# Step 1: Decompose the time series using STL
decomp <- stl(ts_clean, s.window = "periodic")
plot(decomp)

# Step 2: Check stationarity using ADF test
adf_result1 <- adf.test(ts_clean)
print(adf_result1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_clean
## Dickey-Fuller = -1.4435, Lag order = 9, p-value = 0.8138
## alternative hypothesis: stationary

# Step 3: Difference the series once (to remove trend)
ts_diff <- diff(ts_clean, differences = 1)

# Step 4: Re-check stationarity on differenced series
adf_result2 <- adf.test(ts_diff)

## Warning in adf.test(ts_diff): p-value smaller than printed p-value

print(adf_result2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_diff
## Dickey-Fuller = -13.298, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

Task Five: Fit and forecast time series data using ARIMA models

# Load required packages
library(forecast)
library(tseries)

# Check the structure of ts_diff
print("Length of ts_diff:")

## [1] "Length of ts_diff:"

print(length(ts_diff))

## [1] 730

print("Number of NA values in ts_diff:")

## [1] "Number of NA values in ts_diff:"

print(sum(is.na(ts_diff)))

## [1] 0

print("First 10 values of ts_diff:")

## [1] "First 10 values of ts_diff:"

print(head(ts_diff, 10))

## Time Series:
## Start = c(2011, 2) 
## End = c(2011, 11) 
## Frequency = 365 
##  [1] -184  548  213   38    6  -96 -551 -137  499  -58

print("Is ts_diff a time series object?")

## [1] "Is ts_diff a time series object?"

print(is.ts(ts_diff))

## [1] TRUE

# Remove any remaining NA values (if any)
ts_diff_clean <- na.omit(ts_diff)

print("Length after removing NAs:")

## [1] "Length after removing NAs:"

print(length(ts_diff_clean))

## [1] 730

# Try fitting ARIMA on the cleaned version
if (length(ts_diff_clean) > 10) {
  fit_auto <- auto.arima(ts_diff_clean, seasonal = FALSE, stepwise = TRUE, approximation = TRUE)
  print("Model fitted successfully!")
  print(summary(fit_auto))
  
  # Forecast
  fc <- forecast(fit_auto, h = 25)
  plot(fc, main = "Forecast: Next 25 Days")
} else {
  stop("Not enough data to fit ARIMA model.")
}

## [1] "Model fitted successfully!"
## Series: ts_diff_clean 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##          ar1      ma1
##       0.3692  -0.8751
## s.e.  0.0437   0.0213
## 
## sigma^2 = 663543:  log likelihood = -5928.18
## AIC=11862.37   AICc=11862.4   BIC=11876.15
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE      ACF1
## Training set 11.14636 813.4647 588.9245 157.0058 233.5991 0.6555386 0.0110962

Task Six: Findings and Conclusions

Findings and Conclusions

Throughout this project, I analyzed daily bike rental demand in Washington, DC (2011-2012) using time series modeling techniques. Here are my key findings:

1. Data Exploration

There is a strong positive correlation between temperature and total rentals (cor(temp, cnt) = 0.627).
Rentals peak in Summer (mean = 5,644/day) and dip in Winter (mean = 2,604/day), confirming that weather heavily influences demand.

2. Time Series Visualization

Interactive plots showed clear seasonal patterns and an upward trend over time.
Smoothing with SMA(10) revealed the underlying trend while reducing noise.

3. Stationarity & Decomposition

The original series was non-stationary (p-value > 0.05 in ADF test).
After first-order differencing, the series became stationary (p-value = 0.01), making it suitable for ARIMA modeling.

4. ARIMA Forecasting

An ARIMA(1,0,1) model was fitted to the differenced series.
The forecast for the next 25 days shows stable changes around zero, suggesting no strong upward or downward trend in daily changes of rentals.

Final Thoughts

This project reinforced how real-world data (like bike rentals) can be modeled using time series methods. While the forecast is for changes, not absolute values, the process — from exploration to modeling — mirrors real business analytics workflows. For future work, I would explore adding external variables (e.g., holidays, events) to improve accuracy.

Forecast daily bike rental demand using time series models

Franckie Wibisono

2025-10-24