analysis_report.Rmd

---
title: "Assessing Air Quality in Lombardia, Italy through Time Series Analysis: Implications for Public Health and Policy"
subtitle: "Time Series Analysis project"
author: "Giovanni Costa - 880892"
date: "AY 2023/24"
geometry: "left=1cm,right=2cm,top=1cm,bottom=1cm"
output:
  html_document:
    toc: true
    number_sections: true
    toc_depth: 2
    toc_float:
      smooth_scroll: false
    fig_caption: yes
    theme: flatly
    highlight: pygments
    css: "assets/css/styles.css"
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(fig.align = "center")
options(digits = 4)
options(error = recover)
```

```{r environment setup, include=FALSE}
rm(list = ls())
Sys.setlocale("LC_TIME", "en_US.UTF-8")
```


# Introduction
This project aims to evaluate the air quality in Lombardia, Italy, using a comprehensive time series analysis. The data for this study will be derived from sensors located at various stations across the region, which can be accessed via the regional website.

The primary focus is to underscore the significance of air quality on public health. By analyzing the trends and patterns in air quality data over time, it's possible to identify periods of high pollution and correlate these with potential health risks. This analysis will provide valuable insights into how air quality fluctuations may impact the respiratory health of Lombardia's inhabitants.

Considering the data provided by the region, the following pollutants will be analyzed across three different stations placed in different areas:

- **Benzene** is a volatile organic compound (VOC) commonly used in the production of plastics, resins, and synthetic fibers, as well as in gasoline. It is released into the air through emissions from motor vehicles, industrial processes, and the evaporation of benzene-containing products. Exposure to benzene primarily occurs through inhalation and can lead to serious health effects, including bone marrow damage, which can cause blood disorders such as anemia and increase the risk of leukemia, a type of cancer.
- **Carbon Monoxide (CO)** is a colorless, odorless gas produced by incomplete combustion of carbon-containing fuels, such as gasoline, natural gas, oil, and wood. Major sources include motor vehicles, industrial processes, and residential heating systems. CO interferes with the body's ability to transport oxygen by binding to hemoglobin in the blood, forming carboxyhemoglobin. High levels of exposure can lead to symptoms such as headaches, dizziness, and even death due to oxygen deprivation.
- **Nitrogen Dioxide (NO₂)** is a reddish-brown gas with a sharp, biting odor. It is a significant air pollutant formed primarily from the combustion of fossil fuels in vehicles, power plants, and industrial processes. NO₂ can irritate the respiratory system, exacerbate asthma, and reduce lung function. It also contributes to the formation of ground-level ozone and fine particulate matter, which can have further adverse effects on human health and the environment.
- **Nitrogen Oxides (NOₓ)** encompass a group of gases, including nitrogen dioxide (NO₂) and nitrogen monoxide (NO), produced during combustion processes, particularly at high temperatures. Major sources include motor vehicles, power plants, and industrial facilities. NOₓ gases can cause respiratory problems, contribute to the formation of smog and acid rain, and lead to the secondary formation of fine particulate matter (PM2.5), all of which can have severe health and environmental impacts.
- **Particulate Matter 10 (PM10)** refers to airborne particles with a diameter of 10 micrometers or less. These particles can originate from a variety of sources, including construction sites, road dust, industrial emissions, and combustion processes. PM10 can be inhaled into the respiratory system, leading to health issues such as respiratory infections, lung inflammation, and aggravation of existing heart and lung diseases. Long-term exposure can decrease lung function and increase mortality from cardiovascular and respiratory diseases.

In particular, PM10 is selected for the main part of this analysis because it serves as a comprehensive indicator of particulate pollution from a wide range of sources, such as traffic emissions, industrial processes, and the secondary formation of particles from gaseous pollutants. Its direct link to adverse health effects, including respiratory and cardiovascular diseases, makes PM10 a crucial measure of air quality. Additionally, other pollutants like Benzene, CO, NO₂, and NOₓ can contribute to PM10 levels through both primary emissions and secondary reactions, making them valuable predictor variables for analyzing fluctuations in PM10 concentrations.

For this study, daily data are considered, followed by aggregation into monthly averages to facilitate a long-term analysis and provide a clearer understanding of the underlying trends. Subsequently, forecasting models will be developed using pure daily data to predict air quality in the region.

```{r}
# Pre-requirements
set.seed(123) # set pseudorandom generator for reproducibility
requirements <- c("dplyr", "ggplot2", "kableExtra", "ARPALData", "imputeTS", "xts", "forecast", "car")
for (library_name in requirements) {
  if (!require(library_name, character.only = TRUE, exclude = if (library_name == "dplyr") c("lag", "filter") else NULL)) {
    install.packages(library_name, repos = "https://cloud.r-project.org")
    library(library_name, character.only = TRUE, exclude = if (library_name == "dplyr") c("lag", "filter") else NULL)
  }
}
lag_dplyr <- dplyr::lag
filter_dplyr <- dplyr::filter
daily_ts_freq <- 365.25
monthly_ts_freq <- 12

# Import user-defined functions
source("src/utils.R")
source("src/plotting.R")
source("src/smoothing.R")
source("src/forecasting.R")

# library(styler)
# style_file("analysis_report.Rmd")
```

# Dataset description
The air quality dataset utilized is composed of daily observations of Benzene, CO, NO2, NOx, and PM10 coming from 3 different stations placed in different positions:

1. Station 548 - Milano v.Senato refers to the metropolitan area of Milan
1. Station 571 - Bormio v.Monte Braulio is located in the mountain zone
1. Station 703 - Schivenoglia v.Malpasso is placed in the rural plain of Lombardia

The analysis covers the period from January 1, 2014, to December 31, 2023. The original air quality data, which were recorded hourly, were aggregated to daily values by calculating the mean and excluding NA values. For additional details, refer to the ARPALData library manual ^[https://cran.r-project.org/web/packages/ARPALData/index.html].

A dataset with detailed information about the monitoring stations is also available. It includes the sensor ID, the pollutant measured by each sensor, as well as the location, altitude, and the start and stop dates of station operation. Since the stations measure different pollutants, columns in the dataset with all `NA` values indicate that the station does not have sensors for measuring that particular pollutant in the air.

```{r}
# Data retrieval
start_date <- as.Date("2014-01-01")
end_date <- as.Date("2023-12-31")
date_range <- seq(from = start_date, to = end_date, by = 1)

# Milano v.Senato, Bormio v.Monte Braulio, Schivenoglia v. Malpasso
aq_station_ids <- c(548, 571, 703)
aq_station_names_code <- c("Station 548", "Station 571", "Station 703")
aq_station_names_short <- c("Milano v.Senato", "Bormio v.Monte Braulio", "Schivenoglia v. Malpasso")
aq_station_names_full <- c("Station 548 - Milano v.Senato", "Station 571 - Bormio v.Monte Braulio", "Station 703 - Schivenoglia v. Malpasso")
station_colors <- c("steelblue", "darkorange", "#009E73")
filtering_colors <- c("red", "blue", "green")

# pollutant: unit of measure mapping
pollutant_units <- list(
  "Benzene" = "µg/m³",
  "CO" = "mg/m³",
  "NO2" = "µg/m³",
  "NOx" = "µg/m³",
  "PM10" = "µg/m³"
)

df_aq_daily <- NULL
df_aq_stations <- NULL

if (length(list.files("data/", pattern = "*.rds")) == 0) {
  # Air quality data
  df_aq_daily <- get_ARPA_Lombardia_AQ_data(
    ID_station = aq_station_ids, Date_begin = format(start_date),
    Date_end = format(end_date), Frequency = "daily", parallel = TRUE
  )
  # Station data
  df_aq_stations <- get_ARPA_Lombardia_AQ_registry()
  # Filter stations
  df_aq_stations <- df_aq_stations[df_aq_stations$IDStation %in% aq_station_ids, ]
  # Save data
  saveRDS(df_aq_daily, "data/df_aq_daily.rds")
  saveRDS(df_aq_stations, "data/df_aq_stations.rds")
} else {
  df_aq_daily <- readRDS("data/df_aq_daily.rds")
  df_aq_stations <- readRDS("data/df_aq_stations.rds")
}
```
```{r echo=FALSE,out.width="50%", out.height="20%", fig.show='hold', fig.cap="Stations zoning information"}
# plot_zoning_map(
#  title = "ARPA Lombardia zoning",
#  line_type = 1,
#  line_size = 1,
#  xlab = "Longitude",
#  ylab = "Latitude"
# )
# plot_AQ_stations(data_aq = df_aq_stations, title = "Map of ARPA stations in Lombardy", col_points = station_colors)
knitr::include_graphics(c("assets/images/zoning.png", "assets/images/stations_map.png"))
```
```{r}
missing_dates <- check_missing_dates(df_aq_daily, start_date, end_date, "day")
print(paste("The number of missing dates in the dataset from",
  start_date, "to", end_date, "is:", length(missing_dates),
  sep = " "
))
missing_dates <- NULL
```
```{r}
df_aq_m <- df_aq_daily %>% filter_dplyr(IDStation == 548)
df_aq_b <- df_aq_daily %>% filter_dplyr(IDStation == 571)
df_aq_s <- df_aq_daily %>% filter_dplyr(IDStation == 703)

com_poll_names <- find_common_pollutants(
  df_aq_m %>% dplyr::select(-IDStation, -NameStation, -Date),
  df_aq_b %>% dplyr::select(-IDStation, -NameStation, -Date),
  df_aq_s %>% dplyr::select(-IDStation, -NameStation, -Date)
)
print(paste("Common pollutants among the stations:", paste(com_poll_names, collapse = ", ")))
```

# Data overview
## Raw data
As previously mentioned, the air quality stations dataset provides details on the locations, sensors, and their operational status. Below is the table for the station located in Milan, while similar information is available for other stations, which may monitor different types of pollutants.
```{r}
# Station 548 - Milano v.Senato
print_table_custom(
  df_aq_stations %>%
    filter_dplyr(IDStation == 548) %>%
    select(c(
      IDSensor, Pollutant, Province, City,
      Latitude, Longitude, Altitude, DateStart, DateStop
    )) %>%
    distinct() %>%
    arrange(Pollutant),
  title = paste(aq_station_names_full[1], "information")
)
```
```{r}
# Station 571 - Bormio v.Monte Braulio
# print_table_custom(df_aq_stations %>%
#  filter_dplyr(IDStation == 571) %>%
#  select(c(
#    IDSensor, Pollutant, Province, City,
#    Latitude, Longitude, Altitude, DateStart, DateStop
#  )) %>%
#  distinct() %>%
#  arrange(Pollutant), title = paste(aq_station_names_full[2], "information"))
```
```{r}
# Station 703 - Schivenoglia v. Malpasso
# print_table_custom(df_aq_stations %>%
#  filter_dplyr(IDStation == 703) %>%
#  select(c(
#    IDSensor, Pollutant, Province, City,
#    Latitude, Longitude, Altitude, DateStart, DateStop
#  )) %>%
#  distinct() %>%
#  arrange(Pollutant), title = paste(aq_station_names_full[3], "information"))
```

More than the stations' details, it is considered more insightful to focus on the air quality summary tables for the different stations. First, the number of missing values within the selected 10-year daily interval is significant, and these gaps must be addressed through imputation to correctly represent the time data. Additionally, while the selected pollutants generally show a small standard deviation (based on the range between the 1st and 3rd quartiles), some pollutants, such as PM10 and NOx, exhibit maximum values that are considerably higher than the 3rd quartile. 

```{r}
print_table_custom(
  df_aq_m %>%
    select(c(-IDStation, -NameStation, -Date)),
  is_summary = TRUE, title = paste(aq_station_names_full[1], "daily data statistics"),
  highlight_rows = com_poll_names
)
```
```{r}
print_table_custom(
  df_aq_b %>%
    select(c(-IDStation, -NameStation, -Date)),
  is_summary = TRUE, title = paste(aq_station_names_full[2], "daily data statistics"),
  highlight_rows = com_poll_names
)
```
```{r}
print_table_custom(
  df_aq_s %>%
    select(c(-IDStation, -NameStation, -Date)),
  is_summary = TRUE, title = paste(aq_station_names_full[3], "daily data statistics"),
  highlight_rows = com_poll_names
)
```

The data distribution of the PM10 can be better highlighted with a boxplot: the median values of the Milan station and Schivenoglia station are quite similar and all the stations present observations very distant from the others (in particular station 703). Indeed, PM10 concentrations can be higher in rural areas compared to urban metropolitan areas for several reasons. Agricultural activities, such as harvesting and livestock operations, contribute to elevated PM10 levels by generating bioaerosols rich in plant and animal matter. Additionally, rural dust often contains higher concentrations of crustal metals, exacerbated by drier climates and open spaces. Furthermore, rural areas typically experience more stable atmospheric conditions, leading to reduced dispersion of particulates compared to the more unstable, windier conditions often found in urban environments due to intense heat islands.
```{r}
# for (i in 1:length(com_poll_names)) {
#  boxplot(
#    df_aq_daily[[com_poll_names[i]]] ~ as.factor(df_aq_daily$IDStation),
#    main = "Boxplot of common pollutants among the stations",
#    xlab = "Station ID",
#    ylab = paste0(com_poll_names[i], " (", pollutant_units[[com_poll_names[i]]], ")"),
#    col = station_colors,
#    las = 2
#  )
# }
boxplot(
  df_aq_daily[["PM10"]] ~ as.factor(df_aq_daily$IDStation),
  main = "Boxplot of PM10 among the stations",
  xlab = "Station ID",
  ylab = paste0("PM10", " (", pollutant_units[["PM10"]], ")"),
  col = station_colors,
  las = 2
)
```

Standard regulatory levels for PM10 are as follows: the **Acceptable Level** is 50 µg/m³, which can be exceeded on up to 35 days per year without health concerns. The **Information Level** is set at 200 µg/m³, triggering public notifications about potential health risks. The **Alarm Level** is 300 µg/m³, prompting immediate public health actions, such as advising vulnerable groups to limit outdoor activities. Additionally, the **Annual Average** acceptable level is 40 µg/m³, ensuring that the yearly average concentration does not exceed this value to protect public health.
Fortunately, as indicated in this table, the annual average concentration of Particulate Matter 10 remains only slightly above the limit across the years.

```{r}
aq_m_pm10_by_year <- df_aq_m %>%
  dplyr::mutate(year = format(Date, "%Y")) %>%
  dplyr::group_by(year) %>%
  dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_b_pm10_by_year <- df_aq_b %>%
  dplyr::mutate(year = format(Date, "%Y")) %>%
  dplyr::group_by(year) %>%
  dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_s_pm10_by_year <- df_aq_s %>%
  dplyr::mutate(year = format(Date, "%Y")) %>%
  dplyr::group_by(year) %>%
  dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_pm10_by_year <- cbind(aq_m_pm10_by_year, aq_b_pm10_by_year$mean_value, aq_s_pm10_by_year$mean_value)
colnames(aq_pm10_by_year) <- c("Year", aq_station_names_short)
print_table_custom(aq_pm10_by_year, title = "Mean PM10 values by year")
aq_m_pm10_by_year <- NULL
aq_b_pm10_by_year <- NULL
aq_s_pm10_by_year <- NULL
aq_pm10_by_year <- NULL
```


## Time series data
Currently, the data are analyzed as time series rather than as raw values, and further considerations will be based on this interpretation. The plot displaying time series from all stations confirms previous observations about data distribution: the Milan station generally records higher PM10 values, whereas the Bormio station reports the lowest levels, and the Schivenoglia station shows numerous peaks.

```{r}
ts_m <- xts(df_aq_m$PM10, order.by = df_aq_m$Date)
ts_b <- xts(df_aq_b$PM10, order.by = df_aq_b$Date)
ts_s <- xts(df_aq_s$PM10, order.by = df_aq_s$Date)
```
```{r fig.height=6, fig.width=10}
plot_3_ts(
  ts1 = ts_m, ts2 = ts_b, ts3 = ts_s,
  ts_colors = station_colors,
  main = "PM10 time series for the stations",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = aq_station_names_full
)
```

When the time series are plotted individually, their patterns become clearer. Each series exhibits variability and is apparently non-stationary, with indications of some seasonal patterns across all stations.

```{r fig.height=8, fig.width=10}
plot_ts_grid(
  ts_list = list(ts_m, ts_b, ts_s),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  ts_colors = station_colors,
  n_row = 3,
  ts_names = aq_station_names_full
)
```
```{r}
ts_m <- xts2ts(ts_m, daily_ts_freq)
ts_b <- xts2ts(ts_b, daily_ts_freq)
ts_s <- xts2ts(ts_s, daily_ts_freq)
```

# Preprocessing
## Missing values imputation
As previously noted, the data contain many missing values, and time series require consistent spacing without gaps. The following statistics provide insight into these missing values:

- "Number of Gaps" indicates the count of NA gaps, which are sequences of one or more consecutive missing values.
- "Average Gap Size" represents the average length of these consecutive NA gaps.
- "Longest NA Gap" shows the longest sequence of consecutive missing values in the time series.
- "Most Frequent Gap Size" identifies the most commonly occurring length of missing value sequences.

```{r}
stats_na_ts_m <- imputeTS::statsNA(ts_m, print_only = FALSE)
stats_na_ts_b <- imputeTS::statsNA(ts_b, print_only = FALSE)
stats_na_ts_s <- imputeTS::statsNA(ts_s, print_only = FALSE)
stats_na_table <- as.data.frame(
  rbind(
    stats_na_ts_m,
    stats_na_ts_b,
    stats_na_ts_s
  )
)
stats_na_table <- stats_na_table[, c(-dim(stats_na_table)[2], -(dim(stats_na_table)[2] - 1))]
colnames(stats_na_table) <- c(
  "Length TS", "Number NAs", "Number Gaps", "Average Gap Size",
  "Percentage NAs", "Longest NA gap", "Most frequent gap size"
)
rownames(stats_na_table) <- aq_station_names_code
print_table_custom(stats_na_table, title = "Missing values statistics")
```

Fortunately, the most frequent gap length is one, and the average gap size is relatively small, likely resulting from occasional sensor malfunctions. This suggests that simple imputation techniques should be reasonably accurate and close to the actual values.

```{r}
# imputeTS::ggplot_na_distribution2(ts_m,
#  title = paste(aq_station_names_code[1], "-", "missing values ratio per interval"),
# )
# imputeTS::ggplot_na_distribution2(ts_b,
#  title = paste(aq_station_names_code[2], "-", "missing values ratio per interval"),
# )
# imputeTS::ggplot_na_distribution2(ts_s,
#  title = paste(aq_station_names_code[3], "-", "missing values ratio per interval"),
# )
```

To address missing values in the time series, linear interpolation is used. This method assumes that missing values can be estimated by drawing a straight line between the known values on either side. For time series data, this means using the timestamps and values of the adjacent non-missing points to calculate the missing values.

As shown the the following plots, the values imputed among all the stations seem coherent.

```{r fig.height=4, fig.width=8}
ts_m_imputed <- imputeTS::na_interpolation(ts_m)
ts_b_imputed <- imputeTS::na_interpolation(ts_b)
ts_s_imputed <- imputeTS::na_interpolation(ts_s)

imputeTS::ggplot_na_imputations(
  window_ts_xts(ts_m, df_aq_m$Date, "2021-01-01", "2022-12-31"),
  window_ts_xts(ts_m_imputed, df_aq_m$Date, "2021-01-01", "2022-12-31"),
  title = paste(aq_station_names_short[1], "-", "Linear imputation"),
  x_axis_labels = seq(as.Date("2021-01-01"), as.Date("2022-12-31"), by = "day"),
  color_points = station_colors[1],
  color_lines = rgb2hex_custom(col2rgb_custom(station_colors[1], 0.6)),
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
imputeTS::ggplot_na_imputations(
  window_ts_xts(ts_b, df_aq_b$Date, "2014-01-01", "2015-12-31"),
  window_ts_xts(ts_b_imputed, df_aq_b$Date, "2014-01-01", "2015-12-31"),
  title = paste(aq_station_names_short[2], "-", "Linear imputation"),
  x_axis_labels = seq(as.Date("2014-01-01"), as.Date("2015-12-31"), by = "day"),
  color_points = station_colors[2],
  color_lines = rgb2hex_custom(col2rgb_custom(station_colors[2], 0.6)),
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
imputeTS::ggplot_na_imputations(
  window_ts_xts(ts_s, df_aq_s$Date, "2016-01-01", "2017-12-31"),
  window_ts_xts(ts_s_imputed, df_aq_s$Date, "2016-01-01", "2017-12-31"),
  title = paste(aq_station_names_short[3], "-", "Linear imputation"),
  x_axis_labels = seq(as.Date("2016-01-01"), as.Date("2017-12-31"), by = "day"),
  color_points = station_colors[3],
  color_lines = rgb2hex_custom(col2rgb_custom(station_colors[3], 0.6)),
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
ts_m <- ts_m_imputed
ts_b <- ts_b_imputed
ts_s <- ts_s_imputed
df_aq_m$PM10 <- as.numeric(ts_m_imputed)
df_aq_b$PM10 <- as.numeric(ts_b_imputed)
df_aq_s$PM10 <- as.numeric(ts_s_imputed)
```
```{r}
print(
  paste(
    "Number of NA values in PM10 among the station are: ",
    sum(
      sum(is.na(df_aq_m$PM10)),
      sum(is.na(df_aq_m$PM10)),
      sum(is.na(df_aq_b$PM10))
    )
  )
)
```

## Outliers detection
The previously identified outliers are also evident in the time series data, particularly as prominent peaks at the stations. These outliers have been left unaltered to preserve the integrity and semantics of the data.

```{r}
outliers_ts_m <- tsoutliers(ts_m)
outliers_ts_b <- tsoutliers(ts_b)
outliers_ts_s <- tsoutliers(ts_s)

tmp_ts_m <- ts(ts_m)
tmp_ts_m[outliers_ts_m$index] <- NA
imputeTS::ggplot_na_imputations(
  tmp_ts_m, ts_m,
  title = paste(aq_station_names_code[1], "-", "outliers detection"),
  x_axis_labels = date_range,
  color_lines = station_colors[1],
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  size_points = NA,
  size_imputations = 3,
  legend = FALSE
)
tmp_ts_b <- ts(ts_b)
tmp_ts_b[outliers_ts_b$index] <- NA
imputeTS::ggplot_na_imputations(
  tmp_ts_b, ts_b,
  title = paste(aq_station_names_code[2], "-", "outliers detection"),
  x_axis_labels = date_range,
  color_lines = station_colors[2],
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  size_points = NA,
  size_imputations = 3,
  legend = FALSE
)
tmp_ts_s <- ts(ts_s)
tmp_ts_s[outliers_ts_s$index] <- NA
imputeTS::ggplot_na_imputations(
  tmp_ts_s, ts_s,
  title = paste(aq_station_names_code[3], "-", "outliers detection"),
  x_axis_labels = date_range,
  color_lines = station_colors[3],
  color_imputations = "red",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  size_points = NA,
  size_imputations = 3,
  legend = FALSE
)
outliers_ts_m <- NULL
outliers_ts_b <- NULL
outliers_ts_s <- NULL
tmp_ts_m <- NULL
tmp_ts_b <- NULL
tmp_ts_s <- NULL
```

# Time series data analysis
In this section data daily data are averaged across the months for performing a long-term analysis. Due to the lower number of observations, now the time series appear more clear. However, just by plotting the monthly values, the trend and the seasonal pattern don't emerge much.

```{r}
agg_df_aq_m <- ARPALData::Time_aggregate(
  df_aq_m, "monthly",
  Var_vec = "PM10", Fns_vec = "mean"
)
agg_df_aq_b <- ARPALData::Time_aggregate(
  df_aq_b, "monthly",
  Var_vec = "PM10", Fns_vec = "mean"
)
agg_df_aq_s <- ARPALData::Time_aggregate(
  df_aq_s, "monthly",
  Var_vec = "PM10", Fns_vec = "mean"
)
ts_m_monthly <- xts(agg_df_aq_m$PM10, order.by = agg_df_aq_m$Date)
ts_b_monthly <- xts(agg_df_aq_b$PM10, order.by = agg_df_aq_b$Date)
ts_s_monthly <- xts(agg_df_aq_s$PM10, order.by = agg_df_aq_s$Date)

plot_ts_grid(
  ts_list = list(ts_m_monthly, ts_b_monthly, ts_s_monthly),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  ts_colors = station_colors,
  n_row = 3,
  ts_names = aq_station_names_full
)

agg_df_aq_m <- NULL
agg_df_aq_b <- NULL
agg_df_aq_s <- NULL
ts_m_monthly <- xts2ts(ts_m_monthly, monthly_ts_freq)
ts_b_monthly <- xts2ts(ts_b_monthly, monthly_ts_freq)
ts_s_monthly <- xts2ts(ts_s_monthly, monthly_ts_freq)
```

## Autocorrelation and partial autocorrelation
Autocorrelation (ACF) and partial autocorrelation (PACF) function plots provide additional insights into the data. The ACF plots reveal a sinusoidal pattern across all stations, with a pronounced peak every six lags, suggesting a recurring pattern approximately every six months. The PACF plots also show spikes around this period, indicating the presence of significant information during these intervals. Additionally, the analysis confirms with reasonable confidence that the time series is non-stationary.

```{r}
plot_acf_pacf(ts_m_monthly, aq_station_names_code[1])
```
```{r}
plot_acf_pacf(ts_b_monthly, aq_station_names_code[2])
```
```{r}
plot_acf_pacf(ts_s_monthly, aq_station_names_code[3])
```

## Monthplot
The monthplot function is a helpful tool for visualizing and analyzing the monthly patterns in a time series. It displays the average values for each month, making it easier to identify any seasonal trends.
The resulting graph confirms the expected pattern: the observations show a pronounced monthly seasonality, with higher values typically occurring during the winter months and lower values during the summer. Additionally, the variability throughout the year is significant across all stations, indicating that the seasonal fluctuations are consistent yet varied in magnitude.

```{r}
monthplot(
  ts_m_monthly,
  main = paste("Monthly plot of PM10 for", aq_station_names_short[1]),
  xlab = "Month",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
monthplot(
  ts_b_monthly,
  main = paste("Monthly plot of PM10 for", aq_station_names_short[2]),
  xlab = "Month",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
monthplot(
  ts_s_monthly,
  main = paste("Monthly plot of PM10 for", aq_station_names_short[3]),
  xlab = "Month",
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
```

## Smoothing and decomposition
To better understand the trend component of the time series, this section employs smoothing techniques to estimate the underlying trend. Later, the time series from different stations will be decomposed using Seasonal and Trend decomposition using Loess (STL) to further isolate and highlight the trend, seasonal, and residual components.

A widely used and straightforward method for smoothing is the **simple moving average** filter, which computes the arithmetic mean over a centered time window of size \(2p + 1\). The filtered trend estimate at time \(t\) is given by:
$$
\hat{f}_t = \frac{1}{2p + 1} \sum_{i=-p}^p y_{t+i}
$$
The choice of the window size \(p\) is crucial as it directly influences the degree of smoothing: larger values of \(p\) result in a smoother trend, while smaller values retain more of the original variability. Various values of \(p\) are tested to explore different levels of smoothing, each with a specific purpose:

- **\(p = 3\):** A smaller window that applies minimal smoothing, allowing short-term fluctuations to be visible while still reducing noise.
- **\(p = 6\):** This value is chosen to remove the seasonal effect identified in the Auto-Correlation Function (ACF), particularly smoothing out variations that span over a half-year period.
- **\(p = 12\):** A larger window size aimed at providing more significant smoothing, potentially eliminating yearly patterns and offering a clearer view of long-term trends.

In addition, a moving average filter for seasonal data is tried to estimate the trend, given that the monthly time series exhibits a significant seasonal component.

```{r}
plot_filtered_ts(
  original_ts = ts_m_monthly,
  filtered_ts_list = list(simple_ma(ts_m_monthly, p = 3), simple_ma(ts_m_monthly, p = 6), simple_ma(ts_m_monthly, p = 12)),
  main = paste(aq_station_names_code[1], "-", "simple moving average filter"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("p = 3", "p = 6", "p = 12"),
  line_colors = filtering_colors
)
plot_filtered_ts(
  original_ts = ts_m_monthly,
  filtered_ts_list = list(ma_for_seasonal(ts_m_monthly, monthly_ts_freq)),
  main = paste(aq_station_names_code[1], "-", "moving average for seasonal data"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("MA seasonal"),
  line_colors = filtering_colors
)
```
```{r}
plot_filtered_ts(
  original_ts = ts_b_monthly,
  filtered_ts_list = list(simple_ma(ts_b_monthly, p = 3), simple_ma(ts_b_monthly, p = 6), simple_ma(ts_b_monthly, p = 12)),
  main = paste(aq_station_names_code[2], "-", "simple moving average filter"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("p = 3", "p = 6", "p = 12"),
  line_colors = filtering_colors
)
plot_filtered_ts(
  original_ts = ts_b_monthly,
  filtered_ts_list = list(ma_for_seasonal(ts_b_monthly, monthly_ts_freq)),
  main = paste(aq_station_names_code[2], "-", "moving average for seasonal data"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("MA seasonal"),
  line_colors = filtering_colors
)
```
```{r}
plot_filtered_ts(
  original_ts = ts_s_monthly,
  filtered_ts_list = list(simple_ma(ts_s_monthly, p = 3), simple_ma(ts_s_monthly, p = 6), simple_ma(ts_s_monthly, p = 12)),
  main = paste(aq_station_names_code[3], "-", "simple moving average filter"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("p = 3", "p = 6", "p = 12"),
  line_colors = filtering_colors
)
plot_filtered_ts(
  original_ts = ts_s_monthly,
  filtered_ts_list = list(ma_for_seasonal(ts_s_monthly, monthly_ts_freq)),
  main = paste(aq_station_names_code[3], "-", "moving average for seasonal data"),
  ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
  legend_names = c("MA seasonal"),
  line_colors = filtering_colors
)
```

Overall, the estimated trend appears to be slowly decreasing. These data indicates also that during the COVID-19 restrictions in Italy from 2020 to 2022, the average PM10 levels did not significantly decrease. There were periods with increased PM10 values, such as at Station 571 in Bormio in 2021. This suggests that sources of particulate emissions unrelated to mobility, such as industrial activities or other local sources, played a substantial role in sustaining PM10 concentrations during this time.

To further inspect the behavior of the time series, STL (Seasonal and Trend decomposition using Loess) decomposition is applied. Given the previously observed anomalies, the robust version of the algorithm is used to mitigate their impact. STL also offers flexibility in defining the rate of change for the seasonal component. Since the seasonal pattern appears consistent over time, the seasonal window is set to "periodic" to ensure that the entire dataset is utilized for a comprehensive seasonal analysis.

```{r}
stl_ts_m_monthly <- stl(ts_m_monthly, s.window = "periodic", robust = TRUE)
stl_ts_b_monthly <- stl(ts_b_monthly, s.window = "periodic", robust = TRUE)
stl_ts_s_monthly <- stl(ts_s_monthly, s.window = "periodic", robust = TRUE)
plot(
  stl_ts_m_monthly,
  main = paste(aq_station_names_code[1], "-", "PM10 - STL decomposition")
)
plot(
  stl_ts_b_monthly,
  main = paste(aq_station_names_code[2], "-", "PM10 - STL decomposition")
)
plot(
  stl_ts_s_monthly,
  main = paste(aq_station_names_code[3], "-", "PM10 - STL decomposition")
)
```
All stations exhibit a strong seasonal component, with an overall decreasing trend. However, Station 548 shows a notable exception, as it experienced significant peaks in PM10 levels at the end of 2021 and throughout 2022.

The Ljung-Box test supports the validity of the decompositions, as it does not suggest rejecting the null hypothesis that the residuals are white noise for most stations. However, for station 571, the p-value is notably low, indicating that some adjustments to the parameters in the STL function might be necessary to improve the decomposition accuracy.

```{r}
ljung_box_noise_ts_m <- Box.test(stl_ts_m_monthly$time.series[, "remainder"],
  lag = ceiling(length(stl_ts_m_monthly$time.series[, "remainder"]) * 0.1),
  type = "Ljung-Box"
)$p.value
ljung_box_noise_ts_b <- Box.test(stl_ts_b_monthly$time.series[, "remainder"],
  lag = ceiling(length(stl_ts_b_monthly$time.series[, "remainder"]) * 0.1),
  type = "Ljung-Box"
)$p.value
ljung_box_noise_ts_s <- Box.test(stl_ts_s_monthly$time.series[, "remainder"],
  lag = ceiling(length(stl_ts_s_monthly$time.series[, "remainder"]) * 0.1),
  type = "Ljung-Box"
)$p.value
ljung_box_noise <- data.frame(
  p_value = c(ljung_box_noise_ts_m, ljung_box_noise_ts_b, ljung_box_noise_ts_s)
)
rownames(ljung_box_noise) <- aq_station_names_code
colnames(ljung_box_noise) <- "p-value"
print_table_custom(ljung_box_noise, title = "Ljung-Box test for the noise component")

stl_ts_m_monthly <- NULL
stl_ts_b_monthly <- NULL
stl_ts_s_monthly <- NULL
ljung_box_noise_ts_m <- NULL
ljung_box_noise_ts_b <- NULL
ljung_box_noise_ts_s <- NULL
ljung_box_noise <- NULL
```


# Models development
As previously mentioned, this section of the analysis focuses on daily data to develop forecasting models. Given the need to predict future PM10 values for implementing preventive health measures, a short-term analysis is essential.

To simplify visualization and focus on recent data, the time series for this part of the study is limited to the period from January 1, 2022, to December 31, 2023. As shown previously, the choice of a time window at the end of the COVID-19 emergency period in Italy doesn't influence the PM10 values pattern

For an accurate evaluation, it is crucial to avoid using forecast data as training data. Therefore, the time series is divided into training and test sets, with the test period spanning from December 1, 2023, to the end of the period. A one-month test set is selected to assess how the model performs with a relatively long forecast horizon.

## Stochastic models
```{r}
# Train-test split
start_train <- as.Date("2022-01-01")
date_split <- as.Date("2023-12-01")

start_train_float <- date_to_float(start_train, daily_ts_freq)
end_train_float <- date_to_float(date_split - 1, daily_ts_freq)
start_test_float <- end_train_float

train_date_range <- seq(start_train, date_split - 1, by = "day")
test_date_range <- seq(date_split, end_date, by = "day")
```
### Station 548 - Milano v.Senato {.unlisted .unnumbered}
```{r}
ts_m_train <- window(ts_m, start_train_float, end_train_float)
ts_m_test <- window(ts_m, start_test_float)
tsdisplay(ts_m_train, lag.max = 40, main = paste(aq_station_names_code[1], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(log_my(ts_m_train), lag.max = 40, main = paste(aq_station_names_code[1], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```

The plots indicate that the time series is not stationary, as evidenced by the slow decay of lags in the ACF plot. Even after applying a logarithmic transformation to stabilize the variance, the overall situation remains largely unchanged. The ACF plot displays up to 40 lags since, beyond a month of daily correlations, the lags may become insignificant.

```{r}
tsdisplay(diff(ts_m_train), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(diff(ts_m_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(diff(ts_m_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```

Differencing the time series appears to be highly effective, as the data now seem better aligned with the stationarity assumption. However, differencing by lag 7 proves unhelpful for improving stationarity and introduces an artificial seasonal pattern into the data. Similarly, differencing by the time series frequency results in the appearance of a seasonal pattern, with sinusoidal oscillations visible in the ACF plot.
Despite this, applying differencing to remove potential periodic factors in daily observations can be impractical and risky. This approach overlooks time domains like months or weeks, where specific patterns are expected over longer periods. The variability between days across different years—affected by factors such as weather or the day of the week—renders such differencing uninformative and difficult to apply effectively.

Given that the ACF plot of the differenced time series shows more pronounced spikes up to lag 5 and the PACF exhibits a relatively fast decay, an ARIMA model with order (0, 1, 5) is fitted. The residuals are then examined to confirm they resemble white noise, ensuring the model's adequacy. Finally, this model is compared with an automatically selected ARIMA model to assess its performance.

```{r}
ma5_ts_m <- Arima(ts_m_train, order = c(0, 1, 5))
checkresiduals(ma5_ts_m)
```

```{r}
auto_ts_m <- auto.arima(ts_m_train, ic = "aicc")
checkresiduals(auto_ts_m)
```
The residuals of the automatically selected ARIMA model appear preferable. The Ljung-Box test yields a higher p-value, and both the residuals plot and ACF suggest that the residuals are closer to white noise.

This procedure is also applied to the other stations. The time series are differenced once to improve stationarity, and the plots are inspected to determine the most suitable ARIMA model for each. In all cases, the models are compared with those selected automatically. While the residuals for both methods generally align with the assumptions, the automatically selected models consistently perform slightly better.

### Station 571 - Bormio v.Monte Braulio {.unlisted .unnumbered}
```{r}
ts_b_train <- window(ts_b, start_train_float, end_train_float)
ts_b_test <- window(ts_b, start_test_float)
tsdisplay(ts_b_train, lag.max = 40, main = paste(aq_station_names_code[2], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(log_my(ts_b_train), lag.max = 40, main = paste(aq_station_names_code[2], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
```{r}
tsdisplay(diff(ts_b_train), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_b_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_b_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```

For the data from Station 571, more pronounced spikes are observed up to lag 2, along with indications of an autoregressive pattern in the additional spikes. As a result, an ARIMA(1,1,2) model is fitted to capture these characteristics.

```{r}
ar1_ma2_ts_b <- Arima(ts_b_train, order = c(1, 1, 2))
# checkresiduals(ar1_ma2_ts_b)
```
```{r}
auto_ts_b <- auto.arima(ts_b_train, ic = "aicc")
# checkresiduals(auto_ts_b)
```

### Station 703 - Schivenoglia v. Malpasso {.unlisted .unnumbered}
```{r}
ts_s_train <- window(ts_s, start_train_float, end_train_float)
ts_s_test <- window(ts_s, start_test_float)
tsdisplay(ts_s_train, lag.max = 40, main = paste(aq_station_names_code[3], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(log_my(ts_s_train), lag.max = 40, main = paste(aq_station_names_code[3], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
```{r}
tsdisplay(diff(ts_s_train), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_s_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_s_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
The same observation on spikes is applied also to the data of station 703.

```{r}
ar1_ma3_ts_s <- Arima(ts_s_train, order = c(1, 1, 3))
# checkresiduals(ar1_ma3_ts_s)
```
```{r}
auto_ts_s <- auto.arima(ts_s_train, ic = "aicc")
# checkresiduals(auto_ts_s)
```

## Dynamic regression

This section explores the potential benefits of including other pollutants in predicting PM10. NO₂ is a key precursor in particulate matter formation, as it reacts in the atmosphere to create secondary particles. NOₓ (which includes both NO and NO₂) also significantly contributes to secondary particulate matter, originating mainly from combustion processes like vehicle and industrial emissions. While CO is primarily a gas, it can indirectly affect PM10 levels through atmospheric reactions that produce secondary pollutants, including particulate matter.

Inspecting the relation between NO2 and PM10 is decided, as NO2 appears to be the most relevant predictor. Missing values are imputed as before using linear interpolation.

```{r}
no2_ts_m <- xts2ts(xts(df_aq_m$NO2, order.by = df_aq_m$Date), daily_ts_freq)
no2_ts_b <- xts2ts(xts(df_aq_b$NO2, order.by = df_aq_b$Date), daily_ts_freq)
no2_ts_s <- xts2ts(xts(df_aq_s$NO2, order.by = df_aq_s$Date), daily_ts_freq)
no2_ts_m <- imputeTS::na_interpolation(no2_ts_m)
no2_ts_b <- imputeTS::na_interpolation(no2_ts_b)
no2_ts_s <- imputeTS::na_interpolation(no2_ts_s)
no2_ts_m_train <- window(no2_ts_m, start_train_float, end_train_float)
no2_ts_b_train <- window(no2_ts_b, start_train_float, end_train_float)
no2_ts_s_train <- window(no2_ts_s, start_train_float, end_train_float)

## Convert milligrams per cubic meter to micrograms per cubic meter
# co_ts_m <- xts2ts(xts(df_aq_m$CO, order.by = df_aq_m$Date), daily_ts_freq) * 1000
# co_ts_b <- xts2ts(xts(df_aq_b$CO, order.by = df_aq_b$Date), daily_ts_freq) * 1000
# co_ts_s <- xts2ts(xts(df_aq_s$CO, order.by = df_aq_s$Date), daily_ts_freq) * 1000
# co_ts_m <- imputeTS::na_interpolation(co_ts_m)
# co_ts_b <- imputeTS::na_interpolation(co_ts_b)
# co_ts_s <- imputeTS::na_interpolation(co_ts_s)
# co_ts_m_train <- window(co_ts_m, start_train_float, end_train_float)
# co_ts_b_train <- window(co_ts_b, start_train_float, end_train_float)
# co_ts_s_train <- window(co_ts_s, start_train_float, end_train_float)
#
# nox_ts_m <- xts2ts(xts(df_aq_m$NOx, order.by = df_aq_m$Date), daily_ts_freq)
# nox_ts_b <- xts2ts(xts(df_aq_b$NOx, order.by = df_aq_b$Date), daily_ts_freq)
# nox_ts_s <- xts2ts(xts(df_aq_s$NOx, order.by = df_aq_s$Date), daily_ts_freq)
# nox_ts_m <- imputeTS::na_interpolation(nox_ts_m)
# nox_ts_b <- imputeTS::na_interpolation(nox_ts_b)
# nox_ts_s <- imputeTS::na_interpolation(nox_ts_s)
# nox_ts_m_train <- window(nox_ts_m, start_train_float, end_train_float)
# nox_ts_b_train <- window(nox_ts_b, start_train_float, end_train_float)
# nox_ts_s_train <- window(nox_ts_s, start_train_float, end_train_float)
```
```{r}
plot_pollutant_XY_lin(
  x = no2_ts_m_train,
  y = ts_m_train,
  station_name = aq_station_names_code[1],
  unit_measure_x = pollutant_units[["NO2"]],
  unit_measure_y = pollutant_units[["PM10"]],
  xlab = "NO2",
  ylab = "PM10"
)
# plot_pollutant_XY_lin(
#  x = no2_ts_s_train,
#  y = ts_s_train,
#  station_name = aq_station_names_code[3],
#  unit_measure_x = pollutant_units[["NO2"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "NO2",
#  ylab = "PM10"
# )
# plot_pollutant_XY_lin(
#  x = no2_ts_b_train,
#  y = ts_b_train,
#  station_name = aq_station_names_code[2],
#  unit_measure_x = pollutant_units[["NO2"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "NO2",
#  ylab = "PM10"
# )
```
For Station 548, a linear relationship between NO₂ and PM10 is evident, supported by a linear regression model where the predictor is highly significant. However, the residuals violate the white noise assumption. Similar results are observed at other stations, though for brevity, these are not presented here.

```{r}
# plot_pollutant_XY_lin(
#  x = co_ts_m_train,
#  y = ts_m_train,
#  station_name = aq_station_names_code[1],
#  unit_measure_x = pollutant_units[["PM10"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "CO",
#  ylab = "PM10"
# )
# plot_pollutant_XY_lin(
#  x = co_ts_b_train,
#  y = ts_b_train,
#  station_name = aq_station_names_code[2],
#  unit_measure_x = pollutant_units[["PM10"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "CO",
#  ylab = "PM10"
# )
# plot_pollutant_XY_lin(
#  x = co_ts_s_train,
#  y = ts_s_train,
#  station_name = aq_station_names_code[3],
#  unit_measure_x = pollutant_units[["PM10"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "CO",
#  ylab = "PM10"
# )
```
```{r}
# plot_pollutant_XY_lin(
#  x = nox_ts_m_train,
#  y = ts_m_train,
#  station_name = aq_station_names_code[1],
#  unit_measure_x = pollutant_units[["NOx"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "NOx",
#  ylab = "PM10"
# )
# plot_pollutant_XY_lin(
#  x = nox_ts_b_train,
#  y = ts_b_train,
#  station_name = aq_station_names_code[2],
#  unit_measure_x = pollutant_units[["NOx"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "NOx",
#  ylab = "PM10"
# )
# plot_pollutant_XY_lin(
#  x = nox_ts_s_train,
#  y = ts_s_train,
#  station_name = aq_station_names_code[3],
#  unit_measure_x = pollutant_units[["NOx"]],
#  unit_measure_y = pollutant_units[["PM10"]],
#  xlab = "NOx",
#  ylab = "PM10"
# )
```
A dynamic regression model is now fitted to the time series data using NO₂ as a predictor. The model is estimated using the `auto.arima` function, which selects the optimal ARIMA model based on the corrected Akaike Information Criterion (AICc). Residuals are then assessed for white noise using the `checkresiduals` function. Results for other stations are not shown for brevity but residuals appear satisfying.

```{r}
xreg_no2_ts_m <- auto.arima(ts_m_train, xreg = no2_ts_m_train, ic = "aicc")
checkresiduals(xreg_no2_ts_m)
```
```{r}
xreg_no2_ts_b <- auto.arima(ts_b_train, xreg = no2_ts_b_train, ic = "aicc")
# checkresiduals(xreg_no2_ts_b)
```
```{r}
xreg_no2_ts_s <- auto.arima(ts_s_train, xreg = no2_ts_s_train, ic = "aicc")
# checkresiduals(xreg_no2_ts_s)
```

Residuals are in this case similar to white noise and the Ljung-Box test suggests that there's no evidence for rejecting this hypothesis.

## ARIMA comparison
```{r}
ts_m_final_model_list <- list(ma5_ts_m, auto_ts_m, xreg_no2_ts_m)
ts_b_final_model_list <- list(ar1_ma2_ts_b, auto_ts_b, xreg_no2_ts_b)
ts_s_final_model_list <- list(ar1_ma3_ts_s, auto_ts_s, xreg_no2_ts_s)

IC_values_ts_m <- compute_arima_IC(arima_models = ts_m_final_model_list, suffixes = list("", "[auto]", "[auto]"))
IC_values_ts_b <- compute_arima_IC(arima_models = ts_b_final_model_list, suffixes = list("", "[auto]", "[auto]"))
IC_values_ts_s <- compute_arima_IC(arima_models = ts_s_final_model_list, suffixes = list("", "[auto]", "[auto]"))

print_table_custom(IC_values_ts_m[[1]], title = paste(aq_station_names_code[1], " - ARIMA models IC values"))
print_table_custom(IC_values_ts_b[[1]], title = paste(aq_station_names_code[2], " - ARIMA models IC values"))
print_table_custom(IC_values_ts_s[[1]], title = paste(aq_station_names_code[3], " - ARIMA models IC values"))
```

In all the stations, models that use Nitrogen Dioxide (NO₂) as a predictor exhibit lower AICc values. Additionally, the BIC, which imposes a greater penalty for model complexity, is also lower for these models.


## Non linear models
A nonlinear model is used to fit the data, specifically a neural network autoregressive model. This model is a feedforward neural network with a single hidden layer, estimated using the `nnetar()` function, which automatically selects the optimal neural network configuration.
For non-seasonal data, the fitted model is represented as an $NNAR(p,k)$ model, where \( k \) denotes the number of hidden nodes. This model is analogous to an AR(p) model but incorporates nonlinear functions. For seasonal data, the model is denoted as an $NNAR(p,P,k)[m]$, analogous to an $ARIMA(p,0,0)(P,0,0)[m]$ model but with nonlinear components. According to the *Universal Approximation Theorem* ^[G. Cybenko. "Approximation by superpositions of a sigmoidal function". In: Mathematics of Control, Signals and Systems 2 (1989), pp. 303-314], a neural network with a single hidden layer can approximate any continuous function on a compact subset, making it a powerful tool despite its reduced interpretability.

For nonlinear models, traditional residual diagnostics using autocorrelation functions may not be sufficient to assess model validity. Therefore, additional types of correlation are examined to ensure a comprehensive evaluation of the model's performance.

```{r}
nn_ts_m_auto <- nnetar(ts_m_train)
plot_nn_residuals(nn_ts_m_auto, aq_station_names_code[1], ts_m_train, lag_max = 40)
```

The ACF plots of the residuals suggest that they align with the white noise assumption. Additionally, the p-values from the Ljung-Box test indicate no significant evidence to reject the null hypothesis of white noise. However, the ACF of the residuals squared and cross-correlation function plots reveal potential remaining information, as some lags appear correlated. This suggests that a more advanced neural network model might be needed. The same procedure applied to other stations yields similar results.

```{r}
nn_ts_b_auto <- nnetar(ts_b_train)
# plot_nn_residuals(nn_ts_b_auto, aq_station_names_code[2], ts_b_train, lag_max = 40)
```
```{r}
nn_ts_s_auto <- nnetar(ts_s_train)
# plot_nn_residuals(nn_ts_s_auto, aq_station_names_code[3], ts_s_train, lag_max = 40)
```


# Forecasting
This section addresses the forecasting of PM10 levels using various models. An illustrative example of the train-test split for the Milan station data is presented in the plot below. 

For convenience, the forecasting analysis is demonstrated using data from Station 548 in Milan. However, the same methodology can be extended to other stations.

```{r}
plot_train_test_ts(ts_m_train, ts_m_test,
  c("steelblue", "darkorange"),
  main = paste(aq_station_names_code[1]),
  ylab = paste("PM10", pollutant_units[["PM10"]]),
  train_date_range = train_date_range,
  test_date_range = test_date_range
)
# plot_train_test_ts(ts_b_train, ts_b_test,
#  c("steelblue", "darkorange"),
#  main = paste(aq_station_names_code[2]),
#  ylab = paste("PM10", pollutant_units[["PM10"]]),
#  train_date_range = train_date_range,
#  test_date_range = test_date_range
# )
# plot_train_test_ts(ts_s_train, ts_s_test,
#  c("steelblue", "darkorange"),
#  main = paste(aq_station_names_code[3]),
#  ylab = paste("PM10", pollutant_units[["PM10"]]),
#  train_date_range = train_date_range,
#  test_date_range = test_date_range
# )
```

In the dynamic regression approach, the mean of the NO2 predictor observations in the training set is utilized for forecasting new values of PM10.

## Prediction performance
```{r}
largest_horizon <- 31
# Select the best model in the old order of the list
best_idx <- IC_values_ts_m[[2]][which.min(IC_values_ts_m[[1]]$AICc)]
# Best model according to AICc, excluding NN model
best_arima_ts_m <- ts_m_final_model_list[[best_idx]]
plot_forecast(best_arima_ts_m, largest_horizon, ts_m_train, ts_m_test,
  ylab = paste("PM10", pollutant_units[["PM10"]]),
  main = paste("Best ARIMA", "-", aq_station_names_code[1]),
)
forecast_residuals_analysis(best_arima_ts_m, largest_horizon, model_name = "Best ARIMA")
```

Attempting to forecast future values over a 31-day horizon using the best ARIMA model selected based on the AICc score and the dynamic regression model reveals suboptimal results for longer-term predictions. Although the residuals appear uncorrelated, as indicated by the ACF plots and the Ljung-Box test, which supports the hypothesis of white noise residuals, some issues persist. While the residuals have a mean close to zero, their variability is not constant, and the Q-Q plot shows unsatisfactory behavior in the tails.

```{r}
plot_forecast(nn_ts_m_auto, largest_horizon, ts_m_train, ts_m_test,
  ylab = paste("PM10", pollutant_units[["PM10"]]),
  main = paste(nn_ts_m_auto$method, "-", aq_station_names_code[1])
)
forecast_residuals_analysis(nn_ts_m_auto, largest_horizon)
```
The forecasts generated by the neural network autoregressive model exhibit a distinctly different pattern compared to those produced by the previous model. The residual plot indicates that many values are NA. Despite this, the ACF is highly satisfactory, and the residuals appear to meet the assumptions of normality and a zero mean.

## Forecasts comparison
A deeper evaluation in terms of forecasting ability is performed, assessing the two implemented methods with other simpler techniques. 

```{r}
ts_m_forecast_model_list <- list(
  meanf(ts_m_train, largest_horizon),
  rwf(ts_m_train, largest_horizon, drift = TRUE),
  ses(ts_m_train, largest_horizon),
  holt(ts_m_train, largest_horizon),
  best_arima_ts_m,
  nn_ts_m_auto
)
plot_forecast_multiple_models(ts_m_forecast_model_list,
  largest_horizon,
  ts_m_train,
  ts_m_test,
  ylab = paste("PM10", pollutant_units[["PM10"]]),
  main = paste("Forecast comparison -", aq_station_names_code[1]),
  colors = c("darkgray", "blue", "green", "purple", "orange", "#5f3e00"),
  skip_n_train_obs = length(ts_m_train) - 1
)
```
For clearer visualization, the plot below shows only the test set, covering a 31-day forecast horizon. Initially, the drift method, simple exponential smoothing, and Holt's method perform similarly. However, over the long term, their predictions diverge.

The dynamic regression model with the NO₂ predictor initially underestimates values but then predicts accurately before eventually aligning more closely with the simpler methods. In contrast, the neural network-based model exhibits an oscillating behavior.

Tables with various metrics are provided to evaluate the models. Specifically, the models are tested across forecast horizons of 1, 3, 7, 14, and 31 days to assess their performance depending on the desired prediction length. As seen also above, the uncertainty in predictions increases with the length of the forecast horizon. Notably, for models based on daily data, very short-term predictions are particularly crucial.

```{r}
table <- get_forecast_metric_table(ts_m_forecast_model_list, 1, ts_m_train, ts_m_test, largest_horizon)
print_table_custom(table, title = paste("Forecast metrics -", aq_station_names_code[1], " - 1 day ahead"))

table <- get_forecast_metric_table(ts_m_forecast_model_list, 3, ts_m_train, ts_m_test, largest_horizon)
print_table_custom(table, title = paste("Forecast metrics -", aq_station_names_code[1], " - 3 days ahead"))

table <- get_forecast_metric_table(ts_m_forecast_model_list, 7, ts_m_train, ts_m_test, largest_horizon)
print_table_custom(table, title = paste("Forecast metrics -", aq_station_names_code[1], " - 7 days ahead"))

table <- get_forecast_metric_table(ts_m_forecast_model_list, 14, ts_m_train, ts_m_test, largest_horizon)
print_table_custom(table, title = paste("Forecast metrics -", aq_station_names_code[1], " - 14 days ahead"))

table <- get_forecast_metric_table(ts_m_forecast_model_list, 31, ts_m_train, ts_m_test, largest_horizon)
print_table_custom(table, title = paste("Forecast metrics -", aq_station_names_code[1], " - 31 days ahead"))
```
For a 1-day ahead forecast, the simplest methods yield the lowest errors, with simple exponential smoothing achieving the best performance. The MASE score confirms that this method outperforms the naive approach. When the forecast window extends to 3 days, the dynamic regression model with ARIMA errors provides the best results, although simpler methods still perform reasonably well. As the forecast horizon increases beyond 7-14 days, prediction accuracy generally declines due to growing uncertainty and the challenges of capturing long-term trends. In these longer-term scenarios, the average method tends to perform better because it is less sensitive to short-term fluctuations.


# Conclusions

Summarizing, this analysis highlights various aspects of the data related to air pollution in Lombardia, Italy, even if meteorological knowledge is not considered.
Through exploratory data analysis on the monthly data, it was observed that rural areas may be affected by higher PM10 values more than other zones in some periods, and in general, Particulate Matter presents a strong seasonality. The global trend of this pollutant on the analyzed stations seems decreasing and during the emergency period in Italy due to COVID-19 PM10 values have not reduced much. Within the selected date range also the mean value of the pollutant among the zones didn't exceed the annual threshold so in general situation doesn't appear to be alarming.

In terms of forecasting daily data for station 548 in Milano v.Senato, the implemented models seem to not bring much improvement over the simplest methods, even if the white noise assumption for the residuals is satisfied. In particular the specified neural network autoregressive model seems to be not appropriate, considering the scores obtained on the test set. Using external predictors instead appears to be beneficial.

In conclusion, handling daily data likely requires more advanced models to better capture the underlying patterns. In addition, to effectively account for the seasonality which was not fully addressed in this study due to its complexity, incorporating Fourier terms could be a promising approach for improving the accuracy of forecasts^[https://robjhyndman.com/hyndsight/longseasonality/].