-
Notifications
You must be signed in to change notification settings - Fork 0
/
analysis_report.Rmd
1069 lines (946 loc) · 57.5 KB
/
analysis_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Assessing Air Quality in Lombardia, Italy through Time Series Analysis: Implications for Public Health and Policy"
subtitle: "Time Series Analysis project"
author: "Giovanni Costa - 880892"
date: "AY 2023/24"
geometry: "left=1cm,right=2cm,top=1cm,bottom=1cm"
output:
html_document:
toc: true
number_sections: true
toc_depth: 2
toc_float:
smooth_scroll: false
fig_caption: yes
theme: flatly
highlight: pygments
css: "assets/css/styles.css"
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(fig.align = "center")
options(digits = 4)
options(error = recover)
```
```{r environment setup, include=FALSE}
rm(list = ls())
Sys.setlocale("LC_TIME", "en_US.UTF-8")
```
# Introduction
This project aims to evaluate the air quality in Lombardia, Italy, using a comprehensive time series analysis. The data for this study will be derived from sensors located at various stations across the region, which can be accessed via the regional website.
The primary focus is to underscore the significance of air quality on public health. By analyzing the trends and patterns in air quality data over time, it's possible to identify periods of high pollution and correlate these with potential health risks. This analysis will provide valuable insights into how air quality fluctuations may impact the respiratory health of Lombardia's inhabitants.
Considering the data provided by the region, the following pollutants will be analyzed across three different stations placed in different areas:
- **Benzene** is a volatile organic compound (VOC) commonly used in the production of plastics, resins, and synthetic fibers, as well as in gasoline. It is released into the air through emissions from motor vehicles, industrial processes, and the evaporation of benzene-containing products. Exposure to benzene primarily occurs through inhalation and can lead to serious health effects, including bone marrow damage, which can cause blood disorders such as anemia and increase the risk of leukemia, a type of cancer.
- **Carbon Monoxide (CO)** is a colorless, odorless gas produced by incomplete combustion of carbon-containing fuels, such as gasoline, natural gas, oil, and wood. Major sources include motor vehicles, industrial processes, and residential heating systems. CO interferes with the body's ability to transport oxygen by binding to hemoglobin in the blood, forming carboxyhemoglobin. High levels of exposure can lead to symptoms such as headaches, dizziness, and even death due to oxygen deprivation.
- **Nitrogen Dioxide (NO₂)** is a reddish-brown gas with a sharp, biting odor. It is a significant air pollutant formed primarily from the combustion of fossil fuels in vehicles, power plants, and industrial processes. NO₂ can irritate the respiratory system, exacerbate asthma, and reduce lung function. It also contributes to the formation of ground-level ozone and fine particulate matter, which can have further adverse effects on human health and the environment.
- **Nitrogen Oxides (NOₓ)** encompass a group of gases, including nitrogen dioxide (NO₂) and nitrogen monoxide (NO), produced during combustion processes, particularly at high temperatures. Major sources include motor vehicles, power plants, and industrial facilities. NOₓ gases can cause respiratory problems, contribute to the formation of smog and acid rain, and lead to the secondary formation of fine particulate matter (PM2.5), all of which can have severe health and environmental impacts.
- **Particulate Matter 10 (PM10)** refers to airborne particles with a diameter of 10 micrometers or less. These particles can originate from a variety of sources, including construction sites, road dust, industrial emissions, and combustion processes. PM10 can be inhaled into the respiratory system, leading to health issues such as respiratory infections, lung inflammation, and aggravation of existing heart and lung diseases. Long-term exposure can decrease lung function and increase mortality from cardiovascular and respiratory diseases.
In particular, PM10 is selected for the main part of this analysis because it serves as a comprehensive indicator of particulate pollution from a wide range of sources, such as traffic emissions, industrial processes, and the secondary formation of particles from gaseous pollutants. Its direct link to adverse health effects, including respiratory and cardiovascular diseases, makes PM10 a crucial measure of air quality. Additionally, other pollutants like Benzene, CO, NO₂, and NOₓ can contribute to PM10 levels through both primary emissions and secondary reactions, making them valuable predictor variables for analyzing fluctuations in PM10 concentrations.
For this study, daily data are considered, followed by aggregation into monthly averages to facilitate a long-term analysis and provide a clearer understanding of the underlying trends. Subsequently, forecasting models will be developed using pure daily data to predict air quality in the region.
```{r}
# Pre-requirements
set.seed(123) # set pseudorandom generator for reproducibility
requirements <- c("dplyr", "ggplot2", "kableExtra", "ARPALData", "imputeTS", "xts", "forecast", "car")
for (library_name in requirements) {
if (!require(library_name, character.only = TRUE, exclude = if (library_name == "dplyr") c("lag", "filter") else NULL)) {
install.packages(library_name, repos = "https://cloud.r-project.org")
library(library_name, character.only = TRUE, exclude = if (library_name == "dplyr") c("lag", "filter") else NULL)
}
}
lag_dplyr <- dplyr::lag
filter_dplyr <- dplyr::filter
daily_ts_freq <- 365.25
monthly_ts_freq <- 12
# Import user-defined functions
source("src/utils.R")
source("src/plotting.R")
source("src/smoothing.R")
source("src/forecasting.R")
# library(styler)
# style_file("analysis_report.Rmd")
```
# Dataset description
The air quality dataset utilized is composed of daily observations of Benzene, CO, NO2, NOx, and PM10 coming from 3 different stations placed in different positions:
1. Station 548 - Milano v.Senato refers to the metropolitan area of Milan
1. Station 571 - Bormio v.Monte Braulio is located in the mountain zone
1. Station 703 - Schivenoglia v.Malpasso is placed in the rural plain of Lombardia
The analysis covers the period from January 1, 2014, to December 31, 2023. The original air quality data, which were recorded hourly, were aggregated to daily values by calculating the mean and excluding NA values. For additional details, refer to the ARPALData library manual ^[https://cran.r-project.org/web/packages/ARPALData/index.html].
A dataset with detailed information about the monitoring stations is also available. It includes the sensor ID, the pollutant measured by each sensor, as well as the location, altitude, and the start and stop dates of station operation. Since the stations measure different pollutants, columns in the dataset with all `NA` values indicate that the station does not have sensors for measuring that particular pollutant in the air.
```{r}
# Data retrieval
start_date <- as.Date("2014-01-01")
end_date <- as.Date("2023-12-31")
date_range <- seq(from = start_date, to = end_date, by = 1)
# Milano v.Senato, Bormio v.Monte Braulio, Schivenoglia v. Malpasso
aq_station_ids <- c(548, 571, 703)
aq_station_names_code <- c("Station 548", "Station 571", "Station 703")
aq_station_names_short <- c("Milano v.Senato", "Bormio v.Monte Braulio", "Schivenoglia v. Malpasso")
aq_station_names_full <- c("Station 548 - Milano v.Senato", "Station 571 - Bormio v.Monte Braulio", "Station 703 - Schivenoglia v. Malpasso")
station_colors <- c("steelblue", "darkorange", "#009E73")
filtering_colors <- c("red", "blue", "green")
# pollutant: unit of measure mapping
pollutant_units <- list(
"Benzene" = "µg/m³",
"CO" = "mg/m³",
"NO2" = "µg/m³",
"NOx" = "µg/m³",
"PM10" = "µg/m³"
)
df_aq_daily <- NULL
df_aq_stations <- NULL
if (length(list.files("data/", pattern = "*.rds")) == 0) {
# Air quality data
df_aq_daily <- get_ARPA_Lombardia_AQ_data(
ID_station = aq_station_ids, Date_begin = format(start_date),
Date_end = format(end_date), Frequency = "daily", parallel = TRUE
)
# Station data
df_aq_stations <- get_ARPA_Lombardia_AQ_registry()
# Filter stations
df_aq_stations <- df_aq_stations[df_aq_stations$IDStation %in% aq_station_ids, ]
# Save data
saveRDS(df_aq_daily, "data/df_aq_daily.rds")
saveRDS(df_aq_stations, "data/df_aq_stations.rds")
} else {
df_aq_daily <- readRDS("data/df_aq_daily.rds")
df_aq_stations <- readRDS("data/df_aq_stations.rds")
}
```
```{r echo=FALSE,out.width="50%", out.height="20%", fig.show='hold', fig.cap="Stations zoning information"}
# plot_zoning_map(
# title = "ARPA Lombardia zoning",
# line_type = 1,
# line_size = 1,
# xlab = "Longitude",
# ylab = "Latitude"
# )
# plot_AQ_stations(data_aq = df_aq_stations, title = "Map of ARPA stations in Lombardy", col_points = station_colors)
knitr::include_graphics(c("assets/images/zoning.png", "assets/images/stations_map.png"))
```
```{r}
missing_dates <- check_missing_dates(df_aq_daily, start_date, end_date, "day")
print(paste("The number of missing dates in the dataset from",
start_date, "to", end_date, "is:", length(missing_dates),
sep = " "
))
missing_dates <- NULL
```
```{r}
df_aq_m <- df_aq_daily %>% filter_dplyr(IDStation == 548)
df_aq_b <- df_aq_daily %>% filter_dplyr(IDStation == 571)
df_aq_s <- df_aq_daily %>% filter_dplyr(IDStation == 703)
com_poll_names <- find_common_pollutants(
df_aq_m %>% dplyr::select(-IDStation, -NameStation, -Date),
df_aq_b %>% dplyr::select(-IDStation, -NameStation, -Date),
df_aq_s %>% dplyr::select(-IDStation, -NameStation, -Date)
)
print(paste("Common pollutants among the stations:", paste(com_poll_names, collapse = ", ")))
```
# Data overview
## Raw data
As previously mentioned, the air quality stations dataset provides details on the locations, sensors, and their operational status. Below is the table for the station located in Milan, while similar information is available for other stations, which may monitor different types of pollutants.
```{r}
# Station 548 - Milano v.Senato
print_table_custom(
df_aq_stations %>%
filter_dplyr(IDStation == 548) %>%
select(c(
IDSensor, Pollutant, Province, City,
Latitude, Longitude, Altitude, DateStart, DateStop
)) %>%
distinct() %>%
arrange(Pollutant),
title = paste(aq_station_names_full[1], "information")
)
```
```{r}
# Station 571 - Bormio v.Monte Braulio
# print_table_custom(df_aq_stations %>%
# filter_dplyr(IDStation == 571) %>%
# select(c(
# IDSensor, Pollutant, Province, City,
# Latitude, Longitude, Altitude, DateStart, DateStop
# )) %>%
# distinct() %>%
# arrange(Pollutant), title = paste(aq_station_names_full[2], "information"))
```
```{r}
# Station 703 - Schivenoglia v. Malpasso
# print_table_custom(df_aq_stations %>%
# filter_dplyr(IDStation == 703) %>%
# select(c(
# IDSensor, Pollutant, Province, City,
# Latitude, Longitude, Altitude, DateStart, DateStop
# )) %>%
# distinct() %>%
# arrange(Pollutant), title = paste(aq_station_names_full[3], "information"))
```
More than the stations' details, it is considered more insightful to focus on the air quality summary tables for the different stations. First, the number of missing values within the selected 10-year daily interval is significant, and these gaps must be addressed through imputation to correctly represent the time data. Additionally, while the selected pollutants generally show a small standard deviation (based on the range between the 1st and 3rd quartiles), some pollutants, such as PM10 and NOx, exhibit maximum values that are considerably higher than the 3rd quartile.
```{r}
print_table_custom(
df_aq_m %>%
select(c(-IDStation, -NameStation, -Date)),
is_summary = TRUE, title = paste(aq_station_names_full[1], "daily data statistics"),
highlight_rows = com_poll_names
)
```
```{r}
print_table_custom(
df_aq_b %>%
select(c(-IDStation, -NameStation, -Date)),
is_summary = TRUE, title = paste(aq_station_names_full[2], "daily data statistics"),
highlight_rows = com_poll_names
)
```
```{r}
print_table_custom(
df_aq_s %>%
select(c(-IDStation, -NameStation, -Date)),
is_summary = TRUE, title = paste(aq_station_names_full[3], "daily data statistics"),
highlight_rows = com_poll_names
)
```
The data distribution of the PM10 can be better highlighted with a boxplot: the median values of the Milan station and Schivenoglia station are quite similar and all the stations present observations very distant from the others (in particular station 703). Indeed, PM10 concentrations can be higher in rural areas compared to urban metropolitan areas for several reasons. Agricultural activities, such as harvesting and livestock operations, contribute to elevated PM10 levels by generating bioaerosols rich in plant and animal matter. Additionally, rural dust often contains higher concentrations of crustal metals, exacerbated by drier climates and open spaces. Furthermore, rural areas typically experience more stable atmospheric conditions, leading to reduced dispersion of particulates compared to the more unstable, windier conditions often found in urban environments due to intense heat islands.
```{r}
# for (i in 1:length(com_poll_names)) {
# boxplot(
# df_aq_daily[[com_poll_names[i]]] ~ as.factor(df_aq_daily$IDStation),
# main = "Boxplot of common pollutants among the stations",
# xlab = "Station ID",
# ylab = paste0(com_poll_names[i], " (", pollutant_units[[com_poll_names[i]]], ")"),
# col = station_colors,
# las = 2
# )
# }
boxplot(
df_aq_daily[["PM10"]] ~ as.factor(df_aq_daily$IDStation),
main = "Boxplot of PM10 among the stations",
xlab = "Station ID",
ylab = paste0("PM10", " (", pollutant_units[["PM10"]], ")"),
col = station_colors,
las = 2
)
```
Standard regulatory levels for PM10 are as follows: the **Acceptable Level** is 50 µg/m³, which can be exceeded on up to 35 days per year without health concerns. The **Information Level** is set at 200 µg/m³, triggering public notifications about potential health risks. The **Alarm Level** is 300 µg/m³, prompting immediate public health actions, such as advising vulnerable groups to limit outdoor activities. Additionally, the **Annual Average** acceptable level is 40 µg/m³, ensuring that the yearly average concentration does not exceed this value to protect public health.
Fortunately, as indicated in this table, the annual average concentration of Particulate Matter 10 remains only slightly above the limit across the years.
```{r}
aq_m_pm10_by_year <- df_aq_m %>%
dplyr::mutate(year = format(Date, "%Y")) %>%
dplyr::group_by(year) %>%
dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_b_pm10_by_year <- df_aq_b %>%
dplyr::mutate(year = format(Date, "%Y")) %>%
dplyr::group_by(year) %>%
dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_s_pm10_by_year <- df_aq_s %>%
dplyr::mutate(year = format(Date, "%Y")) %>%
dplyr::group_by(year) %>%
dplyr::summarize(mean_value = mean(PM10, na.rm = TRUE))
aq_pm10_by_year <- cbind(aq_m_pm10_by_year, aq_b_pm10_by_year$mean_value, aq_s_pm10_by_year$mean_value)
colnames(aq_pm10_by_year) <- c("Year", aq_station_names_short)
print_table_custom(aq_pm10_by_year, title = "Mean PM10 values by year")
aq_m_pm10_by_year <- NULL
aq_b_pm10_by_year <- NULL
aq_s_pm10_by_year <- NULL
aq_pm10_by_year <- NULL
```
## Time series data
Currently, the data are analyzed as time series rather than as raw values, and further considerations will be based on this interpretation. The plot displaying time series from all stations confirms previous observations about data distribution: the Milan station generally records higher PM10 values, whereas the Bormio station reports the lowest levels, and the Schivenoglia station shows numerous peaks.
```{r}
ts_m <- xts(df_aq_m$PM10, order.by = df_aq_m$Date)
ts_b <- xts(df_aq_b$PM10, order.by = df_aq_b$Date)
ts_s <- xts(df_aq_s$PM10, order.by = df_aq_s$Date)
```
```{r fig.height=6, fig.width=10}
plot_3_ts(
ts1 = ts_m, ts2 = ts_b, ts3 = ts_s,
ts_colors = station_colors,
main = "PM10 time series for the stations",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = aq_station_names_full
)
```
When the time series are plotted individually, their patterns become clearer. Each series exhibits variability and is apparently non-stationary, with indications of some seasonal patterns across all stations.
```{r fig.height=8, fig.width=10}
plot_ts_grid(
ts_list = list(ts_m, ts_b, ts_s),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
ts_colors = station_colors,
n_row = 3,
ts_names = aq_station_names_full
)
```
```{r}
ts_m <- xts2ts(ts_m, daily_ts_freq)
ts_b <- xts2ts(ts_b, daily_ts_freq)
ts_s <- xts2ts(ts_s, daily_ts_freq)
```
# Preprocessing
## Missing values imputation
As previously noted, the data contain many missing values, and time series require consistent spacing without gaps. The following statistics provide insight into these missing values:
- "Number of Gaps" indicates the count of NA gaps, which are sequences of one or more consecutive missing values.
- "Average Gap Size" represents the average length of these consecutive NA gaps.
- "Longest NA Gap" shows the longest sequence of consecutive missing values in the time series.
- "Most Frequent Gap Size" identifies the most commonly occurring length of missing value sequences.
```{r}
stats_na_ts_m <- imputeTS::statsNA(ts_m, print_only = FALSE)
stats_na_ts_b <- imputeTS::statsNA(ts_b, print_only = FALSE)
stats_na_ts_s <- imputeTS::statsNA(ts_s, print_only = FALSE)
stats_na_table <- as.data.frame(
rbind(
stats_na_ts_m,
stats_na_ts_b,
stats_na_ts_s
)
)
stats_na_table <- stats_na_table[, c(-dim(stats_na_table)[2], -(dim(stats_na_table)[2] - 1))]
colnames(stats_na_table) <- c(
"Length TS", "Number NAs", "Number Gaps", "Average Gap Size",
"Percentage NAs", "Longest NA gap", "Most frequent gap size"
)
rownames(stats_na_table) <- aq_station_names_code
print_table_custom(stats_na_table, title = "Missing values statistics")
```
Fortunately, the most frequent gap length is one, and the average gap size is relatively small, likely resulting from occasional sensor malfunctions. This suggests that simple imputation techniques should be reasonably accurate and close to the actual values.
```{r}
# imputeTS::ggplot_na_distribution2(ts_m,
# title = paste(aq_station_names_code[1], "-", "missing values ratio per interval"),
# )
# imputeTS::ggplot_na_distribution2(ts_b,
# title = paste(aq_station_names_code[2], "-", "missing values ratio per interval"),
# )
# imputeTS::ggplot_na_distribution2(ts_s,
# title = paste(aq_station_names_code[3], "-", "missing values ratio per interval"),
# )
```
To address missing values in the time series, linear interpolation is used. This method assumes that missing values can be estimated by drawing a straight line between the known values on either side. For time series data, this means using the timestamps and values of the adjacent non-missing points to calculate the missing values.
As shown the the following plots, the values imputed among all the stations seem coherent.
```{r fig.height=4, fig.width=8}
ts_m_imputed <- imputeTS::na_interpolation(ts_m)
ts_b_imputed <- imputeTS::na_interpolation(ts_b)
ts_s_imputed <- imputeTS::na_interpolation(ts_s)
imputeTS::ggplot_na_imputations(
window_ts_xts(ts_m, df_aq_m$Date, "2021-01-01", "2022-12-31"),
window_ts_xts(ts_m_imputed, df_aq_m$Date, "2021-01-01", "2022-12-31"),
title = paste(aq_station_names_short[1], "-", "Linear imputation"),
x_axis_labels = seq(as.Date("2021-01-01"), as.Date("2022-12-31"), by = "day"),
color_points = station_colors[1],
color_lines = rgb2hex_custom(col2rgb_custom(station_colors[1], 0.6)),
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
imputeTS::ggplot_na_imputations(
window_ts_xts(ts_b, df_aq_b$Date, "2014-01-01", "2015-12-31"),
window_ts_xts(ts_b_imputed, df_aq_b$Date, "2014-01-01", "2015-12-31"),
title = paste(aq_station_names_short[2], "-", "Linear imputation"),
x_axis_labels = seq(as.Date("2014-01-01"), as.Date("2015-12-31"), by = "day"),
color_points = station_colors[2],
color_lines = rgb2hex_custom(col2rgb_custom(station_colors[2], 0.6)),
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
imputeTS::ggplot_na_imputations(
window_ts_xts(ts_s, df_aq_s$Date, "2016-01-01", "2017-12-31"),
window_ts_xts(ts_s_imputed, df_aq_s$Date, "2016-01-01", "2017-12-31"),
title = paste(aq_station_names_short[3], "-", "Linear imputation"),
x_axis_labels = seq(as.Date("2016-01-01"), as.Date("2017-12-31"), by = "day"),
color_points = station_colors[3],
color_lines = rgb2hex_custom(col2rgb_custom(station_colors[3], 0.6)),
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
ts_m <- ts_m_imputed
ts_b <- ts_b_imputed
ts_s <- ts_s_imputed
df_aq_m$PM10 <- as.numeric(ts_m_imputed)
df_aq_b$PM10 <- as.numeric(ts_b_imputed)
df_aq_s$PM10 <- as.numeric(ts_s_imputed)
```
```{r}
print(
paste(
"Number of NA values in PM10 among the station are: ",
sum(
sum(is.na(df_aq_m$PM10)),
sum(is.na(df_aq_m$PM10)),
sum(is.na(df_aq_b$PM10))
)
)
)
```
## Outliers detection
The previously identified outliers are also evident in the time series data, particularly as prominent peaks at the stations. These outliers have been left unaltered to preserve the integrity and semantics of the data.
```{r}
outliers_ts_m <- tsoutliers(ts_m)
outliers_ts_b <- tsoutliers(ts_b)
outliers_ts_s <- tsoutliers(ts_s)
tmp_ts_m <- ts(ts_m)
tmp_ts_m[outliers_ts_m$index] <- NA
imputeTS::ggplot_na_imputations(
tmp_ts_m, ts_m,
title = paste(aq_station_names_code[1], "-", "outliers detection"),
x_axis_labels = date_range,
color_lines = station_colors[1],
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
size_points = NA,
size_imputations = 3,
legend = FALSE
)
tmp_ts_b <- ts(ts_b)
tmp_ts_b[outliers_ts_b$index] <- NA
imputeTS::ggplot_na_imputations(
tmp_ts_b, ts_b,
title = paste(aq_station_names_code[2], "-", "outliers detection"),
x_axis_labels = date_range,
color_lines = station_colors[2],
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
size_points = NA,
size_imputations = 3,
legend = FALSE
)
tmp_ts_s <- ts(ts_s)
tmp_ts_s[outliers_ts_s$index] <- NA
imputeTS::ggplot_na_imputations(
tmp_ts_s, ts_s,
title = paste(aq_station_names_code[3], "-", "outliers detection"),
x_axis_labels = date_range,
color_lines = station_colors[3],
color_imputations = "red",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
size_points = NA,
size_imputations = 3,
legend = FALSE
)
outliers_ts_m <- NULL
outliers_ts_b <- NULL
outliers_ts_s <- NULL
tmp_ts_m <- NULL
tmp_ts_b <- NULL
tmp_ts_s <- NULL
```
# Time series data analysis
In this section data daily data are averaged across the months for performing a long-term analysis. Due to the lower number of observations, now the time series appear more clear. However, just by plotting the monthly values, the trend and the seasonal pattern don't emerge much.
```{r}
agg_df_aq_m <- ARPALData::Time_aggregate(
df_aq_m, "monthly",
Var_vec = "PM10", Fns_vec = "mean"
)
agg_df_aq_b <- ARPALData::Time_aggregate(
df_aq_b, "monthly",
Var_vec = "PM10", Fns_vec = "mean"
)
agg_df_aq_s <- ARPALData::Time_aggregate(
df_aq_s, "monthly",
Var_vec = "PM10", Fns_vec = "mean"
)
ts_m_monthly <- xts(agg_df_aq_m$PM10, order.by = agg_df_aq_m$Date)
ts_b_monthly <- xts(agg_df_aq_b$PM10, order.by = agg_df_aq_b$Date)
ts_s_monthly <- xts(agg_df_aq_s$PM10, order.by = agg_df_aq_s$Date)
plot_ts_grid(
ts_list = list(ts_m_monthly, ts_b_monthly, ts_s_monthly),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
ts_colors = station_colors,
n_row = 3,
ts_names = aq_station_names_full
)
agg_df_aq_m <- NULL
agg_df_aq_b <- NULL
agg_df_aq_s <- NULL
ts_m_monthly <- xts2ts(ts_m_monthly, monthly_ts_freq)
ts_b_monthly <- xts2ts(ts_b_monthly, monthly_ts_freq)
ts_s_monthly <- xts2ts(ts_s_monthly, monthly_ts_freq)
```
## Autocorrelation and partial autocorrelation
Autocorrelation (ACF) and partial autocorrelation (PACF) function plots provide additional insights into the data. The ACF plots reveal a sinusoidal pattern across all stations, with a pronounced peak every six lags, suggesting a recurring pattern approximately every six months. The PACF plots also show spikes around this period, indicating the presence of significant information during these intervals. Additionally, the analysis confirms with reasonable confidence that the time series is non-stationary.
```{r}
plot_acf_pacf(ts_m_monthly, aq_station_names_code[1])
```
```{r}
plot_acf_pacf(ts_b_monthly, aq_station_names_code[2])
```
```{r}
plot_acf_pacf(ts_s_monthly, aq_station_names_code[3])
```
## Monthplot
The monthplot function is a helpful tool for visualizing and analyzing the monthly patterns in a time series. It displays the average values for each month, making it easier to identify any seasonal trends.
The resulting graph confirms the expected pattern: the observations show a pronounced monthly seasonality, with higher values typically occurring during the winter months and lower values during the summer. Additionally, the variability throughout the year is significant across all stations, indicating that the seasonal fluctuations are consistent yet varied in magnitude.
```{r}
monthplot(
ts_m_monthly,
main = paste("Monthly plot of PM10 for", aq_station_names_short[1]),
xlab = "Month",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
monthplot(
ts_b_monthly,
main = paste("Monthly plot of PM10 for", aq_station_names_short[2]),
xlab = "Month",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
monthplot(
ts_s_monthly,
main = paste("Monthly plot of PM10 for", aq_station_names_short[3]),
xlab = "Month",
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = "")
)
```
## Smoothing and decomposition
To better understand the trend component of the time series, this section employs smoothing techniques to estimate the underlying trend. Later, the time series from different stations will be decomposed using Seasonal and Trend decomposition using Loess (STL) to further isolate and highlight the trend, seasonal, and residual components.
A widely used and straightforward method for smoothing is the **simple moving average** filter, which computes the arithmetic mean over a centered time window of size \(2p + 1\). The filtered trend estimate at time \(t\) is given by:
$$
\hat{f}_t = \frac{1}{2p + 1} \sum_{i=-p}^p y_{t+i}
$$
The choice of the window size \(p\) is crucial as it directly influences the degree of smoothing: larger values of \(p\) result in a smoother trend, while smaller values retain more of the original variability. Various values of \(p\) are tested to explore different levels of smoothing, each with a specific purpose:
- **\(p = 3\):** A smaller window that applies minimal smoothing, allowing short-term fluctuations to be visible while still reducing noise.
- **\(p = 6\):** This value is chosen to remove the seasonal effect identified in the Auto-Correlation Function (ACF), particularly smoothing out variations that span over a half-year period.
- **\(p = 12\):** A larger window size aimed at providing more significant smoothing, potentially eliminating yearly patterns and offering a clearer view of long-term trends.
In addition, a moving average filter for seasonal data is tried to estimate the trend, given that the monthly time series exhibits a significant seasonal component.
```{r}
plot_filtered_ts(
original_ts = ts_m_monthly,
filtered_ts_list = list(simple_ma(ts_m_monthly, p = 3), simple_ma(ts_m_monthly, p = 6), simple_ma(ts_m_monthly, p = 12)),
main = paste(aq_station_names_code[1], "-", "simple moving average filter"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("p = 3", "p = 6", "p = 12"),
line_colors = filtering_colors
)
plot_filtered_ts(
original_ts = ts_m_monthly,
filtered_ts_list = list(ma_for_seasonal(ts_m_monthly, monthly_ts_freq)),
main = paste(aq_station_names_code[1], "-", "moving average for seasonal data"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("MA seasonal"),
line_colors = filtering_colors
)
```
```{r}
plot_filtered_ts(
original_ts = ts_b_monthly,
filtered_ts_list = list(simple_ma(ts_b_monthly, p = 3), simple_ma(ts_b_monthly, p = 6), simple_ma(ts_b_monthly, p = 12)),
main = paste(aq_station_names_code[2], "-", "simple moving average filter"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("p = 3", "p = 6", "p = 12"),
line_colors = filtering_colors
)
plot_filtered_ts(
original_ts = ts_b_monthly,
filtered_ts_list = list(ma_for_seasonal(ts_b_monthly, monthly_ts_freq)),
main = paste(aq_station_names_code[2], "-", "moving average for seasonal data"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("MA seasonal"),
line_colors = filtering_colors
)
```
```{r}
plot_filtered_ts(
original_ts = ts_s_monthly,
filtered_ts_list = list(simple_ma(ts_s_monthly, p = 3), simple_ma(ts_s_monthly, p = 6), simple_ma(ts_s_monthly, p = 12)),
main = paste(aq_station_names_code[3], "-", "simple moving average filter"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("p = 3", "p = 6", "p = 12"),
line_colors = filtering_colors
)
plot_filtered_ts(
original_ts = ts_s_monthly,
filtered_ts_list = list(ma_for_seasonal(ts_s_monthly, monthly_ts_freq)),
main = paste(aq_station_names_code[3], "-", "moving average for seasonal data"),
ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""),
legend_names = c("MA seasonal"),
line_colors = filtering_colors
)
```
Overall, the estimated trend appears to be slowly decreasing. These data indicates also that during the COVID-19 restrictions in Italy from 2020 to 2022, the average PM10 levels did not significantly decrease. There were periods with increased PM10 values, such as at Station 571 in Bormio in 2021. This suggests that sources of particulate emissions unrelated to mobility, such as industrial activities or other local sources, played a substantial role in sustaining PM10 concentrations during this time.
To further inspect the behavior of the time series, STL (Seasonal and Trend decomposition using Loess) decomposition is applied. Given the previously observed anomalies, the robust version of the algorithm is used to mitigate their impact. STL also offers flexibility in defining the rate of change for the seasonal component. Since the seasonal pattern appears consistent over time, the seasonal window is set to "periodic" to ensure that the entire dataset is utilized for a comprehensive seasonal analysis.
```{r}
stl_ts_m_monthly <- stl(ts_m_monthly, s.window = "periodic", robust = TRUE)
stl_ts_b_monthly <- stl(ts_b_monthly, s.window = "periodic", robust = TRUE)
stl_ts_s_monthly <- stl(ts_s_monthly, s.window = "periodic", robust = TRUE)
plot(
stl_ts_m_monthly,
main = paste(aq_station_names_code[1], "-", "PM10 - STL decomposition")
)
plot(
stl_ts_b_monthly,
main = paste(aq_station_names_code[2], "-", "PM10 - STL decomposition")
)
plot(
stl_ts_s_monthly,
main = paste(aq_station_names_code[3], "-", "PM10 - STL decomposition")
)
```
All stations exhibit a strong seasonal component, with an overall decreasing trend. However, Station 548 shows a notable exception, as it experienced significant peaks in PM10 levels at the end of 2021 and throughout 2022.
The Ljung-Box test supports the validity of the decompositions, as it does not suggest rejecting the null hypothesis that the residuals are white noise for most stations. However, for station 571, the p-value is notably low, indicating that some adjustments to the parameters in the STL function might be necessary to improve the decomposition accuracy.
```{r}
ljung_box_noise_ts_m <- Box.test(stl_ts_m_monthly$time.series[, "remainder"],
lag = ceiling(length(stl_ts_m_monthly$time.series[, "remainder"]) * 0.1),
type = "Ljung-Box"
)$p.value
ljung_box_noise_ts_b <- Box.test(stl_ts_b_monthly$time.series[, "remainder"],
lag = ceiling(length(stl_ts_b_monthly$time.series[, "remainder"]) * 0.1),
type = "Ljung-Box"
)$p.value
ljung_box_noise_ts_s <- Box.test(stl_ts_s_monthly$time.series[, "remainder"],
lag = ceiling(length(stl_ts_s_monthly$time.series[, "remainder"]) * 0.1),
type = "Ljung-Box"
)$p.value
ljung_box_noise <- data.frame(
p_value = c(ljung_box_noise_ts_m, ljung_box_noise_ts_b, ljung_box_noise_ts_s)
)
rownames(ljung_box_noise) <- aq_station_names_code
colnames(ljung_box_noise) <- "p-value"
print_table_custom(ljung_box_noise, title = "Ljung-Box test for the noise component")
stl_ts_m_monthly <- NULL
stl_ts_b_monthly <- NULL
stl_ts_s_monthly <- NULL
ljung_box_noise_ts_m <- NULL
ljung_box_noise_ts_b <- NULL
ljung_box_noise_ts_s <- NULL
ljung_box_noise <- NULL
```
# Models development
As previously mentioned, this section of the analysis focuses on daily data to develop forecasting models. Given the need to predict future PM10 values for implementing preventive health measures, a short-term analysis is essential.
To simplify visualization and focus on recent data, the time series for this part of the study is limited to the period from January 1, 2022, to December 31, 2023. As shown previously, the choice of a time window at the end of the COVID-19 emergency period in Italy doesn't influence the PM10 values pattern
For an accurate evaluation, it is crucial to avoid using forecast data as training data. Therefore, the time series is divided into training and test sets, with the test period spanning from December 1, 2023, to the end of the period. A one-month test set is selected to assess how the model performs with a relatively long forecast horizon.
## Stochastic models
```{r}
# Train-test split
start_train <- as.Date("2022-01-01")
date_split <- as.Date("2023-12-01")
start_train_float <- date_to_float(start_train, daily_ts_freq)
end_train_float <- date_to_float(date_split - 1, daily_ts_freq)
start_test_float <- end_train_float
train_date_range <- seq(start_train, date_split - 1, by = "day")
test_date_range <- seq(date_split, end_date, by = "day")
```
### Station 548 - Milano v.Senato {.unlisted .unnumbered}
```{r}
ts_m_train <- window(ts_m, start_train_float, end_train_float)
ts_m_test <- window(ts_m, start_test_float)
tsdisplay(ts_m_train, lag.max = 40, main = paste(aq_station_names_code[1], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(log_my(ts_m_train), lag.max = 40, main = paste(aq_station_names_code[1], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
The plots indicate that the time series is not stationary, as evidenced by the slow decay of lags in the ACF plot. Even after applying a logarithmic transformation to stabilize the variance, the overall situation remains largely unchanged. The ACF plot displays up to 40 lags since, beyond a month of daily correlations, the lags may become insignificant.
```{r}
tsdisplay(diff(ts_m_train), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(diff(ts_m_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
tsdisplay(diff(ts_m_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[1], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
Differencing the time series appears to be highly effective, as the data now seem better aligned with the stationarity assumption. However, differencing by lag 7 proves unhelpful for improving stationarity and introduces an artificial seasonal pattern into the data. Similarly, differencing by the time series frequency results in the appearance of a seasonal pattern, with sinusoidal oscillations visible in the ACF plot.
Despite this, applying differencing to remove potential periodic factors in daily observations can be impractical and risky. This approach overlooks time domains like months or weeks, where specific patterns are expected over longer periods. The variability between days across different years—affected by factors such as weather or the day of the week—renders such differencing uninformative and difficult to apply effectively.
Given that the ACF plot of the differenced time series shows more pronounced spikes up to lag 5 and the PACF exhibits a relatively fast decay, an ARIMA model with order (0, 1, 5) is fitted. The residuals are then examined to confirm they resemble white noise, ensuring the model's adequacy. Finally, this model is compared with an automatically selected ARIMA model to assess its performance.
```{r}
ma5_ts_m <- Arima(ts_m_train, order = c(0, 1, 5))
checkresiduals(ma5_ts_m)
```
```{r}
auto_ts_m <- auto.arima(ts_m_train, ic = "aicc")
checkresiduals(auto_ts_m)
```
The residuals of the automatically selected ARIMA model appear preferable. The Ljung-Box test yields a higher p-value, and both the residuals plot and ACF suggest that the residuals are closer to white noise.
This procedure is also applied to the other stations. The time series are differenced once to improve stationarity, and the plots are inspected to determine the most suitable ARIMA model for each. In all cases, the models are compared with those selected automatically. While the residuals for both methods generally align with the assumptions, the automatically selected models consistently perform slightly better.
### Station 571 - Bormio v.Monte Braulio {.unlisted .unnumbered}
```{r}
ts_b_train <- window(ts_b, start_train_float, end_train_float)
ts_b_test <- window(ts_b, start_test_float)
tsdisplay(ts_b_train, lag.max = 40, main = paste(aq_station_names_code[2], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(log_my(ts_b_train), lag.max = 40, main = paste(aq_station_names_code[2], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
```{r}
tsdisplay(diff(ts_b_train), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_b_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_b_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[2], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
For the data from Station 571, more pronounced spikes are observed up to lag 2, along with indications of an autoregressive pattern in the additional spikes. As a result, an ARIMA(1,1,2) model is fitted to capture these characteristics.
```{r}
ar1_ma2_ts_b <- Arima(ts_b_train, order = c(1, 1, 2))
# checkresiduals(ar1_ma2_ts_b)
```
```{r}
auto_ts_b <- auto.arima(ts_b_train, ic = "aicc")
# checkresiduals(auto_ts_b)
```
### Station 703 - Schivenoglia v. Malpasso {.unlisted .unnumbered}
```{r}
ts_s_train <- window(ts_s, start_train_float, end_train_float)
ts_s_test <- window(ts_s, start_test_float)
tsdisplay(ts_s_train, lag.max = 40, main = paste(aq_station_names_code[3], "-", "original"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(log_my(ts_s_train), lag.max = 40, main = paste(aq_station_names_code[3], "-", "logarithmic"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
```{r}
tsdisplay(diff(ts_s_train), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_s_train, lag = 7), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced by lag 7"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
# tsdisplay(diff(ts_s_train, lag = as.integer(daily_ts_freq)), lag.max = 40, main = paste(aq_station_names_code[3], "-", "differenced by lag 365"), ylab = paste("PM10 ", "(", pollutant_units[["PM10"]], ")", sep = ""))
```
The same observation on spikes is applied also to the data of station 703.
```{r}
ar1_ma3_ts_s <- Arima(ts_s_train, order = c(1, 1, 3))
# checkresiduals(ar1_ma3_ts_s)
```
```{r}
auto_ts_s <- auto.arima(ts_s_train, ic = "aicc")
# checkresiduals(auto_ts_s)
```
## Dynamic regression
This section explores the potential benefits of including other pollutants in predicting PM10. NO₂ is a key precursor in particulate matter formation, as it reacts in the atmosphere to create secondary particles. NOₓ (which includes both NO and NO₂) also significantly contributes to secondary particulate matter, originating mainly from combustion processes like vehicle and industrial emissions. While CO is primarily a gas, it can indirectly affect PM10 levels through atmospheric reactions that produce secondary pollutants, including particulate matter.
Inspecting the relation between NO2 and PM10 is decided, as NO2 appears to be the most relevant predictor. Missing values are imputed as before using linear interpolation.
```{r}
no2_ts_m <- xts2ts(xts(df_aq_m$NO2, order.by = df_aq_m$Date), daily_ts_freq)
no2_ts_b <- xts2ts(xts(df_aq_b$NO2, order.by = df_aq_b$Date), daily_ts_freq)
no2_ts_s <- xts2ts(xts(df_aq_s$NO2, order.by = df_aq_s$Date), daily_ts_freq)
no2_ts_m <- imputeTS::na_interpolation(no2_ts_m)
no2_ts_b <- imputeTS::na_interpolation(no2_ts_b)
no2_ts_s <- imputeTS::na_interpolation(no2_ts_s)
no2_ts_m_train <- window(no2_ts_m, start_train_float, end_train_float)
no2_ts_b_train <- window(no2_ts_b, start_train_float, end_train_float)
no2_ts_s_train <- window(no2_ts_s, start_train_float, end_train_float)
## Convert milligrams per cubic meter to micrograms per cubic meter
# co_ts_m <- xts2ts(xts(df_aq_m$CO, order.by = df_aq_m$Date), daily_ts_freq) * 1000
# co_ts_b <- xts2ts(xts(df_aq_b$CO, order.by = df_aq_b$Date), daily_ts_freq) * 1000
# co_ts_s <- xts2ts(xts(df_aq_s$CO, order.by = df_aq_s$Date), daily_ts_freq) * 1000
# co_ts_m <- imputeTS::na_interpolation(co_ts_m)
# co_ts_b <- imputeTS::na_interpolation(co_ts_b)
# co_ts_s <- imputeTS::na_interpolation(co_ts_s)
# co_ts_m_train <- window(co_ts_m, start_train_float, end_train_float)
# co_ts_b_train <- window(co_ts_b, start_train_float, end_train_float)
# co_ts_s_train <- window(co_ts_s, start_train_float, end_train_float)
#
# nox_ts_m <- xts2ts(xts(df_aq_m$NOx, order.by = df_aq_m$Date), daily_ts_freq)
# nox_ts_b <- xts2ts(xts(df_aq_b$NOx, order.by = df_aq_b$Date), daily_ts_freq)
# nox_ts_s <- xts2ts(xts(df_aq_s$NOx, order.by = df_aq_s$Date), daily_ts_freq)
# nox_ts_m <- imputeTS::na_interpolation(nox_ts_m)
# nox_ts_b <- imputeTS::na_interpolation(nox_ts_b)
# nox_ts_s <- imputeTS::na_interpolation(nox_ts_s)
# nox_ts_m_train <- window(nox_ts_m, start_train_float, end_train_float)
# nox_ts_b_train <- window(nox_ts_b, start_train_float, end_train_float)
# nox_ts_s_train <- window(nox_ts_s, start_train_float, end_train_float)
```
```{r}
plot_pollutant_XY_lin(
x = no2_ts_m_train,
y = ts_m_train,
station_name = aq_station_names_code[1],
unit_measure_x = pollutant_units[["NO2"]],
unit_measure_y = pollutant_units[["PM10"]],
xlab = "NO2",
ylab = "PM10"
)
# plot_pollutant_XY_lin(
# x = no2_ts_s_train,
# y = ts_s_train,
# station_name = aq_station_names_code[3],
# unit_measure_x = pollutant_units[["NO2"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "NO2",
# ylab = "PM10"
# )
# plot_pollutant_XY_lin(
# x = no2_ts_b_train,
# y = ts_b_train,
# station_name = aq_station_names_code[2],
# unit_measure_x = pollutant_units[["NO2"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "NO2",
# ylab = "PM10"
# )
```
For Station 548, a linear relationship between NO₂ and PM10 is evident, supported by a linear regression model where the predictor is highly significant. However, the residuals violate the white noise assumption. Similar results are observed at other stations, though for brevity, these are not presented here.
```{r}
# plot_pollutant_XY_lin(
# x = co_ts_m_train,
# y = ts_m_train,
# station_name = aq_station_names_code[1],
# unit_measure_x = pollutant_units[["PM10"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "CO",
# ylab = "PM10"
# )
# plot_pollutant_XY_lin(
# x = co_ts_b_train,
# y = ts_b_train,
# station_name = aq_station_names_code[2],
# unit_measure_x = pollutant_units[["PM10"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "CO",
# ylab = "PM10"
# )
# plot_pollutant_XY_lin(
# x = co_ts_s_train,
# y = ts_s_train,
# station_name = aq_station_names_code[3],
# unit_measure_x = pollutant_units[["PM10"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "CO",
# ylab = "PM10"
# )
```
```{r}
# plot_pollutant_XY_lin(
# x = nox_ts_m_train,
# y = ts_m_train,
# station_name = aq_station_names_code[1],
# unit_measure_x = pollutant_units[["NOx"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "NOx",
# ylab = "PM10"
# )
# plot_pollutant_XY_lin(
# x = nox_ts_b_train,
# y = ts_b_train,
# station_name = aq_station_names_code[2],
# unit_measure_x = pollutant_units[["NOx"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "NOx",
# ylab = "PM10"
# )
# plot_pollutant_XY_lin(
# x = nox_ts_s_train,
# y = ts_s_train,
# station_name = aq_station_names_code[3],
# unit_measure_x = pollutant_units[["NOx"]],
# unit_measure_y = pollutant_units[["PM10"]],
# xlab = "NOx",
# ylab = "PM10"
# )
```
A dynamic regression model is now fitted to the time series data using NO₂ as a predictor. The model is estimated using the `auto.arima` function, which selects the optimal ARIMA model based on the corrected Akaike Information Criterion (AICc). Residuals are then assessed for white noise using the `checkresiduals` function. Results for other stations are not shown for brevity but residuals appear satisfying.
```{r}
xreg_no2_ts_m <- auto.arima(ts_m_train, xreg = no2_ts_m_train, ic = "aicc")
checkresiduals(xreg_no2_ts_m)
```
```{r}
xreg_no2_ts_b <- auto.arima(ts_b_train, xreg = no2_ts_b_train, ic = "aicc")
# checkresiduals(xreg_no2_ts_b)
```
```{r}
xreg_no2_ts_s <- auto.arima(ts_s_train, xreg = no2_ts_s_train, ic = "aicc")
# checkresiduals(xreg_no2_ts_s)
```
Residuals are in this case similar to white noise and the Ljung-Box test suggests that there's no evidence for rejecting this hypothesis.
## ARIMA comparison
```{r}
ts_m_final_model_list <- list(ma5_ts_m, auto_ts_m, xreg_no2_ts_m)
ts_b_final_model_list <- list(ar1_ma2_ts_b, auto_ts_b, xreg_no2_ts_b)
ts_s_final_model_list <- list(ar1_ma3_ts_s, auto_ts_s, xreg_no2_ts_s)
IC_values_ts_m <- compute_arima_IC(arima_models = ts_m_final_model_list, suffixes = list("", "[auto]", "[auto]"))
IC_values_ts_b <- compute_arima_IC(arima_models = ts_b_final_model_list, suffixes = list("", "[auto]", "[auto]"))
IC_values_ts_s <- compute_arima_IC(arima_models = ts_s_final_model_list, suffixes = list("", "[auto]", "[auto]"))
print_table_custom(IC_values_ts_m[[1]], title = paste(aq_station_names_code[1], " - ARIMA models IC values"))
print_table_custom(IC_values_ts_b[[1]], title = paste(aq_station_names_code[2], " - ARIMA models IC values"))
print_table_custom(IC_values_ts_s[[1]], title = paste(aq_station_names_code[3], " - ARIMA models IC values"))
```
In all the stations, models that use Nitrogen Dioxide (NO₂) as a predictor exhibit lower AICc values. Additionally, the BIC, which imposes a greater penalty for model complexity, is also lower for these models.
## Non linear models
A nonlinear model is used to fit the data, specifically a neural network autoregressive model. This model is a feedforward neural network with a single hidden layer, estimated using the `nnetar()` function, which automatically selects the optimal neural network configuration.
For non-seasonal data, the fitted model is represented as an $NNAR(p,k)$ model, where \( k \) denotes the number of hidden nodes. This model is analogous to an AR(p) model but incorporates nonlinear functions. For seasonal data, the model is denoted as an $NNAR(p,P,k)[m]$, analogous to an $ARIMA(p,0,0)(P,0,0)[m]$ model but with nonlinear components. According to the *Universal Approximation Theorem* ^[G. Cybenko. "Approximation by superpositions of a sigmoidal function". In: Mathematics of Control, Signals and Systems 2 (1989), pp. 303-314], a neural network with a single hidden layer can approximate any continuous function on a compact subset, making it a powerful tool despite its reduced interpretability.
For nonlinear models, traditional residual diagnostics using autocorrelation functions may not be sufficient to assess model validity. Therefore, additional types of correlation are examined to ensure a comprehensive evaluation of the model's performance.
```{r}
nn_ts_m_auto <- nnetar(ts_m_train)
plot_nn_residuals(nn_ts_m_auto, aq_station_names_code[1], ts_m_train, lag_max = 40)
```
The ACF plots of the residuals suggest that they align with the white noise assumption. Additionally, the p-values from the Ljung-Box test indicate no significant evidence to reject the null hypothesis of white noise. However, the ACF of the residuals squared and cross-correlation function plots reveal potential remaining information, as some lags appear correlated. This suggests that a more advanced neural network model might be needed. The same procedure applied to other stations yields similar results.
```{r}
nn_ts_b_auto <- nnetar(ts_b_train)
# plot_nn_residuals(nn_ts_b_auto, aq_station_names_code[2], ts_b_train, lag_max = 40)
```
```{r}
nn_ts_s_auto <- nnetar(ts_s_train)
# plot_nn_residuals(nn_ts_s_auto, aq_station_names_code[3], ts_s_train, lag_max = 40)
```
# Forecasting
This section addresses the forecasting of PM10 levels using various models. An illustrative example of the train-test split for the Milan station data is presented in the plot below.
For convenience, the forecasting analysis is demonstrated using data from Station 548 in Milan. However, the same methodology can be extended to other stations.
```{r}
plot_train_test_ts(ts_m_train, ts_m_test,
c("steelblue", "darkorange"),
main = paste(aq_station_names_code[1]),
ylab = paste("PM10", pollutant_units[["PM10"]]),
train_date_range = train_date_range,
test_date_range = test_date_range
)
# plot_train_test_ts(ts_b_train, ts_b_test,
# c("steelblue", "darkorange"),
# main = paste(aq_station_names_code[2]),
# ylab = paste("PM10", pollutant_units[["PM10"]]),
# train_date_range = train_date_range,
# test_date_range = test_date_range
# )
# plot_train_test_ts(ts_s_train, ts_s_test,
# c("steelblue", "darkorange"),
# main = paste(aq_station_names_code[3]),
# ylab = paste("PM10", pollutant_units[["PM10"]]),
# train_date_range = train_date_range,
# test_date_range = test_date_range
# )
```
In the dynamic regression approach, the mean of the NO2 predictor observations in the training set is utilized for forecasting new values of PM10.
## Prediction performance
```{r}
largest_horizon <- 31
# Select the best model in the old order of the list
best_idx <- IC_values_ts_m[[2]][which.min(IC_values_ts_m[[1]]$AICc)]
# Best model according to AICc, excluding NN model
best_arima_ts_m <- ts_m_final_model_list[[best_idx]]
plot_forecast(best_arima_ts_m, largest_horizon, ts_m_train, ts_m_test,
ylab = paste("PM10", pollutant_units[["PM10"]]),
main = paste("Best ARIMA", "-", aq_station_names_code[1]),
)