Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KS-statistic example #61

Merged
merged 19 commits into from
Dec 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,20 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
TimeseriesSurrogates = "c804724b-8c18-5caa-8579-6025a0767c70"

[weakdeps]
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"

[extensions]
TITVisualizations = "Makie"

[compat]
DelimitedFiles = "1"
Downloads = "1"
FFTW = "^1.6"
HypothesisTests = "0.11"
InteractiveUtils = "1"
LinearAlgebra = "1"
Makie = "≥ 0.19"
Random = "1"
Reexport = "1.2"
Statistics = "1"
Expand Down
2 changes: 2 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ SegmentedWindowResults
significant_transitions
TransitionsSignificance
SurrogatesSignificance
ThresholdSignificance
SigmaSignificance
QuantileSignificance
```

Expand Down
65 changes: 58 additions & 7 deletions docs/src/examples/do-events.jl
Original file line number Diff line number Diff line change
@@ -1,16 +1,36 @@
#=
# Dansgaard-Oescher events and Critical Slowing Down

The $\delta^{18}O$ timeseries of the North Greenland Ice Core Project ([NGRIP](https://en.wikipedia.org/wiki/North_Greenland_Ice_Core_Project)) are, to this date, the best proxy record for the Dansgaard-Oeschger events ([DO-events](https://en.wikipedia.org/wiki/Dansgaard%E2%80%93Oeschger_event)). DO-events are sudden warming episodes of the North Atlantic, reaching 10 degrees of regional warming within 100 years. They happened quasi-periodically over the last glacial cycle due to transitions between strong and weak states of the Atlantic Meridional Overturning Circulation and might be therefore be the most prominent examples of abrupt transitions in the field of climate science. We here propose to hindcast these events by applying the theory of Critical Slowing Down (CSD) on the NGRIP data, which can be found [here](https://www.iceandclimate.nbi.ku.dk/data/) in its raw format. This analysis has already been done in [boers-early-warning-2018](@cite) and we here try to reproduce Figure 2.d-f.
The $\delta^{18}O$ timeseries of the North Greenland Ice Core Project
([NGRIP](https://en.wikipedia.org/wiki/North_Greenland_Ice_Core_Project)) are,
to this date, the best proxy record for the Dansgaard-Oeschger events
([DO-events](https://en.wikipedia.org/wiki/Dansgaard%E2%80%93Oeschger_event)).
DO-events are sudden warming episodes of the North Atlantic, reaching 10 degrees
of regional warming within 100 years. They happened quasi-periodically over the
last glacial cycle due to transitions between strong and weak states of the Atlantic
Meridional Overturning Circulation and might be therefore be the most prominent
examples of abrupt transitions in the field of climate science. We here propose
to hindcast these events by applying the theory of Critical Slowing Down (CSD)
on the NGRIP data, which can be found [here](https://www.iceandclimate.nbi.ku.dk/data/)
in its raw format. This analysis has already been done in [boers-early-warning-2018](@cite)
and we here try to reproduce Figure 2.d-f.

## Preprocessing NGRIP

Data pre-processing is not part of TransitionsInTimeseries.jl, but a step the user has to do before using the package. To present an example with a complete scientific workflow, we will showcase typical data pre-processing here, that consist of the following steps:
Data pre-processing is not part of TransitionsInTimeseries.jl,
but a step the user has to do before using the package.
To present an example with a complete scientific workflow,
we will showcase typical data pre-processing here, that consist of the following steps:
1. Load the data, reverse and offset it to have time vector = time before 2000 AD.
2. Filter non-unique points in time and sort the data.
3. Regrid the data from uneven to even sampling.

The time and $\delta^{18}O$ vectors resulting from the $i$-th preprocessing step are respectively called $t_i$ and $x_i$. The final step consists in obtaining a residual $r$, i.e. the fluctuations of the system around the attractor, which, within the CSD theory, is assumed to be tracked. Over this example, it will appear that the convenience of TransitionsInTimeseries.jl leads the bulk of the code to be written for plotting and preprocessing.
The time and $\delta^{18}O$ vectors resulting from the $i$-th preprocessing step are
respectively called $t_i$ and $x_i$. The final step consists in obtaining a residual
$r$, i.e. the fluctuations of the system around the attractor, which, within the CSD
theory, is assumed to be tracked. Over this example, it will appear that the
convenience of TransitionsInTimeseries.jl leads the bulk of the code to be written
for plotting and preprocessing.

### Step 1:
=#
Expand Down Expand Up @@ -84,7 +104,9 @@ fcutoff = 0.95 * 0.01 # cutoff ≃ 0.01 yr^-1 as in (Boers 2018)
t, x, xtrend, r = chebyshev_filter(t3, x3, fcutoff)

#=
Let's now visualize our data in what will become our main figure. For the segmentation of the DO-events, we rely on the tabulated data from [rasmussen-stratigraphic-2014](@cite) (which will soon be available as downloadable):
Let's now visualize our data in what will become our main figure.
For the segmentation of the DO-events, we rely on the tabulated
data from [rasmussen-stratigraphic-2014](@cite) (which will soon be available as downloadable):
=#

using CairoMakie, Loess
Expand Down Expand Up @@ -140,7 +162,15 @@ fig
#=
## Hindcast on NGRIP data

As one can see... there is not much to see so far. Residuals are impossible to simply eye-ball and we therefore use TransitionsInTimeseries.jl to study the evolution, measured by the ridge-regression slope of the residual's variance and lag-1 autocorrelation (AC1) over time. In many examples of the literature, including [boers-early-warning-2018](@cite), the CSD analysis is performed over segments (sometimes only one) of the timeseries, such that a significance value is obtained for each segment. By using `SegmentedWindowConfig`, dealing with segments can be easily done in TransitionsInTimeseries.jl and is demonstrated here:
As one can see... there is not much to see so far.
Residuals are impossible to simply eye-ball and we therefore use
TransitionsInTimeseries.jl to study the evolution, measured by the ridge-regression
slope of the residual's variance and lag-1 autocorrelation (AC1) over time.
In many examples of the literature, including [boers-early-warning-2018](@cite),
the CSD analysis is performed over segments (sometimes only one) of the timeseries,
such that a significance value is obtained for each segment. By using
`SegmentedWindowConfig`, dealing with segments can be easily done in
TransitionsInTimeseries.jl and is demonstrated here:
=#

using TransitionsInTimeseries, StatsBase
Expand Down Expand Up @@ -184,11 +214,32 @@ plot_segment_analysis!(axs, results, signif)
fig

#=
In [boers-early-warning-2018](@cite), 13/16 and 7/16 true positives are respectively found for the variance and AC1, with 16 referring to the total number of transitions. The timeseries actually includes 18 transition but, in [boers-early-warning-2018](@cite), some segments are considered too small to be analysed. In contrast, we here respectively find 9/16 true positives for the variance and 3/16 for AC1. We can track down the discrepancies to be in the surrogate testing, since the indicator timeseries computed here are almost exactly similar to those of [boers-early-warning-2018](@cite). This mismatch points out that packages like TransitionsInTimeseries.jl are wishful for research to be reproducible, especially since CSD is gaining attention - not only within the scientific community but also in popular media.
In [boers-early-warning-2018](@cite), 13/16 and 7/16 true positives are respectively
found for the variance and AC1, with 16 referring to the total number of transitions.
The timeseries actually includes 18 transition but, in
[boers-early-warning-2018](@cite), some segments are considered too small to be analysed.
In contrast, we here respectively find 9/16 true positives
for the variance and 3/16 for AC1. We can track down the discrepancies to be in the
surrogate testing, since the indicator timeseries computed here are almost exactly
similar to those of [boers-early-warning-2018](@cite). This mismatch points
out that packages like TransitionsInTimeseries.jl are wishful for research to be
reproducible, especially since CSD is gaining attention - not only within the
scientific community but also in popular media.

## CSD: only a necessary condition, only in some cases

For codimension-1 systems, approaching a fold, Hopf or transcritical bifurcation implies a widening of the potential $U$, which defines the deterministic term $f = -∇U$ of the SDE's right-hand-side. In the presence of noise, this leads to CSD, which is therefore a **necessary condition** for crossing one of these bifurcations - although it is not always assessable by analysing the timeseries due to practical limitations (e.g. sparse data subject to large measurement noise). It is nonetheless not given that DO-events, as many other real-life applications, can be seen as a codimension-1 fold, Hopf or transcritical bifurcations. Besides this, we emphasise that CSD is **not a sufficient condition** for assessing a transition being ahead in near future, since a resilience loss can happen without actually crossing any bifurcation. This can be illustrated on the present example by performing the same analysis only until few hundred years before the transition:
For codimension-1 systems, approaching a fold, Hopf or transcritical bifurcation implies
a widening of the potential $U$, which defines the deterministic term $f = -∇U$ of the
SDE's right-hand-side. In the presence of noise, this leads to CSD, which is therefore
a **necessary condition** for crossing one of these bifurcations - although it is not
always assessable by analysing the timeseries due to practical limitations (e.g. sparse
data subject to large measurement noise). It is nonetheless not given that DO-events,
as many other real-life applications, can be seen as a codimension-1 fold, Hopf or
transcritical bifurcations. Besides this, we emphasise that CSD is **not a sufficient
condition** for assessing a transition being ahead in near future, since a resilience
loss can happen without actually crossing any bifurcation. This can be illustrated on
the present example by performing the same analysis only until few hundred years before
the transition:
=#

tseg_end = t_rasmussen[2:end] .- 700 # stop analysis 500 years earlier than before
Expand Down
168 changes: 168 additions & 0 deletions docs/src/examples/ks_paleojump.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# # Kolmogorov-Smirnov test for detecting transitions in paleoclimate timeseries

# The goal of this example is to show how simple it is to re-create an analysis _similar_
# to what was done in the paper
# "Automatic detection of abrupt transitions in paleoclimate records",
# [Bagniewski2021](@cite). The same analysis was then used to create a database
# of transitions in paleoclimate records in [Bagniewski2023](@cite)
# Using TransitionsInTimeseries.jl and HypothesisTests.jl,
# the analysis becomes a 10-lines-of-code script (for a given timeseries).

# ## Scientific background

# The approach of [Bagniewski2021](@cite) is based on the [two-sample
# Kolmogorov Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test).
# It tests whether the samples from two datasets or timeseries
# are distributed according to the same cumulative density function or not.
# This can be estimated by comparing the value of the _KS-statistic_
# versus some threshold that depends on the required confidence.

# The application of this test for identifying transitions in timeseries is simple:

# 1. A sliding window analysis is performed in the timesries
# 1. In each window, the KS statistic is estimated between the first half and the second
# half of the timeseries within this window.
# 1. Transitions are defined by when the KS statistic exceeds a particular value
# based on some confidence. The transition occurs in the middle of the window.

# We should point out that in [Bagniewski2021](@cite) the authors did a
# more detailed analysis: analyzed many different window widths, added a
# conditional clause to exclude transitions that do not exceed a predefined minimum
# "jump" in the data, and also added another conditional clause that
# filtered out transitions that are grouped in time (which is a natural consequence
# of using the Kolmogorov-Smirov test for detecting transitions).
#
# Here we won't do that post processing, mainly because it is rather simple
# to include these additional conditional clauses to filter transitions after they are found.

# ## Steps for TransitionsInTimeseries.jl

# Doing this kind of work with TransitionsInTimeseries.jl is so easy you won't even trip!
# This analysis follows the same sliding window approach showcased in our [Tutorial](@ref),
# and it even excludes the "indicator" aspect: the change metric is estimated directly
# from the input data!

# As such, we really only need to define/do these things before we have finished the analysis:

# 1. Load the input data (we will use the same example as the NGRIP data of
# [Dansgaard-Oescher events and Critical Slowing Down](@ref) example)
# and set the appropriate time window.
# 3. Define the function that estimates the change metric (i.e., the KS-statistic)
# 3. Perform the sliding window analysis as in the [Tutorial](@ref) with [`estimate_indicator_changes`](@ref)
# 4. Estimate the "confident" transitions in the data by comparing the estimated
# KS-statistic with a predefined threshold.

# ## Load timeseries and window length
# Following the Dansgaard-Oescher events example, we load
# the data after all the processing steps done in that example:

using DelimitedFiles, CairoMakie

tmp = Base.download("https://raw.githubusercontent.com/JuliaDynamics/JuliaDynamics/"*
"master/timeseries/NGRIP_processed.csv")
data = readdlm(tmp)
t, xtrend, xresid, xloess = collect.(eachcol(data))

fig, ax = lines(t, xtrend; axis = (ylabel = "NGRIP (processed)", xlabel = "time"))
lines!(ax, t, xloess; linewidth = 2)
fig

# For the window, since we are using a sliding window here, we will be using a
# window of length 500 (which is approximately 1/2 to 1/4 the span between typical
# transitions found by [rasmussen-stratigraphic-2014](@cite)).

window = 500

# ## Defining the change metric function

# HypothesisTest.jl implements the Kolmogorov-Smirnov test, however here we are interested
# in the value of the test iself (the so-called KS-statistic), rather than a p-value.
# To this end, we define the following function to compute the statistic,
# which also normalizes it as in [Bagniewski2021](@cite).

using HypothesisTests

function normalized_KS_statistic(timeseries)
N = length(timeseries)
i = N÷2
x = view(timeseries, 1:i)
y = view(timeseries, (i+1):N)
kstest = ApproximateTwoSampleKSTest(x, y)
nx = ny = i # length of each timeseries half of total
n = nx*ny/(nx + ny) # written fully for concreteness
D_KS = kstest.δ # can be compared directly with sqrt(-log(α/2)/2)
## Rescale according to eq. (5) of the paper
rescaled = 1 - ((1 - D_KS)/(1 - sqrt(1/n)))
return rescaled
end

N = 1000 # the statistic is independent of `N` for large enough `N`!
x = randn(N)
y = 1.8randn(N) .+ 1.0
z = randn(N)
w = 0.6randn(N) .- 2.0

fig, ax = density(x; color = ("black", 0.5), strokewidth = 4.0, label = "reference distribution")
ax.title = "showcase of normalized KS-statistic"
for q in (y, z, w)
D_KS = normalized_KS_statistic(vcat(x, q))
density!(ax, q; label = "D_KS = $(D_KS)")
end
axislegend(ax)
fig

# ## Perform the sliding window analysis

# This is just a straightforward call to [`estimate_indicator_changes`](@ref).
# In fact, it is even simpler than the tutorial. Here we skip completely
# the "indicator" estimation step, and we evaluate the change metric directly
# on input data. We do this by simply passing `nothing` as the indicators.

using TransitionsInTimeseries

config = SlidingWindowConfig(nothing, normalized_KS_statistic; width_cha = 500)

results = estimate_indicator_changes(config, xtrend, t)

# Which we can visualize
function visualize_results(results)
fig, ax1 = lines(t, xtrend; axis = (ylabel = "NGRIP (processed)",))
ax2, = lines(fig[2, 1], results.t_change, vec(results.x_change), axis = (ylabel = "D_KS (normalized)", xlabel = "time"))
linkxaxes!(ax1, ax2)
hidexdecorations!(ax1; grid = false)
xloess_normed = (xloess .- minimum(xloess))./(maximum(xloess) - minimum(xloess))
lines!(ax2, t, xloess_normed; color = ("gray", 0.5))
fig
end

visualize_results(results)

# By overplotting the (smoothened) NGRIP timeseries and the
# normalized KS-statistic, it already becomes pretty clear
# that the statistic peaks when transitions occur.

# The same thing happens if we alter the window duration

config = SlidingWindowConfig(nothing, normalized_KS_statistic; width_cha = 200)
results = estimate_indicator_changes(config, xtrend, t)
visualize_results(results)

# So one can easily obtain extra confidence by varying window
# size as in [Bagniewski2021](@cite).

# ## Identifying "confident" transitions

# As this identification here is done via a simple threshold,
# identifying the transitions is a nearly trivial call
# to [`significant_transitions`](@ref) with [`ThresholdSignificance`](@ref)

signif = ThresholdSignificance(0.5) # or any other threshold
flags = significant_transitions(results, signif)

fig = visualize_results(results)
axDKS = content(fig[2,1])
vlines!(axDKS, results.t_change[vec(flags)], color = ("red", 0.25))
fig

# We could proceed with a lot of preprocessing as in [Bagniewski2021](@cite)
# but we skip this here for the sake of simplicity.
Loading
Loading