time windows in statistics #2948

petrelharp · 2024-05-09T23:07:02Z

Here @tforest and I are starting in on adding time windows to statistics. We're starting with what was sketched out in #683, and will explain things in more detail here when we're farther along (ignore this for now).

codecov · 2024-05-09T23:46:18Z

Codecov Report

Attention: Patch coverage is 88.57143% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.83%. Comparing base (16de381) to head (0d48891).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
python/tskit/trees.py	75.00%	5 Missing and 5 partials ⚠️
c/tskit/trees.c	96.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2948      +/-   ##
==========================================
- Coverage   89.85%   89.83%   -0.03%     
==========================================
  Files          29       29              
  Lines       32128    32222      +94     
  Branches     5763     5784      +21     
==========================================
+ Hits        28868    28946      +78     
- Misses       1859     1868       +9     
- Partials     1401     1408       +7

Flag	Coverage Δ
c-tests	`86.71% <96.07%> (+0.01%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`89.06% <100.00%> (+<0.01%)`	⬆️
python-tests	`98.80% <75.00%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
c/tskit/core.c	`95.83% <100.00%> (ø)`
python/_tskitmodule.c	`89.06% <100.00%> (+<0.01%)`	⬆️
c/tskit/trees.c	`90.70% <96.00%> (+0.02%)`	⬆️
python/tskit/trees.py	`98.24% <75.00%> (-0.57%)`	⬇️

... and 1 file with indirect coverage changes

petrelharp · 2024-05-17T21:50:36Z

Note: it is not clear how to do this for site statistics, since the site stat is of the form
$$\sum_a f(w_a)$$
where the sum is over alleles, and $w_a$ is the weight of all samples with allele $a$;
however, it is mutations that have times, not alleles.

The proposal will probably be to compute a site stat that sums over mutations, not alleles, but we'll start with branch stats only for now.

petrelharp · 2024-05-17T22:44:43Z

Next step:

do the AFS first, since it's less tangled up

Also maybe:

allow ts.decapitate( ) to take inf as an argument (that does nothing) ?

andrewkern · 2024-05-17T23:49:18Z

a small nudge here that i mentioned to @petrelharp in passing-- it would be great to have an expectation from theory as to what time stratified quantities like the SFS should be under the (standard, neutral) coalescent

tforest · 2024-07-15T22:35:22Z

Some thoughts after working on time windows.

After these edits the moment the output of, let's say, the AFS is a still 2D array of windows, same for time_windows, when using either of them individually. However, when using windows and time_windows at the same time, the output is a 3D array, with the following shape: [num_windows][num_time_windows][sample_size]. When windows or time_windows are None, associated dimensions are dropped accordingly.
As there is now two types of windows, it will become ambiguous that the historical "windows" parameter is in fact corresponding specifically to genomic spanning windows. We did not renamed it for now though, as it would break previous behavior.

Some ideas:

Add new benchmarks for summary stats to see if the implemented features are optimized both in terms of computational space and time complexity.
Add some plots for summary stats to observe how time windows impact them.

petrelharp · 2024-07-16T22:03:51Z

A note on the potential confusion between windows and time_windows - often one endpoint of the time_windows will be Inf, so if we make sure we produce an informative error if the windows aren't finite, we'll help people avoid the mistake.

petrelharp

Looks good! One question about a possible refactor, and suggesting moving the "general stat" stuff to a different PR.

petrelharp · 2024-07-16T21:57:13Z

python/tests/test_tree_stats.py

+                    for u in tree.nodes()
+                )
+            sigma[tree.index, j, :] = s * tree.span
+    for j in range(1, len(time_windows) - 1):


Suggested change

for j in range(1, len(time_windows) - 1):

for j in range(1, tw):

petrelharp · 2024-07-16T21:58:28Z

python/tests/test_tree_stats.py

-        return windowed_tree_stat(ts, sigma, windows, span_normalise=span_normalise)
+        out = windowed_tree_stat(ts, sigma, windows, span_normalise=span_normalise)
+    if drop_time_windows:
+        # beware: this assumes the first dimension is windows


I think this comment can be removed?

hm but perhaps replaced by

assert len(out.shape) == 3

petrelharp · 2024-07-16T21:59:27Z

python/tests/test_tree_stats.py

@@ -144,39 +144,93 @@ def windowed_tree_stat(ts, stat, windows, span_normalise=True):
    return A


+# Timewindows test
 def naive_branch_general_stat(


This function looks good; however, I think this PR is going to be just about the AFS, not general stats - so, maybe this code should be put aside in a separate PR?

petrelharp · 2024-07-16T22:06:09Z

python/tests/test_tree_stats.py

-        if polarised:
-            s = sum(tree.branch_length(u) * f(x[u]) for u in tree.nodes())
+    sigma = np.zeros((ts.num_trees, tw, m))
+    for j, upper_time in enumerate(time_windows[1:]):


This assumes that time_windows[0] is 0, I think.

This could be fixed by setting sigma not to 0 but to -1 times the value calculated from ts.decapitate(time_windows[0]).

python/tests/test_tree_stats.py

petrelharp · 2024-07-16T22:13:19Z

python/tests/test_tree_stats.py

+    # Warning: when using Windows and TimeWindows,
+    # the output has three dimensions


can delete this

petrelharp · 2024-07-16T22:25:33Z

python/tests/test_tree_stats.py

-                c = fold(c, out_dim)
-            index = tuple([window_index] + list(c))
-            result[index] += x
+    def update_result(window_index, u, right, time_windows):


time_windows isn't being changed by this function

Suggested change

def update_result(window_index, u, right, time_windows):

def update_result(window_index, u, right):

petrelharp · 2024-07-16T22:35:53Z

python/tests/test_tree_stats.py

+                # interval between child and parent inside the window
+                t_v = branch_length[u] + time[u]
+                tw_branch_length = min(time_windows[k_tw + 1], t_v) - max(
+                    time_windows[0], time[u]


Whoops?

Suggested change

time_windows[0], time[u]

time_windows[k_tw], time[u]

petrelharp · 2024-07-16T22:40:44Z

python/tests/test_tree_stats.py

+        for k_tw, _ in enumerate(time_windows[:-1]):
+            if 0 < count[u, -1] < ts.num_samples:
+                # interval between child and parent inside the window
+                t_v = branch_length[u] + time[u]


Here we're losing the advantage of cacheing branch_length[u]; this might as well be time[v] (with v passed in also).

Alternatives:

let branch_length be a (num nodes x num time windows) array instead of just a vector, so that we'd have

tw_branch_length = branch_length[u, k_tw]

Do the calculation down a few lines, something like this (this is not right):

u = edge.child v = edge.parent t_c = time[u] t_p = time[v] time_window_index = 0 while t_p < time_windows[time_window_index + 1]: while v != -1: tw_branch_length = min(time_windows[k_tw + 1], t_p) - max(time_windows[k_tw], t_c) update_result(window_index, time_window_index, v, t_left, tw_branch_length) count[v] -= count[u] t_c = t_p v = parent[v] t_p = time[v] time_window_index += 1

The advantage to this is that computation isn't increased by a factor of (num time windows). The disadvantage might be that the code is harder to understand?

petrelharp · 2024-07-16T22:41:12Z

python/tests/test_tree_stats.py

            window_index += 1
        tree_index += 1

    assert window_index == windows.shape[0] - 1
    if span_normalise:
        for j in range(num_windows):
            result[j] /= windows[j + 1] - windows[j]
+
+    if drop_time_windows:


see suggestions above

benjeffery · 2024-09-23T10:47:18Z

I've added this work to the next release milestone. Hoping to get a release out in a week or two, if that is too ambitious for this let me know.

petrelharp · 2024-09-23T19:18:36Z

Probably too ambitious, but we might have something in by then.

python/tskit/trees.py

petrelharp

This looks great! Some suggestions, mostly minor; let's discuss getting the tests in there.

petrelharp · 2024-10-31T19:00:24Z

python/tskit/trees.py

@@ -7637,6 +7637,7 @@ def parse_windows(self, windows):
        # Note: need to make sure windows is a string or we try to compare the
        # target with a numpy array elementwise.
        if windows is None:
+            # initiate default spanning windows


Suggested change

# initiate default spanning windows

petrelharp · 2024-10-31T19:01:50Z

python/tskit/trees.py

+        if strip_win:
+            stat = stat[0, :, :]
+        elif strip_timewin:
+            stat = stat[:, 0, :]


this looks like you can't have both, ie windows=None, time_windows=None?

petrelharp · 2024-10-31T19:08:28Z

python/tskit/trees.py

+            if (stat.shape == () and windows is None) or (
+                stat.shape == () and time_windows is None
+            ):


Suggested change

if (stat.shape == () and windows is None) or (

stat.shape == () and time_windows is None

):

if (stat.shape == () and windows is None and time_windows is None):

I think the intention of this rule is so that if you do like

ts.diversity([0,1,2])

then you get a single number, not a length-1 array, but if anyone is supplying windows explicitly (or time windows!) then they should get an array with the number of dimensions they expect.

We should write the bit in the docs that includes time windows, so we've got this clear?

petrelharp · 2024-10-31T19:10:04Z

python/_tskitmodule.c

@@ -9077,7 +9077,7 @@ parse_windows(
    npy_intp *shape;

    windows_array = (PyArrayObject *) PyArray_FROMANY(
-        windows, NPY_FLOAT64, 1, 1, NPY_ARRAY_IN_ARRAY);
+				      windows, NPY_FLOAT64, 1, 1, NPY_ARRAY_IN_ARRAY);


Was this change (and others like it) done by linting?

Yes, or probably an indentation that I made by mistake. I put it back how it was before.

petrelharp · 2024-10-31T19:11:00Z

python/_tskitmodule.c

        "span_normalise", "polarised", NULL };
    PyObject *sample_set_sizes = NULL;
    PyObject *sample_sets = NULL;
    PyObject *windows = NULL;
-    char *mode = NULL;
+    PyObject *time_windows = NULL;
+    char *mode = "NULL";


Suggested change

char *mode = "NULL";

char *mode = NULL;

petrelharp · 2024-11-03T14:46:17Z

c/tskit/trees.c

-        }
-        increment_nd_array_value(afs, num_sample_sets, result_dims, coordinate, x);
+    if (parent[u] != -1){
+	t_v = time[parent[u]];


perhaps also t_u here?

There's no t_u variable, do you think that's necessary? I'm using time[u] here.

petrelharp · 2024-11-03T14:46:48Z

c/tskit/trees.c

+		if (!polarised){
+		    fold(coordinate, result_dims, num_sample_sets);
+		}
+		tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[0], time[u]);


shouldn't this be

Suggested change

tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[0], time[u]);

tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[time_window_index], time[u]);

?

Hm - the tests below should be catching this if it is indeed wrong, but it sure looks wrong to me - I'm not sure what's going on?

petrelharp · 2024-11-03T14:52:18Z

c/tskit/trees.c

+    if (parent[u] != -1){
+	t_v = time[parent[u]];
+	if (0 < all_samples && all_samples < self->num_samples) {
+	    for (time_window_index = 0; time_window_index < num_time_windows; time_window_index++){


A lot of edges are recent, so we might avoid substantial work if we do like

time_window_index = 0; while (time_window_index < num_time_windows && time_windows[time_window_index] < t_v){ ... time_window_index++; }

petrelharp · 2024-11-03T15:05:58Z

python/tests/test_tree_stats.py

+
+
+class TestTimeWindows(TestBranchAlleleFrequencySpectrum):
+    def test_four_taxa_test_case(self):


This shouldn't really be in this class, since it's testing general_stat, not the AFS; perhaps leave a comment? Or move it along with the general_stat code above to a new PR for future work?

petrelharp · 2024-11-03T15:14:46Z

python/tests/test_tree_stats.py

+        )
+        self.assertArrayAlmostEqual(x, true_x)
+
+    def test_afs_branch(self):


This seems very useful, but it's hard to tell exactly what's being tested. For instance, there's no call to ts.allele_frequency_spectrum here, I think? Perhaps this could be rearranged? Simplified? Commented? There's also some references to self.mode, which might be confusing since this is branch-only?

petrelharp · 2024-11-21T18:53:57Z

c/tests/test_stats.c

@@ -1258,23 +1270,24 @@ verify_afs(tsk_treeseq_t *ts)
    sample_set_sizes[0] = n - 2;
    sample_set_sizes[1] = 2;
    ret = tsk_treeseq_allele_frequency_spectrum(
-	ts, 2, sample_set_sizes, samples, 0, NULL, 0, NULL, 0, result);
+        ts, 2, sample_set_sizes, samples, 0, NULL, 0, NULL, 0, result);


uh-oh, are these tabs?

petrelharp · 2024-11-21T19:01:27Z

c/tskit/trees.c

-        ret = tsk_treeseq_check_time_windows(
-            num_time_windows, time_windows);
+        if (stat_site
+            && tsk_memcmp(time_windows, default_time_windows, sizeof(double)) != 0) {


Hm, this is a bit awkward - what if instead we used num_time_windows=0 to mean "default/no time windows"?

But time_windows are always initialized by default as [0, inf], so num_time_windows=2, comparing to the default was the clearest I found for now. But maybe the problem lies in the initialization caused by the parsing of the windows in the first place.

Oh, wait - we're already in the else clause where we know that time_windows != NULL. So I don't think we need to check this at all - just throw the error?

Suggested change

&& tsk_memcmp(time_windows, default_time_windows, sizeof(double)) != 0) {

) {

So, if someone explicitly specifies time_windows = [0, np.inf], mode="node" then they'll get the error, but that's okay - the error message says "you can't specify time windows", not "time windows must be [0, Inf)" (I think that's what it says anyhow).

petrelharp commented Jul 16, 2024

View reviewed changes

tforest force-pushed the time_windows branch from 8f5fa02 to f7f679a Compare July 18, 2024 22:55

nspope mentioned this pull request Aug 6, 2024

Use reduction function in pair_coalescence_stat #2975

Merged

petrelharp mentioned this pull request Sep 6, 2024

"mutation" mode for statistics #2982

Open

benjeffery added this to the Python 0.5.9 milestone Sep 23, 2024

benjeffery modified the milestones: Python 0.5.9, Python 0.5.10 Oct 16, 2024

petrelharp and others added 13 commits October 21, 2024 12:13

example test for time windows

aeda4dc

a little more flailing at beginning tests

0a30696

first simple test for AFS in branch mode with time windows

1a988c0

AFS non-naive version with time windows + Some tests

18ffda0

Improve AFS branch mode

c6f9562

fix AFS tests

7a3149f

Fix naive branch general stat

71da7ad

intermediate changes to C AFS implementation with time windows

2a44909

tw afs C implementation

4460db9

Better dimension drop with time windows

bbec6a9

AFS branch mode with time windows to review

da5f205

Adapting some tests with new time_windows parameter

6b3ab4f

time_windows addition after code rebase

f2d857b

tforest force-pushed the time_windows branch from d3e17a9 to f2d857b Compare October 30, 2024 20:21

petrelharp commented Oct 31, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

tforest added 3 commits October 31, 2024 12:05

Fix some tests failing because of time_windows

b8f4ba5

Remove unused parameters in afs function

c9f6c06

Fixing a memory issue in tsk_treeseq_update_branch_afs

59ea266

petrelharp commented Nov 4, 2024

View reviewed changes

tforest added 2 commits November 18, 2024 16:39

Fixing some time_windows issues with AFS

96ac0ce

Adjusting some stats tests using time_windows

26a7f09

petrelharp commented Nov 21, 2024

View reviewed changes

tforest added 2 commits December 10, 2024 16:58

Adjusting some tests trying to pass codecov

8a8c05b

Updating naive_branch_... and branch_allele_freq... in tests

0d48891

	for j in range(1, len(time_windows) - 1):
	for j in range(1, tw):

		# Warning: when using Windows and TimeWindows,
		# the output has three dimensions

	def update_result(window_index, u, right, time_windows):
	def update_result(window_index, u, right):

	tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[0], time[u]);
	tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[time_window_index], time[u]);



		class TestTimeWindows(TestBranchAlleleFrequencySpectrum):
		def test_four_taxa_test_case(self):

	&& tsk_memcmp(time_windows, default_time_windows, sizeof(double)) != 0) {
	) {

time windows in statistics #2948

Are you sure you want to change the base?

time windows in statistics #2948

Conversation

petrelharp commented May 9, 2024

codecov bot commented May 9, 2024 • edited Loading

Codecov Report

petrelharp commented May 17, 2024 • edited Loading

petrelharp commented May 17, 2024

andrewkern commented May 17, 2024

tforest commented Jul 15, 2024

petrelharp commented Jul 16, 2024

petrelharp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjeffery commented Sep 23, 2024

petrelharp commented Sep 23, 2024

petrelharp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tforest Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 9, 2024 •

edited

Loading

petrelharp commented May 17, 2024 •

edited

Loading

tforest Nov 21, 2024 •

edited

Loading