Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear Cooler.pixels().fetch behavior #455

Open
StefanoCretti opened this issue Jan 26, 2025 · 0 comments
Open

Unclear Cooler.pixels().fetch behavior #455

StefanoCretti opened this issue Jan 26, 2025 · 0 comments

Comments

@StefanoCretti
Copy link

Just wanted to report a behavior which I assume is intentional but might be misleading (or at least it was for me).

When using Cooler.pixels().fetch("<some_genomic_region>"), an instance of the class RangeSelector1D is returned. This selector returns all the pixels where the first bin id belongs to the provided interval. As an example, using the yeast.10kb.cool file in the tests/data folder:

from cooler import Cooler

handle = Cooler("yeast.10kb.cool")
print(handle.pixels().fetch("chrII"))

returns:

        bin1_id  bin2_id  count
36311        33       33   2433
36312        33       34   7410
36313        33       35   3444
36314        33       36    475
36315        33       37   1741
...         ...      ...    ...
128531      114     1209      1
128532      114     1221      2
128533      114     1223      2
128534      114     1224      2
128535      114     1225     12

[92225 rows x 3 columns]

which checks out since:

print(handle.extent("chrII"))  # results in (33, 115)

The issue (in my opinion) is that it does not return all pixels with at least one bin in the region. There can be pixels where the second bin belongs to the region and yet they are not returned.

print(handle.pixels()[:].query("bin2_id >= 33 and bin2_id < 115 and bin1_id < 33"))

returns:

       bin1_id  bin2_id  count
28           0       33      3
29           0       34     41
30           0       35     48
31           0       36      9
32           0       37     61
...        ...      ...    ...
36110       32       65      1
36111       32       75      1
36112       32       90      2
36113       32      101      1
36114       32      113      1

[2460 rows x 3 columns]

I understand why this might be the case. There is no easy way of indexing the bin2_id column, therefore finding these pixels would require iterating and filtering all pixels prior to the retrieved slice. This would clash with the lazy fetching and most likely be slow.

Understanding why this behavior might not change, I believe it should be clearly stated in the documentation (I did not find anything about it but I might have missed it), since one could get misleading results. I would be happy to open a PR to either change or document the behavior if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant