Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] dandi fsspec filesystem #1395

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

balbasty
Copy link

Hi team!

This PR contains a fsspec file system to navigate files on dandiarchive. Under the hood, it uses things like get_assets_with_path_prefix, get_assets_by_glob, get_asset_by_path to figure out files and their directory structure.

It also automatically finds the s3 url of any asset.

It allows doing things like:

import fsspec
import nibabel
from indexed_gzip import IndexedGzipFile

uri = 'dandi://dandi/000026@draft/sub-I46/ses-MRI/anat/sub-I46_ses-MRI_echo-1_flip-1_VFA.nii.gz'
with fsspec.open(uri) as gzippedstream:
    with IndexedGzipFile(gzippedstream) as bytestream:
        mri = nibabel.Nifti1Image.from_stream(bytestream)
        dat = mri.get_fdata()

However,

  1. There may be ways to make this implementation more efficient. To check if a directory exists, I have to list all assets that start with its path, and if there is at least one, the directory exists. If there are many assets under this directory, this is quite slow. For example this is very slow:
RemoteDandiFileSystem().isdir('dandi://dandi/000026@draft/sub-I46/ses-MRI')
  1. If there is no faster way to parse directories, it would make sense to add a caching mechanism.
  2. I have my own parser for the different URI types (i.e. to split instance/dandiset/path), and they may not be perfect. I did not find such parsers in the current codebase, but I've surely missed them.
  3. There are currently no tests. I would welcome guidance on which tests you find necessary.
  4. I haven't made fsspec a dependency. I raise an intelligible error when dandifs is imported but fsspec is missing. However, fsspec[http] is a dependency of zarr, so it's likely to be present.

Here are a few examples that work, assuming fs = RemoteDandiFileSystem():

>> fs.ls('dandi://dandi/000026@draft/sub-I46/ses-MRI/anat/')
[{'name': 'sub-I46/ses-MRI/anat/sub-I46_ses-MRI_echo-1_flip-1_VFA.nii.gz',
  'size': 2636077950,
  'created': '2022-07-22T00:42:53.878917+00:00',
  'modified': '2022-07-22T00:42:58.099819+00:00',
  'type': 'file'},
 {'name': 'sub-I46/ses-MRI/anat/sub-I46_ses-MRI_echo-1_flip-2_VFA.nii.gz',
  'size': 2635075935,
  'created': '2022-07-22T00:42:58.028799+00:00',
  'modified': '2022-07-22T00:43:02.026999+00:00',
  'type': 'file'},
...
 {'name': 'sub-I46/ses-MRI/anat/sub-I46_ses-MRI_echo-2_flip-1_VFA.json',
  'size': 247,
  'created': '2023-04-06T16:18:55.008784+00:00',
  'modified': '2023-04-06T16:18:55.054241+00:00',
  'type': 'file'},
 {'name': 'sub-I46/ses-MRI/anat/sub-I46_ses-MRI_echo-1_flip-1_VFA.json',
  'size': 247,
  'created': '2023-04-06T16:18:56.232298+00:00',
  'modified': '2023-04-06T16:18:56.266695+00:00',
  'type': 'file'},
...
]
>> fs.ls('dandi://dandi/000026@draft/sub-I46/ses-MRI')
[{'name': 'sub-I46/ses-MRI/anat', 'size': 0, 'type': 'directory'}]
>> fs.isdir('dandi://dandi/000026@draft/sub-I46/ses-MRI')
True
>> fs.isdir('dandi://dandi/000026@draft/sub-I46/ses-MR')
False
>> fs.info('dandi://dandi/000026@draft/sub-I46/ses-MRI')
{'name': 'dandi://dandi/000026@draft/sub-I46/ses-MRI',
 'size': 0,
 'type': 'directory'}

I understand there may be a bit of work to do before it gets merged, but I hope it can be useful.

@jwodder jwodder added the minor Increment the minor version when merged label Jan 25, 2024
info = requests.request(url=asset.api_url, method='get').json()
url = ''
for url in info['contentUrl']:
if url.startswith('https://dandiarchive.s3.amazonaws.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
https://dandiarchive.s3.amazonaws.com
may be at an arbitrary position in the sanitized URL.
for url in info['contentUrl']:
if url.startswith('https://dandiarchive.s3.amazonaws.com'):
break
if not url.startswith('https://dandiarchive.s3.amazonaws.com'):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
https://dandiarchive.s3.amazonaws.com
may be at an arbitrary position in the sanitized URL.

def _maybe_to_s3(self, url):
url = stringify_path(url)
is_s3 = url.startswith('https://dandiarchive.s3.amazonaws.com')

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
https://dandiarchive.s3.amazonaws.com
may be at an arbitrary position in the sanitized URL.
@jwodder
Copy link
Member

jwodder commented Jan 25, 2024

@balbasty

  • Regarding checking whether a directory exists, you can make a request to $API_URL/dandisets/{dandiset_id}/versions/{version_id}/assets/paths?path_prefix={dirpath} and see whether it returns 200 with nonempty "results".
  • The code for parsing DANDI URLs is all in dandi/dandiarchive.py.

@jwodder jwodder requested review from yarikoptic and jwodder January 25, 2024 17:26
Copy link

codecov bot commented Jan 25, 2024

Codecov Report

Attention: 230 lines in your changes are missing coverage. Please review.

Comparison is base (96c3953) 88.50% compared to head (e72ae2b) 86.70%.

Files Patch % Lines
dandi/dandifs.py 0.00% 230 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1395      +/-   ##
==========================================
- Coverage   88.50%   86.70%   -1.81%     
==========================================
  Files          77       78       +1     
  Lines       10492    10722     +230     
==========================================
+ Hits         9286     9296      +10     
- Misses       1206     1426     +220     
Flag Coverage Δ
unittests 86.70% <0.00%> (-1.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yarikoptic
Copy link
Member

relates to

@balbasty balbasty changed the title Balbasty dandifs [FEAT] dandi fsspec filesystem Jan 25, 2024
Copy link
Member

@jwodder jwodder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the lint & type-checking failures.

"""
A `fsspec` File System for (remote) DANDI
"""
__all__ = ['RemoteDandiFileSystem']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this after the imports.

'Please install with: pip install fsspec[http]')
import re
import requests
from typing import Optional, Union, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add from __future__ import annotations to the top of the file so that Optional[T], Union[T, U], and Tuple[T, ...] can be written T | None, T | U, and tuple[T, ...] instead.

import requests
from typing import Optional, Union, Tuple
from urllib.parse import unquote as url_unquote
from dandi.dandiapi import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the formatting, it appears that you did not run black et alii via pre-commit. Install pre-commit, run pre-commit install in your clone of this repository, and then run pre-commit run -a to format your PR.

dandiset: Optional[Union[str, RemoteDandiset]] = None,
version: Optional[str] = None,
client: Optional[Union[str, DandiInstance, DandiAPIClient]] = None,
**http_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user supplies a RemoteDandiset with one version, a version string with a different version, and a client that doesn't match the RemoteDandiset's client? Such mismatches should either result in immediate errors or else be prevented in the first place via dedicated constructor classmethods.

return self._dandiset

@dandiset.setter
def dandiset(self, x: Optional[RemoteDandiset]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to support replacing the dandiset attribute to begin with?

instance, dandiset, version, *_ = split_dandi_url(url)
return cls(dandiset, version, instance)

def get_dandiset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to rewrite this using navigate from dandiarchive.py.

Comment on lines +193 to +195
info = requests.request(url=asset.api_url, method='get').json()
url = ''
for url in info['contentUrl']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info = requests.request(url=asset.api_url, method='get').json()
url = ''
for url in info['contentUrl']:
for url in asset.get_raw_metadata().get('contentUrl', []):

info = requests.request(url=asset.api_url, method='get').json()
url = ''
for url in info['contentUrl']:
if url.startswith('https://dandiarchive.s3.amazonaws.com'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that dandiarchive.s3.amazonaws.com is only used by the production dandi instance; the staging instance uses dandi-api-staging-dandisets.s3.amazonaws.com.

Comment on lines +197 to +200
break
if not url.startswith('https://dandiarchive.s3.amazonaws.com'):
return None
return url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break
if not url.startswith('https://dandiarchive.s3.amazonaws.com'):
return None
return url
return url
return None

return self._httpfs.open(s3url, *args, **kwargs)


def split_dandi_url(url: str) -> Tuple[str, str, str, str, str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this with the types from dandiarchive.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants