Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

Merged
merged 13 commits into from
Nov 3, 2023

Conversation

timj
Copy link
Member

@timj timj commented Oct 26, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (e89626a) 87.66% compared to head (17bc74e) 87.68%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #899      +/-   ##
==========================================
+ Coverage   87.66%   87.68%   +0.01%     
==========================================
  Files         276      276              
  Lines       36165    36326     +161     
  Branches     7547     7581      +34     
==========================================
+ Hits        31705    31851     +146     
- Misses       3305     3313       +8     
- Partials     1155     1162       +7     
Files Coverage Δ
python/lsst/daf/butler/__init__.py 100.00% <100.00%> (ø)
python/lsst/daf/butler/_butler.py 90.71% <100.00%> (+0.48%) ⬆️
python/lsst/daf/butler/registry/_registry.py 98.23% <100.00%> (ø)
...n/lsst/daf/butler/remote_butler/server/__init__.py 100.00% <100.00%> (ø)
.../daf/butler/remote_butler/server/_server_models.py 100.00% <100.00%> (+100.00%) ⬆️
python/lsst/daf/butler/script/ingest_files.py 92.59% <100.00%> (ø)
tests/test_butler.py 97.94% <100.00%> (ø)
tests/test_server.py 95.77% <100.00%> (+3.88%) ⬆️
tests/test_simpleButler.py 98.43% <100.00%> (ø)
tests/test_testRepo.py 100.00% <100.00%> (ø)
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +85 to +88
# Populate the test server.
server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "base.yaml"))
server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "datasets-uuid.yaml"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter yet but you may want to move this above the call to _make_remote_butler, in case we decide to read some stuff from the server during RemoteButler initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that it doesn't make a difference here either way because the clients won't be downloading caches of dataset type and collection before this runs will they?

@timj timj force-pushed the tickets/DM-41365 branch from 7b59d29 to 9f27ffd Compare October 27, 2023 22:40
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a look at the APIs, and while I left a lot of comments they're all minor. Did not look at the implementations (I gather that's not needed from me).

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Show resolved Hide resolved
*,
collections: CollectionArgType | None = None,
timespan: Timespan | None = None,
datastore_records: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we provide this as an option, we should also provide a way to expand the data ID (i.e. add dimension records) at the same time, and we should think about whether we want just one option to do both or allow the two kinds of expansion to happen separately.

Copy link
Member Author

@timj timj Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had completely failed to notice that that flag was in there now. One option is for find_dataset to always expand and add everything. People shouldn't be calling it in bulk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably get_dataset should have the datastore_records and dimension_records booleans as well for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to add datastore records to the SerializedDatasetRef because that was never updated and so it's impossible to return datastore records from the server at the moment...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jim asked me to not add them to SerializedDatasetRef, as it may increase graph size on disk (but I don't think we give records to refs that go into the graph, at least not yet).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sending SerializedDatasetRef over the wire so we can only reconstruct things in the far end that were put in at the other end. We need to add them somehow (but obviously they should be optional). For now I've turned the feature off in client/server so you can't get the datastore records.

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType:
"""
raise NotImplementedError()

@abstractmethod
def get_dataset(self, id: DatasetId) -> DatasetRef | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want a vectorized version of this in the future, but it's a pain to implement now. I don't think this interface needs to change in order to live alongside a vectorized one, but I wanted to raise it in case you think otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two open questions:

  • If we have vectorized version, get_datasets do we keep get_dataset around?
  • What should get_datasets return? list[DatasetRef] or DatasetRefContainer?

I think we likely want a single request to return a single ref so it seems like we always want Butler.get_dataset even if we later add Butler.get_datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we ever expose bare DatasetId to butler clients (do we need this method at butler level)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly makes client/server testing a lot easier to begin with since it's the simplest possible API to implement. We do have VO services that are given a UUID (via datalink from, eg, ObsTAP) and need to convert that to a dataset ref so it's not impossible that the server will need to do that (depending on how services connect to butlers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to start exposing dataset IDs in more places, too; I've been thinking about how we might to get them into the --where expression language, for instance. They appear pretty prominently in QGs already.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, few minor questions.

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType:
"""
raise NotImplementedError()

@abstractmethod
def get_dataset(self, id: DatasetId) -> DatasetRef | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we ever expose bare DatasetId to butler clients (do we need this method at butler level)?

*,
collections: CollectionArgType | None = None,
timespan: Timespan | None = None,
datastore_records: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
Comment on lines +245 to +248
# Temporary hack. Assume strings for collections. In future
# want to construct CollectionWildcard and filter it through collection
# cache to generate list of collection names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on caching ticket now, I would probably prefer to not have a complete list of collections on every client in the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I thought we said we would last week. I thought @TallJimbo wanted the client to know all possible collections when it was building the query object before sending it to the server. The caching story between client and server is a bit tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but we may need to rethink that part, otherwise we may have scaling issue. I looked at Butler initialization time today, the simplest butler query-data-ids --collections=HSC/raw/all /repo/main exposure takes 10 seconds, large fraction of that comes from pre-loading of collections cache. We have 300k collections today, and it will only grow with time. There are ways to reduce that probably, but in my ideal world it would be better if we cache almost nothing on client side. 🙂

Copy link
Member

@TallJimbo TallJimbo Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the caches in the client has worked well for us so far (i.e. with the server just being the DB), and I like the idea of being able to optimize server calls away entirely in the client sometimes. And I do think caching on the server is a lot harder, since we can't just punt cache misses to the user there.

But I do hesitate a bit because I'm not sure if "cache all the collections the user can see" continues to be in the "small enough to be irrelevant" category after DR1, both in terms of downloading the cache and holding it in memory. And I have no idea if my "do more on the client to keep the server load light" intuition is better than @dhirving 's "do nothing on the client to keep version management easier" intuition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading a million collection records just to figure out that pattern does not match any of them looks like a lot of overhead. It is of course about the right balance between what we do on server vs client, we need to find something that works and scales.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm sold on not trying to aggressively fetch all collections the user can access. It's the dataset type definitions, chained collection members, and collection summaries that are needed for query contruction, and we don't need all of any of those.

I think what may break if we drop the complete collection-name cache is our regex/glob collection query implementation - we'll need to start translating/limiting those patterns to something that can be run inside the DB. Yes, we could do a full cache in the RemoteButler server, but this scaling problem will hit DirectButler, too.

I'll do a brain dump on Slack later today about the places we use the caches in the query system so somebody other than me can do some conceptual design work on what it should look like. I think we need to have a better sense of where this is all going, even if we don't decide to switch fully from the current approach yet, but I want to stay off of at least this part of the critical path.

# Temporary hack. Assume strings for collections. In future
# want to construct CollectionWildcard and filter it through collection
# cache to generate list of collection names.
wildcards = CollectionWildcard.from_expression(collections)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do a quick check here and raise if patterns specifies something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I have to change the API to something a bit simpler for collection list anyhow.

) -> SerializedDatasetType:
"""Return the dataset type."""
butler = factory.create_butler()
datasetType = butler.get_dataset_type(dataset_type_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is exception transported transparently back to client?

tests/test_server.py Show resolved Hide resolved
@timj timj force-pushed the tickets/DM-41365 branch from 9f27ffd to 13d3bff Compare October 31, 2023 18:09
The underscores are the preferred way to add new APIs now.

The collections parameter should not support wildcards so explicitly
declare it should be a sequence of str.
@timj timj force-pushed the tickets/DM-41365 branch from 13d3bff to b22a5f3 Compare October 31, 2023 18:54
@timj timj force-pushed the tickets/DM-41365 branch 3 times, most recently from 07b0946 to 9afece7 Compare November 2, 2023 22:00
timj added 2 commits November 2, 2023 15:58
…aset

Datastore records do not work in remote butler because they
were never added to SerializedDatasetRef.
Demonstrate that MissingDatasetTypeError can be caught in
the server and raised again in the client.
@timj timj force-pushed the tickets/DM-41365 branch from 9afece7 to 17bc74e Compare November 2, 2023 22:58
@timj timj merged commit 5baa637 into main Nov 3, 2023
16 checks passed
@timj timj deleted the tickets/DM-41365 branch November 3, 2023 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants