Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to pydantic 2 #625

Merged
merged 44 commits into from
Apr 26, 2024
Merged

Conversation

thomas-maschler
Copy link
Contributor

Related Issue(s):

Description:
This PR updates dependencies to Pydantic 2 and Stac-Pydantic 3
It removes the internal Stac, Search and Opertator Types in favor of Types from Stac Pydantic

PR Checklist:

  • pre-commit hooks pass locally
  • Tests pass (run make test)
  • Documentation has been updated to reflect changes, if applicable, and docs build successfully (run make docs)
  • Changes are added to the CHANGELOG.

@thomas-maschler thomas-maschler mentioned this pull request Feb 12, 2024
4 tasks
@lossyrob
Copy link
Member

@thomas-maschler , moving the STAC types away from the TypeDict implementations in stac_fastapi.types to Pydantic types could have performance implications. Specifically I'm worried about the STAC types encoding Client output. These were originally moved away from Pydantic models as they tended to be slow for large payloads, e.g. returning an ItemCollection from search that contains 1000 large Sentinel 2 Items. Running Pydantic validation against known valid Items that come out of PgSTAC as part of response causes additional latency without a lot of value. I know the Pydantic 2 backend changes means validation will be faster, but are we introducing the potential for increased latency in the API routes because of the move away from TypedDicts in e.g. BaseCoreClient in this PR? If so, could the upgrade be done for but leave the TypeDict usage as is? Or have you seen that these changes don't introduce additional latency due to the magic of Rust 🪄?

@thomas-maschler
Copy link
Contributor Author

thomas-maschler commented Feb 21, 2024

I can see your concerns regarding latency. I did not benchmark this myself. Others reported a speed improvement of x5 between pydantic 1 and 2. I was assuming that this would be enough to address latency concerns.

The benefits I saw in keeping the pydandic models are peace of mind, due to basic checks and setting of default values you wouldn't have to store in the DB. I have a use case, where data are stored in a legacy system and I have to translate them to STAC at runtime. Being able to push part of this into Pydantic makes the job easier.

If latency is still the main concern I could see several routes

  1. Keep TypedDicts
    Be explicit in the documentation that responses are not type checked and it is the responsibility of the user to do so.

  2. Allow to skip validation for ItemCollections and Collections

from pydantic import SkipValidation
from stac_pydantic import api
from typing import Sequence

class ItemCollectionUnsafe(api.ItemCollection):
    features: Sequence[SkipValidation[api.Item]]
 
class CollectionsUnsafe(api.Collections):
    collections: Sequence[SkipValidation[api.Collection]]

With minimal code changes, users can use the Unsafe subclasses as their return model, telling pydantic to skip validation for items and collections when making bulk requests.

  1. Add some configuration options to let the user choose between current implementation and options 1 or 2 above.

@rhysrevans3
Copy link
Contributor

Couple of notes from my pull request:

  1. If typing List and Dict are being used this part will need to be changed as in it's current form it will miss off the Optional typing of nested Lists & Dicts.
  2. The types.links seems to be covered by stac-pydantic links and link_factory.
  3. The stac-pydantic Item/Collection/ItemCollection types are not json serialisable so will need to be json encoded/converted to dictionaries before they're returned from the API.

@thomas-maschler
Copy link
Contributor Author

@rhysrevans3, regarding your first point, can you give me an example of when this fails? I cannot replicate it. I tried this, where the filter attribute is defined as Optional[Dict]

@pytest.mark.parametrize(
    "filter,passes",
    [(None, True), ({"test": "test"}, True), ("test==test", False), ([], False)],
)
def test_create_post_request_model(filter, passes):
    extensions = [FilterExtension()]
    request_model = create_post_request_model(extensions, BaseSearchPostRequest)

    if not passes:
        with pytest.raises(ValidationError):
            model = request_model(filter=filter)
    else:
        model = request_model(
            collections=["test1", "test2"],
            ids=["test1", "test2"],
            bbox=[0, 0, 1, 1],
            datetime="2020-01-01T00:00:00Z",
            limit=10,
            filter=filter,
            **{"filter-crs": "epsg:4326", "filter-lang": "cql2-text"},
        )

        assert model.collections == ["test1", "test2"]
        assert model.filter_crs == "epsg:4326"
        assert model.filter == filter

@lossyrob
Copy link
Member

@thomas-maschler understood about the value of validation in the response classes for other use cases.

I think a solution where users could skip validation based on configuration would be great. In my previous testing (a year or so ago) it was the Pydantic validation that was causing slowdowns. I'd still want to validate that there's no other performance hits besides validation for using Pydantic models over TypeDicts before integrating these changes, but I have enough confidence that this would address the performance implications, and any unforeseen issues could be addressed in follow up work.

Would you be willing to add that SkipValidation configuration to this PR?

@vincentsarago
Copy link
Member

about performances, in TiPg we also went for the TypedDict way because pydantic validation was really slowing down the response (even with pydantic 2.0)

@rhysrevans3
Copy link
Contributor

@thomas-maschler I found it was an issue for nested List/Dicts types: for example the sortby field in the SortExtensionPostRequest is

sortby: Optional[List[PostSortModel]]

But when I check the FieldInfo shows required=True

>>> from stac_fastapi.extensions.core.sort.request import SortExtensionPostRequest
>>>SortExtensionPostRequest.model_fields['sortby']
FieldInfo(annotation=Union[List[SortExtension], NoneType], required=True)

@thomas-maschler
Copy link
Contributor Author

@rhysrevans3 this is b/c we didn't set a default value. Right now the model requires either None or a list of PostSortModel. If you set the default to None the FieldInfo will say required=False. However, None is filled in when sortby is not provided so it still works even without a default value.

Regarding your point 2.
The type.Links are no Pydantic Models and are not used anywhere in the package. I prefer to leave them in right now to not break too much. I am not sure if any of the backends currently rely on it.

Regarding your point 3.
All pydantic models are json serializable. either with .model_dump_json() (output as str) or .model_dump(mode="json") (output as dict). I added some test, directly returning the pydantic Item, Collection and ItemCollection and this works for me.

I am still working on the switch between TypeDict and Pydantic response models and will probably push my updates later today.

@rhysrevans3
Copy link
Contributor

@thomas-maschler ah that makes more sense thank you for the explanation. I guess this just need to be filled in the backend I'm using.

Point 2 that sounds like a good idea. But don't the links in the api.landing_page used here need to be stac-pydantic Links? I guess with the validation removed this won't be an issue.

Point 3 sorry yes I meant json.dumps or orjson won't serialise them which I think fastapi uses for returning JSON/GEOJSON content. So was more of a note that the other repos will need to serialise or convert to a dict before returning the api routes. I think this effect the landing pages. however, I might be missing something obvious.

@thomas-maschler
Copy link
Contributor Author

@lossyrob can you take a look at my recent changes, in particular 8118f10.

  • In types.config I added a parameter validate_response that defaults to False (same as now).
  • I added types.stac back in
  • I added types.response_model which implements the switch
  • In api.core I made sure we always use response_model classes and added some extra type-checking
  • I added additional tests to make sure this works correctly, in particular, test_response_model and test_app

I stumbled over some other bugs regarding field aliases in the filter extension that I didn't manage to fix. I will file an issue for that.
I also noticed that there are still some pydantic extension models that are either out of sync with stac-pydantic or not present in stac-pydantic. I will port them over and open another PR once they make their way through the system.

@vincentsarago vincentsarago added this to the 3.0.0 milestone Apr 9, 2024
@jonhealy1
Copy link
Collaborator

@vincentsarago I thought we had results here that seemed to make sense?

@thomas-maschler
Copy link
Contributor Author

I believe we now test for correct validation.

@vincentsarago
Copy link
Member

I believe we now test for correct validation.

@thomas-maschler Yes

@vincentsarago I thought we had results #650 (comment) that seemed to make sense?

@jonhealy1 absolutely, so this mean that validation does not impact the performance 🤯. I would still keep this optional but it might be great to mention this somewhere

@thomas-maschler
Copy link
Contributor Author

Ok, I'll update the readme

@vincentsarago
Copy link
Member

thanks @thomas-maschler I was going update this branch tomorrow morning 🙏

let see if we can get this merged before the end of the week 😄

Copy link
Collaborator

@jonhealy1 jonhealy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. The readme looks great - just a couple of spelling things

CHANGES.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
stac_fastapi/api/setup.py Outdated Show resolved Hide resolved
stac_fastapi/api/setup.py Outdated Show resolved Hide resolved
stac_fastapi/extensions/setup.py Outdated Show resolved Hide resolved
stac_fastapi/extensions/setup.py Outdated Show resolved Hide resolved
stac_fastapi/types/setup.py Outdated Show resolved Hide resolved
stac_fastapi/types/stac_fastapi/types/version.py Outdated Show resolved Hide resolved
@jonhealy1 jonhealy1 self-requested a review April 26, 2024 06:55
Copy link
Collaborator

@jonhealy1 jonhealy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Copy link
Member

@vincentsarago vincentsarago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 thanks @thomas-maschler

Notes:

@jonhealy1
Copy link
Collaborator

@vincentsarago when you say wait do you mean wait to merge?

@vincentsarago
Copy link
Member

@jonhealy1 sorry I meant, wait for the 3.0 release :-)

I'll merge this now 🥳

@vincentsarago vincentsarago merged commit 63cac39 into stac-utils:main Apr 26, 2024
7 checks passed
@vincentsarago
Copy link
Member

vincentsarago commented Apr 26, 2024

🤯 😍
Screenshot 2024-04-26 at 9 27 17 AM

Screenshot 2024-04-26 at 9 27 15 AM

There is almost no big difference when there is or is not model validation ❤️

https://stac-utils.github.io/stac-fastapi/benchmarks.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants