Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

recipe_run SQLModel -> FastAPI -> CRUD Client & CLI #6

Merged
merged 49 commits into from
Jan 7, 2022
Merged

recipe_run SQLModel -> FastAPI -> CRUD Client & CLI #6

merged 49 commits into from
Jan 7, 2022

Conversation

cisaacstern
Copy link
Member

Yesterday @rabernat and I discussed the fact that #1 contains a lot of cobbled together workarounds precipitated by the fact that we have not yet implemented pangeo-forge/roadmap#31. Rather than building upon the un-scalable, unsteady footing of CSV logs (as #1 suggests), we are going to take a step back and implement an actual database + API for Pangeo Forge.

The first table in our database will be for recipe_run logs: records of which recipes were run, when, deposited where, etc. These are the first priority because unlike the Bakery database or recipe metadata, they don't currently exist anywhere else.

This PR currently just mirrors the SQLModel documentation example of a "Heroes" sqlite database plus FastAPI layer, with the addition of sketches for server and cli modules. Opening this as a draft now, and will mark as Ready for review once that's the case. Unlike #1, going to aim to keep this one truly minimal. 😄

@cisaacstern
Copy link
Member Author

cisaacstern commented Nov 19, 2021

An update on my current thinking here. I've now got a reasonable handle on how SQLModel works. Before building out our recipe_run SQL table, I'm taking a moment to consider if there's an approachable way to write factory functions to generate derived classes for the multiple models with inheritance approach.

This multiple model approach is clearly a robust style which is more DRY than the alternatives. The one maintainability issue I foresee, however, is that as we scale it beyond recipe_run to numerous interrelated tables, a lot of boilerplate classes will be defined (each of which is then referenced multiple times in the tests, API functions, etc.). So if there was a way to automate generation of derived models with factories, our implementation would be a lot more readable and maintainable.

The relationship of the Base and Create classes is the simplest example, as they are the same classes with different names. In this case, instead of

class HeroBase(SQLModel):
    name: str
    secret_name: str
    age: Optional[int] = None

class HeroCreate(HeroBase):
    pass

we can do

def make_subclass(base: SQLModel, rename_base_to: str, attrs: dict = {}):
    cls_name = base.__name__.replace("Base", rename_base_to)
    return type(cls_name, (base,), attrs)

HeroCreate = make_subclass(HeroBase, "Create")

the real conciseness gains will be in using a similar approach for the other derived classes, but this is a bit more involved, as it requires changing attributes, defaults, and/or annotations.

The next step would be, for a given set of dynamically generated Models, to also dynamically generate the associated API functions and tests. This seems doable, given that for a given CRUD API function (e.g. create), the basic syntax is quite standardized, with the only real differences being the arguments and/or type hints passed to the function and its decorator.

Perhaps it seems like premature optimization to be concerned about this before implementing our own minimal database, however I'm quite concerned a real implementation will quickly become cluttered with hundreds of lines of boilerplate that will make it difficult to understand the core of what we are trying to achieve.

Following the multiple models approach proposed in the SQLModel documentation, each SQL table requires at least: five model definitions + four API functions + four client functions + at least eight tests (if you include error condition tests) + (optionally) four CLI functions = 25 objects, almost all of which conform to a known (class or function) template of some sort.

If we're able to find a reasonable way to implement them, it seems like some of these ideas may be worth proposing upstream. SQLModel is a relatively recently released project, the core of which is very elegant (unsurprisingly, given its author). I'm cautiously optimistic we may be able to come up with some convenience methods that make it even easier to use than it currently is.

@cisaacstern
Copy link
Member Author

cisaacstern commented Nov 21, 2021

the real conciseness gains will be in using a similar approach for the other derived classes, but this is a bit more involved, as it requires changing attributes, defaults, and/or annotations.

891372e (without diff here) reflects a bit more work on this. We're now able to define 3 of the 5 model types for the multiple models with inheritance style using factory functions. The models defined by these factories pass all of the API tests described in the SQLModel docs (which are copied in this PR's tests/test_api.py).

I'm still unclear as to whether or not the remaining two model types will be factory-able, given that both of them require (non-default) fields, and I'm not sure if there's a reliable way to dynamically specify that via type() or types.new_class(). (Feels like it might require overwriting __init__ or __init_subclass__ or somesuch thing, but with all the metaclass wizardry going on in SQLModel, that seems inadvisable.) Even if the proposed factory production of 3 out of 5 models is where we land, though, this feels like a meaningful boilerplate reduction + readability boost to me. Not sure if there are any reliability factors I'm overlooking in this style of dynamic class definitions, but they seem pretty straightforward to me.

@cisaacstern
Copy link
Member Author

MultipleModels is a new container for holding associated models generated by the factory functions described above.

Today's work was focused on adding and testing GenerateEndpoints which streamlines creation of the FastAPI endpoints for a MultipleModels collection.

Taken together, these objects allow us to abstract the process of adding arbitrary numbers of tables to our SQL database while greatly reducing boilerplate and thereby places for bugs to creep in.

I've got a bit more work remaining to add the full complement of CRUD commands to the client and cli interfaces, and then verify that all this works with relationship attributes (i.e., foreign keys).

Then it's off to the races with our actual database implementation (presumably by later this week).

@rabernat
Copy link
Contributor

It's great to see all the progress here.

I can sense you have gone down a bit of a rabbit hole with SQLModel, but my naive impression is that what you have come up with is pretty reasonable. My one recommendation before committing to this path (factory functions, automatic CRUD test generation) would be to try to socialize this approach with someone who really groks SQLModel / Pydantic, etc. I don't see any red flags, but I am not that type of python developer.

@cisaacstern
Copy link
Member Author

try to socialize this approach with someone who really groks SQLModel / Pydantic, etc.

Great suggestion.

Working up the reproducer was itself a helpful exercise, prompting further refinement of the approach. Hopefully we'll be able to get some good feedback on this via the Issue and/or other avenues, now that we have a distilled reference for it.

@cisaacstern cisaacstern mentioned this pull request Dec 6, 2021
@cisaacstern cisaacstern mentioned this pull request Dec 7, 2021
@rabernat
Copy link
Contributor

Looks like great progress happening here.

One minor suggestion: rename entrypoints to interfaces. "Entry points" has a very specific meaning in python.

@cisaacstern cisaacstern marked this pull request as ready for review December 16, 2021 20:39
@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 16, 2021

@rabernat:

My comments are basically in three categories:

  • Let's make sure everything is documented with docstrings

I believe I've now filled all of the missing docstrings you found, with the exception of .models.RecipeRunBase. I've left this unfinished to avoid redoing the work once we change the fields to match GitHub check runs, as you propose in #6 (comment).

That fields rewrite is beyond the scope of what we'll be able to finish before the holiday break, so in the interest of merging this week (as a bit of a pre-holiday morale boost 🚀 ), I suggest we do the check runs work and associated docstring as its own PR.

  • I'm a bit on-the-fence about just passing the urls + json directly to the CLI. At that point, do we even need the CLI? We could accomplish almost the same thing with curl. However, this is minor and I think we can leave things as they are for now. Eventually, we should try to make the CLI more verbose about the options supported. We will learn a lot from using the CLI in our infrastructure.

Totally agree, and as you suggest here, this is probably most fruitful as its own PR(s) once we've had experience running this package in the wild.

  • The test suit is very comprehensive but has some ugliness. Let's see if we can find a more elegant and also more readable solution. I think it will involve defining custom classes for use in the test suite.

There's a lot of new organization now, some of which I've highlighted in responses to your inline comments, but all of which I hope will be self explanatory by just taking a fresh browse through the tests directory. Your feedback on this point was very helpful in pushing us towards a more readable and maintainable organization for the test suite.

tests/conftest.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we are getting really close with this. Thanks for all of your hard work on refactoring the test suite, which I agree is much improved.

I have left a few requests for clarifications, but those are all pretty minor.

My only major-ish issue is that I still find the structure of the test suite to be pretty byzantine. The big win in this version is that you have eliminated lots of the repetition. a class the TestUpdate is now pretty compact, with the core test logic (e.g. evaluate data, test_update) fairly straightforward and readable.

The cost of this is several layers of inherited classes, spanning conftest.py, interfaces.py, and test_databse.py. This code is hard to follow; it was hard for me to figure out which arguments were fixures and where those fixtures are defined. And it uses some sketchy anti-patterns, such as encoding key information inside function names (c.f. get_interface).

If you are not totally sick of this yet, I'd like to propose one more iteration, inspired by what is here, but hopefully with more simplicity and fewer hacks. The key is to use class mixins and inheritance smartly.

Imagine the following code in test_database.py.

class CreateLogic:
    # contains only the "business logic" of creation
    def test_create(self, model_to_create):
        models, request, blank_opts = model_to_create
        data = self.create(models) # defined in a mixin
        # now do specific tests of correctness, e.g.
        assert data == request

# by combining a Logic class with a CRUD class, we get a complete test
class TestCreateDatabase(CreateLogic, DatabaseCRUD):
    pass

# this is so easy and simple we don't even need parameterization
class TestCreateCLI(CreateLogic, CLICRUD):
    pass

...

class TestUpdateDatabase(UpdateLogic, DatabaseCRUD):
    pass

# etc

pangeo_forge_orchestrator/cli.py Show resolved Hide resolved
feedstock_id: int # TODO: Foreign key
commit: str
version: str
status: str # TODO: Enum or categorical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we going to punt on this suggestion for now?

To be more explicit, here are all the fields from github check runs API:

Name Type In Description
accept string header Setting to application/vnd.github.v3+json is recommended.
owner string path  
repo string path  
name string body Required. The name of the check. For example, "code-coverage".
head_sha string body Required. The SHA of the commit.
details_url string body The URL of the integrator's site that has the full details of the check. If the integrator does not provide this, then the homepage of the GitHub app is used.
external_id string body A reference for the run on the integrator's system.
status string body The current status. Can be one of queued, in_progress, or completed.Default: queued
started_at string body The time that the check run began. This is a timestamp in ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ.
conclusion string body Required if you provide completed_at or a status of completed. The final conclusion of the check. Can be one of action_required, cancelled, failure, neutral, success, skipped, stale, or timed_out. When the conclusion is action_required, additional details should be provided on the site specified by details_url.Note: Providing conclusion will automatically set the status parameter to completed. You cannot change a check run conclusion to stale, only GitHub can set this.
completed_at string body The time the check completed. This is a timestamp in ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ.
output object body Check runs can accept a variety of data in the output object, including a title and summary and can optionally provide descriptive details about the run. See the output object description.

pangeo_forge_orchestrator/models.py Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/interfaces.py Outdated Show resolved Hide resolved
tests/interfaces.py Outdated Show resolved Hide resolved
tests/test_database.py Outdated Show resolved Hide resolved
tests/test_database.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor

rabernat commented Jan 7, 2022

Noting idea from our discussion today.

In order to eliminate all of the calls to clear_table, we could create a fixture that does this for us. To do this, we would chain together two fixutres:

  • The current database / API session-scoped fixtures
  • Function-scoped fixture that takes those as inputs but automatically clears the database tables before each invocation

@cisaacstern
Copy link
Member Author

@rabernat, I believe I've covered everything we discussed today in the last few commits.

to eliminate all of the calls to clear_table, we could create a fixture that does this for us

3e0f540 shows what was needed to do this. It was slightly more involved than we'd initially guessed, but ultimately quite straightforward. And it eliminated the need to call clear_table at the test level, a definite win for test simplicity.

It's certainly possible there's something I've overlooked, so definitely take a look, but AFAICT we're ready to merge.

@rabernat
Copy link
Contributor

rabernat commented Jan 7, 2022

Just wanted to note that I followed the development guide and everything worked perfectly. The auto-generated API docs are a thing of beauty 🤩

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants