Skip to content
This repository has been archived by the owner on Dec 5, 2023. It is now read-only.

Task class #43

Closed
wants to merge 29 commits into from
Closed

Task class #43

wants to merge 29 commits into from

Conversation

matthewhanson
Copy link
Member

This draft Task class (see #42) is derived from the Cirrus task template: https://github.com/cirrus-geo/cirrus-task-template

@matthewhanson
Copy link
Member Author

I was creating a new task for Cirrus, so I decided to just expand on the Task class here and get a working version.

I've also added a task example in the tests directory that show a basic copy task, now far simpler than having all the previous boilerplate.

@jkeifer
Copy link
Member

jkeifer commented Mar 29, 2022

As you know I really like this idea. I wish I had more time to review and test out some of my thoughts, but here's a couple high-level comments:

  • I think the base class here is not Task, but Lambda, and Task should be a subclass of Lambda. Then we'd also have Feeder and Function subclasses. I say that because things like logging, event parsing, and args/cli functions would be common across all of these "lambda" types.
  • I would look into using the __init_subclass__ method on that base Lambda class to register the handler method as the function lambda_handler within the subclass's module. I might have this totally wrong, but it could be something like this:
    def __init_subclass__(cls, abstract=False, **kwargs):
        super().__init_subclass__(**kwargs)
        if abstract:
            return
        module = cls.__module__
        module.lambda_handler = cls.handler

For abstract classes like Task and Feeder, you could declare them like class Task(Lambda, abstract=True) and those would skip the handler registration.

@matthewhanson
Copy link
Member Author

Good point on having another base class, although I really don't like calling it Lambda though, because this Task would be used for Batch as well. So not sure what to call the base class, but like the idea of having one and then adding a Feeder class (which also does not need to be a Lambda).

@jkeifer
Copy link
Member

jkeifer commented Mar 29, 2022

In my mind this was only for use for tasks within a cirrus project, i.e., for things packaged as a lambda. But you're suggesting it could be used for tasks built into docker containers as well? I suppose that makes sense.

Then maybe we want to have a LambdaHandler mixin that does the handler registration, and we could expose both a Task and TaskLambda class? Or perhaps a Lambda class decorator that does the registration?

I don't know what is best here, I just want to make sure it is super easy for a task author to ensure the task class will actually work with the default lambda definition.yml config.

@matthewhanson
Copy link
Member Author

Oh I see, I was wondering what the value was with the registration. But yeah, you could use the Task class when creating a new Batch task as well, it provides the CLI for local testing and all the wrapping/fetching/parsing/payload return.

We could also change the default generated definition.yml file for new tasks to not use lambda_function.lambda_handler, which as you know I've never cared for.

@jkeifer
Copy link
Member

jkeifer commented Mar 29, 2022

I suppose cirrus could import the .py file in the component dir to inspect it to see if it has a subclass of the base (whatever we're calling it), an explicit lambda_handler definition, or any other patterns we would consider supporting here. This seems a little more "magic" than I'd normally like, because it comes from a build-time inspection and isn't something you can "see" externally, but advantage over the current just-set-a-default approach is that we could error the build command if no handler is set in the definition.

I guess we also need to know what .py file to import; that could be handled in a similar discovery-with-error process as above.

Though we also have to consider with the lambda-as-batch runner, to make sure whatever is set here is compatible with the patterns it supports. If we are actually modifying the module state on class instantiation to set a well-known lambda_handler value, then we get that compatibility for free.

Also, maybe setting lambda_handler in the module namespace isn't a problem for batch-only applications. So perhaps we don't need different behavior for lambda vs not.

self.original_items = payload.pop('features')
self.items = deepcopy(self.original_items)

self.process_definition = payload.pop('process')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with workflow chaining, I think you want to use payload.process here, as that will figure out the current process definition in the payload for you. Otherwise you could get a list of process definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I say that, but it seems this attribute is used two different ways. In light of that realization, I think maybe you want something like this:

Suggested change
self.process_definition = payload.pop('process')
self.process_definition = payload.process
self._process_definition = payload.pop('process')

Then in output_payload you would use `self._process_definition.

That said, I'd be interested in learning more about the need here and what problem(s) this whole pull-it-apart-and-rebuild-later strategy solves for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points here.

The motivation here was to try and keep things clear for developers as to what should be mutable and what shouldn't.

The input is items (features) which we want to store the original version, but then also a separate output version of that users would modify (they would never modify the original items, it is for reference and adding source links and such). I'm also wondering if it makes sense to copy the input items by default into the output structure.

The process definition on the other hand should generally not be changed by the user (although I've hit a couple use cases where I modify the parameters for future tasks).

So basically the reason for splitting them up is because the user interacts with the Items and the Process definition in different ways. Although it probably makes sense to treat them in a similar way. Make a deepcopy of the original paylod then a copy that can be changed by the user.

src/cirrus/lib/task.py Outdated Show resolved Hide resolved
src/cirrus/lib/task.py Outdated Show resolved Hide resolved
@matthewhanson
Copy link
Member Author

@jkeifer The testing error is also happening in your PR #46 , which is in the transfer tests. Any idea what could be happening here?

@matthewhanson matthewhanson changed the title initial draft of a Task class Task class Apr 12, 2022
@matthewhanson matthewhanson marked this pull request as ready for review April 12, 2022 04:06
@matthewhanson
Copy link
Member Author

I added some tests, although there are still failing tests from test_transfer unrelated to these additions

@matthewhanson
Copy link
Member Author

This also introduces using VCR for testing, will open new ticket

src/cirrus/lib/task.py Outdated Show resolved Hide resolved
src/cirrus/lib/task.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented May 13, 2022

Codecov Report

Merging #43 (9da9121) into main (64f7cd1) will decrease coverage by 0.14%.
The diff coverage is 63.38%.

@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
- Coverage   69.77%   69.63%   -0.15%     
==========================================
  Files           5        6       +1     
  Lines         622      764     +142     
==========================================
+ Hits          434      532      +98     
- Misses        188      232      +44     
Impacted Files Coverage Δ
src/cirrus/lib/task.py 63.38% <63.38%> (ø)
src/cirrus/lib/transfer.py 88.23% <0.00%> (+9.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64f7cd1...9da9121. Read the comment docs.

@matthewhanson
Copy link
Member Author

I was trying to fix the tests, started changing from mocking to vcr when I updated all the dev dependencies and the tests all work now. So ready for review so we can get this merged into main @jkeifer

CHANGELOG.md Show resolved Hide resolved
src/cirrus/lib/task.py Outdated Show resolved Hide resolved
src/cirrus/lib/task.py Outdated Show resolved Hide resolved
Comment on lines 28 to 30
_name = 'task'
_description = 'A task for doing things'
_version = '0.1.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me to default these things. Shouldn't they be required parameters on any subclasses? That said, validating that they are set on subclasses likely would require using the __new__ method on a custom metaclass. Perhaps we make an issue to follow up on this instead of trying to resolve it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are static attributes and this was the easiest way to define them, but yes they should be required. Every base class should define them, so open to options on how to better do that.

src/cirrus/lib/task.py Outdated Show resolved Hide resolved

@property
def process_definition(self) -> Dict:
return self._payload['process']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have an incompatibility here with workflow chaining. You should use self._payload.process instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a workflow chaining test case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.
Test for workflow chaining test case sounds like a good idea. Not it.

@matthewhanson
Copy link
Member Author

This work has been replaced by new library stac-task

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants