Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutate Decorator #1160

Merged
merged 4 commits into from
Oct 7, 2024
Merged

Mutate Decorator #1160

merged 4 commits into from
Oct 7, 2024

Conversation

jernejfrank
Copy link
Contributor

@jernejfrank jernejfrank commented Sep 28, 2024

So I figured out that if we use function pointers to pass into the decorator we can append to them within this decorator.

Addressing: #922
I think this makes life much easier, but let me know if I simplified my life too much and this can lead to potential troubles down the line.

Changes

  • Added new decorator mutate
  • Applicable support delayed use y inputing fn=None and can be target via target_fn
  • pipe_output you can specify which end nodes get which transforms applied to
  • Light refactoring in the same module

How I tested this

  • new example has a smoke test
  • added as override module to sci-kit example species_distribution_modeling
  • unit tests

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 754964f in 12 seconds

More details
  • Looked at 500 lines of code in 9 files
  • Skipped 2 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. hamilton/function_modifiers/macros.py:1167
  • Draft comment:
    Ensure that the transform attribute is initialized for the target function if it doesn't already exist before appending to it.
  • Reason this comment was not posted:
    Comment did not seem useful.
2. hamilton/function_modifiers/macros.py:1168
  • Draft comment:
    Ensure that the transform attribute is initialized for the target function if it doesn't already exist before appending to it.
  • Reason this comment was not posted:
    Marked as duplicate.
3. docs/reference/decorators/pipe.rst:30
  • Draft comment:
    Ensure that the mutate decorator is fully documented here, as it is a new addition. The PR already adds documentation for it, which is good.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    The mutate decorator is being introduced, and it should be documented in the sphinx documentation. The PR already adds documentation for the mutate decorator in the docs/reference/decorators/pipe.rst file, which is appropriate.
4. hamilton/function_modifiers/macros.py:1070
  • Draft comment:
    The mutate class is a new addition and should be documented in the sphinx documentation. Ensure that it is included in the relevant documentation files.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    The mutate decorator is being introduced, and it should be documented in the sphinx documentation. The PR already adds documentation for the mutate decorator in the docs/reference/decorators/pipe.rst file, which is appropriate.

Workflow ID: wflow_nny9Up2c9Q0OqvXU


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@jernejfrank
Copy link
Contributor Author

Just documenting some thoughts while implementing to keep a record:

  1. Using function pointers forces the @mutate functions to be placed after the definitions of the target functions --> nice because we already get info from the pointer if they are decorated with some adapter lifecycle.
  2. For now will only check for pipe_output as a decorator since others are a bit tricky to capture.
    we have the following base classes:
  • NodeResolver
  • NodeCreator
  • NodeExpander
  • NodeTransformer
  • NodeInjector
  • NodeDecorator
    and the problem is that they don't necessarily force the output to be single or multiple nodes. (For example expander is easy to block that cannot be used with mutate, but creator has child classes with_columns, does, datasaver, dataloader) --> we would need to filter by each decorator separately but that means this list needs to be maintained... --> need to think about other solutions how to capture a property that identifies that the decorator can be used in conjunction with mutate.
  1. I know we discussed something like pipe_output_builder, this is still on the table with the additional extra functionalities I haven't implemented yet (namespace, all the things we can add to step, i.e . named, when...). But for first working version seem excessive.

Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun! Great stuff!

Edge cases / things to double check or think through:

  1. Does this work with both variants of using Ray? i.e. checking SERDE.
  2. What should we do if someone does
@mutate
def no_leading_underscore(...) -> ...:
  1. I assume the Hamilton UI has no issue with this?
  2. How could someone screw this up? Do we need anything for error handling?
  3. What guardrails do we have? or should we think about adding?

Otherwise the commit for this should include any assumptions/constraints so we can catalog design decisions and anything that might help our future selves --> so if it's in this PR message it's easy then to squash merge with it...


def __init__(
self,
*target_functions: Callable,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to support a string -- since it could come from extract_fields for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting edge case and also one for pipe_output that we haven't addressed; right now if we try

def _pre_step(something:int)->int:
        return something + 10

def _post_step(something:int)->int:
        return something + 100

def a()->int:
        return 10

@pipe_output(
        step(_pre_step)
)
@extract_fields(
        {"field_1":int, "field_2":int}
)
@pipe_input(
        step(_pre_step)
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2}

we run into an InvalidDecoratorException: Expected a single node to transform, but got 2. when trying to use post_pipe. Since mutate is an extension of post_pipe we get the same issue.

I think this should be solved upstream in post_pipe and added as as allowed option, what do you think?

Copy link
Contributor Author

@jernejfrank jernejfrank Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also opens up the discussion from my notes, using with other decorators we need to decide on an individual basis when mutate is meant to be applied. For example check_output should be applied to the result of mutate, but here mutate should be applied to the result of exteact_fields. Sometimes you may even want to mutate the function first and then extract fields?

This ordering is somewhat tricky since we are not stacking decorators in the usual way on top of a single function where the order is clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is difficult to get right and I'd prefer not to support it for that reason. @jernejfrank your original implementation would have made this easier (it worked on strings), but we're really just reusing the pieces we have to not add additional concerns inside the function graph (as we have not really thought about he implication).

Specifically:

  1. Decorators operate on a per-function basis
  2. This adds the function-level decorators to functions
  3. We don't know the node until after we process the. functions

While we could change that, it gets really tricky, and I think it'll spaghettify the code.

To unblock this I'd say we add a target for @mutate -- simliar to how we do others:

@mutate(other_function, target_=...)
def ...

Copy link
Collaborator

@skrawcz skrawcz Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm just to make sure we're on the same wave length I'm thinking about this:

@extract_columns("foo", "bar")
def data_set(...) -> pd.DataFrame:
     ...

# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate("foo", "bar")
def _normalize(col: pd.Series) -> pd.Series:
  ...

@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
   ...

# but I want to apply the same transforms to each data set
@mutate("train", "test")
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
   ...

That is -- you want to apply mutate to some prior valid Hamilton function/node.

Copy link
Contributor Author

@jernejfrank jernejfrank Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @skrawcz's suggestion: @mutate((funtion_pointer, node_string),...).

It allows us to get out of the spaghetti mess with pure strings. I couldn't find a reasonable way of doing it without either adding another mutate_functions collector at the graph.py level or letting the DAG be generated as is and then adding a second transversal through nodes searching for the specific string names. The main problem is that without the function pointer, we lose any locality where the node will be generated, so it is impossible to add them to the current node resolution without the above ways that get around that problem.

API-wise we can have @mutate(Callable | Tuple(Callable, str)) and if it is only a function pointer it gets added as a classical pipe_output, if it is a Tuple it delays construction until we reach the node level that can be added to it (now we get this info, because the function_pointer has it stored!)

Re: config -- I think that's fine. We just say this is the API

# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate(
    (data_set__y, "foo"),
    (data_set__y, "bar"),
    (data_set__z, "foo"),
    (data_set__z, "bar")
    )
def _normalize(col: pd.Series) -> pd.Series:
    return ...

or to make it less verbose:

@mutate(
    (data_set__y, ["foo","bar"], some_dataset_specif_variable=value(10) ),
    (data_set__z, ["foo","bar"], some_dataset_specif_variable=source('upstream_node'))
    )
def _normalize(col: pd.Series, some_dataset_specif_variable:int) -> pd.Series:
    return col*some_dataset_specif_variable

and the fully flexible option:

@mutate(
    (data_set__y, "foo", some_dataset_specif_variable=value(10)),
    (data_set__y, "bar", some_dataset_specif_variable=value(20)),
    (data_set__z, "foo", some_dataset_specif_variable=source('upstream_node1')),
    (data_set__z, "bar", some_dataset_specif_variable=source('upstream_node2'))
    )
def _normalize(col: pd.Series, some_dataset_specif_variable:int) -> pd.Series:
    return ...

Practically, I think this would then add a tag on the decorator level and we just need to be sure that it is the last NodeTransformer decorator during the node resolution. Then the above use-cases can be handles easily since we get nodes passed into and have all the info we need / the config stuff has already been resolved.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elijahbenizzy target has other meaning with other decorators, so I wouldn't use that here.

One attempt at your point:

@extract_columns("foo", "bar")
def data_set(...) -> pd.DataFrame:
     ...

# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate((dataset, "foo"), (dataset, "bar"))
def _normalize(col: pd.Series) -> pd.Series:
  ...

@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
   ...

# but I want to apply the same transforms to each data set
@mutate((training_data, "train"), (training_data, "test"))
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
   ...

but function pointers break down with @config.when which is a common decorator:

@extract_columns("foo", "bar")
@config.when(x="y")
def data_set__y(...) -> pd.DataFrame:
     ...

@extract_columns("foo", "bar")
@config.when(x="z")
def data_set__z(...) -> pd.DataFrame:
     ...

# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate((???, "foo"), (???, "bar"))
def _normalize(col: pd.Series) -> pd.Series:
  ...

Otherwise Hamilton should process the nodes in order they were encountered. So I don't think it needs to be two pass if that assumption is true. This feature is only 50% useful to Hamilton if it can only depend on function pointers.

Edge cases to consider:

  • what happens if node it depends on doesn't exist? e.g. via @config.when or user error, etc.

@skrawcz @jernejfrank target is explicitly meant for this. Semantically it's identical. If we append a chain, target provides the node on which the chain is appended/nodes are modified.

Furthermore, it's partially wired in. @post_pipe is a SingleNodeNodeTransformer. This, by default, means that it will apply to all the "sink" nodes (target=None). We limit it to just a single sink node, so we don't have to have target.

Instead, we should be able to alter these decorators to be a NodeTransformer (superclass of SingleNodeNodeTransformer) and thus take in target. And wiring through the target is then the exact same parameter. See this.

So IMO target is more consistent, and pretty much the same as tuples above. Should be straightforward to get this to work, if we want to support this.

Then there's the question of what to do if a target isn't applied -- my guess (and we'd have to test) is that if we make pipe_input and pipe_output a NodeTransformer, and use the default target (of None -- implies all "sink" nodes get decorated"), then we'll actually be able to just apply the same pipe operation to every column. Cool!

E.G.

@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
   ...

# but I want to apply the same transforms to each data set
@mutate(training_data) # will, by default, apply to each field
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
   ...

Regarding config, I think we just need to specify both functions. We may be able to use a string for a function and look for anything in the same module to see if it matches (E.G. without the double underscore), but we should save this edge case for a bit later.

Otherwise, re: second pass, it's subtle as to why it has to be done. It's not because it carries references to other nodes, it's because running this decorator requires changing the other node's names. E.G. you have a node foo. Then you pipe_output foo so you have foo_raw -> foo. Changing the name of foo to foo_raw is something that has to be done outside of the current decorator framework.

Conceptually, this is equivalent to allowing decorators to modify any part of the entire DAG, rather than individual subdags produced by functions, which makes it tougher to reason about, and requires n passes (one for each decorator). I'd prefer to avoid this for now, and I actually quite like the function-pointer approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought -- target is actually overused in Hamilton. We have:

  1. target being the parameter we apply to (see data loaders)
  2. target being the output node we use -- see data savers, data validators, check_output, tag, etc...)

(2) is the more used case, and corresponds internally, but we don't have to name it target if it's confusing. Just that we should use the internal concept of target to implement, and that'll make it easier.

An idea, re: config. I think this should work:

@extract_columns("foo", "bar")
@config.when(x="y")
def data_set__y(...) -> pd.DataFrame:
     ...

@extract_columns("foo", "bar")
@config.when(x="z")
def data_set__z(...) -> pd.DataFrame:
     ...

# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate('data_set', "foo", "bar") # however you want to specify foo/bar, we have a few different patterns
def _normalize(col: pd.Series) -> pd.Series:
  ...

Specifically, if we have a string, it can refer to:

  1. The name of the function in the same module as the current one
  2. The name of the function without __ (I think we might even track it...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skrawcz @jernejfrank target is explicitly meant for this. Semantically it's identical. If we append a chain, target provides the node on which the chain is appended/nodes are modified.

Furthermore, it's partially wired in. @post_pipe is a SingleNodeNodeTransformer. This, by default, means that it will apply to all the "sink" nodes (target=None). We limit it to just a single sink node, so we don't have to have target.

Instead, we should be able to alter these decorators to be a NodeTransformer (superclass of SingleNodeNodeTransformer) and thus take in target. And wiring through the target is then the exact same parameter. See this.

So IMO target is more consistent, and pretty much the same as tuples above. Should be straightforward to get this to work, if we want to support this.

Awesome, thanks @elijahbenizzy for drawing my attention to the superclass. That is a very clean way to resolve this dilemma and does exactly what we want to achieve with tuples. I'll go ahead and implement it quickly to have a concrete case that we can further discuss if something feels off and we initially missed.

Then there's the question of what to do if a target isn't applied -- my guess (and we'd have to test) is that if we make pipe_input and pipe_output a NodeTransformer, and use the default target (of None -- implies all "sink" nodes get decorated"), then we'll actually be able to just apply the same pipe operation to every column. Cool!

E.G.

@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
   ...

# but I want to apply the same transforms to each data set
@mutate(training_data) # will, by default, apply to each field
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
   ...

pipe_input I don't think we need to touch (it's a NodeInjector now) and I am not sure how we would use it with mutate, can you expand where it comes into play @elijahbenizzy ?

But I did change pipe_output to the superclass NodeTransformer and happy to report that your above hunch is correct. In case no target is specified, at least for extract_field, it gets applied to both "train" and "test" (some minor caveats around adopting the name from the node instead of the function in case namespace is "...", but that should be ok).

Regarding config, I think we just need to specify both functions. We may be able to use a string for a function and look for anything in the same module to see if it matches (E.G. without the double underscore), but we should save this edge case for a bit later.

I will start off simple that you need to input both functions and if everything works/we are happy with the API, this can be changed at the end if that's alright with you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! You are correct, @pipe_input does not apply to this, although it does apply to the other definition of target -- adding a parameter target would make sense here IMO:

.

But that's a separate issue, so I created this: #1161.

LMK what you need/when you want a review!

@jernejfrank
Copy link
Contributor Author

jernejfrank commented Sep 29, 2024

Re Edge cases -- 1., 3. are ok!

For 2.,4.,5. -- we could add the flag allow_experimental_mode. Then we have one version that offers basic, limited and safe functionality and the other one the user accepts to deal with caveats -- as long as we document it, should be ok, no? Also gives us the way out to change things and signal the API is still bound to change.

Basic:

  • you can use with check output, probably some others
  • pass in function pointers and additional arguments
  • all targeting functions are mutated the same way
  • single module
  • transforms are automatically hidden (no_leading_underscore edge case)
  • all the limits of pipe_output, just allows you to implement it "remotely"

Advanced

  • pass in strings
  • can be across multiple modules
  • have an additional config, can do stuff like naming, namespace, when
  • there's a default order, but have the option to overwrite it
  • combining with other decorators and affecting things on the node level?

@elijahbenizzy
Copy link
Collaborator

Re Edge cases -- 1., 3. are ok!

For 2.,4.,5. -- we could add the flag allow_experimental_mode. Then we have one version that offers basic, limited and safe functionality and the other one the user accepts to deal with caveats -- as long as we document it, should be ok, no? Also gives us the way out to change things and signal the API is still bound to change.

Basic:

  • you can use with check output, probably some others
  • pass in function pointers and additional arguments
  • all targeting functions are mutated the same way
  • single module
  • transforms are automatically hidden (no_leading_underscore edge case)
  • all the limits of pipe_output, just allows you to implement it "remotely"

Advanced

  • pass in strings
  • can be across multiple modules
  • have an additional config, can do stuff like naming, namespace, when
  • there's a default order, but have the option to overwrite it
  • combining with other decorators and affecting things on the node level?

I think it's OK to document caveats here. This is new functionality, and as long as the API is stable I'm happy. Cross-module, to me, might be another decorator @ad_hoc_mutate, as the intended use-case is next to the driver. E.G. for making quick mutations.

Agreed that transforms should be automatically hidden, whether you use underscores or not (E.G. marked to ignore by hamilton).

Re: additional config, that should be easy enough to add, we do the same thing with step and the transform list, but you just need an API. IMO this would be nice to ship here, but curious about the actual challenge in implementation. Not sure the value of overwriting the order (it feels like it'll just get confusing).

This should work the same way pipe_output should layer decorators -- but re: decorating the actual step nodes, @jernejfrank and I talked about this. We'll have to call out to the node resolution tool, which could be used by pipe_input, pipe_output, etc... I think that's a natural next step, but shoudn't block this.

@elijahbenizzy
Copy link
Collaborator

Nice work! Given that this is an in-progress PR, I'll leave the following feedback:

  1. Structure/approach looks good -- we want to ensure the API is good. Means deciding whether we're going to support strings (I think you'll find it harder without hacking up the code more). Can be a TODO for later.
  2. Worth thinking about the use-cases this unlocks. IMO the coolest is actually the cross-module case. So might be worth adding that validation as optional? Or a separate decorator that changes this? I think this allows for quick changes of dataflows before you put it in.
  3. Think about how you want to do tests. I'd recommend testing the individual pieces of this then having a few end-to-end tests to tie it together!

@jernejfrank jernejfrank changed the title Mutate Decorator WIP - Mutate Decorator Oct 1, 2024
@jernejfrank
Copy link
Contributor Author

jernejfrank commented Oct 1, 2024

Ok, this is very much WIP and my stab at this -- very open to just hearing what I did is ridiculous and I shouldn't complicate things so much :)

Effectively, I tried to mimick the pipe API and wiring things in the background. Before I go deeper down the rabbit hole with edge cases, tests, docs, etc., would be good to hear your thoughts on it. No real decision around naming yet, just put some placeholders that allow me to keep things apart.

@elijahbenizzy
Copy link
Collaborator

@jernejfrank happy to look at the implenentation next, first took a look at the API. I really like it!

Only question I have -- the targeted is clever, but my worry is that it muddles things a bit:

To me, it would be clearer as two separate decorators (this should be allowed, right?). Think you had it. OTOH, it's a bit weird if it's split out between decorators like it is here (field_1 is modified twice...).

Thoughts?

@pipe_output(
        step(_pre_step).named(name="b")
        target=["field_1"]
)
@pipe_output(
        step(_pre_step).named(name="b").targeted("field_1"),
        target=["field_3", "field_2"]
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}

@jernejfrank
Copy link
Contributor Author

jernejfrank commented Oct 1, 2024

Nice, glad it is going in the right direction.

@jernejfrank happy to look at the implenentation next, first took a look at the API. I really like it!

Only question I have -- the targeted is clever, but my worry is that it muddles things a bit:

To me, it would be clearer as two separate decorators (this should be allowed, right?). Think you had it. OTOH, it's a bit weird if it's split out between decorators like it is here (field_1 is modified twice...).

Thoughts?

@pipe_output(
        step(_pre_step).named(name="b")
        target=["field_1"]
)
@pipe_output(
        step(_pre_step).named(name="b").targeted("field_1"),
        target=["field_3", "field_2"]
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}

Re: two separate decorators -- actually we cannot stack two pipe_output decorators like this ATM. It comes down to renaming of the original node to node_raw and then Hamilton thinks we have two nodes with the same name and throws an error. To do this we would need to keep a global state of how many pipe_output decorators we apply and wire them internally together (skipping the in-between identity nodes). Not exactly sure how to do this on that level since we only get the current node to look at..

Also note that this would not work (added comment to your code):

@pipe_output(
        step(_pre_step).named(name="b")
        target=["field_1"]
)
"field_1")
@pipe_output(
        step(_pre_step).named(name="b").targeted("field_1"),
        target=["field_3", "field_2"]  #this  make the global subset to "field_2", "field_3" which means that "_pre_step.target("field_1")" gets ignored (no global 
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}

The functionality ATM is the following: global target gets applied first and creates a subset of all availible nodes, then the individual steps have their own target that makes a subset of the already created subset of the global one. (if global one is empty its the whole set)

@pipe_output(
        step(_foo),
        step(_bar),
# no global target here means `foo`, `bar` get applied to all output nodes "field_1", "field_2", "field_3"
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}
@pipe_output(
        step(_foo),
        step(_bar),
target=["field_1","field_2"]
# global target here means `foo`, `bar` get applied to "field_1" and "field_2"
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}
@pipe_output(
        step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1"
        step(_bar), # this gets applied to "field_1", "field_2", "field_3"
# no global target here means `foo`, `bar` get applied to all output nodes "field_1", "field_2", "field_3"
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}
@pipe_output(
        step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1"
        step(_bar).targeted("field_2"), # this makes `bar` only applied to "field_2"
target=["field_1","field_2"]
# global target here means `foo`, `bar` get applied to "field_1" and "field_2"
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}
@pipe_output(
        step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1", but there is not "field_1" since only "field_3" available, so it is skipped
        step(_bar).targeted("field_2"), # this makes `bar` only applied to "field_2", but there is not "field_2" since only "field_3" available, so it is skipped
target="field_3"
# global target here means `foo`, `bar` get applied to "field_3"
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}

I hope this is not too confusing.

@jernejfrank
Copy link
Contributor Author

jernejfrank commented Oct 1, 2024

These two functionalities achieve the same thing

%%cell_to_module -m pipe_output_with_target --display
from typing import Dict
from hamilton.function_modifiers import extract_fields, pipe_input, pipe_output, step

def _pre_step(something:int)->int:
        return something + 10

def _post_step(something:int)->int:
        return something + 100

def _something_else(something:int)->int:
        return something + 1000

def a()->int:
        return 10

@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b").targeted("field_1"),
        step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}


@pipe_output(
        step(_something_else),
        target="b_field_2"
)
@pipe_output(
        step(_something_else),
        step(_post_step).named(name="a"),
        target="b_field_3"
)
@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b"),
        step(_post_step).named(name="a"),
        target="b_field_1"
)
@extract_fields(
        {"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
    return {"b_field_1":1, "b_field_2":2, "b_field_3":3}

Either we have one pipe_output and target on multiple levels or we need to stack pipe_output and each groups steps with the same target node in mind. However, in the second case with a single target the user really needs to pay attention which output nodes still don't have a post_pipe on them, otherwise it errors out that we have duplicate nodes

@elijahbenizzy
Copy link
Collaborator

These two functionalities achieve the same thing

%%cell_to_module -m pipe_output_with_target --display
from typing import Dict
from hamilton.function_modifiers import extract_fields, pipe_input, pipe_output, step

def _pre_step(something:int)->int:
        return something + 10

def _post_step(something:int)->int:
        return something + 100

def _something_else(something:int)->int:
        return something + 1000

def a()->int:
        return 10

@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b").targeted("field_1"),
        step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}


@pipe_output(
        step(_something_else),
        target="b_field_2"
)
@pipe_output(
        step(_something_else),
        step(_post_step).named(name="a"),
        target="b_field_3"
)
@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b"),
        step(_post_step).named(name="a"),
        target="b_field_1"
)
@extract_fields(
        {"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
    return {"b_field_1":1, "b_field_2":2, "b_field_3":3}

Either we have one pipe_output and target on multiple levels or we need to stack pipe_output and each groups steps with the same target node in mind. However, in the second case with a single target the user really needs to pay attention which output nodes still don't have a post_pipe on them, otherwise it errors out that we have duplicate nodes

Interesting... So I'm a fan of the second, but I'm curious if we can do these modifications:

  1. Allow target to be a list -- thus we just apply to all (this could just be a for-loop internally)
  2. Apply the target as part of the namespace so we don't get collisions.
  3. Maybe allow target to be Ellipsis (...), which would mean it applies to everything (or have that be the default)?

I'm not 100% following the point about erroring out with duplicates. Do you have an example of where that would happen?

@jernejfrank
Copy link
Contributor Author

jernejfrank commented Oct 1, 2024

These two functionalities achieve the same thing

%%cell_to_module -m pipe_output_with_target --display
from typing import Dict
from hamilton.function_modifiers import extract_fields, pipe_input, pipe_output, step

def _pre_step(something:int)->int:
        return something + 10

def _post_step(something:int)->int:
        return something + 100

def _something_else(something:int)->int:
        return something + 1000

def a()->int:
        return 10

@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b").targeted("field_1"),
        step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
)
@extract_fields(
        {"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
    return {"field_1":1, "field_2":2, "field_3":3}


@pipe_output(
        step(_something_else),
        target="b_field_2"
)
@pipe_output(
        step(_something_else),
        step(_post_step).named(name="a"),
        target="b_field_3"
)
@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b"),
        step(_post_step).named(name="a"),
        target="b_field_1"
)
@extract_fields(
        {"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
    return {"b_field_1":1, "b_field_2":2, "b_field_3":3}

Either we have one pipe_output and target on multiple levels or we need to stack pipe_output and each groups steps with the same target node in mind. However, in the second case with a single target the user really needs to pay attention which output nodes still don't have a post_pipe on them, otherwise it errors out that we have duplicate nodes

Interesting... So I'm a fan of the second, but I'm curious if we can do these modifications:

Funny, I prefer the first one :) I would rename targeted to target_node or something, but to me it looks slicker and I like that we can gather the building blocks separately but then we construct a single pipe in one go opposed to constructing multiple pipes.

I also added a restriction that we can either use target the level of pipe and it applies it globally or if we apply it for a step, then we cannot have the global target anymore.

  1. Allow target to be a list -- thus we just apply to all (this could just be a for-loop internally)

This is already allowed -- you can pass in list of nodes or if target=None it gets applied to all sink nodes. The same rules apply as NodeTransformer

  1. Apply the target as part of the namespace so we don't get collisions.

I am not sure what you mean by that. I did have to change namespace already a bit (not fully tested) and use node_.name to avoid collisions (maybe we are thinking the same thing here).

I think there re two collision issues, one on the level of nodes and the other if we call multiple pipe_outputs how to handle the original_node_raw --> this one I find problematic because now you don't know that there is another pipe coming because tey get applied sequentially.

  1. Maybe allow target to be Ellipsis (...), which would mean it applies to everything (or have that be the default)?

This breaks down with extract_field because of type mismatch, the original node will give you dict and the extracted field is a different type.

I'm not 100% following the point about erroring out with duplicates. Do you have an example of where that would happen?

This would not work wherever we place something_else. If we want to apply it to all fields 1,2,3 we need to put something_else into all three 3 pipe_output decorators (each targeting its own field).

@pipe_output(
        step(_something_else),
)
@pipe_output(
        step(_post_step).named(name="a"),
        target="b_field_3"
)
@pipe_output(
        step(_pre_step).named(name="b"),
        step(_post_step).named(name="a"),
        target="b_field_1"
)
# @pipe_output(
#         step(_something_else),
# )
@extract_fields(
        {"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
    return {"b_field_1":1, "b_field_2":2, "b_field_3":3}

But here you get it applied to all fields without needing repetition and then the other ones just limit to nodes they should target

@pipe_output(
        step(_something_else),
        step(_pre_step).named(name="b").targeted("field_1"),
        step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
)

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good -- I think it's worth thinking about how to subclass the Applicable stuff so we can share code -- a lot of commonalities between them.

examples/mutate/README Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
"""

@classmethod
def _single_target_level(cls, target: base.TargetType, transforms: Tuple[Applicable]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this validate_single_target_level

hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
applicable = step(mutating_fn, **remote_applicable_builder.mutating_fn_kwargs)

if remote_applicable_builder.when_key_value_pairs is not None:
applicable = applicable.when(**remote_applicable_builder.when_key_value_pairs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, so I think if we make applicable allowed to have a None function it might just solve this? Otherwise it's the same, right?

hamilton/function_modifiers/macros.py Show resolved Hide resolved
@jernejfrank jernejfrank changed the title WIP - Mutate Decorator Mutate Decorator Oct 3, 2024
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take another look at the implementation with fresh eyes, but it's looking good. My general take is:

  1. The example is a bit abstract -- it serves more as documentation. Not sure what time you have, but if you come up with a motivating example (E.G. feature engineering) it might ring true. It's a good test/dev env/smoke screen, but I think that you may want to come up with a narrative that helps justify it.
  2. Worth documenting assumptions -- the code is clean but there are some clever things (E.G. the Applicable stuff) that we want to make sure is clear as to why
  3. Related -- I'd try to highlight the value in the comments/docstring/have them relate to the code examples (E.G. draw on your own experience). I think you can use hte scikit-learn example you did for post-pipe
  4. Also, focus on the fact that we can do cool inetractive, ad-hoc stuff now!

hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Show resolved Hide resolved
hamilton/function_modifiers/macros.py Show resolved Hide resolved
tests/function_modifiers/test_macros.py Outdated Show resolved Hide resolved
examples/mutate/mutate_on_output.py Outdated Show resolved Hide resolved
examples/mutate/mutate_on_output.py Outdated Show resolved Hide resolved
examples/mutate/mutate_on_output.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
@jernejfrank
Copy link
Contributor Author

  1. Worth documenting assumptions -- the code is clean but there are some clever things (E.G. the Applicable stuff) that we want to make sure is clear as to why

Should this just be added into the code via comments/docstrings?

  1. Related -- I'd try to highlight the value in the comments/docstring/have them relate to the code examples (E.G. draw on your own experience). I think you can use hte scikit-learn example you did for post-pipe

Here is a crazy idea: I feel mutate is much more about the dev stage than production (I would expect production code to have this as pipe_output or normal nodes). Would a notebook documenting how to run feature engineering experiments step by step be more appropriate for this functionality since it allows for a time evolution of the code?

  1. Also, focus on the fact that we can do cool inetractive, ad-hoc stuff now!

I played around a bit with cross-module and I think it will take a bit longer to get there (some hickups around namespace/naming stage of nodes in pipe_output) and a bit more thought how we want that API to be (like would mutate from another module always be appended at the end or can we push it somewhere between other existing pipe steps).

What are your thought on limiting this to same module and having the cross-module a seperate PR? (We alrady tinkered with on_output in pipe_output so feels like the scope keeps growing)

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 4, 2024

@elijahbenizzy @jernejfrank my thoughts:

Should this just be added into the code via comments/docstrings?

So commit messages & doc strings -- where appropriate. Commits are nice because they can cover multiple places in code, where as doc strings cover just that point.

Here is a crazy idea: I feel mutate is much more about the dev stage than production (I would expect production code to have this as pipe_output or normal nodes). Would a notebook documenting how to run feature engineering experiments step by step be more appropriate for this functionality since it allows for a time evolution of the code?

Yes more dev --> but the idea is that we can provide prod introspection, so ideally you wouldn't need to change things. So I wouldn't expect people to translate it to pipe_output. So any example I think would just focus on how you might use this functionality to get to a place you're happy with quickly and easily.

Side note: we need to think through how this would impact being used with caching -- I don't think it should be just something to test.

I played around a bit with cross-module and I think it will take a bit longer to get there (some hickups around namespace/naming stage of nodes in pipe_output) and a bit more thought how we want that API to be (like would mutate from another module always be appended at the end or can we push it somewhere between other existing pipe steps).

What are your thought on limiting this to same module and having the cross-module a seperate PR? (We alrady tinkered with on_output in pipe_output so feels like the scope keeps growing)

Agreed. Let's limit scope to just within a module. We can always expand it later; easier to add things than to remove them.

@jernejfrank
Copy link
Contributor Author

Yes more dev --> but the idea is that we can provide prod introspection, so ideally you wouldn't need to change things. So I wouldn't expect people to translate it to pipe_output. So any example I think would just focus on how you might use this functionality to get to a place you're happy with quickly and easily.

@skrawcz I mostly agree with it, but you have some limits with mutate that made me think it might not be a bad idea to suggest that once mutate achieves a satisfactory result, it is rational to refactor it into pipe_output to remain flexible; two cases come to mind:

  • you cannot re-apply the same function with mutate: normalise-->do something --> normalise again
  • if you have foo and bar and both are forced to have the mutate pipelines ordered the same way
@pipe_output(
        normalise,
        something_funky,
        normalise
)
def foo():
    ...

@pipe_output(
        something_funky,
        normalise
)
def bar():
    ...

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 5, 2024

Yes more dev --> but the idea is that we can provide prod introspection, so ideally you wouldn't need to change things. So I wouldn't expect people to translate it to pipe_output. So any example I think would just focus on how you might use this functionality to get to a place you're happy with quickly and easily.

@skrawcz I mostly agree with it, but you have some limits with mutate that made me think it might not be a bad idea to suggest that once mutate achieves a satisfactory result, it is rational to refactor it into pipe_output to remain flexible; two cases come to mind:

  • you cannot re-apply the same function with mutate: normalise-->do something --> normalise again
  • if you have foo and bar and both are forced to have the mutate pipelines ordered the same way
@pipe_output(
        normalise,
        something_funky,
        normalise
)
def foo():
    ...

@pipe_output(
        something_funky,
        normalise
)
def bar():
    ...

Right -- but the following should work though right?

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _something_funky(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

If you haven't been keeping track already (haven't reviewed the code sorry), there should be a set of tests/examples showing edge cases and how we want to handle them...

@elijahbenizzy
Copy link
Collaborator

Yes more dev --> but the idea is that we can provide prod introspection, so ideally you wouldn't need to change things. So I wouldn't expect people to translate it to pipe_output. So any example I think would just focus on how you might use this functionality to get to a place you're happy with quickly and easily.

@skrawcz I mostly agree with it, but you have some limits with mutate that made me think it might not be a bad idea to suggest that once mutate achieves a satisfactory result, it is rational to refactor it into pipe_output to remain flexible; two cases come to mind:

  • you cannot re-apply the same function with mutate: normalise-->do something --> normalise again
  • if you have foo and bar and both are forced to have the mutate pipelines ordered the same way
@pipe_output(
        normalise,
        something_funky,
        normalise
)
def foo():
    ...

@pipe_output(
        something_funky,
        normalise
)
def bar():
    ...

Right -- but the following should work though right?

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _something_funky(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

If you haven't been keeping track already (haven't reviewed the code sorry), there should be a set of tests/examples showing edge cases and how we want to handle them...

So I think the edge-case of the same can be supported -- happy to help you figure this out. But I would argue that this is genearlly a good dev feature (for the use-case of mutating inside a notebook), but also should work for prod as well.

That said, for anything that we don't expect to change, it's good to push people towards @post_pipe, purely for readability.

@jernejfrank
Copy link
Contributor Author

Right -- but the following should work though right?

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _something_funky(df: pd.DataFrame) -> pd.DataFrame:
   ...

@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
   ...

Oh wow @skrawcz, this actually works, nice!

If you haven't been keeping track already (haven't reviewed the code sorry), there should be a set of tests/examples showing edge cases and how we want to handle them...

There's an abstract example under new folder mutate at the moment where all these edge cases are in seperate scripts / notebook and I add to it as we go along.

There's a more realistic example where I converted the pipe_output to mutate from the sci-kit script I used last time. Also planning to play around more with this on some ML things, I'll open another PR to get it added once finished/polished.

When multiple end nodes are availible we can choose onto which we will
apply pipe_output. Can be multiple.

Changes:
- Applicable gets the method on_output that allows to specify target nodes
- pipe_output inherits from NodeTransformer
- pipe_output the namespace is resolved in Node level name to avoid
  collisions
- if on_output is global it limits the all the steps to the same subset,
  we prohibit additionally for steps to have clear specification of the
  end nodes

Tested and documented.
We can apply functions to nodes without touching it

Mutate recieves functions as arguments and builds for those target
functions a pipe_output. If the target function already has a
pipe_output it will append the mutate function as the last step.
examples/mutate/abstract functionality blueprint/README Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a quick header to this notebook to dmeo what you'll learn (also saying it's a bit more abstract). That's repeated info from before, but definitely valuable to have.

hamilton/function_modifiers/macros.py Show resolved Hide resolved
@@ -883,7 +922,11 @@ def __init__(
# super(flow, self).__init__(*transforms, collapse=collapse, _chain=False)


class pipe_output(base.SingleNodeNodeTransformer):
class SingleTargetError(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on this to know when it should be raised

hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
return Applicable(fn=None, args=(), kwargs=mutating_fn_kwargs, target_fn=fn_, _resolvers=[])


class NotSameModuleError(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto -- docstring sto show where it's being used

.. code-block:: python
:name: Simple @mutate example with multiple arguments

@mutate(A,B,arg2=step('upstream_node'),arg3=value(some_literal),...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -- spaces in parameters for formatting for @mutate call

_chain: bool = False,
**mutating_function_kwargs: Union[SingleDependency, Any],
):
"""Instantiates a `\@mutate` decorator.
Copy link
Collaborator

@elijahbenizzy elijahbenizzy Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that RST wants double-backtick

hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A few minor points but let's merge tomorrow + ship in the next release.

Something to think about/a caveat:

  1. If @mutate runs once it's good
  2. If you run that again, it might double-up

So in the juptyer notebook/cross-module case (which, admittedly, is not supported now), we'll want to guard against it.

The problem is how we remove the old one once we change it... Thoughts? Not critical (although document now), but will be for the cool ad-hoc stuff we want to do.

hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Show resolved Hide resolved
hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/macros.py Outdated Show resolved Hide resolved
tests/function_modifiers/test_macros.py Show resolved Hide resolved
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 7, 2024

Something to think about/a caveat:

  1. If @mutate runs once it's good
  2. If you run that again, it might double-up

So in the juptyer notebook/cross-module case (which, admittedly, is not supported now), we'll want to guard against it.

The problem is how we remove the old one once we change it... Thoughts? Not critical (although document now), but will be for the cool ad-hoc stuff we want to do.

So just to clarify, will this impact the notebook dev experience? E.g. I use the same cell with multiple mutates and continually add to it -- would this impact things? If so I think that should be addressed before we release, since that's a core way people use Hamilton...

@jernejfrank
Copy link
Contributor Author

Something to think about/a caveat:

  1. If @mutate runs once it's good
  2. If you run that again, it might double-up

So in the juptyer notebook/cross-module case (which, admittedly, is not supported now), we'll want to guard against it.
The problem is how we remove the old one once we change it... Thoughts? Not critical (although document now), but will be for the cool ad-hoc stuff we want to do.

So just to clarify, will this impact the notebook dev experience? E.g. I use the same cell with multiple mutates and continually add to it -- would this impact things? If so I think that should be addressed before we release, since that's a core way people use Hamilton...

I tried this and didn't run into any duplication issues, but I don't dev in Jupyter so not sure I replicated a typical workflow. I tried the following:

%%cell_to_module -m mutate_on_output_2
from typing import Any,Dict, List
import pandas as pd
from hamilton.function_modifiers import extract_fields, extract_columns, apply_to, mutate, value, source

def data_1() -> pd.DataFrame:
    df = pd.DataFrame.from_dict({"col_1": [3, 2, pd.NA, 0], "col_2": ["a", "b", pd.NA, "d"]})
    return df


# # data1 and data2
# @mutate(data_1)
# def _filter(some_data:pd.DataFrame)->pd.DataFrame:
#     return some_data.dropna()


# @mutate(data_1, missing_row=value(['c', 145]))
# def add_missing_value(some_data:pd.DataFrame, missing_row:List[Any])->pd.DataFrame:
#     some_data.loc[-1] = missing_row
#     return some_data


# @mutate(data_1,)
# def sort(some_data:pd.DataFrame)->pd.DataFrame:
#     columns = some_data.columns
#     return some_data.sort_values(by=columns[0])

Uncommented one at a time and re-run the cell (didn't restart notebook or re-import module) and re-built the driver in the next cell.

This makes me think it is contained to cross module.

If I try to have mutate in a different cell (like it is a different module) I get blocked by the error that it has to be in the same module.

@elijahbenizzy
Copy link
Collaborator

Something to think about/a caveat:

  1. If @mutate runs once it's good
  2. If you run that again, it might double-up

So in the juptyer notebook/cross-module case (which, admittedly, is not supported now), we'll want to guard against it.
The problem is how we remove the old one once we change it... Thoughts? Not critical (although document now), but will be for the cool ad-hoc stuff we want to do.

So just to clarify, will this impact the notebook dev experience? E.g. I use the same cell with multiple mutates and continually add to it -- would this impact things? If so I think that should be addressed before we release, since that's a core way people use Hamilton...

I tried this and didn't run into any duplication issues, but I don't dev in Jupyter so not sure I replicated a typical workflow. I tried the following:

%%cell_to_module -m mutate_on_output_2
from typing import Any,Dict, List
import pandas as pd
from hamilton.function_modifiers import extract_fields, extract_columns, apply_to, mutate, value, source

def data_1() -> pd.DataFrame:
    df = pd.DataFrame.from_dict({"col_1": [3, 2, pd.NA, 0], "col_2": ["a", "b", pd.NA, "d"]})
    return df


# # data1 and data2
# @mutate(data_1)
# def _filter(some_data:pd.DataFrame)->pd.DataFrame:
#     return some_data.dropna()


# @mutate(data_1, missing_row=value(['c', 145]))
# def add_missing_value(some_data:pd.DataFrame, missing_row:List[Any])->pd.DataFrame:
#     some_data.loc[-1] = missing_row
#     return some_data


# @mutate(data_1,)
# def sort(some_data:pd.DataFrame)->pd.DataFrame:
#     columns = some_data.columns
#     return some_data.sort_values(by=columns[0])

Uncommented one at a time and re-run the cell (didn't restart notebook or re-import module) and re-built the driver in the next cell.

This makes me think it is contained to cross module.

If I try to have mutate in a different cell (like it is a different module) I get blocked by the error that it has to be in the same module.

Yep, agreed, definitely contained to cross-module. The reason that works is because data_1 gets redefined, so the state is lost. If it's in a different cell/module, it'll maintain state when you redefine the functions. So yeah, something to think about for the next iteration (which I think is high-value enough to be worth it).

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only nit is that the docs formatting is a bit off, DMed @jernejfrank

Testing with unit tests inidividual pieces and examples have a new smoke
test / abstract exampe how to use mutate. In that example are also edge
cases and how they are handled or if they are supported.

Added a more user friendly example by converting the existing
pipe_output example and replaced the pipe_output implementation by
mutate.

Light refactoring / improved naming and added TODOs.
@elijahbenizzy elijahbenizzy merged commit 69c94cc into DAGWorks-Inc:main Oct 7, 2024
24 checks passed
@jernejfrank jernejfrank deleted the feat/mutate branch October 8, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants