Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: capture flow through comprehensions #17577

Merged
merged 21 commits into from
Oct 4, 2024

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Sep 25, 2024

  • add comprehension functions as DataFlowCallables
  • add comprehension call as DataFlowCall
  • create capture argument node for comprehension calls

Pull Request checklist

All query authors

Internal query authors only

  • Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
  • Changes are validated at scale (internal access required).
  • Adding a new query? Consider also adding the query to autofix.

- add comprehension functions as `DataFlowCallable`s
- add comprehension call as `DataFlowCall`
- create capture argument node for comprehension calls
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still in draft, but since I looked it over now anyway: For the final review, let's get in some more docs 😊

yoff added 12 commits September 27, 2024 09:44
For a comprehension `[x for x in l]
- `l` is now a legal argument (in DataFlowPublic)
- `l` is the argument of the comprehension function (in DataFlowDispatch)
- the parameter of the comprehension function is being read rather than `l` (in IterableUnpacking)
Thus the read that used to cross callable boundaries is now split into a arg-param edge and a read from that param.
We used to use the CfgNode for the comprehension itself.
In cases where that is also an argument, say
```python
",".join([x for x in l])
```
that would be an argument to two different calls causing a dataflow consistency violation.
- add yield as a dataflow return
- replace comprehension store step
   with a store step to the yield
- adjust scope of argument, the argument is outside the called function
- add missing post-update nodes for the new arguments
We now have a new callable, yielding new enclosing callables
Using the comprehension store step meant that all comprehensions would receive taint.
This because comprehension flow now goes via a callable, meaning they share the return node.
- also adjust test expectations in experimental
More doc is needed, but this should turn the tests green
@yoff yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Sep 30, 2024
@yoff yoff marked this pull request as ready for review September 30, 2024 14:25
@yoff yoff requested a review from a team as a code owner September 30, 2024 14:25
tausbn
tausbn previously approved these changes Oct 1, 2024
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor typo fix, but otherwise this looks sensible to me. Solid stuff! 👍

@yoff yoff requested a review from tausbn October 1, 2024 10:57
@yoff
Copy link
Contributor Author

yoff commented Oct 1, 2024

There is a small perfomance penalty, but we also find new flow, including new alerts such as this one:

def generate(hello):
                yield hello
                yield flask.request.args["name"]
                yield "!"

return flask.Response(generate("Hello "))

@yoff yoff removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 1, 2024
We had accidentally used precise content leadingto blowup
tausbn
tausbn previously approved these changes Oct 1, 2024
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending the new performance check. 🙂

@yoff yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 1, 2024
@yoff
Copy link
Contributor Author

yoff commented Oct 1, 2024

Thanks for the pairing @tausbn, I agree that it was useful to investigate. For the public record, we did a performance investigation pairing session and found a problematic CP which would blow up whenever there are many different dictionary keys. In this case the DB had 17k.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to see the improvements in our tests 👍 I have a few minor questions around the code though 🤔

result.(Flow::ExprNode).getExpr().getNode() = comp
)
or
// TODO: Should the `Comp`s above be excluded here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you resolve this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not. My intention was to leave it for later if we did not observe weird flow in the tests or bad performance.

ExtractedReturnNode() { node = any(Return ret).getValue().getAFlowNode() }
ExtractedReturnNode() {
node = any(Return ret).getValue().getAFlowNode() or
node = any(Yield yield).getAFlowNode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that we could just add yields as a ExtractedReturnNode... Did you consider using different ReturnKind to model return/yield statements? (my impression was we would need to do that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new yield store step, where the yielded element is stored to the yield expression, it seems fine to simply use the yield as a normal return. We are returning a new thing and we can control its content. I have seen C# use a different return kind for variables that already exist but are written to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that as: since I got yield working without messing with the ReturnKinds, I didn't really investigate that path much.

I guess that's fine 👍 I'm still a little curious how the approach would have worked out 🤔 (but it's not super important so 🤷)

Comment on lines 201 to 208
not exists(Comp comp | func = comp.getFunction()) and
(
c instanceof ListElementContent
or
c instanceof SetElementContent
or
c instanceof DictionaryElementAnyContent
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need SetElementContent or DictionaryElementAnyContent here?

My mental model is that when you call a generator function you get an iterable back, so from my understanding you would always need to turn those elements into a set/dict before use anyway.

That is, you could do set(my_generator_func()) but doing my_set | my_generator_func() results in an error. (and likewise for dictionaries).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our model of the dict constructor does not convert content, so that would have to change in order to model something like

dict([k, v for v in l])

I did check if we were always just adding list content and the constructor pulled it out and I would also be open to that regime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that out, it seems a lot nicer and also improved our coverage slightly..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that 38b1eb7 is enough. Do we already have the logic in place for sets?

I'm also surprised that dict-comprehensions and set-comprehensions just works 🤔 -- is that because a they are transformed into dict(_comp_function()) and set(_comp_function())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a comment saying that this might need to be revised when adding support for dict-comprehensions then? 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed together that it might make sense to do some followup work on how dict-comprehensions are handled by the extractor, and what it would take to support them better in our analysis.

@yoff
Copy link
Contributor Author

yoff commented Oct 2, 2024

Evaluation looks much more comfortable this time :-)

Median (excl. partials)         0 0
Overall (excl. partials) 3972   3959   -13 -0.00327

@yoff yoff removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 2, 2024
@RasmusWL
Copy link
Member

RasmusWL commented Oct 3, 2024

Suggestion for followup work: Handle yield from ... as well 👍

@yoff
Copy link
Contributor Author

yoff commented Oct 3, 2024

Suggestion for followup work: Handle yield from ... as well 👍

Agreed 👍

@yoff yoff requested a review from RasmusWL October 3, 2024 12:20
RasmusWL
RasmusWL previously approved these changes Oct 4, 2024
note that we do retain precision in
`test_dict_from_keyword()`
@yoff yoff merged commit 6bb98b0 into github:main Oct 4, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants