python: capture flow through comprehensions #17577

yoff · 2024-09-25T08:03:06Z

add comprehension functions as DataFlowCallables
add comprehension call as DataFlowCall
create capture argument node for comprehension calls

Pull Request checklist

All query authors

A change note is added if necessary. See the documentation in this repository.
All new queries have appropriate .qhelp. See the documentation in this repository.
QL tests are added if necessary. See Testing custom queries in the GitHub documentation.
New and changed queries have correct query metadata. See the documentation in this repository.

Internal query authors only

Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
Changes are validated at scale (internal access required).
Adding a new query? Consider also adding the query to autofix.

- add comprehension functions as `DataFlowCallable`s - add comprehension call as `DataFlowCall` - create capture argument node for comprehension calls

RasmusWL

I know this is still in draft, but since I looked it over now anyway: For the final review, let's get in some more docs 😊

For a comprehension `[x for x in l] - `l` is now a legal argument (in DataFlowPublic) - `l` is the argument of the comprehension function (in DataFlowDispatch) - the parameter of the comprehension function is being read rather than `l` (in IterableUnpacking) Thus the read that used to cross callable boundaries is now split into a arg-param edge and a read from that param.

We used to use the CfgNode for the comprehension itself. In cases where that is also an argument, say ```python ",".join([x for x in l]) ``` that would be an argument to two different calls causing a dataflow consistency violation.

- add yield as a dataflow return - replace comprehension store step with a store step to the yield

- adjust scope of argument, the argument is outside the called function - add missing post-update nodes for the new arguments

We now have a new callable, yielding new enclosing callables

Using the comprehension store step meant that all comprehensions would receive taint. This because comprehension flow now goes via a callable, meaning they share the return node.

- also adjust test expectations in experimental

More doc is needed, but this should turn the tests green

tausbn

One minor typo fix, but otherwise this looks sensible to me. Solid stuff! 👍

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll

…atch.qll Co-authored-by: Taus <[email protected]>

yoff · 2024-10-01T11:04:51Z

There is a small perfomance penalty, but we also find new flow, including new alerts such as this one:

def generate(hello):
                yield hello
                yield flask.request.args["name"]
                yield "!"

return flask.Response(generate("Hello "))

We had accidentally used precise content leadingto blowup

tausbn

Approved pending the new performance check. 🙂

yoff · 2024-10-01T12:10:44Z

Thanks for the pairing @tausbn, I agree that it was useful to investigate. For the public record, we did a performance investigation pairing session and found a problematic CP which would blow up whenever there are many different dictionary keys. In this case the DB had 17k.

RasmusWL

Really nice to see the improvements in our tests 👍 I have a few minor questions around the code though 🤔

RasmusWL · 2024-10-01T11:51:40Z

python/ql/lib/semmle/python/dataflow/new/internal/VariableCapture.qll

+    result.(Flow::ExprNode).getExpr().getNode() = comp
+  )
+  or
+  // TODO: Should the `Comp`s above be excluded here?


did you resolve this TODO?

I did not. My intention was to leave it for later if we did not observe weird flow in the tests or bad performance.

RasmusWL · 2024-10-01T11:55:22Z

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll

-  ExtractedReturnNode() { node = any(Return ret).getValue().getAFlowNode() }
+  ExtractedReturnNode() {
+    node = any(Return ret).getValue().getAFlowNode() or
+    node = any(Yield yield).getAFlowNode()


I'm surprised that we could just add yields as a ExtractedReturnNode... Did you consider using different ReturnKind to model return/yield statements? (my impression was we would need to do that)

With the new yield store step, where the yielded element is stored to the yield expression, it seems fine to simply use the yield as a normal return. We are returning a new thing and we can control its content. I have seen C# use a different return kind for variables that already exist but are written to.

I understood that as: since I got yield working without messing with the ReturnKinds, I didn't really investigate that path much.

I guess that's fine 👍 I'm still a little curious how the approach would have worked out 🤔 (but it's not super important so 🤷)

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll

RasmusWL · 2024-10-01T12:12:23Z

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+    not exists(Comp comp | func = comp.getFunction()) and
+    (
+      c instanceof ListElementContent
+      or
+      c instanceof SetElementContent
+      or
+      c instanceof DictionaryElementAnyContent
+    )


I wonder if we really need SetElementContent or DictionaryElementAnyContent here?

My mental model is that when you call a generator function you get an iterable back, so from my understanding you would always need to turn those elements into a set/dict before use anyway.

That is, you could do set(my_generator_func()) but doing my_set | my_generator_func() results in an error. (and likewise for dictionaries).

Our model of the dict constructor does not convert content, so that would have to change in order to model something like

dict([k, v for v in l])

I did check if we were always just adding list content and the constructor pulled it out and I would also be open to that regime.

I tried that out, it seems a lot nicer and also improved our coverage slightly..

I'm surprised that 38b1eb7 is enough. Do we already have the logic in place for sets?

I'm also surprised that dict-comprehensions and set-comprehensions just works 🤔 -- is that because a they are transformed into dict(_comp_function()) and set(_comp_function())?

Yes the set constructor already handles list elements: https://github.com/github/codeql/blob/main/python/ql/lib/semmle/python/frameworks/Stdlib.qll#L4315-L4317

I think that raw dict comprehensions do not actually work: https://github.com/github/codeql/blob/main/python/ql/test/library-tests/dataflow/coverage/test.py#L182-L184

could we add a comment saying that this might need to be revised when adding support for dict-comprehensions then? 🙏

we discussed together that it might make sense to do some followup work on how dict-comprehensions are handled by the extractor, and what it would take to support them better in our analysis.

yoff · 2024-10-02T12:06:45Z

Evaluation looks much more comfortable this time :-)

Median (excl. partials)					0	0
Overall (excl. partials)	3972		3959		-13	-0.00327

RasmusWL · 2024-10-03T08:31:32Z

Suggestion for followup work: Handle yield from ... as well 👍

yoff · 2024-10-03T08:35:49Z

Suggestion for followup work: Handle yield from ... as well 👍

Agreed 👍

…dd-comprehension-capture-flow

note that we do retain precision in `test_dict_from_keyword()`

python: capture flow through comprehensions

fc2dc28

- add comprehension functions as `DataFlowCallable`s - add comprehension call as `DataFlowCall` - create capture argument node for comprehension calls

github-actions bot added the Python label Sep 25, 2024

RasmusWL requested changes Sep 26, 2024

View reviewed changes

yoff added 12 commits September 27, 2024 09:44

Python: flow through yield

d4ea62e

- add yield as a dataflow return - replace comprehension store step with a store step to the yield

Python: fix dataflow inconsistencies

310819d

- adjust scope of argument, the argument is outside the called function - add missing post-update nodes for the new arguments

Python: add location to node

3ef05a6

Python: update test expectations

f9f46f0

We now have a new callable, yielding new enclosing callables

Python: allow comp arg as argumentnode

ded3974

Python: adjust test expectations

fb07a56

Python: use yield step also for taint

7392d18

Using the comprehension store step meant that all comprehensions would receive taint. This because comprehension flow now goes via a callable, meaning they share the return node.

Python: use known sanitiser

a22ea6c

- also adjust test expectations in experimental

Python: add missing qldoc

438e664

More doc is needed, but this should turn the tests green

Python: docs and a simplification

dacc0ab

yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Sep 30, 2024

yoff marked this pull request as ready for review September 30, 2024 14:25

yoff requested a review from a team as a code owner September 30, 2024 14:25

sidshank assigned RasmusWL and tausbn Oct 1, 2024

Python: add change note

e0a3c8a

github-actions bot added the documentation label Oct 1, 2024

tausbn previously approved these changes Oct 1, 2024

View reviewed changes

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll Outdated Show resolved Hide resolved

Update python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDisp…

2b6aab1

…atch.qll Co-authored-by: Taus <[email protected]>

yoff dismissed tausbn’s stale review via 2b6aab1 October 1, 2024 10:36

Python: valid change note

64890a1

yoff requested a review from tausbn October 1, 2024 10:57

yoff removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 1, 2024

Python: use imprecise content in cp

f39dc41

We had accidentally used precise content leadingto blowup

tausbn previously approved these changes Oct 1, 2024

View reviewed changes

yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 1, 2024

RasmusWL reviewed Oct 1, 2024

View reviewed changes

Python: just use ListElementContent for iterables

38b1eb7

yoff dismissed tausbn’s stale review via 38b1eb7 October 1, 2024 14:24

yoff removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 2, 2024

yoff requested a review from RasmusWL October 3, 2024 12:20

Python: comment around dictionary comprehensions

977767b

RasmusWL previously approved these changes Oct 4, 2024

View reviewed changes

Merge branch 'main' of https://github.com/github/codeql into python/a…

a4c1a62

…dd-comprehension-capture-flow

yoff dismissed RasmusWL’s stale review via a4c1a62 October 4, 2024 12:53

Python: adjust test expectations

6f5b949

note that we do retain precision in `test_dict_from_keyword()`

RasmusWL approved these changes Oct 4, 2024

View reviewed changes

yoff merged commit 6bb98b0 into github:main Oct 4, 2024
14 checks passed

python: capture flow through comprehensions #17577

python: capture flow through comprehensions #17577

Uh oh!

Conversation

yoff commented Sep 25, 2024

Pull Request checklist

All query authors

Internal query authors only

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yoff commented Oct 1, 2024

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

yoff commented Oct 1, 2024

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Oct 2, 2024

Uh oh!

RasmusWL commented Oct 3, 2024

Uh oh!

yoff commented Oct 3, 2024

Uh oh!

Uh oh!

Uh oh!