-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: capture flow through comprehensions #17577
Conversation
- add comprehension functions as `DataFlowCallable`s - add comprehension call as `DataFlowCall` - create capture argument node for comprehension calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still in draft, but since I looked it over now anyway: For the final review, let's get in some more docs 😊
For a comprehension `[x for x in l] - `l` is now a legal argument (in DataFlowPublic) - `l` is the argument of the comprehension function (in DataFlowDispatch) - the parameter of the comprehension function is being read rather than `l` (in IterableUnpacking) Thus the read that used to cross callable boundaries is now split into a arg-param edge and a read from that param.
We used to use the CfgNode for the comprehension itself. In cases where that is also an argument, say ```python ",".join([x for x in l]) ``` that would be an argument to two different calls causing a dataflow consistency violation.
- add yield as a dataflow return - replace comprehension store step with a store step to the yield
- adjust scope of argument, the argument is outside the called function - add missing post-update nodes for the new arguments
We now have a new callable, yielding new enclosing callables
Using the comprehension store step meant that all comprehensions would receive taint. This because comprehension flow now goes via a callable, meaning they share the return node.
- also adjust test expectations in experimental
More doc is needed, but this should turn the tests green
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor typo fix, but otherwise this looks sensible to me. Solid stuff! 👍
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
…atch.qll Co-authored-by: Taus <[email protected]>
There is a small perfomance penalty, but we also find new flow, including new alerts such as this one: def generate(hello):
yield hello
yield flask.request.args["name"]
yield "!"
return flask.Response(generate("Hello ")) |
We had accidentally used precise content leadingto blowup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending the new performance check. 🙂
Thanks for the pairing @tausbn, I agree that it was useful to investigate. For the public record, we did a performance investigation pairing session and found a problematic CP which would blow up whenever there are many different dictionary keys. In this case the DB had 17k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice to see the improvements in our tests 👍 I have a few minor questions around the code though 🤔
result.(Flow::ExprNode).getExpr().getNode() = comp | ||
) | ||
or | ||
// TODO: Should the `Comp`s above be excluded here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you resolve this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not. My intention was to leave it for later if we did not observe weird flow in the tests or bad performance.
ExtractedReturnNode() { node = any(Return ret).getValue().getAFlowNode() } | ||
ExtractedReturnNode() { | ||
node = any(Return ret).getValue().getAFlowNode() or | ||
node = any(Yield yield).getAFlowNode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that we could just add yield
s as a ExtractedReturnNode
... Did you consider using different ReturnKind
to model return
/yield
statements? (my impression was we would need to do that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new yield store step, where the yielded element is stored to the yield
expression, it seems fine to simply use the yield as a normal return. We are returning a new thing and we can control its content. I have seen C# use a different return kind for variables that already exist but are written to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that as: since I got yield
working without messing with the ReturnKinds, I didn't really investigate that path much.
I guess that's fine 👍 I'm still a little curious how the approach would have worked out 🤔 (but it's not super important so 🤷)
not exists(Comp comp | func = comp.getFunction()) and | ||
( | ||
c instanceof ListElementContent | ||
or | ||
c instanceof SetElementContent | ||
or | ||
c instanceof DictionaryElementAnyContent | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really need SetElementContent
or DictionaryElementAnyContent
here?
My mental model is that when you call a generator function you get an iterable back, so from my understanding you would always need to turn those elements into a set/dict before use anyway.
That is, you could do set(my_generator_func())
but doing my_set | my_generator_func()
results in an error. (and likewise for dictionaries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our model of the dict
constructor does not convert content, so that would have to change in order to model something like
dict([k, v for v in l])
I did check if we were always just adding list content and the constructor pulled it out and I would also be open to that regime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that out, it seems a lot nicer and also improved our coverage slightly..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that 38b1eb7 is enough. Do we already have the logic in place for sets?
I'm also surprised that dict-comprehensions and set-comprehensions just works 🤔 -- is that because a they are transformed into dict(_comp_function())
and set(_comp_function())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the set
constructor already handles list elements: https://github.com/github/codeql/blob/main/python/ql/lib/semmle/python/frameworks/Stdlib.qll#L4315-L4317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that raw dict comprehensions do not actually work: https://github.com/github/codeql/blob/main/python/ql/test/library-tests/dataflow/coverage/test.py#L182-L184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a comment saying that this might need to be revised when adding support for dict-comprehensions then? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed together that it might make sense to do some followup work on how dict-comprehensions are handled by the extractor, and what it would take to support them better in our analysis.
Evaluation looks much more comfortable this time :-)
|
Suggestion for followup work: Handle |
Agreed 👍 |
…dd-comprehension-capture-flow
note that we do retain precision in `test_dict_from_keyword()`
DataFlowCallable
sDataFlowCall
Pull Request checklist
All query authors
.qhelp
. See the documentation in this repository.Internal query authors only
.ql
,.qll
, or.qhelp
files. See the documentation (internal access required).