-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-list: limit constraint comparisons to avoid DoS #5681
job-list: limit constraint comparisons to avoid DoS #5681
Conversation
b711a4b
to
ee198ac
Compare
This seems reasonable as a first proposal, but it feels like it may be tricky to thwart all attacks using these kinds of limits. This problem feels similar to sandboxing in embedded languages (e.g. Lua), and one approach often used there is to set an "iteration limit" to avoid infinite (or very long) loops. I wonder if that would be possible here. That is, instead of putting limits on individual components of the constraint object, you carry a counter (or even timer) along with the match operation, and abort it if it hits some limit. E.g. if using a counter, each "op" would increase the count by the number of "values" (where in a hostlist each host is considered a value), or you just increment and check the counter for each match. To allow long matches, you could make the limit configurable and/or disable it for instance owner. This might be slightly more work, and would allow large matches to proceed instead of catching them before they are even attempted, but would be more reliable in that it would place a hard limit on the amount of time a match is allowed to take. Just an idea, and perhaps it could even be used in concert with the constraint limits already proposed in this PR. |
Hmmmm I like this idea. Lemme give it a whirl see how it turns out. Gotta carry around some counting stuffs while refactoring things. |
ee198ac
to
d3870ee
Compare
re-pushed a WIP of limiting the number of "comparisons" approach. I choose 100K as the random "max" comparisons for the time being and adding a with the rest mostly refactoring or config stuff. I haven't tested yet or added tests, but wanted to post to see what people thought of the general approach. Edit: oh yeah, built on top of #5712 ... just b/c of the config file reading infrastructure added. |
b16d7de
to
a589a5c
Compare
re-pushed, add some tests and removing WIP. I didn't add back the max elements and max recursion depth of the constraint, as I felt this handled the overriding case. |
3ce51d8
to
62e4b58
Compare
2290879
to
f55f2a4
Compare
f55f2a4
to
c24ac88
Compare
c24ac88
to
511c9c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing, I noticed garbage returned in error messages where job_match()
fails, e.g. when the max comparisons limit is exceeded.
Issue noted inline.
@@ -26,11 +26,13 @@ | |||
|
|||
typedef int (*match_f) (struct list_constraint *, | |||
const struct job *, | |||
unsigned int *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message typo: comma instead of period in the second to last sentence.
src/modules/job-list/list.c
Outdated
if ((ret = job_match (job, c, &error)) < 0) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function uses a local flux_error_t
to handle the error message from job_match()
, but doesn't copy it to errp->text
on error. Should errp
be used here and the local flux_error_t error
be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this also indicates that there's no testing of a proper error message in the response when job_match()
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I have tests to ensure the error occurs but didn't check the message
0c88077
to
ea300b7
Compare
re-pushed, fixing per comments above. The main change was having |
src/modules/job-list/match.c
Outdated
@@ -54,6 +56,20 @@ struct timestamp_value { | |||
match_comparison_t t_comp; | |||
}; | |||
|
|||
#define CONSTRAINT_COMPARISON_MAX 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limit feels a bit low, for example:
--------------------------------------
> 0.00000
> job-list.list
> flags=topic,payload,route userid=unknown rolemask=none nodeid=any matchtag=1
> {"max_entries": 1000, "attrs": ["dependencies", "ntasks", "t_run", "state", "result", "t_cleanup", "duration", "name", "nodelist", "nnodes", "queue", "annotations", "userid"], "since": 0.0, "constraint": {"and": [{"hostlist": ["corona[1-1000]"]}, {"userid": [6885]}]}}
--------------------------------------
< 0.04141
< job-list.list
< flags=topic,payload,route userid=6885 rolemask=owner,local errnum=22 matchtag=1
< Excessive comparisons made, limit search via states or since
The limit was hit in 40ms on corona. If we say on this system the limit should prevent job-list getting stuck doing matches for more than say 200ms, then we can probably raise the limit by 4-5x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh i set this limit before the hostlist support (which is in the follow up PR).
Maybe just upping it to like 1-2 million would even be fine. I think the main goal is to just not DoS job-list an obscene amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
8e3137e
to
f9dcf2b
Compare
re-pushed, upping default limit to 1 million. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5681 +/- ##
==========================================
+ Coverage 83.28% 83.32% +0.03%
==========================================
Files 514 514
Lines 82802 82868 +66
==========================================
+ Hits 68964 69048 +84
+ Misses 13838 13820 -18
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM.
Problem: The job_match() function currently returns a bool, indicating if a job matches a constraint or does not. In the future, we will need to be able to return an error from job_match(), but that is not currently possible. Refactor job_match() to return an int instead of a bool. It now returns 1 for a match, 0 for not, and -1 on error. In addition, have job_match() take a flux_error_t pointer as an argument, in preparation for future error message returns.
Problem: Job-list constraints are not limited in their size. A nefarious user could cause a DoS of the job-list service by sending an extremely large constraint request. Solution: To limit constraints, introduce the concept of constraint comparisons. All comparison "checks", such as seeing if a job was submitted by a specific user, are counted. When the maximum number of comparisons is reached, an error is returned to the user. The user can try again by limiting the number of jobs they are searching through, such as only searching active vs inactive jobs or if jobs were submitted since a certain time. For the time being, hard code the maximum number of comparisons to 100 thousand. Fixes flux-framework#5669
Problem: In the near future, some initialization will have to be done for constraint matching. However, no information can currently be passed to the list_constraint_create() function. Add infrastructure to allow a match context to be initialized on job-list load and can be passed to list_constraint_create(). While not presently used, it will allow easier adjustments to constraint matching in the future.
Problem: Currently the max number of comparisons that can be done in a job-list constraint match is hard coded to 100 thousand. There may be circumstances where we'd like that number to be higher or lower. Support configuration of this maximum via a new job-list.max_comparisons configuration.
Problem: The match_context struct is not initialized with the new max_comparisons field for unit tests. In addition, now that the values in the ctx will actually be used, it should not sit on the stack, otherwise we might read corrupted data. Create a global match ctx for all list constraint creations. Set the max_comparisons to 0 for unlimited in the unit tests.
Problem: The job-list module will return descriptive text about errors it has hit. The "flux job list" command does not output these descriptive errors, it just outputs the errno string. Call future_strerror() when the job-list module returns an error, to get the more descriptive error. Update several tests in t2260-job-list.t that have error messages changed as a result.
Problem: A few tests were mixed together within another test's section. Move "stress tests" out of the "stats & corner case" tests section. Split the "stats & corner case" tests into two obviously different chunks.
Problem: The new job-list max_comparisons config is not covered. Add tests in t2260-job-list.t.
f9dcf2b
to
af105ea
Compare
great, thanks! setting MWP |
Problem: Job-list constraints are not limited in their size. A nefarious user could cause a DoS of the job-list service by sending an extremely large constraint request.
Solution: Set a reasonable max on constraint sizes. The maximum length a constraint array can be is 256 elements and the maximum recursive depth (via and/or/not operators) is set to 16.
Add unit tests.
Fixes #5669
This was something I came up with ... good idea? bad idea? the limit sizes ok?