Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-list: limit constraint comparisons to avoid DoS #5681

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jan 13, 2024

Problem: Job-list constraints are not limited in their size. A nefarious user could cause a DoS of the job-list service by sending an extremely large constraint request.

Solution: Set a reasonable max on constraint sizes. The maximum length a constraint array can be is 256 elements and the maximum recursive depth (via and/or/not operators) is set to 16.

Add unit tests.

Fixes #5669


This was something I came up with ... good idea? bad idea? the limit sizes ok?

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch from b711a4b to ee198ac Compare January 13, 2024 23:07
@grondo
Copy link
Contributor

grondo commented Jan 16, 2024

This seems reasonable as a first proposal, but it feels like it may be tricky to thwart all attacks using these kinds of limits.
For example, the 256 value array length would let me create an or of 256 values, each of which could be a non-matching hostlist of 1024 elements, and I could or that together 16 times for a total of 4M non-matching hosts. Depending on the size of the job lists, this could still easily cause a DoS (though some limit is better than nothing)

This problem feels similar to sandboxing in embedded languages (e.g. Lua), and one approach often used there is to set an "iteration limit" to avoid infinite (or very long) loops. I wonder if that would be possible here. That is, instead of putting limits on individual components of the constraint object, you carry a counter (or even timer) along with the match operation, and abort it if it hits some limit. E.g. if using a counter, each "op" would increase the count by the number of "values" (where in a hostlist each host is considered a value), or you just increment and check the counter for each match. To allow long matches, you could make the limit configurable and/or disable it for instance owner. This might be slightly more work, and would allow large matches to proceed instead of catching them before they are even attempted, but would be more reliable in that it would place a hard limit on the amount of time a match is allowed to take. Just an idea, and perhaps it could even be used in concert with the constraint limits already proposed in this PR.

@chu11
Copy link
Member Author

chu11 commented Jan 16, 2024

To allow long matches, you could make the limit configurable and/or disable it for instance owner. This might be slightly more work, and would allow large matches to proceed instead of catching them before they are even attempted, but would be more reliable in that it would place a hard limit on the amount of time a match is allowed to take. Just an idea, and perhaps it could even be used in concert with the constraint limits already proposed in this PR.

Hmmmm I like this idea. Lemme give it a whirl see how it turns out. Gotta carry around some counting stuffs while refactoring things.

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch from ee198ac to d3870ee Compare January 31, 2024 20:56
@chu11 chu11 changed the title job-list: limit constraint sizes to avoid DoS [WIP] job-list: limit constraint sizes to avoid DoS Jan 31, 2024
@chu11
Copy link
Member Author

chu11 commented Jan 31, 2024

re-pushed a WIP of limiting the number of "comparisons" approach. I choose 100K as the random "max" comparisons for the time being and adding a job-list.max_comparisons config option. The key commit is:

76015e5

with the rest mostly refactoring or config stuff. I haven't tested yet or added tests, but wanted to post to see what people thought of the general approach.

Edit: oh yeah, built on top of #5712 ... just b/c of the config file reading infrastructure added.

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch 2 times, most recently from b16d7de to a589a5c Compare February 1, 2024 00:23
@chu11 chu11 changed the title [WIP] job-list: limit constraint sizes to avoid DoS job-list: limit constraint sizes to avoid DoS Feb 1, 2024
@chu11
Copy link
Member Author

chu11 commented Feb 1, 2024

re-pushed, add some tests and removing WIP.

I didn't add back the max elements and max recursion depth of the constraint, as I felt this handled the overriding case.

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch 2 times, most recently from 3ce51d8 to 62e4b58 Compare February 1, 2024 04:21
@chu11 chu11 changed the title job-list: limit constraint sizes to avoid DoS job-list: limit constraint comparisons to avoid DoS Feb 1, 2024
@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch 2 times, most recently from 2290879 to f55f2a4 Compare February 1, 2024 18:24
@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch from f55f2a4 to c24ac88 Compare February 2, 2024 00:22
@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch from c24ac88 to 511c9c1 Compare March 20, 2024 16:54
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing, I noticed garbage returned in error messages where job_match() fails, e.g. when the max comparisons limit is exceeded.

Issue noted inline.

@@ -26,11 +26,13 @@

typedef int (*match_f) (struct list_constraint *,
const struct job *,
unsigned int *,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message typo: comma instead of period in the second to last sentence.

if ((ret = job_match (job, c, &error)) < 0)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function uses a local flux_error_t to handle the error message from job_match(), but doesn't copy it to errp->text on error. Should errp be used here and the local flux_error_t error be dropped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this also indicates that there's no testing of a proper error message in the response when job_match() fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I have tests to ensure the error occurs but didn't check the message

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch 2 times, most recently from 0c88077 to ea300b7 Compare April 8, 2024 17:22
@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

re-pushed, fixing per comments above.

The main change was having flux job list call future_strerror() so it would output the descriptive error vs the generic "errno error". So we can now test that the correct error message is returned.

@@ -54,6 +56,20 @@ struct timestamp_value {
match_comparison_t t_comp;
};

#define CONSTRAINT_COMPARISON_MAX 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limit feels a bit low, for example:

--------------------------------------
> 0.00000
> job-list.list
> flags=topic,payload,route userid=unknown rolemask=none nodeid=any matchtag=1
> {"max_entries": 1000, "attrs": ["dependencies", "ntasks", "t_run", "state", "result", "t_cleanup", "duration", "name", "nodelist", "nnodes", "queue", "annotations", "userid"], "since": 0.0, "constraint": {"and": [{"hostlist": ["corona[1-1000]"]}, {"userid": [6885]}]}}
--------------------------------------
< 0.04141
< job-list.list
< flags=topic,payload,route userid=6885 rolemask=owner,local errnum=22 matchtag=1
< Excessive comparisons made, limit search via states or since

The limit was hit in 40ms on corona. If we say on this system the limit should prevent job-list getting stuck doing matches for more than say 200ms, then we can probably raise the limit by 4-5x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh i set this limit before the hostlist support (which is in the follow up PR).

Maybe just upping it to like 1-2 million would even be fine. I think the main goal is to just not DoS job-list an obscene amount.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch 2 times, most recently from 8e3137e to f9dcf2b Compare April 18, 2024 18:15
@chu11
Copy link
Member Author

chu11 commented Apr 18, 2024

re-pushed, upping default limit to 1 million.

Copy link

codecov bot commented Apr 18, 2024

Codecov Report

Merging #5681 (f9dcf2b) into master (091fc1f) will increase coverage by 0.03%.
The diff coverage is 92.37%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5681      +/-   ##
==========================================
+ Coverage   83.28%   83.32%   +0.03%     
==========================================
  Files         514      514              
  Lines       82802    82868      +66     
==========================================
+ Hits        68964    69048      +84     
+ Misses      13838    13820      -18     
Files Coverage Δ
src/cmd/job/list.c 87.82% <100.00%> (+0.86%) ⬆️
src/modules/job-list/job-list.c 80.73% <100.00%> (+8.18%) ⬆️
src/modules/job-list/list.c 72.22% <100.00%> (+0.20%) ⬆️
src/modules/job-list/match.c 90.88% <91.34%> (-1.30%) ⬇️

... and 15 files with indirect coverage changes

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM.

Problem: The job_match() function currently returns a bool, indicating
if a job matches a constraint or does not.  In the future, we will need
to be able to return an error from job_match(), but that is not currently
possible.

Refactor job_match() to return an int instead of a bool.  It now returns
1 for a match, 0 for not, and -1 on error.  In addition, have job_match()
take a flux_error_t pointer as an argument, in preparation for future
error message returns.
chu11 added 7 commits April 18, 2024 21:28
Problem: Job-list constraints are not limited in their size.  A nefarious
user could cause a DoS of the job-list service by sending an extremely
large constraint request.

Solution: To limit constraints, introduce the concept of constraint
comparisons.  All comparison "checks", such as seeing if a job was submitted
by a specific user, are counted.  When the maximum number of comparisons is
reached, an error is returned to the user.  The user can try again by
limiting the number of jobs they are searching through, such as only
searching active vs inactive jobs or if jobs were submitted since a certain time.

For the time being, hard code the maximum number of comparisons to 100
thousand.

Fixes flux-framework#5669
Problem: In the near future, some initialization will have to be
done for constraint matching.  However, no information can currently
be passed to the list_constraint_create() function.

Add infrastructure to allow a match context to be initialized on
job-list load and can be passed to list_constraint_create().  While
not presently used, it will allow easier adjustments to constraint
matching in the future.
Problem: Currently the max number of comparisons that can be
done in a job-list constraint match is hard coded to 100 thousand.
There may be circumstances where we'd like that number to be higher
or lower.

Support configuration of this maximum via a new job-list.max_comparisons
configuration.
Problem: The match_context struct is not initialized with
the new max_comparisons field for unit tests.  In addition,
now that the values in the ctx will actually be used, it should
not sit on the stack, otherwise we might read corrupted data.

Create a global match ctx for all list constraint creations.  Set
the max_comparisons to 0 for unlimited in the unit tests.
Problem: The job-list module will return descriptive text about
errors it has hit.  The "flux job list" command does not output
these descriptive errors, it just outputs the errno string.

Call future_strerror() when the job-list module returns an error,
to get the more descriptive error.

Update several tests in t2260-job-list.t that have error messages
changed as a result.
Problem: A few tests were mixed together within another test's
section.

Move "stress tests" out of the "stats & corner case" tests section.
Split the "stats & corner case" tests into two obviously different
chunks.
Problem: The new job-list max_comparisons config is not covered.

Add tests in t2260-job-list.t.
@chu11 chu11 force-pushed the issue5669_job_list_constraint_limit branch from f9dcf2b to af105ea Compare April 19, 2024 04:29
@chu11
Copy link
Member Author

chu11 commented Apr 19, 2024

great, thanks! setting MWP

@mergify mergify bot merged commit 20ca0b8 into flux-framework:master Apr 19, 2024
33 checks passed
@chu11 chu11 deleted the issue5669_job_list_constraint_limit branch April 19, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job-list: limit size of job constraint
2 participants