-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle 413 Payload Too Large errors #1199
Conversation
Still manually testing this. I'm a bit confused at the results currently. I configured ES to have a very low max message size and sent it a payload that exceeds that. Without the change it appears the pipeline just fails:
With this change it appears to attempt the retries:
|
It seems like preventing the adapter from raising an error results in exponential backoff. It seems like from your description this is not desirable. Do we want to update retry logic to not attempt to retry 413 errors? I'm not sure what expectations etc there are around the DLQ (now there is an event that will have error details at least). Essentially my question is do we want to DLQ 413 errors in logstash-output-elasticsearch/lib/logstash/plugin_mixins/elasticsearch/common.rb Lines 287 to 302 in de61518
|
AFAIK, we only route events to the DLQ in two situations:
When we fail to make a successful HTTP request, we are supposed to retry indefinitely (but in your unpatched test, since the pipeline is being shut down due to all inputs closing it bails on the retries). By not raising the 413, the request is considered a "success" and doc-level handling gets invoked (we expect a valid bulk-API response with one entry per sent document). BUT: what happens when we have a batch of two events, one of which exceeds the limit? Do we get an HTTP 413, or do we get an HTTP 2XX with doc-level 413's? I've seen situations where ES batch-of-one responses have the HTTP status of the doc-level response. |
From my testing it certainly looks like message size within a batch is not handled differently. If the batch is too large, it is whole sale rejected with a 413 even if within the batch there are events which would have been under the size limit. From this comment logstash-output-elasticsearch/lib/logstash/outputs/elasticsearch/http_client.rb Lines 10 to 23 in de61518
Assuming we are sticking by the 20MiB batch size (assuming it is reasonable that all ES would accept this size) the only time you would be dealing with 413 errors is when you are generating a single message that is too big. I dont think we would want to keep retrying that large message, but putting it in the DLQ would probably be useful for users to find the problematic messages. |
We definitely have some issues with the BadResponseCodeError abstraction.. A suggestion is to move BadResponseCodeError a layer up so that the perform_request no longer raises BadResponseCodeError. This leaves the caller to make the decision on what to consider a bad response code error depending on the type of request:
This way we can return the previous behaviour of allowing the http client to make the choice of reinterpreting the bulk level 413 as a document level error since we know there's only a single document in the bulk. Not sure if this change carries other consequences.. |
Been experimenting with refactoring the BadResponseCodeError handling. I think the surface area in the code itself wont be too large, the majority of the changes will be in ensuring the test suite mocking works with the refactor. As i'm working on gettting the tests together i'll push a commit showing what i'm thinking for the refactor 8427d29 take a look and let me know if that is way off base before I go too far down the test refactor :) |
8427d29
to
82872d0
Compare
Current StateThe current change (82872d0) has:
Open questionsIs it desirable to not retry 413 and dump them to DLQ if possible? Is the refactor moving the BadResponseCode generation out of the adapter worthwhile? WIth a config like:
And ES configured to have a very low message size (1kb)
We see
I would probably need to add some conditional logging to ensure that we are not claiming to retry when we are in fact not going to do that. Before I continue though i want to make sure This approach is desirable. |
cc @jsvd WDYT about #1199 (comment) ? |
Previously when Elasticsearch responds with a 413 (Payload Too Large) status, the manticore adapter raises an error before the response can be processed by the bulk_send error handling. This commit refactors the way `BadErrorResponse` codes are handled. Previously we had logic in the manticore adaptor which special cased raising errors on some codes. This commit refactors such that the adaptor raises on any error status and the caller is now responsible for special case handling the code.
82872d0
to
c903a4b
Compare
After speaking with @jsvd We agreed to take on the approach sketched out here #1203 . I've rebased this pr on this and fixed up the test assertions and bug in that sketch. I have added documentation suggesting what to do if you do not want 413 to continue to retry and confired that workflow works by building it in to logstash. From my observations 413 will retry by default, if you add that code to custom dlq setting then they are not retried and instead added to dlq for further analysis. |
ci seems to be failing just bc |
response = use_get ? @pool.get(path) : @pool.head(path) | ||
response.code >= 200 && response.code <= 299 | ||
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | ||
return true if e.code == 404 | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be false if code is 404, right? interesting that the tests didn't catch this.
response = use_get ? @pool.get(path) : @pool.head(path) | |
response.code >= 200 && response.code <= 299 | |
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | |
return true if e.code == 404 | |
raise e | |
use_get ? @pool.get(path) : @pool.head(path) | |
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | |
return false if e.code == 404 | |
raise e | |
else | |
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused on that! I had copied that from https://github.com/logstash-plugins/logstash-output-elasticsearch/pull/1203/files#diff-4de2e59688ce60550472a447d9534d75298d14b845b5c77293f5a255359efac1R413
Previous behavior was on 404 nothing was raised and exists?
returned false. Your suggestion preserves this behavior. I'll look in to where we have a test hole here.
response = @pool.put(path, nil, LogStash::Json.dump(template)) | ||
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | ||
return response if e.code == 404 | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not doing anything with the return value, and we can't guarantee response
will be defined in the rescue block:
response = @pool.put(path, nil, LogStash::Json.dump(template)) | |
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | |
return response if e.code == 404 | |
raise e | |
@pool.put(path, nil, LogStash::Json.dump(template)) | |
rescue ::LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError => e | |
raise e if e.code != 404 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, i had misunderstood something earlier and added that by mistake. Updated in 93eddbc
Previously a few bugs spotted in code review were being obfuscated by the combinations of tests not running in CI and the incorrect method for retrieving a code from a BadResponseCodeError. This commit updates the method names and addresses the feedback from code review.
Ci is green finally! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I've tested the behaviour and works as expected.
We should add a follow up PR to change the log entry at https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/main/lib/logstash/plugin_mixins/elasticsearch/common.rb#L299-L303
# only log what the user whitelisted
@document_level_metrics.increment(:retryable_failures)
@logger.info "Retrying failed action", status: status, action: action, error: error if log_failure_type?(error)
actions_to_retry << action
end
So that the large, > 100MB document isn't logged in its entirety, at least not by default..maybe something like:
# only log what the user whitelisted
@document_level_metrics.increment(:retryable_failures)
loggable_action = @logger.debug? ? action.to_s[0..10000] : action
@logger.info "Retrying failed action", status: status, action: loggable_action, error: error if log_failure_type?(error)
actions_to_retry << action
end
Filed #1205 for proposed follow on work. |
* Properly handle `413` Payload Too Large errors Previously when Elasticsearch responds with a 413 (Payload Too Large) status, the manticore adapter raises an error before the response can be processed by the bulk_send error handling. This commit refactors the way `BadErrorResponse` codes are handled. Previously we had logic in the manticore adaptor which special cased raising errors on some codes. This commit refactors such that the adaptor raises on any error status and the caller is now responsible for special case handling the code. * 12.0.2 release prep * Use `error_code` instead of `code` when handling BadResponseCodeError Previously a few bugs spotted in code review were being obfuscated by the combinations of tests not running in CI and the incorrect method for retrieving a code from a BadResponseCodeError. This commit updates the method names and addresses the feedback from code review. --------- Co-authored-by: João Duarte <[email protected]>
Previously when Elasticsearch responds with a 413 (Payload Too Large) status,
the manticore adapter raises an error before the response can be processed
by the bulk_send error handling. This commit refactors the way
BadErrorResponse
codes are handled. Previously we had logic in the manticoreadaptor which special cased raising errors on some codes. This commit refactors
such that the adaptor raises on any error status and the caller is now
responsible for special case handling the code.