Why don't 429 errors in the sink cause the buffer to fill up? #21694

tronboto · 2024-11-04T19:24:38Z

tronboto
Nov 4, 2024

See attached docker config to reproduce this but essentially I have a pipeline with the demo_logs source and the elasticsearch sink. I'm intentionally setting the write thread pool queue size to 0 in elasticsearch (opensearch) so that 429 errors are returned, or at least 429s are returned in the json response body, which is how elasticsearch does 429s.

At this point, given that I have set request_retry_partial: true, I'm expecting to see the vector_buffer_events metric go up but it remains at 0. However, if I then stop the opensearch container so that vector can no longer connect, vector_buffer_events does start to go up. Why is this?

Additionally, when our elasticsearch cluster is under load and starts returning 429 errors, we're seeing rate(vector_http_client_requests_sent_total[5m]) for the elasticsearch sink continue to rise to huge numbers (see charts below), which isn't what I would expect given that the cluster is under load. From what I can tell, this is then causing bulk indexing tasks to pile up on the cluster, which is something ARC should be helping to avoid.

jszwedko · 2024-11-06T21:27:20Z

jszwedko
Nov 6, 2024
Maintainer

I'm intentionally setting the write thread pool queue size to 0 in elasticsearch (opensearch) so that 429 errors are returned, or at least 429s are returned in the json response body, which is how elasticsearch does 429s.

Can you confirm what HTTP status code Elasticsearch returns when returning a "429" response? We do have some logic that looks for "429"s in the response body, but only if the overall HTTP response is reported as successful (2xx):

vector/src/sinks/elasticsearch/retry.rs

Lines 123 to 143 in acea5ae

    
           if self.retry_partial { 
        
               // We will retry if there exists at least one item that 
        
               // failed with a retriable error. 
        
               // Those are backpressure and server errors. 
        
               if let Some((status, error)) = 
        
                   resp.iter_status().find(|(status, _)| { 
        
                       *status == StatusCode::TOO_MANY_REQUESTS 
        
                           || status.is_server_error() 
        
                   }) 
        
               { 
        
                   let msg = if let Some(error) = error { 
        
                       format!( 
        
                           "partial error, status: {}, error type: {}, reason: {}", 
        
                           status, error.err_type, error.reason 
        
                       ) 
        
                   } else { 
        
                       format!("partial error, status: {}", status) 
        
                   }; 
        
                   return RetryAction::Retry(msg.into()); 
        
               } 
        
           }

Also, can you share your sink configuration?

0 replies

tronboto · 2024-11-07T09:48:45Z

tronboto
Nov 7, 2024
Author

Hey @jszwedko. Thanks for getting back on this. Yes, the status code on these is 200. See below for the elasticsearch sink config:

  sink_elasticsearch:
    type: elasticsearch
    inputs:
      - transform_remap_set_processing_time
    endpoints:
      - host1
      - host2
      - host3
    api_version: v7
    batch:
      max_bytes: 20971520
    bulk:
      action: create
      index: "{{ '{{' }} es_index {{ '}}' }}-%Y.%m.%d"
    pipeline: ingest-pipeline
    buffer:
      max_events: 100000
    compression: gzip
    tls:
      ca_file: /etc/ssl/certs/ca-bundle.crt
      crt_file: /etc/vector/ssl/vector.cer
      key_file: /etc/vector/ssl/vector.key
    request_retry_partial: true

1 reply

jszwedko Nov 7, 2024
Maintainer

Thanks @tronboto . Can you share the Vector log output? I'm curious if it is discarding the events, incorrectly, instead of retrying them. You could also look at the component_discarded_events_total metric.

tronboto · 2024-11-08T14:11:42Z

tronboto
Nov 8, 2024
Author

Hey @jszwedko. So these are possibly 2 different issues(?) and I'm sorry for conflating them.

The first issue, or possibly misunderstanding, is related to the config in the attached zip file in the OP - there you can see the vector config that is able to reproduce the scenario in the title. That is, the cluster is returning 429 errors and the logs suggest that retries are happening but the buffer doesn't appear to be filling up.
The second issue is similar but perhaps different because here we are seeing the buffer fill up:

Here is an extract from the logs related to the elasticsearch sink around that time:

I can try and get you them in text format but it isn't easy because of this issue:

Allow elasticsearch sink response error logs to be made less verbose #21648

You can also see several connection refused errors - this is because one of the hosts specified in the endpoints configuration was down. Hmmm, perhaps this is it then? The rate of vector_http_client_requests_sent_total goes up but the rate of vector_http_client_responses_total doesn't because vector is unable to connect to that node? Having said that, if I move the time period of that graph forward a bit (that node was still down during this period) we just see slightly more requests that responses:

until we again get to another period where the requests vastly outnumber responses (this also coincided with the buffers filling up):

As for component_discarded_events_total, that is graphed below (note that we enabled request_retry_partial at around 12:45):

Let me know if you'd like to see anything else and thanks again for taking a look.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why don't 429 errors in the sink cause the buffer to fill up? #21694

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why don't 429 errors in the sink cause the buffer to fill up? #21694

tronboto Nov 4, 2024

Replies: 3 comments · 1 reply

jszwedko Nov 6, 2024 Maintainer

tronboto Nov 7, 2024 Author

jszwedko Nov 7, 2024 Maintainer

tronboto Nov 8, 2024 Author

tronboto
Nov 4, 2024

Replies: 3 comments 1 reply

jszwedko
Nov 6, 2024
Maintainer

tronboto
Nov 7, 2024
Author

jszwedko Nov 7, 2024
Maintainer

tronboto
Nov 8, 2024
Author