Performance #15977

splitice · 2023-01-17T06:19:54Z

splitice
Jan 17, 2023

How are people finding vectors performance?

We have been running vector for quite a while now and quite like it. However we can't help but note it's not exactly performant.

I've created a few issues with areas noted for improvements over the past year. Is the current focus strictly moving towards new features or is there room for a performance orientated milestone release? It seems like one would go a miss.
Would it be helpful to receive some perf reports?

+    7.13%     7.05%  vector-worker  vector              [.] _rjem_sdallocx                                                                                                                                        `
+    7.10%     0.00%  vector-worker  [unknown]           [.] 0x0000000000000001                                                                                                                                    a
+    4.40%     4.29%  vector-worker  vector              [.] _rjem_malloc                                                                                                                                          a
+    3.35%     3.31%  vector-worker  vector              [.] metrics_tracing_context::TracingContext<R,F>::enhance_key::{{closure}}::{{closure}}                                                                   a
+    2.78%     0.00%  vector-worker  [unknown]           [k] 0000000000000000                                                                                                                                      a
+    2.75%     0.00%  vector-worker  [unknown]           [.] 0x0000000000000008                                                                                                                                    a
+    2.74%     0.00%  vector-worker  [unknown]           [.] 0x7669656365725f74                                                                                                                                    a
+    2.73%     2.69%  vector-worker  vector              [.] metrics::key::generate_key_hash                                                                                                                       a
+    2.65%     2.63%  vector-worker  vector              [.] <metrics::common::KeyHasher as core::hash::Hasher>::write                                                                                             a
+    2.62%     0.00%  vector-worker  libc-2.31.so        [.] 0x00007fd457f35315                                                                                                                                    a
+    2.54%     2.54%  vector-worker  libc-2.31.so        [.] 0x0000000000161315                                                                                                                                    a
+    2.51%     2.37%  vector-worker  vector              [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write                                                                                             a
+    2.26%     0.00%  vector-worker  [unknown]           [.] 0x000000000000001e                                                                                                                                    a
+    2.23%     2.23%  vector-worker  vector              [.] <vector_core::event::array::EventArray as vector_core::event::estimated_json_encoded_size_of::EstimatedJsonEncodedSizeOf>::estimated_json_encoded_sizea
+    2.04%     1.98%  vector-worker  vector              [.] vector_core::metrics::recorder::Registry::visit_metrics                                                                                               a
+    2.01%     1.93%  vector-worker  vector              [.] vector_core::event::vrl_target::precompute_metric_value                                                                                               a
+    1.96%     1.94%  vector-worker  vector              [.] alloc::collections::btree::map::BTreeMap<K,V,A>::bulk_build_from_sorted_it

My first finding from that is that (if I'm reading that right)

On a system with vector is sitting at 41% CPU memory allocation is costing 11.5%?

This is a simple kubernetes_logs -> vector configuration with no real processing. promethetheus_metrics (cardinality limited, not that metric cardinality seems to be the issue here)

[transforms.internal_metrics_remove_tags]
    type = "remap"
    inputs = ["internal_metrics"]
    source = '''
    del(.tags.file)
    del(.tags.pod_name)
    '''
[transforms.internal_metrics_cardinality]
    type = "tag_cardinality_limit"
    inputs = [ "internal_metrics_remove_tags" ]
    mode = "exact"
    value_limit = 50

A configuration sample can be provided.

splitice · 2023-01-17T06:29:44Z

splitice
Jan 17, 2023
Author

If I'm going to hazard a guess its the metrics collection for kubernetes logs thats doing all the allocating.

If that could not be optimized we would happily give up per pod metrics, we already remove that cardinality because our monitoring system can't handle the sheer amount of data for the cluster.

0 replies

tobz · 2023-01-17T19:48:46Z

tobz
Jan 17, 2023
Collaborator

Generally speaking, we're aware that Vector allocates more than it would need to given an optimal design of each subsystem. As the authors, we don't love this fact. Vector could definitely perform better and use resources more efficiently.

That said, many customers/users often report increased performance and better resource efficiency after switching to Vector. We try to focus our time where there seems to be the most demand/desire, which has been firmly in the "features" camp for a while now.

Internal telemetry has always been an annoying thorn in our sides due to the allocations required to get all of the labels/tags into metrics that we need to exist. We have one long-term initiative to port as much of our internal telemetry over to a new design that eschews most, if not all, of the runtime allocations that you might currently see happening for internal telemetry. This work has to happen on a component-by-component basis, though, and the kubernetes_logs source has not yet been revamped as part of this initiative.

We're always open to PRs, or detailed issue reports, for things that could clearly be improved. Please feel free to open an issue specifically for the kubernetes_logs source, as it could be useful to see if other community members chime in that they too don't need, or would be willing to give up the Kubernetes-specific labels in metrics for higher resource efficiency.

0 replies

splitice · 2023-01-18T10:54:09Z

splitice
Jan 18, 2023
Author

Threw together a hackish build today and installed it on our development cluster. All this build does is removes the per pod cardinality from within kubernetes metrics.

I know, not an upstreamable solution. However I beleive the experiment proves the resource waste point well.

master...X4BNet:vector:exp/remove-kuberentes-pod-metrics

The nodes in this cluster tend to send more logs messages on average as they run development builds and can be a bit overloaded at times (expecially with vector eating a third of each nodes resources typically).

Before (600 - 800m, saturated cpu):

vector-agent-tqsr4                  642m         263Mi

After (19m - 60m over 20 minutes, 29m avg):

vector-agent-bv7hh                  25m          32Mi

Same or higher log volume (I tested stressing some loggy services).

And with that development cluster is no longer saturated just handling vector :)

@tobz you mentioned an alternative telemetry API. Is there details on this or an example component for which it has been implemented?

6 replies

spencergilbert Jan 18, 2023
Maintainer

🤔 would be nice to see how much improvement the registered metrics would show here, or if we do need to consider the metrics we're generating/tagging and how useful they are.

We do have an issue or two to review our tagging and metric production to make sure they're useful and not just causing resource usage 😉

zamazan4ik Jan 22, 2023

@spencergilbert Actually, I think we need to think about a way to disable a metric collection in some way. Some users are not interested in metrics for some components at all, and with the current scheme metric generation still could use a lot of computation resources even if they are not exposed.

I see at least two options here:

Make an ability to disable metrics in compile-time via a feature-flag. In this case, all emit! code could be gated with a feature flag. And if a user does not need metrics - just disable them :) A much more granular way will suggest per-component metric feature flags like "k8s-metrics", "file-metrics", or event per-component-per-metric features like "k8s-events-sent". But in this way, the user will need to recompile Vector, and will not be able to change exposed metrics in runtime (if they are disabled in compile-time).
Make a way to disable metrics generation in runtime. It could be a similar way to feature flags. This option is more flexible from the configuration point of view - at least you will be able to enable metrics without recompilation. But here we have a performance penalty since in runtime we still will have a condition to check, whether a metric is enabled or not. Do we need to care about it right now? Not sure.

Regarding the new register! API for metrics. Honestly, not sure that it will help in this case we still need to pass values like pod_name and other stuff as Strings. Maybe, if we do here some tricks to avoid repeating copies of pod information (since this information usually does not change too often), implement some kind of "caching" - it would help. Just not sure.

@splitice shows that his patch helped A LOT with a Vector performance in real life. Maybe at first would be a good idea to implement some kind of a "dirty fix" to resolve the current issue. And then, when we will implement a more mature way to handle metrics, we will just revert this dirty hack? What do you think? IMO, the current situation with Vector on k8s is worth it.

splitice Jan 28, 2023
Author

Perhaps metric tagging and collection be controlled by VRL at the component level?

That way the defaults could be as per the current behaviour but anyone overriding could bring it down to only the vrl execution cost effectively.

zamazan4ik Jan 28, 2023

VRL cannot help here since the emitting metrics are integrated into the components' Vector source code in a "hard" way. Now I am trying to estimate, how much performance we could gain by just migrating to register! metrics pattern. Maybe this optimization would be enough for 99% of the actual use-cases.

splitice Jan 29, 2023
Author

I was feeling a little skeptical of my results so I wanted to double check them.

I can definately confirm them fortunately.

Back on the current latest agent:

vector-agent-gl8hw                  723m         207Mi
vector-agent-kf86v                  634m         340Mi

Returning back to the hack!

zamazan4ik · 2023-01-28T01:32:44Z

zamazan4ik
Jan 28, 2023

To note - I achieved 2.5x performance boost in the scenario "File source -> Blackhole sink" just by commenting out two metrics from the File source: FileBytesReceived, FileEventsReceived. Right now the File source uses emit! infrastructure, so I hope migrating to register! will help with performance as well.

For us File source is the main use case, so we are interested a lot in the improvements in this area as well.

2 replies

zamazan4ik Jan 28, 2023

Just finished the benchmark. I've cheated a little bit (because I am a bit lazy) and instead of re-implementing File-related events to be compatible with register pattern, I just registered BytesReceived events instead of file events. If I am benching it in a right way, the new pattern resolves the problem. So I think we most valuable thing to do will be to migrate all sources to the new approach - I think the Vector team has the same thoughts.

zamazan4ik Jan 29, 2023

@splitice probably you are interested in these changes: #16178

splitice · 2023-02-05T03:42:25Z

splitice
Feb 5, 2023
Author

For anyone curious what the next big consumers of resources are after metrics:

+   23.61%    22.02%  vector-worker  vector             [.] memcpy
+    7.71%     0.00%  vector-worker  [unknown]          [k] 0x0000000000000001
+    5.95%     0.00%  vector-worker  [unknown]          [.] 0000000000000000
+    4.81%     0.00%  vector-worker  [unknown]          [.] 0x00007f208f49d100
+    4.25%     0.00%  vector-worker  [unknown]          [.] 0x900000000000841f
+    3.25%     0.00%  vector-worker  [unknown]          [k] 0x000000000001ffff
+    3.24%     0.00%  vector-worker  [unknown]          [.] 0x000000000000000e
+    3.14%     3.14%  vector-worker  vector             [.] memcmp
+    2.98%     2.82%  vector-worker  vector             [.] _rjem_sdallocx
+    2.95%     1.22%  vector-worker  [kernel.kallsyms]  [k] asm_sysvec_apic_timer_interrupt
+    2.93%     0.00%  vector-worker  [unknown]          [.] 0x7400000007c7f714
+    2.92%     0.00%  vector-worker  [unknown]          [.] 0x000000000000000c
+    2.88%     0.00%  vector-worker  [unknown]          [.] 0x0000000000000004
+    2.67%     0.00%  vector-worker  [unknown]          [.] 0x00007f2089831e00
+    2.63%     2.51%  vector-worker  vector             [.] <alloc::collections::btree::map::BTreeMap<K,V> as vector_common::byte_size_of::ByteSizeOf>::allocated_bytes
     2.44%     2.30%  vector-worker  vector             [.] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
+    2.38%     2.27%  vector-worker  vector             [.] memmove
+    2.23%     0.45%  vector-worker  [kernel.kallsyms]  [k] asm_exc_page_fault
+    2.14%     2.14%  vector-worker  vector             [.] core::hash::Hasher::write_str
+    2.02%     2.02%  vector-worker  vector             [.] vector_core::event::metadata::EventMetadata::set_schema_definition
+    1.94%     0.00%  vector-worker  [unknown]          [.] 0x00000000000001f4
+    1.85%     1.73%  vector-worker  vector             [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write
+    1.83%     1.68%  vector-worker  vector             [.] <alloc::string::String as vector_common::byte_size_of::ByteSizeOf>::allocated_bytes
+    1.78%     0.00%  vector-worker  [unknown]          [.] 0x0000000000000011
+    1.77%     0.00%  vector-worker  [unknown]          [.] 0x00007f2089001720
+    1.77%     0.00%  vector-worker  [unknown]          [.] 0x0073676f64695f74
+    1.75%     0.03%  vector-worker  [kernel.kallsyms]  [k] exc_page_fault
+    1.75%     0.00%  vector-worker  [unknown]          [.] 0xb0cc18b755412fb8
+    1.72%     0.00%  vector-worker  [kernel.kallsyms]  [k] do_user_addr_fault
     1.66%     1.62%  vector-worker  vector             [.] _rjem_malloc
+    1.63%     0.00%  vector-worker  [unknown]          [.] 0x0000000000000032
+    1.61%     0.19%  vector-worker  [kernel.kallsyms]  [k] handle_mm_fault
+    1.56%     1.50%  vector-worker  vector             [.] vector_core::metrics::recorder::Registry::visit_metrics

So something doing memory copies in place of memmoves / ptr logic? I think in order to get data on that a rust level capture would need to be used.

0 replies

splitice · 2023-02-05T04:17:15Z

splitice
Feb 5, 2023
Author

Is it just me or does this profile result not look like the typical output of a release build?

Samples: 1K of event 'cycles', Event count (approx.): 161833709
  Children      Self  Command        Shared Object      Symbol          
+   50.76%     0.00%  vector-worker  [unknown]          [.] 0xffffffffffffffff
+   39.15%     0.00%  vector-worker  vector             [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   25.03%    22.77%  vector-worker  vector             [.] memcpy
+   21.29%     0.05%  vector-worker  vector             [.] tokio::runtime::scheduler::multi_thread::worker::Context::run_task
+   21.13%     0.06%  vector-worker  vector             [.] tokio::runtime::scheduler::multi_thread::worker::Context::run
+   20.71%     0.00%  vector-worker  vector             [.] tokio::macros::scoped_tls::ScopedKey<T>::set
+   20.52%     0.00%  vector-worker  vector             [.] tokio::runtime::task::harness::Harness<T,S>::poll
+   20.52%     0.00%  vector-worker  vector             [.] tokio::runtime::task::core::Core<T,S>::poll
+   20.52%     0.00%  vector-worker  vector             [.] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
+   20.44%     0.00%  vector-worker  vector             [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   20.44%     0.00%  vector-worker  vector             [.] vector_core::transform::SyncTransform::transform_all
+   20.35%     0.00%  vector-worker  vector             [.] tokio::runtime::scheduler::multi_thread::worker::run
+   20.35%     0.00%  vector-worker  vector             [.] <vector::transforms::remap::Remap<Runner> as vector_core::transform::SyncTransform>::transform
+   20.12%     0.00%  vector-worker  vector             [.] tokio::runtime::blocking::pool::Inner::run
+   20.03%     0.00%  vector-worker  vector             [.] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
+   19.95%     0.00%  vector-worker  vector             [.] tokio::runtime::task::harness::Harness<T,S>::poll
+   19.95%     0.00%  vector-worker  vector             [.] tokio::runtime::task::core::Core<T,S>::poll
+   19.47%     0.00%  vector-worker  vector             [.] core::ops::function::FnOnce::call_once{{vtable.shim}}
+   19.47%     0.00%  vector-worker  vector             [.] std::sys_common::backtrace::__rust_begin_short_backtrace
+   18.64%     0.00%  vector-worker  vector             [.] start
+   18.64%     0.00%  vector-worker  vector             [.] std::sys::unix::thread::Thread::new::thread_start
+   17.35%     0.00%  vector-worker  vector             [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   17.35%     0.00%  vector-worker  vector             [.] <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
+   17.21%     0.14%  vector-worker  vector             [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   16.81%     0.00%  vector-worker  vector             [.] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
+   13.80%     0.00%  vector-worker  vector             [.] vector_core::metrics::Controller::capture_metrics
+   13.80%     1.32%  vector-worker  vector             [.] vector_core::metrics::recorder::Registry::visit_metrics
+   13.69%     0.16%  vector-worker  vector             [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll

Its compiled with make package-x86_64-unknown-linux-musl which looks to be building for release.

Why does a release build spend 19% of it's time (after kubernetes metrics waste removed) capturing short stack traces?

Did I screw up and somehow build for debug?

1 reply

splitice Feb 5, 2023
Author

Also std::sys::unix::thread::Thread::new::thread_start thats got to be the blocking thread pool starting a bunch of threads.

For what for the fileio for kuberentes log files?

cetanu · 2023-06-13T00:57:51Z

cetanu
Jun 13, 2023

I found this thread while looking into vector performance

If there are any tools or instructions for profiling vector, I would be happy to also contribute some reports

1 reply

jszwedko Jun 23, 2023
Maintainer

Hey!

Typically I've seen perf for CPU profiling and valgrind for memory profiling for Vector.

lololozhkin · 2024-03-11T07:49:33Z

lololozhkin
Mar 11, 2024

@splitice, Hi! I wanted to write some rust code to this project on my last weekends. Moreover, vector is transfering logs in my production at this moment, and memory consumption is quite big, much more, than I would expect.

So, I found this issue, and tried to implement new internal metrics flow with register. I created HashMap with registered counters and started to call them directly, as @zamazan4ik did in his PR to file source. This change, as I thought, must improve memory consumption for kubernetes_logs source. I finally started patched code, and literally 0 changes occurred on my grafana dashboard, which was showing container_memory_working_set_bytes at the same level, as build without any changes.

But I continued to investigate the memory issue. After that, I added heaptrack to the Dockerfile to track heap memory.

First thing I noticed: heaptrack showed, that vector uses only 100MB of heap, while grafana showed me, that vector uses more than 900MB of memory :)

Second one: I found, that metric container_memory_working_set_bytes may include memory, that is used for file caching in kernel.

Third: I was so disappointed and sad, so I tried to just use your changes (removing any pod_info labels in metrics)
And I haven't seen any results again :(

So, @splitice, could you please check your hacky build again, on fresh master version. I just wanna know what I've done wrong, or, maybe, some optimizations in vector code were added, and this behaviour is expected.

I want to add some context: I was generating logs in my minikube cluster using simple log generator written in go, and the log rate was set to 7k/sec. Vector configuration was very simple: only one input with kubernetes_logs type, and one output to blackhole. Looking at heaptrack flamegraph, I haven't found any metrics mention, and that was confusing. The most memory was consumed by cloning strings from kubernetes annotators, but that fact seems ok, because that is an event content, that should be cloned (maybe Rc or Arc should help here, but then we need to use Rc in all usages of event struct)

And, I want to gain this performance issue, so I'm ready to work around it on my weekends, but I need some help.. I know about discord channel, but maybe participants of this issue could give me some feedback / help about this performace issue, maybe I'm missing something?

0 replies

splitice · 2024-03-11T08:12:04Z

splitice
Mar 11, 2024
Author

Unfortunately actually fixing this requires someone with real rust knowledge. I'm definitely not a rust developer.

There's certainly need for better handling here than what this hack does. The way metrics are handled currently is absurdly bad for anyone doing cronjobs or jobs on their cluster.

I'm still running my hacky patch for the cluster collector. Most of those run at around 100mb of ram. But that's all that vector instance does, sinks to another vector cluster for processing and storage which is more up to date.

I haven't tested my hack against latest master. Hopefully eventually some development will be done on vector performance and I won't need to rebase it.

1 reply

lololozhkin Mar 11, 2024

Thank you for your reply! I'll try to communicate with developers in discord. But I cannot understand, why applying your hack to master code didn't imrpove memory consumption :(

Maybe that's because of nonrealistic payload (kubernetes_logs -> blackhole), and I need to setup kubernetes_logs -> vector or something like that.

Good luck :)

davidpellcb · 2024-03-22T21:18:47Z

davidpellcb
Mar 22, 2024

Just wanted to chime in on this thread as I'm currently doing a proof-of-concept to replace fluent-bit agent + custom golang app as aggregator with Vector as both agent and aggregator in our Kubernetes clusters. People on the team were very excited because of the findings of this blog post (bare metal) but currently our real-world experience in Kubernetes is that while throughput does seem better than fluent-bit, it comes at a considerable resource cost in the kubernetes_logs source.

I don't have much else to add at this point. Just wanted to chime in that we noticed this and would love it if something could be done to improve performance in kubernetes.

0 replies

splitice · 2024-12-04T02:46:02Z

splitice
Dec 4, 2024
Author

I am still hoping at some point for a performance iteration, the performance of vector can still be quite low at times.

There is definately alot of low hanging fruit in general.

syslog recvfrom should use batching (recvmmsg)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0 shouldnt be performed multiple times for every message
Kubernetes tags are a huge cost for most kubernetes logging applications

1 reply

pront Dec 4, 2024
Maintainer

Hey @splitice, thank you for digging into this. It seems like a lot of the above are related to individual components (and not Vector core). I recommend creating new issues requesting enhancements. Details are valuable and they will help whoever decides to work on those issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #15977

{{title}}

Replies: 11 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance #15977

Replies: 11 comments · 12 replies

splitice Jan 17, 2023 Author

tobz Jan 17, 2023 Collaborator

splitice Jan 18, 2023 Author

spencergilbert Jan 18, 2023 Maintainer

splitice Jan 28, 2023 Author

splitice Jan 29, 2023 Author

splitice Feb 5, 2023 Author

splitice Feb 5, 2023 Author

splitice Feb 5, 2023 Author

jszwedko Jun 23, 2023 Maintainer

splitice Mar 11, 2024 Author

splitice Dec 4, 2024 Author

pront Dec 4, 2024 Maintainer

Replies: 11 comments 12 replies

splitice
Jan 17, 2023
Author

tobz
Jan 17, 2023
Collaborator

splitice
Jan 18, 2023
Author

spencergilbert Jan 18, 2023
Maintainer

splitice Jan 28, 2023
Author

splitice Jan 29, 2023
Author

splitice
Feb 5, 2023
Author

splitice
Feb 5, 2023
Author

splitice Feb 5, 2023
Author

jszwedko Jun 23, 2023
Maintainer

splitice
Mar 11, 2024
Author

splitice
Dec 4, 2024
Author

pront Dec 4, 2024
Maintainer