Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

Open
aWN4Y25pa2EK opened this issue Feb 12, 2025 · 7 comments
Labels
external Issues created by non node team members needs:triage

Comments

@aWN4Y25pa2EK
Copy link

aWN4Y25pa2EK commented Feb 12, 2025

Description

During performance testing of the DA bridge node, we discovered that the node is significantly underutilizing available system resources during synchronization, particularly when syncing from scratch.

Network - 32MB-100k

  • ODS Block Size -> ~32 MB
  • Q4 Block Size -> 128 MB

Current Behavior

  • DA/BN Node performs at a flat ~800 Write/OPs
  • Network average in: ~62–63 Mb/s
  • BBR

Existing Capabilities

DA Bridge Node

  • CPU: 32 cores
  • Memory 124.0 GiB
  • Network: 10 Gbps
  • Storage: 16 TB / 16k IOPS, 1000 MiB/s throughput

Validator

  • CPU: 32 cores
  • Memory 126.0 GiB
  • Network: 3.2 Gbps
  • Storage: 15k IOPS

DA Configuration used

config.toml
[Node]
  StartupTimeout = "2m0s"
  ShutdownTimeout = "2m0s"
[Core]
  IP = ""
  Port = "9090"
[State]
  DefaultKeyName = "my_celes_key.info"
  DefaultBackendName = "test"
[P2P]
  ListenAddresses = ["/ip4/0.0.0.0/udp/2121/quic-v1/webtransport", "/ip6/::/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/tcp/2121"]
  AnnounceAddresses = []
  NoAnnounceAddresses = ["/ip4/127.0.0.1/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/udp/2121/quic-v1/webtransport", "/ip6/::/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/udp/2121/quic-v1", "/ip4/127.0.0.1/udp/2121/quic-v1", "/ip6/::/udp/2121/quic-v1", "/ip4/0.0.0.0/tcp/2121", "/ip4/127.0.0.1/tcp/2121", "/ip6/::/tcp/2121"]
  MutualPeers = []
  PeerExchange = true
  RoutingTableRefreshPeriod = "1m0s"
  [P2P.ConnManager]
    Low = 800
    High = 1000
    GracePeriod = "1m0s"
[RPC]
  Address = "0.0.0.0"
  Port = "26658"
[Gateway]
  Address = "0.0.0.0"
  Port = "26659"
  Enabled = false
[Share]
  UseShareExchange = true
  [Share.EDSStoreParams]
    GCInterval = "0s"
    RecentBlocksCacheSize = 10
    BlockstoreCacheSize = 128
  [Share.ShrExEDSParams]
    ServerReadTimeout = "5s"
    ServerWriteTimeout = "1m0s"
    HandleRequestTimeout = "1m0s"
    ConcurrencyLimit = 10
    BufferSize = 32768
  [Share.ShrExNDParams]
    ServerReadTimeout = "5s"
    ServerWriteTimeout = "1m0s"
    HandleRequestTimeout = "1m0s"
    ConcurrencyLimit = 10
  [Share.PeerManagerParams]
    PoolValidationTimeout = "2m0s"
    PeerCooldown = "3s"
    GcInterval = "30s"
    EnableBlackListing = false
  [Share.Discovery]
    PeersLimit = 5
    AdvertiseInterval = "1h0m0s"
[Header]
  TrustedHash = ""
  TrustedPeers = []
  [Header.Store]
    StoreCacheSize = 4096
    IndexCacheSize = 16384
    WriteBatchSize = 2048
  [Header.Syncer]
    TrustingPeriod = "336h0m0s"
  [Header.Server]
    WriteDeadline = "8s"
    ReadDeadline = "1m0s"
    RangeRequestTimeout = "10s"
  [Header.Client]
    MaxHeadersPerRangeRequest = 64
    RangeRequestTimeout = "8s"

Investigation Points

  • Increase daser parallel workers count
  • Tune ConcurrencyLimit for network bandwidth utilization
  • Adjust BlockstoreCacheSize for memory usage
  • Review WriteBatchSize vs IOPS capacity
  • Evaluate BufferSize for throughput optimization

Impact

This significantly affects node operators who need to:

  • Relocate nodes
  • Perform full sync from scratch
  • Recover from data loss scenarios

Would be great to have the ability to increase/fine tune the DA node configuration parameters in such a way to match the hardware capacity for a faster synchronisation.

@github-actions github-actions bot added needs:triage external Issues created by non node team members labels Feb 12, 2025
@aWN4Y25pa2EK
Copy link
Author

aWN4Y25pa2EK commented Feb 12, 2025

Network IRQ Distribution seems unbalanced on the DA side (irqbalance) enabled (cpu usage at 99% constantly) during the sync process.

Image

@aWN4Y25pa2EK
Copy link
Author

aWN4Y25pa2EK commented Feb 12, 2025

Storage (gp3)

~190 MiB/s throughput / ~800 IOPs out of 1000MiB/s / 16k IOPs capacity

Network

~62–63 Mb/s out of 10Gbp/s

Image Image

@aWN4Y25pa2EK
Copy link
Author

2025-02-12T15:34:00.969Z	DEBUG	core	core/exchange.go:171	fetched signed block from core	{"height": 14570}
2025-02-12T16:32:19.566Z	DEBUG	core	core/exchange.go:171	fetched signed block from core	{"height": 18607}

Start: 2025‐02‐12 15:34:00.969Z, height = 14570
End: 2025‐02‐12 16:32:19.566Z, height = 18607

18607 − 14570 = 4037 blocks

Blocks per second = 4037 ÷ 3 498.597s ≈ 1.15 blocks/s
Blocks per minute = 1.15 × 60 ≈ 69 blocks/min

@walldiss
Copy link
Member

Thank you for reporting this detailed performance issue. From your data, the Bridge Node (BN) appears to sync at approximately 1.15 blocks/s for 32MB blocks, which translates to around 36.8 MB/s of throughput. Considering the 10 Gbps network on the BN side (and 3.2 Gbps on the validator), this is roughly 9.2% utilization of the validator’s network capacity—far below what the hardware should support. Additionally, the node has 32 CPU cores and 128 GB RAM, indicating that system resources should not be the limiting factor.

We need to run benchmarks of BN sync in controlled env and identify the bottleneck for such low bandwidth utilisation

@walldiss
Copy link
Member

Hypothesis

One possible explanation for the performance bottleneck is how worker parallelization is currently managed in the BN sync process. Unlike the DASer approach—where a fixed number of workers continuously process tasks—the BN sync uses an errorgroup that splits the work among workers and waits for all to finish. If one task (e.g., fetching or processing a single block) is slow, the entire group can stall, leaving multiple workers idle and underutilizing available CPU and network resources. Ensuring that work is distributed in a way that prevents a single slow task from blocking others could significantly improve overall sync throughput.

Test

Need blocking profile benchmark.

Potential solution

Use DASer for core sync

@renaynay
Copy link
Member

Using DASer for core sync will be a very complex change.

@Wondertan
Copy link
Member

Wondertan commented Feb 13, 2025

Another way to test the hypothesis is to run with 1-2 cocurrency limit and see if thoughput changes. If it doesnt, then we should be looking at TCP connection itself, like enabling MPTCP or BBR. (Or socket IPC if on the same proc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external Issues created by non node team members needs:triage
Projects
None yet
Development

No branches or pull requests

4 participants