[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

aWN4Y25pa2EK · 2025-02-12T12:35:00Z

Description

During performance testing of the DA bridge node, we discovered that the node is significantly underutilizing available system resources during synchronization, particularly when syncing from scratch.

Network - `32MB-100k`

ODS Block Size -> ~32 MB
Q4 Block Size -> 128 MB

Current Behavior

DA/BN Node performs at a flat ~800 Write/OPs
Network average in: ~62–63 Mb/s
BBR

Existing Capabilities

DA Bridge Node

CPU: 32 cores
Memory 124.0 GiB
Network: 10 Gbps
Storage: 16 TB / 16k IOPS, 1000 MiB/s throughput

Validator

CPU: 32 cores
Memory 126.0 GiB
Network: 3.2 Gbps
Storage: 15k IOPS

DA Configuration used

config.toml

[Node]
  StartupTimeout = "2m0s"
  ShutdownTimeout = "2m0s"
[Core]
  IP = ""
  Port = "9090"
[State]
  DefaultKeyName = "my_celes_key.info"
  DefaultBackendName = "test"
[P2P]
  ListenAddresses = ["/ip4/0.0.0.0/udp/2121/quic-v1/webtransport", "/ip6/::/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/tcp/2121"]
  AnnounceAddresses = []
  NoAnnounceAddresses = ["/ip4/127.0.0.1/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/udp/2121/quic-v1/webtransport", "/ip6/::/udp/2121/quic-v1/webtransport", "/ip4/0.0.0.0/udp/2121/quic-v1", "/ip4/127.0.0.1/udp/2121/quic-v1", "/ip6/::/udp/2121/quic-v1", "/ip4/0.0.0.0/tcp/2121", "/ip4/127.0.0.1/tcp/2121", "/ip6/::/tcp/2121"]
  MutualPeers = []
  PeerExchange = true
  RoutingTableRefreshPeriod = "1m0s"
  [P2P.ConnManager]
    Low = 800
    High = 1000
    GracePeriod = "1m0s"
[RPC]
  Address = "0.0.0.0"
  Port = "26658"
[Gateway]
  Address = "0.0.0.0"
  Port = "26659"
  Enabled = false
[Share]
  UseShareExchange = true
  [Share.EDSStoreParams]
    GCInterval = "0s"
    RecentBlocksCacheSize = 10
    BlockstoreCacheSize = 128
  [Share.ShrExEDSParams]
    ServerReadTimeout = "5s"
    ServerWriteTimeout = "1m0s"
    HandleRequestTimeout = "1m0s"
    ConcurrencyLimit = 10
    BufferSize = 32768
  [Share.ShrExNDParams]
    ServerReadTimeout = "5s"
    ServerWriteTimeout = "1m0s"
    HandleRequestTimeout = "1m0s"
    ConcurrencyLimit = 10
  [Share.PeerManagerParams]
    PoolValidationTimeout = "2m0s"
    PeerCooldown = "3s"
    GcInterval = "30s"
    EnableBlackListing = false
  [Share.Discovery]
    PeersLimit = 5
    AdvertiseInterval = "1h0m0s"
[Header]
  TrustedHash = ""
  TrustedPeers = []
  [Header.Store]
    StoreCacheSize = 4096
    IndexCacheSize = 16384
    WriteBatchSize = 2048
  [Header.Syncer]
    TrustingPeriod = "336h0m0s"
  [Header.Server]
    WriteDeadline = "8s"
    ReadDeadline = "1m0s"
    RangeRequestTimeout = "10s"
  [Header.Client]
    MaxHeadersPerRangeRequest = 64
    RangeRequestTimeout = "8s"

Investigation Points

Increase daser parallel workers count
Tune ConcurrencyLimit for network bandwidth utilization
Adjust BlockstoreCacheSize for memory usage
Review WriteBatchSize vs IOPS capacity
Evaluate BufferSize for throughput optimization

Impact

This significantly affects node operators who need to:

Relocate nodes
Perform full sync from scratch
Recover from data loss scenarios

Would be great to have the ability to increase/fine tune the DA node configuration parameters in such a way to match the hardware capacity for a faster synchronisation.

The text was updated successfully, but these errors were encountered:

aWN4Y25pa2EK · 2025-02-12T14:34:00Z

Network IRQ Distribution seems unbalanced on the DA side (irqbalance) enabled (cpu usage at 99% constantly) during the sync process.

aWN4Y25pa2EK · 2025-02-12T16:09:23Z

Storage (gp3)

~190 MiB/s throughput / ~800 IOPs out of 1000MiB/s / 16k IOPs capacity

Network

~62–63 Mb/s out of 10Gbp/s

aWN4Y25pa2EK · 2025-02-12T16:37:34Z

2025-02-12T15:34:00.969Z	DEBUG	core	core/exchange.go:171	fetched signed block from core	{"height": 14570}
2025-02-12T16:32:19.566Z	DEBUG	core	core/exchange.go:171	fetched signed block from core	{"height": 18607}

Start: 2025‐02‐12 15:34:00.969Z, height = 14570
End: 2025‐02‐12 16:32:19.566Z, height = 18607

18607 − 14570 = 4037 blocks

Blocks per second = 4037 ÷ 3 498.597s ≈ 1.15 blocks/s
Blocks per minute = 1.15 × 60 ≈ 69 blocks/min

walldiss · 2025-02-12T18:08:33Z

Thank you for reporting this detailed performance issue. From your data, the Bridge Node (BN) appears to sync at approximately 1.15 blocks/s for 32MB blocks, which translates to around 36.8 MB/s of throughput. Considering the 10 Gbps network on the BN side (and 3.2 Gbps on the validator), this is roughly 9.2% utilization of the validator’s network capacity—far below what the hardware should support. Additionally, the node has 32 CPU cores and 128 GB RAM, indicating that system resources should not be the limiting factor.

We need to run benchmarks of BN sync in controlled env and identify the bottleneck for such low bandwidth utilisation

walldiss · 2025-02-12T18:15:48Z

Hypothesis

One possible explanation for the performance bottleneck is how worker parallelization is currently managed in the BN sync process. Unlike the DASer approach—where a fixed number of workers continuously process tasks—the BN sync uses an errorgroup that splits the work among workers and waits for all to finish. If one task (e.g., fetching or processing a single block) is slow, the entire group can stall, leaving multiple workers idle and underutilizing available CPU and network resources. Ensuring that work is distributed in a way that prevents a single slow task from blocking others could significantly improve overall sync throughput.

Test

Need blocking profile benchmark.

Potential solution

Use DASer for core sync

renaynay · 2025-02-13T09:47:05Z

Using DASer for core sync will be a very complex change.

Wondertan · 2025-02-13T10:15:56Z

Another way to test the hypothesis is to run with 1-2 cocurrency limit and see if thoughput changes. If it doesnt, then we should be looking at TCP connection itself, like enabling MPTCP or BBR. (Or socket IPC if on the same proc)

github-actions bot added needs:triage external Issues created by non node team members labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

aWN4Y25pa2EK commented Feb 12, 2025

walldiss commented Feb 12, 2025

walldiss commented Feb 12, 2025

renaynay commented Feb 13, 2025

Wondertan commented Feb 13, 2025 •

edited

Loading

[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

[Performance] DA Bridge Node Not Utilising Full Storage/Network Capacity During Sync #4108

Comments

aWN4Y25pa2EK commented Feb 12, 2025 • edited Loading

Description

Network - 32MB-100k

Current Behavior

Existing Capabilities

DA Bridge Node

Validator

DA Configuration used

Investigation Points

Impact

aWN4Y25pa2EK commented Feb 12, 2025 • edited Loading

aWN4Y25pa2EK commented Feb 12, 2025 • edited Loading

Storage (gp3)

Network

aWN4Y25pa2EK commented Feb 12, 2025

walldiss commented Feb 12, 2025

walldiss commented Feb 12, 2025

Hypothesis

Test

Potential solution

renaynay commented Feb 13, 2025

Wondertan commented Feb 13, 2025 • edited Loading

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

Network - `32MB-100k`

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

aWN4Y25pa2EK commented Feb 12, 2025 •

edited

Loading

Wondertan commented Feb 13, 2025 •

edited

Loading