Startup time - larger ring_size #15

martinsumner · 2024-12-23T19:39:53Z

For larger ring-sizes, long startup times can be seen for individual nodes.

There is some confusion between parameters as well, in particular: riak_core.vnode_rolling_start; riak_core.vnode_parallel_start. Both these default to 16.

There are comments about races and hacks:

riak_core/src/riak_core_ring_handler.erl

Lines 126 to 130 in 25d9a6f

    
           %% NOTE: This following is a hack.  There's a basic 
        
           %%       dependency/race between riak_core (want to start vnodes 
        
           %%       right away to trigger possible handoffs) and riak_kv 
        
           %%       (needed to support those vnodes).  The hack does not fix 
        
           %%       that dependency: internal techdebt todo list #A7 does.

It appears that the tick on the riak_core_vnode_manager prompts two processes which both result in starting vnodes:

a call to maybe_ring_changed/4 which will result in a call to riak_core_ring_handler:ensure_vnodes_started/1;
a call to maybe_start_vnodes/2.

The behaviour of maybe_ring_changed is dependent on the existence of a start_vnodes/1 function - which exists in riak_kv, but not in riak_pb. Within riak_kv, this will call riak_kv_vnode_manager:get_vnode_pid/2 with a list of all the vnodes to start - but starting those vnodes will be constrained by the number set in riak_core.vnode_parallel_start. For riak_pb, this will call riak_kv_vnode_manager:get_vnode_pid/2 for each vnode in turn. This will prompt the start of vnodes for which this node is the designated owner in the ring only - in parallel (for a batch of riak_core.vnode_parallel_start) for riak_kv, and in series (for all) for riak_pb.

The maybe_start_vnodes/2 function is constrained by riak_core.vnode_rolling_start. It will receive this number of tokens, and then schedule (say) 16 attempts to start an unstarted vnode in a loop (each attempt must wait though for the previous attempt to clear). There is no parallel starting through this mode, the riak_core.vnode_rolling_start controls the attempts per 10s management tick, each one will be triggered after the last. This will try and start every vnode that has never been started on this node before - regardless of whether this node owns the vnode in the ring (i.e. as it may need to handoff a vnode it doesn't own if it has started with some historic data present under that vnode).

The net effect for riak_kv - is that the (possibly) same 16 vnodes are requested to start each management tick (assuming startup is very fast as the vnodes are empty) by two different processes. If the first vnode triggered via maybe_start_vnodes/2 is slow to startup (i.e. because it has a lot of data, or is repairing etc), then the 16 parallel ones will be blocked in the message queue, and no parallel starts will be prompted until the first request is released.

This means that if a node is in a ring, but is restarted. If the ring-size is 512 and the node owns 1/8th of the ring, then 64 vnodes will be started in 4 loops by the maybe_ring_changed/4 call triggered on the management tick - so the nodes it needs to function, are started in 4 ticks (assuming all start within 10s).

However, all the other vnodes must wait to be started 16 at a time via maybe_start_vnodes/2. This is for both riak_pipe and riak_kv - so this will take 2 * (512 - 64) / 16 loops at least, or around ten minutes.

This creates confusion when looking at riak admin transfers. Each node will start the (512 - 64) nodes, and then have to hand them off to the genuine owner of the partition. So even as handoffs occur, new vnodes are bing started every 10 seconds - so it looks like transfers are not progressing.

It is necessary to start every vnode, even if the vnode is not in the ring - as it may have data to handoff.

The text was updated successfully, but these errors were encountered:

martinsumner · 2024-12-23T19:43:30Z

It would be preferable if the parallel starting of owned vnodes occurred before the non-parallel starting of non-started vnodes (to prevent the issue of one vnode starting not in parallel first i.e. after a ledger clean). However, the parallel starting is only prompted by a ring change - and perhaps it is necessary to start a node via the other process to trigger a ring change (as handoff will complete).

It would be preferable if the non-parallel starting of non-started vnodes was either parallel or at least faster. This could be fixed simply by allowing the riak_core.vnode_rolling_start to be configured via riak.conf a much higher value

martinsumner · 2024-12-24T13:26:06Z

Note that every vnode must have a start attempt via maybe_start_vnodes/2, as only calls to this function lead to the vnode being removed from the never_started list in the riak_core_vnode_manager state. However, an attempt to start through this method will be a null event - the fact that it is already started is detected before prompting the startup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup time - larger ring_size #15

Startup time - larger ring_size #15

martinsumner commented Dec 23, 2024 •

edited

Loading

martinsumner commented Dec 23, 2024 •

edited

Loading

martinsumner commented Dec 24, 2024

Startup time - larger ring_size #15

Startup time - larger ring_size #15

Comments

martinsumner commented Dec 23, 2024 • edited Loading

martinsumner commented Dec 23, 2024 • edited Loading

martinsumner commented Dec 24, 2024

martinsumner commented Dec 23, 2024 •

edited

Loading

martinsumner commented Dec 23, 2024 •

edited

Loading