You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For larger ring-sizes, long startup times can be seen for individual nodes.
There is some confusion between parameters as well, in particular: riak_core.vnode_rolling_start; riak_core.vnode_parallel_start. Both these default to 16.
%% NOTE: This following is a hack. There's a basic
%% dependency/race between riak_core (want to start vnodes
%% right away to trigger possible handoffs) and riak_kv
%% (needed to support those vnodes). The hack does not fix
%% that dependency: internal techdebt todo list #A7 does.
It appears that the tick on the riak_core_vnode_manager prompts two processes which both result in starting vnodes:
a call to maybe_ring_changed/4 which will result in a call to riak_core_ring_handler:ensure_vnodes_started/1;
a call to maybe_start_vnodes/2.
The behaviour of maybe_ring_changed is dependent on the existence of a start_vnodes/1 function - which exists in riak_kv, but not in riak_pb. Within riak_kv, this will call riak_kv_vnode_manager:get_vnode_pid/2 with a list of all the vnodes to start - but starting those vnodes will be constrained by the number set in riak_core.vnode_parallel_start. For riak_pb, this will call riak_kv_vnode_manager:get_vnode_pid/2 for each vnode in turn. This will prompt the start of vnodes for which this node is the designated owner in the ring only - in parallel (for a batch of riak_core.vnode_parallel_start) for riak_kv, and in series (for all) for riak_pb.
The maybe_start_vnodes/2 function is constrained by riak_core.vnode_rolling_start. It will receive this number of tokens, and then schedule (say) 16 attempts to start an unstarted vnode in a loop (each attempt must wait though for the previous attempt to clear). There is no parallel starting through this mode, the riak_core.vnode_rolling_start controls the attempts per 10s management tick, each one will be triggered after the last. This will try and start every vnode that has never been started on this node before - regardless of whether this node owns the vnode in the ring (i.e. as it may need to handoff a vnode it doesn't own if it has started with some historic data present under that vnode).
The net effect for riak_kv - is that the (possibly) same 16 vnodes are requested to start each management tick (assuming startup is very fast as the vnodes are empty) by two different processes. If the first vnode triggered via maybe_start_vnodes/2 is slow to startup (i.e. because it has a lot of data, or is repairing etc), then the 16 parallel ones will be blocked in the message queue, and no parallel starts will be prompted until the first request is released.
This means that if a node is in a ring, but is restarted. If the ring-size is 512 and the node owns 1/8th of the ring, then 64 vnodes will be started in 4 loops by the maybe_ring_changed/4 call triggered on the management tick - so the nodes it needs to function, are started in 4 ticks (assuming all start within 10s).
However, all the other vnodes must wait to be started 16 at a time via maybe_start_vnodes/2. This is for both riak_pipe and riak_kv - so this will take 2 * (512 - 64) / 16 loops at least, or around ten minutes.
This creates confusion when looking at riak admin transfers. Each node will start the (512 - 64) nodes, and then have to hand them off to the genuine owner of the partition. So even as handoffs occur, new vnodes are bing started every 10 seconds - so it looks like transfers are not progressing.
It is necessary to start every vnode, even if the vnode is not in the ring - as it may have data to handoff.
The text was updated successfully, but these errors were encountered:
It would be preferable if the parallel starting of owned vnodes occurred before the non-parallel starting of non-started vnodes (to prevent the issue of one vnode starting not in parallel first i.e. after a ledger clean). However, the parallel starting is only prompted by a ring change - and perhaps it is necessary to start a node via the other process to trigger a ring change (as handoff will complete).
It would be preferable if the non-parallel starting of non-started vnodes was either parallel or at least faster. This could be fixed simply by allowing the riak_core.vnode_rolling_start to be configured via riak.conf a much higher value
Note that every vnode must have a start attempt via maybe_start_vnodes/2, as only calls to this function lead to the vnode being removed from the never_started list in the riak_core_vnode_manager state. However, an attempt to start through this method will be a null event - the fact that it is already started is detected before prompting the startup.
For larger ring-sizes, long startup times can be seen for individual nodes.
There is some confusion between parameters as well, in particular:
riak_core.vnode_rolling_start
;riak_core.vnode_parallel_start
. Both these default to 16.There are comments about races and hacks:
riak_core/src/riak_core_ring_handler.erl
Lines 126 to 130 in 25d9a6f
It appears that the tick on the riak_core_vnode_manager prompts two processes which both result in starting vnodes:
maybe_ring_changed/4
which will result in a call toriak_core_ring_handler:ensure_vnodes_started/1
;maybe_start_vnodes/2
.The behaviour of
maybe_ring_changed
is dependent on the existence of astart_vnodes/1
function - which exists inriak_kv
, but not inriak_pb
. Withinriak_kv
, this will callriak_kv_vnode_manager:get_vnode_pid/2
with a list of all the vnodes to start - but starting those vnodes will be constrained by the number set inriak_core.vnode_parallel_start
. Forriak_pb
, this will callriak_kv_vnode_manager:get_vnode_pid/2
for each vnode in turn. This will prompt the start of vnodes for which this node is the designated owner in the ring only - in parallel (for a batch ofriak_core.vnode_parallel_start
) forriak_kv
, and in series (for all) forriak_pb
.The
maybe_start_vnodes/2
function is constrained byriak_core.vnode_rolling_start
. It will receive this number of tokens, and then schedule (say) 16 attempts to start an unstarted vnode in a loop (each attempt must wait though for the previous attempt to clear). There is no parallel starting through this mode, theriak_core.vnode_rolling_start
controls the attempts per 10s management tick, each one will be triggered after the last. This will try and start every vnode that has never been started on this node before - regardless of whether this node owns the vnode in the ring (i.e. as it may need to handoff a vnode it doesn't own if it has started with some historic data present under that vnode).The net effect for
riak_kv
- is that the (possibly) same 16 vnodes are requested to start each management tick (assuming startup is very fast as the vnodes are empty) by two different processes. If the first vnode triggered viamaybe_start_vnodes/2
is slow to startup (i.e. because it has a lot of data, or is repairing etc), then the 16 parallel ones will be blocked in the message queue, and no parallel starts will be prompted until the first request is released.This means that if a node is in a ring, but is restarted. If the ring-size is 512 and the node owns 1/8th of the ring, then 64 vnodes will be started in 4 loops by the
maybe_ring_changed/4
call triggered on the management tick - so the nodes it needs to function, are started in 4 ticks (assuming all start within 10s).However, all the other vnodes must wait to be started 16 at a time via
maybe_start_vnodes/2
. This is for bothriak_pipe
andriak_kv
- so this will take2 * (512 - 64) / 16
loops at least, or around ten minutes.This creates confusion when looking at
riak admin transfers
. Each node will start the (512 - 64) nodes, and then have to hand them off to the genuine owner of the partition. So even as handoffs occur, new vnodes are bing started every 10 seconds - so it looks like transfers are not progressing.It is necessary to start every vnode, even if the vnode is not in the ring - as it may have data to handoff.
The text was updated successfully, but these errors were encountered: