You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a perspective on the issues with the current online documentation. This issue exists as a bit of a brain dump, as much as anything to emphasise the scale of the problem should the project try to edit its way out of the documentation issues - as opposed to starting again.
The problems are divided into three categories:
the confusion of duplication merging with legacy;
some general gripes;
large sections that need a rewrite.
Confusion of duplication/legacy
Information is duplicated within the documentation in too many places, without any auto-generation from source, and this complicates the jobs of keeping things up-to-date. Also as we maintain legacy information, even where the tag is selected for the latest release, if an update is missing from one place only legacy information is seen and therefore the result for the reader is commonly be misleading.
Example 1 - node_confirms
node_confirms is six years old, so one of the oldest post-basho features. The features were design to resolve confusion over write properties, and replace the confused use of pw.
node_confirms is documented, in the context of bucket properties. There is a box with more detail, it is largely accurate, although doesn’t refer to the use of node_confirms on read. There is a lack of context about the key difference between pw=2 and node_confirms=2.
Elsewhere there are references to blog posts and other artefacts, that all pre-date node_confirms, and so no mention will be given to it. In all these places readers are instead pointed to use pw.
In order to use node_confirms, you have to be lucky to land in the only place where it is documented. Landing anywhere else will give you misleading information.
Example 2 - OTP version for 3.2
For Riak 3.2 we recommend the use of OTP 24. In the docs there are release notes that refer to OTP 24 but also OTP 25, and states OTP 25 is recommended. This is a statement that went in the first release notes for 3.2.0, and was never corrected when we went to even numbers only and removed the OTP 25 recommendation because of the 22 -> 25 uplift issue.
In a separate page the recommended OTP versions for 3.2.0 are given as OTP 20, 21 and 22 - but with best performance in OTP 22.
This is documented in the developers guide for using secondary indexes, where it is incorrectly stated that it requires a node restart to manage (it can be changed at run-time from remote_console).
This is not documented anywhere in the configuration reference, although it is a key item of configuration. Neither is it documented in the guidelines for managing recovery in clusters (the primary purpose of the configuration is to have it disabled while a node is recovering from failure - and only re-enable it when AAE confirms recovery is complete). It is documented in the guidelines for 2i, but not in the documentation of other services which depend on coverage queries (nextgenrepl full-sync, and aae_fold).
Some general gripes
There are a significant number of house-keeping issues:
Numerous pages with a broken [use admin riak admin] link
In some places readers are directed to use the mailing list
References to EE and other legacy names (even when using the tag of the latest release)
References to services basho provide
Lack of clarity of where to go for community support, various references to raising tickets with TI Tokyo, even a reference to Erlang Solutions.
None of the OS-specific riak operation guides mention systemd
There are various references to old 2.0 upgrades, using vm.args
Overly conservative recommendations re large objects (100KB + average is fine, > 2MB not such a big deal) and disk space (upgrade at only 50%-60%?)
Handoff configuration tuning still based on older versions (none of the actual config parameters that help with transfer batch control, or the secret forced_ownership_handoff)
Pages for JMX and SNMP
Links to third-party monitoring plugins that haven’t been touched for ten years
Backend documentation needs an overhaul. In 3.2.0 only bitcask and leveled are tested prior to release (plus a minimal pass over eleveldb if we made any changes have concerns about). Leveled is mentioned, but not included in things like backend comparison tables. There is a recommendation in one place to start with multi-backend, whereas I would suggest avoid multi-backend unless you have no other choice. The in-memory backend we’ve not formally deprecated I think, but that’s a mistake, it really shouldn’t be used, and isn’t tested.
The leveled configuration references is missing some very important items, most noticeably the reload_recalc option - but snapshot timeout configuration, zstd compression are important omissions (zstd can have significant performance advantages).
The recommendation now should be to default to leveled, and use bitcask only when object mutation is rare (and anti-entropy or multiple clusters are not required). Other backends, including multi-backend support exist for backwards compatibility only.
For deletes, the issue related to expiry and AAE don’t appear to be touched upon at all. The delete_mode of keep is recommended which is good (although we don’t make it our default), but I don’t think there is anywhere near enough information to explain the consequences of this - the resurrection of tombstones is not covered at all, no reference to kv679 and why delete mode is important. The reaper and eraser and referred to directly (just the aae_folds), so nothing on how to use them sensibly, and how to monitor reap and erase jobs. Nothing about tombstone_pause, handoff_deletes, dollarkey_readtombs.
The behaviour of deletes really confuses operators. The best resource for explaining used to be a post on the mailing list, it would be much nicer to have this summarised in the documents.
The FAQ is wrong on so many questions, there's a need to start again.
For AWS guidance there is an installation page and a tuning page which is more of a general guide. The installation recommends Riak instances from the VM market place … I suspect they may not give a user an up-to-date version of Riak. The performance guide is very old, gives bad advice, and fails to advertise a key strength of Riak. There are now AWS instances optimised for Riak like databases (im4gn), and it is a key feature of Riak that it runs on these low-cost instances reliably (with ARM CPUs and available to deal with ephemeral storage). Support for placement groups in Riak is a key feature (it is hard to find anything in the documents about v2 vs v3 vs v4 cluster claim), readable backups to S3 as well. No guidance about working with AZs, regions etc.
The text was updated successfully, but these errors were encountered:
This is a perspective on the issues with the current online documentation. This issue exists as a bit of a brain dump, as much as anything to emphasise the scale of the problem should the project try to edit its way out of the documentation issues - as opposed to starting again.
The problems are divided into three categories:
Confusion of duplication/legacy
Information is duplicated within the documentation in too many places, without any auto-generation from source, and this complicates the jobs of keeping things up-to-date. Also as we maintain legacy information, even where the tag is selected for the latest release, if an update is missing from one place only legacy information is seen and therefore the result for the reader is commonly be misleading.
Example 1 - node_confirms
node_confirms is six years old, so one of the oldest post-basho features. The features were design to resolve confusion over write properties, and replace the confused use of pw.
node_confirms is documented, in the context of bucket properties. There is a box with more detail, it is largely accurate, although doesn’t refer to the use of node_confirms on read. There is a lack of context about the key difference between pw=2 and node_confirms=2.
But is you got to the page on replication properties there is no mention of node_confirms, nor in PBC Store Object, nor in PBC Set Bucket Properties etc.
Elsewhere there are references to blog posts and other artefacts, that all pre-date node_confirms, and so no mention will be given to it. In all these places readers are instead pointed to use pw.
In order to use node_confirms, you have to be lucky to land in the only place where it is documented. Landing anywhere else will give you misleading information.
Example 2 - OTP version for 3.2
For Riak 3.2 we recommend the use of OTP 24. In the docs there are release notes that refer to OTP 24 but also OTP 25, and states OTP 25 is recommended. This is a statement that went in the first release notes for 3.2.0, and was never corrected when we went to even numbers only and removed the OTP 25 recommendation because of the 22 -> 25 uplift issue.
In a separate page the recommended OTP versions for 3.2.0 are given as OTP 20, 21 and 22 - but with best performance in OTP 22.
In all cases if the reader looks at the installation guide for this version the guidance is to download OTP 16RB02-basho10.
Example 3 - Participate in Coverage
This is documented in the developers guide for using secondary indexes, where it is incorrectly stated that it requires a node restart to manage (it can be changed at run-time from remote_console).
This is not documented anywhere in the configuration reference, although it is a key item of configuration. Neither is it documented in the guidelines for managing recovery in clusters (the primary purpose of the configuration is to have it disabled while a node is recovering from failure - and only re-enable it when AAE confirms recovery is complete). It is documented in the guidelines for 2i, but not in the documentation of other services which depend on coverage queries (nextgenrepl full-sync, and aae_fold).
Some general gripes
There are a significant number of house-keeping issues:
There is a lot of out-of-date guidance.
forced_ownership_handoff
)riak attach
notriak remote_console
Some of the docs were always bad and misleading, and that debt remains:
Section re-writes required
There are some significant re-writes required on:
Backend documentation needs an overhaul. In 3.2.0 only bitcask and leveled are tested prior to release (plus a minimal pass over eleveldb if we made any changes have concerns about). Leveled is mentioned, but not included in things like backend comparison tables. There is a recommendation in one place to start with multi-backend, whereas I would suggest avoid multi-backend unless you have no other choice. The in-memory backend we’ve not formally deprecated I think, but that’s a mistake, it really shouldn’t be used, and isn’t tested.
The leveled configuration references is missing some very important items, most noticeably the reload_recalc option - but snapshot timeout configuration, zstd compression are important omissions (zstd can have significant performance advantages).
The recommendation now should be to default to leveled, and use bitcask only when object mutation is rare (and anti-entropy or multiple clusters are not required). Other backends, including multi-backend support exist for backwards compatibility only.
For deletes, the issue related to expiry and AAE don’t appear to be touched upon at all. The delete_mode of keep is recommended which is good (although we don’t make it our default), but I don’t think there is anywhere near enough information to explain the consequences of this - the resurrection of tombstones is not covered at all, no reference to kv679 and why delete mode is important. The reaper and eraser and referred to directly (just the aae_folds), so nothing on how to use them sensibly, and how to monitor reap and erase jobs. Nothing about tombstone_pause, handoff_deletes, dollarkey_readtombs.
The behaviour of deletes really confuses operators. The best resource for explaining used to be a post on the mailing list, it would be much nicer to have this summarised in the documents.
The FAQ is wrong on so many questions, there's a need to start again.
For AWS guidance there is an installation page and a tuning page which is more of a general guide. The installation recommends Riak instances from the VM market place … I suspect they may not give a user an up-to-date version of Riak. The performance guide is very old, gives bad advice, and fails to advertise a key strength of Riak. There are now AWS instances optimised for Riak like databases (im4gn), and it is a key feature of Riak that it runs on these low-cost instances reliably (with ARM CPUs and available to deal with ephemeral storage). Support for placement groups in Riak is a key feature (it is hard to find anything in the documents about v2 vs v3 vs v4 cluster claim), readable backups to S3 as well. No guidance about working with AZs, regions etc.
The text was updated successfully, but these errors were encountered: