Issues with Riak documentation #14

martinsumner · 2024-11-29T18:26:32Z

This is a perspective on the issues with the current online documentation. This issue exists as a bit of a brain dump, as much as anything to emphasise the scale of the problem should the project try to edit its way out of the documentation issues - as opposed to starting again.

The problems are divided into three categories:

the confusion of duplication merging with legacy;
some general gripes;
large sections that need a rewrite.

Confusion of duplication/legacy

Information is duplicated within the documentation in too many places, without any auto-generation from source, and this complicates the jobs of keeping things up-to-date. Also as we maintain legacy information, even where the tag is selected for the latest release, if an update is missing from one place only legacy information is seen and therefore the result for the reader is commonly be misleading.

Example 1 - node_confirms

node_confirms is six years old, so one of the oldest post-basho features. The features were design to resolve confusion over write properties, and replace the confused use of pw.

node_confirms is documented, in the context of bucket properties. There is a box with more detail, it is largely accurate, although doesn’t refer to the use of node_confirms on read. There is a lack of context about the key difference between pw=2 and node_confirms=2.

But is you got to the page on replication properties there is no mention of node_confirms, nor in PBC Store Object, nor in PBC Set Bucket Properties etc.

Elsewhere there are references to blog posts and other artefacts, that all pre-date node_confirms, and so no mention will be given to it. In all these places readers are instead pointed to use pw.

In order to use node_confirms, you have to be lucky to land in the only place where it is documented. Landing anywhere else will give you misleading information.

Example 2 - OTP version for 3.2

For Riak 3.2 we recommend the use of OTP 24. In the docs there are release notes that refer to OTP 24 but also OTP 25, and states OTP 25 is recommended. This is a statement that went in the first release notes for 3.2.0, and was never corrected when we went to even numbers only and removed the OTP 25 recommendation because of the 22 -> 25 uplift issue.

In a separate page the recommended OTP versions for 3.2.0 are given as OTP 20, 21 and 22 - but with best performance in OTP 22.

In all cases if the reader looks at the installation guide for this version the guidance is to download OTP 16RB02-basho10.

Example 3 - Participate in Coverage

This is documented in the developers guide for using secondary indexes, where it is incorrectly stated that it requires a node restart to manage (it can be changed at run-time from remote_console).

This is not documented anywhere in the configuration reference, although it is a key item of configuration. Neither is it documented in the guidelines for managing recovery in clusters (the primary purpose of the configuration is to have it disabled while a node is recovering from failure - and only re-enable it when AAE confirms recovery is complete). It is documented in the guidelines for 2i, but not in the documentation of other services which depend on coverage queries (nextgenrepl full-sync, and aae_fold).

Some general gripes

There are a significant number of house-keeping issues:

Numerous pages with a broken [use admin riak admin] link
In some places readers are directed to use the mailing list
References to EE and other legacy names (even when using the tag of the latest release)
References to services basho provide
Lack of clarity of where to go for community support, various references to raising tickets with TI Tokyo, even a reference to Erlang Solutions.
Some obviously broken bits, such as a Troubleshooting page where the only content is an explanation of a 204 http response, and a bucket types usage page which talks only of the _dont_index sentinel

There is a lot of out-of-date guidance.

Apparently we don’t test on any version of Ubuntu in active support
Incorrectly recommending non-default Erlang VM tuning (for leveled anyway)
Recommending ZFS for storage, but not on any OS that is supported
There are kernel, network and vm tuning guidelines which I suspect haven’t been tested for ten years
We still document riak_control, and say it is installed as part of the release
Even for 3.2.0 all the guidance for logging is based on lager
None of the OS-specific riak operation guides mention systemd
There are various references to old 2.0 upgrades, using vm.args
Overly conservative recommendations re large objects (100KB + average is fine, > 2MB not such a big deal) and disk space (upgrade at only 50%-60%?)
Handoff configuration tuning still based on older versions (none of the actual config parameters that help with transfer batch control, or the secret forced_ownership_handoff)
Pages for JMX and SNMP
Links to third-party monitoring plugins that haven’t been touched for ten years
Runtime interaction what even is this?
All erlang commands still being run by riak attach not riak remote_console
Can't use security unless you use R16B

Some of the docs were always bad and misleading, and that debt remains:

explanation of lww = true/false, notfound_ok
The difficulty of discovering conflicts between features, i.e. what does write_once actually break, what will mess with AAE, break full-sync etc
No mention of term_regex, and/or overloading terms in 2i, misleading statements about problems of scale, confusion over recommending alternatives; the docs at one place even state that Map/Reduce is the preferred method of non primary key lookup
Almost no mention of scaling issues in CRDTs; including in a recommendation to keep a set as an alternative to using list keys

Section re-writes required

There are some significant re-writes required on:

backend choices
deletes
Faq
AWS/Cloud

Backend documentation needs an overhaul. In 3.2.0 only bitcask and leveled are tested prior to release (plus a minimal pass over eleveldb if we made any changes have concerns about). Leveled is mentioned, but not included in things like backend comparison tables. There is a recommendation in one place to start with multi-backend, whereas I would suggest avoid multi-backend unless you have no other choice. The in-memory backend we’ve not formally deprecated I think, but that’s a mistake, it really shouldn’t be used, and isn’t tested.

The leveled configuration references is missing some very important items, most noticeably the reload_recalc option - but snapshot timeout configuration, zstd compression are important omissions (zstd can have significant performance advantages).

The recommendation now should be to default to leveled, and use bitcask only when object mutation is rare (and anti-entropy or multiple clusters are not required). Other backends, including multi-backend support exist for backwards compatibility only.

For deletes, the issue related to expiry and AAE don’t appear to be touched upon at all. The delete_mode of keep is recommended which is good (although we don’t make it our default), but I don’t think there is anywhere near enough information to explain the consequences of this - the resurrection of tombstones is not covered at all, no reference to kv679 and why delete mode is important. The reaper and eraser and referred to directly (just the aae_folds), so nothing on how to use them sensibly, and how to monitor reap and erase jobs. Nothing about tombstone_pause, handoff_deletes, dollarkey_readtombs.

The behaviour of deletes really confuses operators. The best resource for explaining used to be a post on the mailing list, it would be much nicer to have this summarised in the documents.

The FAQ is wrong on so many questions, there's a need to start again.

For AWS guidance there is an installation page and a tuning page which is more of a general guide. The installation recommends Riak instances from the VM market place … I suspect they may not give a user an up-to-date version of Riak. The performance guide is very old, gives bad advice, and fails to advertise a key strength of Riak. There are now AWS instances optimised for Riak like databases (im4gn), and it is a key feature of Riak that it runs on these low-cost instances reliably (with ARM CPUs and available to deal with ephemeral storage). Support for placement groups in Riak is a key feature (it is hard to find anything in the documents about v2 vs v3 vs v4 cluster claim), readable backups to S3 as well. No guidance about working with AZs, regions etc.

The text was updated successfully, but these errors were encountered:

tburghart · 2024-12-04T12:43:47Z

Depressing as it is, thanks for writing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Riak documentation #14

Issues with Riak documentation #14

martinsumner commented Nov 29, 2024 •

edited

Loading

tburghart commented Dec 4, 2024

Issues with Riak documentation #14

Issues with Riak documentation #14

Comments

martinsumner commented Nov 29, 2024 • edited Loading

Confusion of duplication/legacy

Example 1 - node_confirms

Example 2 - OTP version for 3.2

Example 3 - Participate in Coverage

Some general gripes

Section re-writes required

tburghart commented Dec 4, 2024

martinsumner commented Nov 29, 2024 •

edited

Loading