Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Riak documentation #14

Open
martinsumner opened this issue Nov 29, 2024 · 1 comment
Open

Issues with Riak documentation #14

martinsumner opened this issue Nov 29, 2024 · 1 comment

Comments

@martinsumner
Copy link

martinsumner commented Nov 29, 2024

This is a perspective on the issues with the current online documentation. This issue exists as a bit of a brain dump, as much as anything to emphasise the scale of the problem should the project try to edit its way out of the documentation issues - as opposed to starting again.

The problems are divided into three categories:

  • the confusion of duplication merging with legacy;
  • some general gripes;
  • large sections that need a rewrite.

Confusion of duplication/legacy

Information is duplicated within the documentation in too many places, without any auto-generation from source, and this complicates the jobs of keeping things up-to-date. Also as we maintain legacy information, even where the tag is selected for the latest release, if an update is missing from one place only legacy information is seen and therefore the result for the reader is commonly be misleading.

Example 1 - node_confirms

node_confirms is six years old, so one of the oldest post-basho features. The features were design to resolve confusion over write properties, and replace the confused use of pw.

node_confirms is documented, in the context of bucket properties. There is a box with more detail, it is largely accurate, although doesn’t refer to the use of node_confirms on read. There is a lack of context about the key difference between pw=2 and node_confirms=2.

But is you got to the page on replication properties there is no mention of node_confirms, nor in PBC Store Object, nor in PBC Set Bucket Properties etc.

Elsewhere there are references to blog posts and other artefacts, that all pre-date node_confirms, and so no mention will be given to it. In all these places readers are instead pointed to use pw.

In order to use node_confirms, you have to be lucky to land in the only place where it is documented. Landing anywhere else will give you misleading information.

Example 2 - OTP version for 3.2

For Riak 3.2 we recommend the use of OTP 24. In the docs there are release notes that refer to OTP 24 but also OTP 25, and states OTP 25 is recommended. This is a statement that went in the first release notes for 3.2.0, and was never corrected when we went to even numbers only and removed the OTP 25 recommendation because of the 22 -> 25 uplift issue.

In a separate page the recommended OTP versions for 3.2.0 are given as OTP 20, 21 and 22 - but with best performance in OTP 22.

In all cases if the reader looks at the installation guide for this version the guidance is to download OTP 16RB02-basho10.

Example 3 - Participate in Coverage

This is documented in the developers guide for using secondary indexes, where it is incorrectly stated that it requires a node restart to manage (it can be changed at run-time from remote_console).

This is not documented anywhere in the configuration reference, although it is a key item of configuration. Neither is it documented in the guidelines for managing recovery in clusters (the primary purpose of the configuration is to have it disabled while a node is recovering from failure - and only re-enable it when AAE confirms recovery is complete). It is documented in the guidelines for 2i, but not in the documentation of other services which depend on coverage queries (nextgenrepl full-sync, and aae_fold).

Some general gripes

There are a significant number of house-keeping issues:

There is a lot of out-of-date guidance.

  • Apparently we don’t test on any version of Ubuntu in active support
  • Incorrectly recommending non-default Erlang VM tuning (for leveled anyway)
  • Recommending ZFS for storage, but not on any OS that is supported
  • There are kernel, network and vm tuning guidelines which I suspect haven’t been tested for ten years
  • We still document riak_control, and say it is installed as part of the release
  • Even for 3.2.0 all the guidance for logging is based on lager
  • None of the OS-specific riak operation guides mention systemd
  • There are various references to old 2.0 upgrades, using vm.args
  • Overly conservative recommendations re large objects (100KB + average is fine, > 2MB not such a big deal) and disk space (upgrade at only 50%-60%?)
  • Handoff configuration tuning still based on older versions (none of the actual config parameters that help with transfer batch control, or the secret forced_ownership_handoff)
  • Pages for JMX and SNMP
  • Links to third-party monitoring plugins that haven’t been touched for ten years
  • Runtime interaction what even is this?
  • All erlang commands still being run by riak attach not riak remote_console
  • Can't use security unless you use R16B

Some of the docs were always bad and misleading, and that debt remains:

Section re-writes required

There are some significant re-writes required on:

  • backend choices
  • deletes
  • Faq
  • AWS/Cloud

Backend documentation needs an overhaul. In 3.2.0 only bitcask and leveled are tested prior to release (plus a minimal pass over eleveldb if we made any changes have concerns about). Leveled is mentioned, but not included in things like backend comparison tables. There is a recommendation in one place to start with multi-backend, whereas I would suggest avoid multi-backend unless you have no other choice. The in-memory backend we’ve not formally deprecated I think, but that’s a mistake, it really shouldn’t be used, and isn’t tested.

The leveled configuration references is missing some very important items, most noticeably the reload_recalc option - but snapshot timeout configuration, zstd compression are important omissions (zstd can have significant performance advantages).

The recommendation now should be to default to leveled, and use bitcask only when object mutation is rare (and anti-entropy or multiple clusters are not required). Other backends, including multi-backend support exist for backwards compatibility only.

For deletes, the issue related to expiry and AAE don’t appear to be touched upon at all. The delete_mode of keep is recommended which is good (although we don’t make it our default), but I don’t think there is anywhere near enough information to explain the consequences of this - the resurrection of tombstones is not covered at all, no reference to kv679 and why delete mode is important. The reaper and eraser and referred to directly (just the aae_folds), so nothing on how to use them sensibly, and how to monitor reap and erase jobs. Nothing about tombstone_pause, handoff_deletes, dollarkey_readtombs.

The behaviour of deletes really confuses operators. The best resource for explaining used to be a post on the mailing list, it would be much nicer to have this summarised in the documents.

The FAQ is wrong on so many questions, there's a need to start again.

For AWS guidance there is an installation page and a tuning page which is more of a general guide. The installation recommends Riak instances from the VM market place … I suspect they may not give a user an up-to-date version of Riak. The performance guide is very old, gives bad advice, and fails to advertise a key strength of Riak. There are now AWS instances optimised for Riak like databases (im4gn), and it is a key feature of Riak that it runs on these low-cost instances reliably (with ARM CPUs and available to deal with ephemeral storage). Support for placement groups in Riak is a key feature (it is hard to find anything in the documents about v2 vs v3 vs v4 cluster claim), readable backups to S3 as well. No guidance about working with AZs, regions etc.

@tburghart
Copy link
Member

Depressing as it is, thanks for writing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants