From 5b469185c60721d947a6f9dd025904cdbf71f046 Mon Sep 17 00:00:00 2001 From: Lorenzo Mangani Date: Mon, 18 Sep 2023 23:03:37 +0200 Subject: [PATCH] Create garbage_collector.md Signed-off-by: Lorenzo Mangani --- docs/garbage_collector.md | 64 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/garbage_collector.md diff --git a/docs/garbage_collector.md b/docs/garbage_collector.md new file mode 100644 index 0000000..e825716 --- /dev/null +++ b/docs/garbage_collector.md @@ -0,0 +1,64 @@ +# Job of Garbage Collector + +The `garbage_collector` service is responsible for cleaning up the +`parquet_file` table in the catalog, as well as eventually removing +old uneeded parquet files in the object store. + +## Background +As IOx ingests data, it is stored as parquet files on object +store. Each parquet file created is recorded in a row of the +`parquet_file` table in the catalog. + +Over time, the [compactor](compactor.md) creates new, more optimized +parquet files by combining several pre-existing files. When a new file +is successfully created on object_store, a new row is added to +`parquet_file` and the previous files with the same data are "soft +deleted" by setting the value of the `parquet_file.to_delete` to the +current time. + +Without the garbage collector, the size of the `parquet_file` table +and the number of files in object store would grow without bound. + +# Interaction with Querier + +The Querier caches entries from the `parquet_file` table to answer +queries and periodically refreshes this cache. This cache means that +even after a row in `parquet_file` has been marked as `to_delete` or +actually deleted, the querier may still attempt to read the underlying +parquet file from object store until its cache is refreshed. + +# Configuration + +There are two key configuration knobs that control the behavior of the +garbage collector: + +* `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF`: this setting controls when + rows are deleted from the `parquet_file` table. Any row that has a + value of `to_delete` that is greater than this setting will be + deleted from the catalog. + +* `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF`: this setting controls when + objects are actually deleted from the object store. Any object that + was created (according to the object store timestamp) longer than + this interval ago and is not referenced in the catalog's `parquet_file` table + will be deleted. + +# Frequently Asked Questions + +Q: Why do we need two cutoffs? + +A: The querier relies on objects not being deleted until its caches +are refreshed. For example, if `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF` is +set to `90 days` but a parquet file is a year old, as soon as the row +is removed from `parquet_file` the object may be deleted from the +object store, even though it is still referred to in the Querier +cache. Thus `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF` must be set +sufficiently high to ensure the querier cache is refreshed before +objects are candidates for deletion. + +Q: Why not delete objects immediately when +`INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF` expires? + +A: Database backups contain references to parquet files. In order to +ensure all files referred to by these backups are not deleted, +`INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF` can be set sufficiently high.