-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Lorenzo Mangani <[email protected]>
- Loading branch information
Showing
1 changed file
with
64 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Job of Garbage Collector | ||
|
||
The `garbage_collector` service is responsible for cleaning up the | ||
`parquet_file` table in the catalog, as well as eventually removing | ||
old uneeded parquet files in the object store. | ||
|
||
## Background | ||
As IOx ingests data, it is stored as parquet files on object | ||
store. Each parquet file created is recorded in a row of the | ||
`parquet_file` table in the catalog. | ||
|
||
Over time, the [compactor](compactor.md) creates new, more optimized | ||
parquet files by combining several pre-existing files. When a new file | ||
is successfully created on object_store, a new row is added to | ||
`parquet_file` and the previous files with the same data are "soft | ||
deleted" by setting the value of the `parquet_file.to_delete` to the | ||
current time. | ||
|
||
Without the garbage collector, the size of the `parquet_file` table | ||
and the number of files in object store would grow without bound. | ||
|
||
# Interaction with Querier | ||
|
||
The Querier caches entries from the `parquet_file` table to answer | ||
queries and periodically refreshes this cache. This cache means that | ||
even after a row in `parquet_file` has been marked as `to_delete` or | ||
actually deleted, the querier may still attempt to read the underlying | ||
parquet file from object store until its cache is refreshed. | ||
|
||
# Configuration | ||
|
||
There are two key configuration knobs that control the behavior of the | ||
garbage collector: | ||
|
||
* `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF`: this setting controls when | ||
rows are deleted from the `parquet_file` table. Any row that has a | ||
value of `to_delete` that is greater than this setting will be | ||
deleted from the catalog. | ||
|
||
* `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF`: this setting controls when | ||
objects are actually deleted from the object store. Any object that | ||
was created (according to the object store timestamp) longer than | ||
this interval ago and is not referenced in the catalog's `parquet_file` table | ||
will be deleted. | ||
|
||
# Frequently Asked Questions | ||
|
||
Q: Why do we need two cutoffs? | ||
|
||
A: The querier relies on objects not being deleted until its caches | ||
are refreshed. For example, if `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF` is | ||
set to `90 days` but a parquet file is a year old, as soon as the row | ||
is removed from `parquet_file` the object may be deleted from the | ||
object store, even though it is still referred to in the Querier | ||
cache. Thus `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF` must be set | ||
sufficiently high to ensure the querier cache is refreshed before | ||
objects are candidates for deletion. | ||
|
||
Q: Why not delete objects immediately when | ||
`INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF` expires? | ||
|
||
A: Database backups contain references to parquet files. In order to | ||
ensure all files referred to by these backups are not deleted, | ||
`INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF` can be set sufficiently high. |