Skip to content

Commit

Permalink
Reimplement BRIN internals for AO/CO tables
Browse files Browse the repository at this point in the history
Motivation:

For AO/CO tables, we have the revmap explosion problem that the massive
gaps in logical heap block numbers brought (across physical segment
boundaries). The problem is articulated with an example in the README.
Earlier, we solved this problem with the help of UPPER pages, which
acted like a lookup table to find the revmap page, given a logical heap
block number.

One of the biggest shortcomings of the design was that even an empty
BRIN index would take up ~3.2M at rest. This is because upper pages were
always pre-allocated, to cover all possible heap block numbers. This
space would be consumed on a per-segment basis, given GPDB's MPP nature.

Further, for every operation involving the revmap, there was this 1
additional page always involved, which added to overhead.

Highlights:

(1) We removed the UPPER page design in a prior commit and now have
replaced it with a chaining design.

We completely break away from the restriction that the revmap pages
follow one another right after the metapage, in contiguous block
numbers. Instead, we now have them point to one another in a singly
linked list.

Furthermore, there are up to MAX_AOREL_CONCURRENCY such linked lists of
revmap pages. There is one list per block sequence. The heads and tails
of these lists(or chains) are maintained in the metapage (and cached in
the revmap access struct).

Since revmap pages are no longer contiguous for AO/CO tables, we have to
additionally maintain logical page numbers (in the BrinSpecialSpace)
for all revmap pages (depicted in the diagram above). These logical
page numbers are used for both iterating over the revmap during scans
and also while extending the revmap.

We traverse these lists in order within a block sequence and block
sequence by block sequence.

We never have to lock more than 1 revmap page at a time during chain
traversal. Only for revmap extension, do we have to lock two revmap
pages: the last revmap page in the chain and the new revmap page being
added.

For operations such as insert, we make use of the chain tail pointer in
the metapage. Due to the appendonly nature of AO/CO tables, we would
always write to the last logical heap block within a block sequence.
Thus, unlike for heap, blocks other than the last block would never be
summarized as a result of an insert. So, we can safely position the
revmap iterator at the end of the chain(instead of traversing the chain
unnecessarily from the front).

(2) pageinspect and waldump have been modified in accordance with these
changes.

(3) Whitebox tests have been added for all BRIN operations, with the
exception of desummarize. These tests utilize pageinspect.

(4) WAL changes: Catalog bump is performed as we can't change
XLOG_PAGE_MAGIC, in order to avoid future merge conflicts.

(5) Created 202_wal_consistency_brin.pl under src/test/recovery as a
replica of src/test/modules/brin/t/02_wal_consistency.pl, with added
tests for AO/CO tables (since src/test/modules is excluded from CI)

Note: Please refer to the updated README for more details.
  • Loading branch information
soumyadeep2007 authored and reshke committed Jan 11, 2025
1 parent f9455b1 commit d06063d
Show file tree
Hide file tree
Showing 28 changed files with 1,843 additions and 854 deletions.
2 changes: 1 addition & 1 deletion contrib/pageinspect/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ OBJS = \
rawpage.o

EXTENSION = pageinspect
DATA = pageinspect--1.8--1.9.sql \
DATA = pageinspect--1.8--1.9.sql \
pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
Expand Down
114 changes: 112 additions & 2 deletions contrib/pageinspect/brinfuncs.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,21 @@
#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "pageinspect.h"
#include "storage/bufmgr.h"
#include "utils/array.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "miscadmin.h"

PG_FUNCTION_INFO_V1(brin_page_type);
PG_FUNCTION_INFO_V1(brin_page_items);
PG_FUNCTION_INFO_V1(brin_metapage_info);
PG_FUNCTION_INFO_V1(brin_revmap_data);

/* GPDB specific */
PG_FUNCTION_INFO_V1(brin_revmap_chain);

#define IS_BRIN(r) ((r)->rd_rel->relam == BRIN_AM_OID)

typedef struct brin_column_state
Expand Down Expand Up @@ -361,8 +366,11 @@ brin_metapage_info(PG_FUNCTION_ARGS)
Page page;
BrinMetaPageData *meta;
TupleDesc tupdesc;
Datum values[4];
bool nulls[4];
Datum values[8];
bool nulls[8];
Datum *firstrevmappages;
Datum *lastrevmappages;
Datum *lastrevmappagenums;
HeapTuple htup;

if (!superuser())
Expand All @@ -388,6 +396,41 @@ brin_metapage_info(PG_FUNCTION_ARGS)
values[2] = Int32GetDatum(meta->pagesPerRange);
values[3] = Int64GetDatum(meta->lastRevmapPage);

/* GPDB specific fields */
values[4] = Int64GetDatum(meta->isAo);
if (!meta->isAo)
{
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
}
else
{
firstrevmappages = palloc(sizeof(Datum) * MAX_AOREL_CONCURRENCY);
lastrevmappages = palloc(sizeof(Datum) * MAX_AOREL_CONCURRENCY);
lastrevmappagenums = palloc(sizeof(Datum) * MAX_AOREL_CONCURRENCY);

for (int i = 0; i < MAX_AOREL_CONCURRENCY; i++)
{
firstrevmappages[i] = UInt32GetDatum(meta->aoChainInfo[i].firstPage);
lastrevmappages[i] = UInt32GetDatum(meta->aoChainInfo[i].lastPage);
lastrevmappagenums[i] = UInt32GetDatum(meta->aoChainInfo[i].lastLogicalPageNum);
}

values[5] = PointerGetDatum(construct_array(firstrevmappages,
MAX_AOREL_CONCURRENCY,
INT8OID,
sizeof(int64), true, 'i'));
values[6] = PointerGetDatum(construct_array(lastrevmappages,
MAX_AOREL_CONCURRENCY,
INT8OID,
sizeof(int64), true, 'i'));
values[7] = PointerGetDatum(construct_array(lastrevmappagenums,
MAX_AOREL_CONCURRENCY,
INT8OID,
sizeof(int64), true, 'i'));
}

htup = heap_form_tuple(tupdesc, values, nulls);

PG_RETURN_DATUM(HeapTupleGetDatum(htup));
Expand Down Expand Up @@ -449,3 +492,70 @@ brin_revmap_data(PG_FUNCTION_ARGS)

SRF_RETURN_DONE(fctx);
}

/*
* GPDB: Returns the chain of revmap block numbers for a given segno (aka block
* sequence).
*/
Datum
brin_revmap_chain(PG_FUNCTION_ARGS)
{
bytea *raw_page = PG_GETARG_BYTEA_P(0);
Oid indexRelid = PG_GETARG_OID(1);
int segno = PG_GETARG_UINT32(2);
Page metapage;
BrinMetaPageData *meta;
ArrayBuildState *astate = NULL;
BlockNumber currRevmapBlk;

Relation indexRel = index_open(indexRelid, AccessShareLock);

if (!superuser())
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
(errmsg("must be superuser to use raw page functions"))));

if (!IS_BRIN(indexRel))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is not a %s index",
RelationGetRelationName(indexRel), "BRIN")));

if (segno < 0 || segno > AOTupleId_MaxSegmentFileNum)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("\"%u\" is not a valid segno value (valid values are in [0,127])",
segno)));

metapage = verify_brin_page(raw_page, BRIN_PAGETYPE_META, "metapage");

if (PageIsNew(metapage))
{
index_close(indexRel, AccessShareLock);
PG_RETURN_NULL();
}

meta = (BrinMetaPageData *) PageGetContents(metapage);
currRevmapBlk = meta->aoChainInfo[segno].firstPage;
while (currRevmapBlk != InvalidBlockNumber)
{
/* Look at the chain link to see what the next revmap blknum is */
Buffer curr;

astate = accumArrayResult(astate, UInt32GetDatum(currRevmapBlk), false,
INT8OID, CurrentMemoryContext);

curr = ReadBuffer(indexRel, currRevmapBlk);
LockBuffer(curr, BUFFER_LOCK_SHARE);
currRevmapBlk = BrinNextRevmapPage(BufferGetPage(curr));
UnlockReleaseBuffer(curr);
}

index_close(indexRel, AccessShareLock);

if (astate)
PG_RETURN_DATUM(makeArrayResult(astate,
CurrentMemoryContext));
else
PG_RETURN_NULL();
}
21 changes: 21 additions & 0 deletions contrib/pageinspect/pageinspect--1.8--1.9.sql
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,24 @@ CREATE FUNCTION brin_page_items(IN page bytea, IN index_oid regclass,
RETURNS SETOF record
AS 'MODULE_PATHNAME', 'brin_page_items'
LANGUAGE C STRICT PARALLEL SAFE;
-- brin_metapage_info()
--
DROP FUNCTION brin_metapage_info(IN page bytea, OUT magic text,
OUT version integer, OUT pagesperrange integer, OUT lastrevmappage bigint);
CREATE FUNCTION brin_metapage_info(IN page bytea, OUT magic text,
OUT version integer, OUT pagesperrange integer, OUT lastrevmappage bigint,
/* GPDB specific for AO/CO tables */
OUT isAo boolean,
OUT firstrevmappages bigint[],
OUT lastrevmappages bigint[],
OUT lastrevmappagenums bigint[])
AS 'MODULE_PATHNAME', 'brin_metapage_info'
LANGUAGE C STRICT PARALLEL SAFE;

--
-- brin_revmap_chain()
--
CREATE FUNCTION brin_revmap_chain(IN page bytea, IN indexrelid regclass, IN segno int)
RETURNS bigint[]
AS 'MODULE_PATHNAME', 'brin_revmap_chain'
LANGUAGE C STRICT PARALLEL SAFE;
180 changes: 131 additions & 49 deletions src/backend/access/brin/README
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,38 @@ Future improvements

GPDB:

(1) Main design problem:

BRIN needs special handling for append-optimized tables. The revmap relies on
the assumption that block numbers are consecutive, there are no gaps in the
sequence of block numbers for a given relation. This assumption does not hold
for append-optimized tables. The AO tid is comprised of
<segment file number, row number>. Concurrent inserts into an AO table result in
multiple segment files, one per insert, being populated.

The existing revmap structure is simple in the sense that it is easy to
calculate the block number for a revmap page (the block layout is always:
{meta page, [revmap pages], [data pages]}). The number of revmap pages is
directly proportional to the logical heap block numbers we are covering in the
index.

If we continue with this representation, we will have to create revmap entries
for all the nonexistent TIDs in this gap, leading to large amounts of wasted
space. For example in a simple AO table with segment 1, having 10 logical heap
blocks: [33554432, 33554441], we would have to create revmap pages covering the
range [0, 33554431], and if pages_per_range = 1, that would mean creating close
to (33554432 / REVMAP_PAGE_MAXITEMS) = (33554432 / 5456) ~= 6150 revmap pages!
And an AO/CO table can have 128 such segments!

We discuss how we change the internal structure for the metapage and revmap to
tackle this problem (See Section (3)).

There is also the question is how can we ensure that most of the code between
heap and AO/CO tables is unified. Section (2) describes how we tackle that
through the introduction of new table AM APIs and BlockSequences.

(2) BlockSequences and Table AM APIs:

We have introduced a new table AM API relation_get_block_sequences() that helps
unify code for block-based iteration for BRIN scan and summarization, in a
table AM agnostic manner.
Expand All @@ -216,52 +248,102 @@ Sometimes, an alternative API is also needed: to get the block sequence, given
a logical heap block number. For that purpose, we have introduced
relation_get_block_sequence().

BRIN on append only tables
--------------------------

Cloudberry has a new kind of table - append only table. BRIN needs special
handling for append-optimized tables. The revmap relies on the assumption
that block numbers are consecutive, there are no gaps in the sequence of block
numbers for a given relation. This assumption does not hold for append-optimized
tables. The AO tid is comprised of <segment file number, row number>. Concurrent
inserts into an AO table result in multiple segment files, one per insert, being
populated. When mapped to heap TIDs, there is a large gap between the block
number of the last TID on segment number 1 and the first TID on segment
number 2. If we continue to represent this using just the revmap, we will have
to create revmap entries for all the nonexistent TIDs in this gap, leading to
large amount of wasted space.The structure of revmap has been improved to adapt
to append only table. An upper block on top of revmap is introduced to avoid
wasting space due to non-existent AO TIDs.

The Ao table is logically composed of 128 aosegs to support concurrent inserts.
Each tuple in the Ao table corresponds to a virtual tid. The virtual tid of
the first tuple of each Aoseg is equal to (248/128)*segnum, then the first
virtual block number of each Aoseg is equal to (232/128) * segnum.

If there are three blocks in aoseg0, aoseg1, and aoseg127, their block numbers
are 0x0000 0000 0x0000 0001 0x0000 0002, 0x0200 0000 0x0200 0001 0x0200 0002,
0xFE00 0000 0xFE00 0001 0xFE00 0002. Then the largest index in the revmap array
is 0xFE00 0002. In this way, the revmap array contains 4,261,412,866 tids,
taking up 24GB of space. This is clearly unacceptable.

So we added an extra upper level on top of the revmap. In this way, at the
level of revmap, tid and the corresponding block are initialized only when
the corresponding block number has data. The upper level block stores the
revmap level block number. In this way, the revmap level will only store the
tid corresponding to the block that has been filled with data. The upper
level will initialize all the blocks corresponding to the block number at
one time. But because the upper level only stores the block number of the
revmap, the number of records in the upper level is 232/TidNumPerPage which
is approximately equal to 800,000. Takes up 3.2MB of space.

The corresponding relationship between the block number and the upper level
array index is:
upper_index=blocknum/TidNumPerPage
Stored in the upper level array is the block number of the revmap, and the
offset in the block of the revmap tid is:
revmap_offset=blocknum%TidNumPerPage
TidNumPerPage: The number of tids that each revmap page can hold.
All the discussions above have ignored the pagesPerRange variable.



(3) Changes to the internal page structure:

BRIN data pages remain unchanged. Only the metapage and revmap pages undergo a
change in structure, in order to deal with the main design problem highlighted
in Section (1). Also, these changes are made only for AO/CO tables - for heap
table,s the fields added to the internal structures are unused.

We completely break away from the restriction that the revmap pages follow one
another right after the metapage, in contiguous block numbers. Instead, we now
have them point to one another in a singly linked list. We have introduced the
nextRevmapPage pointer in BrinSpecialSpace to this end.

Note: Since revmap pages are not contiguous, we don't have to follow the page
evacuation protocol (that we have to follow for indexes on heap tables), which
had to move data pages to the end of the index relation, to make room for
revmap pages.

Furthermore, there are up to MAX_AOREL_CONCURRENCY such linked lists of revmap
pages. There is one list per block sequence. The heads and tails of these lists
(or chains) are maintained in the metapage (and cached in the revmap access
struct).

We have depicted the logical chain structure below:

+----------+
| meta |
| |
| |
+-----+----+
|
+----------------+------------------+
seq0| seq1| ... seqN|
| | |
+----v-----+ +-----v----+ +-----v----+
| rev | | rev | | rev |
| +--+--+ | +--+--+ | +--+--+
| | 1| | | | 1| | | | 1| |
+----+--++-+ +----+--++-+ +----+--++-+
| | |
| | |
+--------v-+ +--------v-+ +--------v-+
| rev | | rev | | rev |
| +--+--+ | +--+--+ | +--+--+
| | 2| | | | 2| | | | 2| |
+----+--++-+ +----+--++-+ +----+--++-+
| | |
v v v
...
+----------+ +----------+ +----------+
| rev | | rev | | rev |
| +--+--+ | +--+--+ | +--+--+
| |n1| | | |n2| | | |nN| |
+----+--+--+ +----+--+--+ +----+--+--+

Omitted from the diagram are the tail pointers to the revmap chains and the
data pages, for clarity.

Since revmap pages are no longer contiguous for AO/CO tables, we have to
additionally maintain logical page numbers (in the BrinSpecialSpace) for all
revmap pages (depicted in the diagram above). The need can be highlighted with
the following example:

For heap tables, let's say we have metapage: Block0 and revmap pages: Block1,2,3
and let's say we have pages_per_range = 1. If we wanted to look up the summary
info for heapBlk=6000, that would map to Block3 (we know that from simple math.
See: HEAPBLK_TO_REVMAP_BLK()). However, for AO/CO tables, we have no idea what
revmap block number this would map to since revmap pages are not contiguous.
This is where the 1-based logical page number comes in. With it we can say,
heapBlk 6000 maps to the 2nd revmap page for block sequence 9 (seg0)
(See: HEAPBLK_TO_REVMAP_PAGENUM_AO()). We can then traverse the revmap chain for
seg0 until we find the revmap page with pagenum=2.

These logical page numbers are used for both iterating over the revmap during
scans and also while extending the revmap (see revmap_extend_and_get_blkno_ao()).
The logical revmap page number for a given logical heap block is calculated by
paying attention to the segment to which the logical heap block belongs and the
fixed number of items that can fit in a revmap page (See
HEAPBLK_TO_REVMAP_PAGENUM_AO()). The logical page numbers of the last chain
members are also cached in the metapage (and cached in the revmap access struct)

For operations such as scan, build and summarize:
We always traverse each chain in order justifying their singly-linked-ness.
Also these chains are always traversed in block sequence order - the chain for
seg0 is traversed, chain for seg1 and so on. We use a revmap iterator to attain
this goal. Before traversing each chain, we position the iterator at the start
of the chain.

We never have to lock more than 1 revmap page at a time during chain traversal.
Only for revmap extension, do we have to lock two revmap pages: the last revmap
page in the chain and the new revmap page being added.

For operations such as insert, we make use of the chain tail pointer in the
metapage. Due to the appendonly nature of AO/CO tables, we would always write to
the last logical heap block within a block sequence. Thus, unlike for heap,
blocks other than the last block would never be summarized as a result of an
insert. So, we can safely position the revmap iterator at the end of the chain
(instead of traversing the chain unnecessarily from the front).

Note: Multiple revmap pages across chains can map to the same data page.
Loading

0 comments on commit d06063d

Please sign in to comment.