bio-submission.txt

Bio submission overview
=======================

Since btrfs has support for various raid levels it's submission path is more
involved than a traditional filesystem. For the purposes of this document a
raid1-based filesystem is going to be considered.


Submission
----------

The upper layers of the filesystem would create a single bio for a particular
logical address. However, due to the way btrfs manages its space the first
thing that needs to happen is map this logical address to physical one.
Additionally, depending on the raid profile of the portion of disk where the
requested data falls in, it's possible that the single io is actually going to
turn into multiple io requests.

This mapping occurs in btrfs_map_bio. First a btrfs_bio struct is created - it
holds information about the underlying properties of the range that we want to
interact with. This details such as how many stripes (i.e copies of the
data there are on disk), length of the stripe and copies of some value of the
initial bio which are going to be replaced/hooked in the submission process.
Then for every stripe a bio is going to be created. Those multiple bios
constitute the io for reading/writing data. They will also share the btrfs_bio
structure and use it to record errors. During submission of each bio (including
the original one) the values of bi_iter.bi_sector, bi_end_io and bi_private are
modified to correspond to the location of the requested data on each stripe.

Bio mapping
-----------
Bio mapping is the process during which the filesystem calculates in what way
exactly the given io request has to be submitted. It essentially translates a
high-level request such as "write 128k from logical address 1234 on a RAID0
block-group" into:

1. Split the request into stripes (assuming a default stripe size of 64k) this
means this request will have to be split into 2 writes, each of 64k length.
    1.1 First 64k are going to go into disk0, create the necessary structures to
    describe this information
    1.2 Second 64k are going into disk1, create the necessary structures to describe
    this

In this section we are going to look into detail how this process happens, since
it might look a bit daunting at first. To fully comprehend it it's important to
describe a few terms as used in btrfs bio mapping parlance:

 * Chunk - A chunk is the logical unit from which btrfs allocates space. They
 usually come in size of 1gb (but could be bigger). When we have a RAID1 chunk
 it will be backed by 2 disks stripes (more on that see below). A chunk has a
 starting logical address and a size. So data writes will fall within the
 boundaries of a particular chunk.

 * Stripe - this could really mean two things. This first is a stripe as per the
 allocation profile of a block group (BG)/chunk. For example a BG with а RAID1
 profile will have 2 stripes - because it's mirrored. So a 1gb block group will
 be backed by 2x1gb stripes from two devices.

 The other meaning of stripe is the logical chunks into which btrfs groups
 writes. If we take for example the default 64k stripe size and have an address,
 say, 1104412672, then it really falls within the 16852th stripe.

For the sake of simplicity this section will look at how a RAID0 write of 128k
is mapped by btrfs. If you understand this then you won't have problems
understanding the rest of the __btrfs_map_block. For this example we'll consider
a single write of 128k and the relevant chunk configuration will follow as an
excerpt from btrfs inspect-internal command:

	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15751 itemsize 112
		length 2147483648 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 2 offset 1083179008
			dev_uuid a92967cd-a78a-4c2a-831f-6ae681d445a7
			stripe 1 devid 1 offset 1104150528
			dev_uuid 92876f31-1811-4119-aa04-a3b7b7fcaf01

From this we can conclude that this chunk is backed by 2 physical stripes (1 on
each disk, each starting at 1083179008 and 1104150528 respectively, note those
are actual physical addresses on the raw devices). I/O is going to be splitted
in stripes of 64k each and since this is RAID0 the stripes will be written to
alternating disks. Also, let's assume the starting address is 1104150528 which is
generated by btrfs' allocator.

The mapping process starts with a call to __btrfs_map_block, which function will
usually be wrapped by the non __ prepended version. In this case 'logical' will
be the address we want to map - 1104150528 and length will be the bytes we want
to map i.e the request length. The first step is to find the in-memory data
about the chunk. It's represented by an instance of extent_map structure. Then
the following calculations take places:

1. The chunk offset is calculated by substracting the chunk start address from
the logical address passed:
`chunk_offset = 1104150528 - 1104150528 = 0;`
2. The number of 64k strips we need to skip from the beginning of the chunk
to arrive at our address:
`stripe_nr = 0 /  64k = 0.`
3. The starting address of our stripe
`stripe_offset = stripe_nr * stripe_len = 0 * 64k = 0`
4. The offset, in bytes, within the resulting stripe that this address starts:
`stripe_offset = chunk_offset - stripe_offset = 0 - 0 = 0;`

Since it so happens that this address align perfectly with the beginning of the
chunk all of these offsets is going to be 0.

Afterwards there is code which ensures that the mapped length i.e the allowed bytes
to be written do not span a stripe (64k) boundary. Following this there is a
rather large if {} else if{} construct that applies the mapping logic for each
respective raid mode. In this example we'll look at RAID0. The value that has
to be calculated is the index of the physical stripe i.e which device this write
is going to be directed. This is facilitated by the following code:

`stripe_nr = div_u64_rem(stripe_nr, map->num_stripes, &stripe_index);`

So what happens here is that the logical stripe number (stripe_nr) is divided
by the physical number of stripes (i.e disk stripes, see above definition of
stripe if you are confused) and the remainder is assigned to stripe_index. So
stripe index will actually hold the disk number this write is going to (either
0 or 1) depending on whether stripe_nr is even or not. Also stripe_nr is now
halved. This is due to this disk holding only half of the stripes, so for examples
if we are writing stripe 127 (i.e. address 8323072, 64k * 127) in the logical
space then only half of those will be on this disk i.e. the correct value of
stripe_nr will be 63.

Finally, after all calculations are done the btrfs_bio struct is filled with the
device corresponding to the physical stripe as well as the address where the
write has to happen:

`bbio->stripes[i].physical = map->stripes[stripe_index].physical + stripe_offset + stripe_nr * map->stripe_len;`

So here we first take the physical address of the corresponding physical stripe,
in this case stripe_index is 0 so the equation becomes:

`1083179008 + 0(stripe_offset) + 0(stripe_nr)*64k(map->stripe_len) = 1083179008`

And indeed, the first ever data write to a new filesystem will start at the
physical start address of the backing device.

For the sake of completeness, here is a worked example for the 2nd physical
stripe. The logical address will be: `1104216064 (1104150528+64k)`.

    chunk_offset = 1104216064 - 1104150528 = 65536 (64k)
    stripe_nr = 65536 / 65536  = 1
    stripe_offset = 1 * 65536 = 65536
    stripe_offset = 65536 - 65536 = 0`
    stripe_nr = 1 / 2 = 0 and stripe_index = 1
    physical_addr = 1104150528 + 0(stripe_offset) + 0(stripe_nr)*64k = 1104150528

So the next 64k will be written at physical address 1104150528 on the 2nd device.

End IO handling
---------------
Due to the raid support in btrfs end io handling is somewhat complicated. It
currently involves at least 2 layers. For the sake of simplicity this section
will consider a read completion scenario wherein an error is not encountered
and btrfs doesn't perform any repair.

Initially a read bio is created `end_bio_extent_readpage` as the
io handler in `__do_readpage` but during the submission process the endio
handler will be changed to `end_workqueue_bio` in `btrfs_bio_wq_end_io`. This
is necessary as btrfs' readpage endio handler is not softirq safe and since bios in
the kernel are completed in softirq context, the first order of business is to
schedule the necessary completion work in a normal task context. This occurs by
saving the original end io handler and original private context of the bio into
another structure - `btrfs_end_io_wq` and setting it as the `bi_private`
information for the bio.

Subsequently the bio is sent to be mapped as described above and right
before submission, in `submit_stripe_bio` the `bi_end_io`/`bi_private` are
again changed for another set of data. Namely, the endio function is now set to
`btrfs_end_bio` and bi_private is set to `struct btrfs_bio`. Given this, the
first function which is run o bio completion is `btrfs_end_bio`. Since it's called
for every stripe bio it first performs some generic bookeping, ensuring that
for a request involving multiple bios (i.e a write to RAID1) only the final bio
would run further code. It also deals with incrementing error counters if the
bio has encountered an error. If all is well eventually `btrfs_end_bbio` is
called. It simply copies the original values of `bi_private`/`bi_end_io` to the
bio and calls `bio_endio` which calls the new endio handler. The original value
in this case was `end_workqueue_bio`. That function in turns is responsible to
queueing the final execution of the real endio handler to the appropriate workqueue.

So the real execution chain is `btrfs_end_bio->end_workqueue_bio->end_bio_extent_readpage`.
Lastly, in `end_bio_extent_readpage` the checksum of every segment is checked
(that is if checksums are enabled) and finally the bios is completed.

Bio repair
------------
Since BTRFS supports raid functionality it's able to seamlessly repair broken
reads. The exact details vary depending on the raid level recovery has to be
performed from. For clarity next section will discuss RAID1 repair scenario of
data reads.

 RAID1
---------
In RAID1 mode every data block has a copy so when a failure is detected it's
possible to retry the read using the 2nd copy. The journey of bio repair starts
in the endio handler - `end_bio_extent_readpage`. As the code iterates through
the bio vectors in the bio it will perform checksum validation by calling the
`readpage_end_io_hook` . If there is an error during that validation
`btrfs_submit_read_repair` will be called. It's important to note that the
repair code always works on filesystem block size granularity.  Currently FSB
is always equal to page size. So the repair function always knows the exact
page where failure has occurred. First a failure record is created and recorded
in the per-inode io failure tree. That record holds information such as the
failed page from the original bio, logical/physical offsets and length.
Subsequently `btrfs_check_repairable` is called to see if the bios can indeed
be repaired i.e there's more than 1 copy. But what's more important it also
initializes `failrec::this_mirror` - this member indicates which mirror is
going  to be used by the repair bio. Finally repair_bio is created in
`btrfs_submit_read_repair` and submitted as usual by calling `submit_bio_hook`,
using the mirror as initialized by `btrfs_check_repairable`. For RAID1 mirrors
are simply iterated from 1 and the failed one is skipped. For RAID5/6
`this_mirror` is used to implement a state machine and certain values have
special meaning.

Once the repair bio is submitted it becomes responsible for filling in the
passed page with valid data. This allows for a bio consisting of multiple pages
to have only some of them corrupted. In this scenario those pages which are
fine will be unlocked by the original bio whilst those which are corrupted each
will be filled in and unlocked by the respective repair bio.