Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore io_uring capabilities #141

Closed
40 tasks done
cevans87 opened this issue Sep 29, 2021 · 3 comments
Closed
40 tasks done

Explore io_uring capabilities #141

cevans87 opened this issue Sep 29, 2021 · 3 comments
Assignees

Comments

@cevans87
Copy link

cevans87 commented Sep 29, 2021

Prep for #142

Literature

Questions

  • Can we efficiently poll multiple iorings?
  • Can we control memory location of ioring?
    • No. io_uring_setup returns an fd, and we must use that fd to mmap via shared memory all of the data structures for an ioring. There is no option in the io_uring_setup params that indicate we can deviate from this behavior.
    • This implies that we would have two VMAs for every ioring (one for each mmap). We can't do this for every actor (potentially 100k+).
  • How does cancelling a submission work?
    • How do we reference the submission we want to cancel?
      • Via the user_data field. We use it as the key.
    • Does 1 completion imply 1 submission? Basically, can we count completions to know we have no outstanding submissions?
      • Not always, such as with poll with IORING_POLL_ADD_MULTI enabled. However, it's always possible to tell when there will be no more completions by checking for IORING_CQE_F_MORE in the CQE.
      • We're going to have associated memory allocations for every submission. We're going to have to track the submitted-but-not-completed work in each of those allocations, so answering this question doesn't inform any design decisions.
    • Can we bulk cancel submissions based on some form of key (like actor id)?
      • No. The chain of eventually calls io_async_cancel in io_uring.c, which iterates through every task, calls io_async_cancel_one on them and returns on the first ENOENT result (aka, one matching task was found).
      • It turns out that we're going to have unique memory addresses for associated allocated memory for every single submission. The user_data field is going to point to that memory, so every submission will be unique.
    • Can we cancel a submission that is already dispatched, but not fullfilled (e.g. a network request)?
  • How big are our urings? Say for high throughput socket operations.
    • Not really sure how to answer this now. It's not critical.
  • Can we allocate a large number of urings in a single shared mapping?
    • No. The kernel maps memory for just one uring instance during the io_uring_setup system call. There's no way to change this behavior.
  • Can we start a kernel io thread, but never wake it up unless we start hitting a bottleneck with submissions?
    • This question is irrelevant. We realized that the kernel thread is only useful for very specialized IO-intense services that have fairly constant, heavy workloads. It's not something we want to use for a general-use language like Hemlock.
  • Where might we keep registered buffers so that initial kernel->hemlock completions do not require a copy?
    • We don't actually want to use registered buffers since they are actually backed by physical memory. However, the non-moving part of the question is still relevant. Each actor will need a non-moving IO region. Our best design so far is to use a simple size-class span allocation scheme to stage data for reads/writes. To avoid fragmentation, we'll need to agressively get data out of that area and into the actor's major/minor heap as quickly as possible.
  • For polling a single uring, is it better to poll the CQ ring tail, or is registering an fd with eventfd more efficient?
    • Intuition says polling the CQ ring tail is more efficient, as long as there is only one ring for each executor to poll. Less steps for the kernel to take and polling the CQ ring tail only requires a read barrier.
  • Do linked submission have to be directly next to each other in the uring? What if I want to continue the chain much later?
    • Yes
    • This has some ugly implications in our executor runtime. If an actor submits part of a chain and requires executor intervention to deal with completions before submission can happen again, the actor MUST be allowed to finish submitting its chain. Nothing can be allowed to use the uring until the actor does so.
  • If two uring instances share the same WQ backend, can we start an I/O from one uring instance and then successfully cancel it from the other? If yes, this makes some actor cleanup easier since we don't need to track which uring still has some of its outstanding I/O. We can just use the current executor's uring instance.
    • No, it's not possible. The match function checks that the cancelling ring context is the same as the submitted ring context.

Tests

  • Basic functionality
    • NOP test. Ensure I can submit multiple NOPs with user_data filled in and can read corresponding successful (res == 0) completions.
    • call io_uring_enter and then immediately spin for a bit. See if our spinning delays the kernel from dispatching the submission.
      • Test wasn't needed. I found documentation that said that io_uring may try to complete submissions inline, but only if results are already cached or take a constant amount of time. Otherwise, they're submitted to a queue in the kernel and io_uring_enter will return. It will not block.
    • Set up two urings with a shared WQ backend. Submit a timer in one, and cancel it in the other. Show that cancellation succeeds.
      • Tested. Cancellation from a different ring is not possible.
@cevans87 cevans87 self-assigned this Sep 29, 2021
@cevans87
Copy link
Author

cevans87 commented Oct 5, 2021

I'm abandoning my experiment with IORING_OP_NOP for the moment. I'm saving the current state and
moving on to the next experiment (simle read/write).

The current test setup I have is that I'm creating two NOP submissions, one with 42 in the
user_data field, and one with 43. I get two completions with res set to 0, as expected. However,
both completions have user_data set to 42. I expected one of them to be 43.

At this point, I want to take a known working example and inspect user_data there instead. NOP
seems like too much of a special case in the io_uring implementation to show what real behavior
looks like. I may be doing something wrong, but beating my head against this specific test is the
wrong way to find out.

Here's the print-out of my test before calling io_uring_enter and after.

hemlock@3c401cfc335c:~/Hemlock/experiments/io_uring_bench/build$ ./io_uring_bench
ioring: 0x7fff1d4203e0
  fd:           3
  vm:           0x7f41f4d95000
  cqring: 0x7fff1d4204a0
    head:         0
    tail:         0
    ring_entries: 64
    ring_mask:    63
    cqes:
  sqring: 0x7fff1d420468
    head:         0
    tail:         2
    ring_entries: 32
    ring_mask:    31
    flags:        0
    array:        0
    sqes:
      pos: 0
        opcode:    0
        user_data: 42
        flags:     0
        ioprio:    0
        fd:        0
      pos: 1
        opcode:    14
        user_data: 43
        flags:     0
        ioprio:    0
        fd:        0
ioring: 0x7fff1d4203e0
  fd:           3
  vm:           0x7f41f4d95000
  cqring: 0x7fff1d4204a0
    head:         0
    tail:         2
    ring_entries: 64
    ring_mask:    63
    cqes:
      pos: 0
        user_data: 42
        res:       0
        flags:     0
      pos: 1
        user_data: 42
        res:       0
        flags:     0
  sqring: 0x7fff1d420468
    head:         2
    tail:         2
    ring_entries: 32
    ring_mask:    31
    flags:        0
    array:        0
    sqes:
io_uring_bench: ../main.c:399: test_ioring_enter: Assertion `43ULL == cqe->user_data' failed.
Aborted (core dumped)

@cevans87
Copy link
Author

cevans87 commented Oct 5, 2021

After pursuing the next test of simply writing twice to stdout via IORING_OP_WRITE, I noticed that the first write was happening twice, and the second not at all. It turns out I'd missed the importance of the sqring->array field. It's actually the real ring, not sqring->sqes. array is an indirection layer to the sqes. In multithreaded applications, it makes it possible for threads to atomically reserve an sqe, fill it in, and then atomically write the sqe index into the array without blocking each other.

I wasn't updating the array, so it effectively had

[0, 0, 0, 0, 0...]

and I want it to look more like

[0, 1, 2, 3, 4...]

I fixed it and now both tests work.

@cevans87
Copy link
Author

We're in a good place with the literature and experiments. Moving on to design and implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant