Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to new (v2) VFIO migration interface #741

Closed
wants to merge 9 commits into from
304 changes: 291 additions & 13 deletions docs/vfio-user.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ vfio-user Protocol Specification
********************************

--------------
Version_ 0.9.1
Version_ 0.9.2
--------------

.. contents:: Table of Contents
Expand Down Expand Up @@ -366,6 +366,9 @@ Name Command Request Direction
``VFIO_USER_DMA_WRITE`` 12 server -> client
``VFIO_USER_DEVICE_RESET`` 13 client -> server
``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server
``VFIO_USER_DEVICE_FEATURE`` 16 client -> server
``VFIO_USER_MIG_DATA_READ`` 17 client -> server
``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server
====================================== ========= =================

Header
Expand Down Expand Up @@ -515,18 +518,7 @@ Capabilities:
| | | are supported if the value is ``true``. |
+--------------------+---------+------------------------------------------------+

The migration capability contains the following name/value pairs:

+-----------------+--------+--------------------------------------------------+
| Name | Type | Description |
+=================+========+==================================================+
| pgsize | number | Page size of dirty pages bitmap. The smallest |
| | | between the client and the server is used. |
+-----------------+--------+--------------------------------------------------+
| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` |
| | | and ``VFIO_DMA_UNMAP`` messages. Optional, |
| | | with a default value of 256MB. |
+-----------------+--------+--------------------------------------------------+
The migration capability is currently an empty object.
w-henderson marked this conversation as resolved.
Show resolved Hide resolved

Reply
^^^^^
Expand Down Expand Up @@ -1468,6 +1460,292 @@ Reply

* *wr_cnt* is the number of device writes completed.

``VFIO_USER_DEVICE_FEATURE``
----------------------------

This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or
probe feature data of the device.

Request
^^^^^^^

The request payload for this message is a structure of the following format.

+-------+--------+--------------------------------+
| Name | Offset | Size |
+=======+========+================================+
| argsz | 0 | 4 |
+-------+--------+--------------------------------+
| flags | 4 | 4 |
+-------+--------+--------------------------------+
| | +---------+---------------------------+ |
| | | Bit | Definition | |
| | +=========+===========================+ |
| | | 0 to 15 | VFIO_DEVICE_FEATURE_MASK | |
| | +---------+---------------------------+ |
| | | 16 | VFIO_DEVICE_FEATURE_GET | |
| | +---------+---------------------------+ |
| | | 17 | VFIO_DEVICE_FEATURE_SET | |
| | +---------+---------------------------+ |
| | | 18 | VFIO_DEVICE_FEATURE_PROBE | |
| | +---------+---------------------------+ |
+-------+--------+--------------------------------+
| data | 8 | variable |
+-------+--------+--------------------------------+

* *argsz* is the size of the above structure.

* *flags* defines the action to be performed by the server and upon which
feature:

* ``VFIO_DEVICE_FEATURE_MASK`` is the 16-bit feature index of the relevant
w-henderson marked this conversation as resolved.
Show resolved Hide resolved
feature.

* ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the feature data.
w-henderson marked this conversation as resolved.
Show resolved Hide resolved

* ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to
that given in the ``data`` field of the payload.

* ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probing for one or more of these three things:

  1. is the flag supported at all
  2. can I do a "get" for this flag
  3. can I do a "set" for this flag

right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is if you don't set GET/SET, it is successful if the feature is supported at all, but if you set GET/SET it is only successful if whatever commands you probed for are also supported. How would you suggest I word this more clearly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a whole section on probe, saying something like:

Probing

If just VFIO_DEVICE_FEATURE_PROBE is set, the server should return the corresponding flags that are supported.

If VFIO_DEVICE_FEATURE_PROBE is set along with VFIO_DEVICE_FEATURE_GET, VFIO_DEVICE_FEATURE_SET, or both, the server should return whether the corresponding operation(s) are supported for the given flags.

support. If ``VFIO_DEVICE_FEATURE_GET/SET`` are also set, the probe will
w-henderson marked this conversation as resolved.
Show resolved Hide resolved
only return success if all of the indicated methods are supported.

``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually
exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``.

* *data* is specific to the particular feature. It is not used for probing.

This part of the request is analogous to VFIO's ``struct vfio_device_feature``.

Reply
^^^^^

The reply payload must be the same as the request payload for setting or
probing a feature. For getting a feature's data, the data is added in the data
section and its length is added to ``argsz``.
w-henderson marked this conversation as resolved.
Show resolved Hide resolved

Device Features
^^^^^^^^^^^^^^^

The only device features supported by vfio-user are those related to migration.
w-henderson marked this conversation as resolved.
Show resolved Hide resolved

``VFIO_DEVICE_FEATURE_MIGRATION``
"""""""""""""""""""""""""""""""""

This feature indicates that the device can support the migration API through
``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING``
and ``ERROR`` states are always supported. Support for additional states is
indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be
set.

There is no data field of the request message.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of the spec is pretty much just lifted from VFIO, I don't believe you can call SET on this feature...?

https://elixir.bootlin.com/linux/v6.3.8/source/include/uapi/linux/vfio.h#L814


The data field of the reply message is structured as follows:

+-------+--------+---------------------------+
| Name | Offset | Size |
+=======+========+===========================+
| flags | 0 | 8 |
+-------+--------+---------------------------+
| | +-----+--------------------------+ |
| | | Bit | Definition | |
| | +=====+==========================+ |
| | | 0 | VFIO_MIGRATION_STOP_COPY | |
| | +-----+--------------------------+ |
| | | 1 | VFIO_MIGRATION_P2P | |
| | +-----+--------------------------+ |
| | | 2 | VFIO_MIGRATION_PRE_COPY | |
| | +-----+--------------------------+ |
+-------+--------+---------------------------+

``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``
""""""""""""""""""""""""""""""""""""""""

Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO
device. The new state is supplied in ``device_state``. The state transition must
fully complete before the reply is sent.

The data field of the reply message, as well as the ``SET`` request message, is
structured as follows:

+--------------+--------+------+
| Name | Offset | Size |
+==============+========+======+
| device_state | 0 | 4 |
+--------------+--------+------+
| data_fd | 4 + 4 |
w-henderson marked this conversation as resolved.
Show resolved Hide resolved
+--------------+--------+------+

* *device_state* is the current state of the device (for ``GET``) or the
state to transition to (for ``SET``). It is defined by the
``vfio_device_mig_state`` enum as detailed below. These states are the states
of the device migration Finite State Machine.

+--------------------------------+-------+---------------------------------------------------------------------+
| Name | State | Description |
+================================+=======+=====================================================================+
| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. |
+--------------------------------+-------+---------------------------------------------------------------------+
| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) |
+--------------------------------+-------+---------------------------------------------------------------------+

* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is always going to be completely meaningless, we should discuss whether we should diverge here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, the only reason why we could keep it is to slightly simplify QEMU implementation, though I'm not sure that's a strong reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we haven't discussed whether we're going to support memory mapped data transfers for migration (this FD is something we could use), @w-henderson have you given any thought to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about this - I quite like the idea of following VFIO as closely as we can so my preference is probably to leave it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support memory mapped data transfers for migration (this FD is something we could use)

What I said doens't make sense, the FD needs to be passed via ancillary data. I don't have a strong opinion about keeping it, perhaps qemu-devel may help us decide.

``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data
transport.

Direct State Transitions
""""""""""""""""""""""""

The device migration FSM is a Mealy machine, so actions are taken upon the arcs
between FSM states. The following transitions need to be supported by the
server, a subset of those defined in ``<linux/vfio.h>``
(``enum vfio_device_mig_state``).

* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device. The ``STOP_COPY`` arc terminates the data transfer session.

* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing
of the migration data. Stop the operation of the device. If the delivered data
is found to be incomplete, inconsistent, or otherwise invalid, fail the
``SET`` command and optionally transition to the ``ERROR`` state.

* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is
now fully operational.

* ``STOP -> RUNNING``: Start the operation of the device.

* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving
the device state. The device operation is unchanged, but data transfer begins.
``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of
states.

* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop
device operation.

* ``STOP -> RESUMING``: Start the process of restoring the device state. The
internal device state may be changed to prepare the device to receive the
migration data.

The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should
return an error if requested.

``ERROR`` cannot be specified as a device state, but any transition request can
be failed and then move the state into ``ERROR`` if the server was unable to
execute the requested arc AND was unable to restore the device into any valid
state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to
return back to ``RUNNING``.

If ``PRE_COPY`` is not supported, arcs touching it are removed.

Complex State Transitions
"""""""""""""""""""""""""

The remaining possible transitions are to be implemented as combinations of the
above FSM arcs. As there are multiple paths, the path should be selected based
on the following rules:

* Select the shortest path.

* The path cannot have saving group states as interior arcs, only start/end
states.

``VFIO_USER_MIG_DATA_READ``
---------------------------

This command is used to read data from the source migration server while it is
in a saving group state (``PRE_COPY`` or ``STOP_COPY``).

This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the
``data_fd`` file descriptor in ``<linux/vfio.h>``
(``struct vfio_device_feature_mig_state``) to enable all data transport to use
the single already-established UNIX socket. Hence, the migration data is
treated like a stream, so the client must continue reading until no more
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look in vfio? Is a stream "safe", or would we ever want to support sparse read/writes in some way?

I'm not clear how this would even work if we wanted to do device state dirty tracking in PRE_COPY state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in VFIO we never "seek", only read/write until no more reading/writing can be done. I'm also unsure about how we'd do device state dirty tracking in PRE_COPY state - I think the idea is we track the dirty pages in PRE_COPY so we can only send the necessary migration data once we get to STOP_COPY but I'm not too sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in VFIO we never "seek", only read/write until no more reading/writing can be done.

That's correct.

I'm not sure I understand the rest of the discussion though, especially how dirty page tracking is related to device state tracking.

I have the following question (perhaps we're talking about the same thing?), regarding PRE_COPY, from VFIO:

 * RUNNING -> PRE_COPY
 * RUNNING_P2P -> PRE_COPY_P2P
 * STOP -> STOP_COPY
 *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
 *   which share a data transfer session. Moving between these states alters
 *   what is streamed in session, but does not terminate or otherwise affect
 *   the associated fd.
 *
 *   These arcs begin the process of saving the device state and will return a
 *   new data_fd. The migration driver may perform actions such as enabling
 *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
 *
 *   Each arc does not change the device operation, the device remains
 *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
 *   in PRE_COPY_P2P -> STOP_COPY.

IIUC PRE_COPY returns an FD as this starts a new copy session, so in vfio-user this means that the client can start sending VFIO_USER_MIG_DATA_READ messages until the server returns 0 length read (equivalnt to the data_fd returning no more data in VFIO). While in PRE_COPY state, the device migth generate more device data, how does the VFIO client know this? Does reading from the data_fd return data (it would previously return EOF).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VFIO defines PRE_COPY to mean

The device is running normally but tracking internal state changes

which maybe implies that we don't actually transfer any data in that state, rather just do whatever we can in the background to prepare for transfer and track the changes to keep this up to date? I'm not sure if this makes sense though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arcs begin the process of saving the device state and will return a new data_fd.

This suggests otherwise, but I agree it's not clear at all.

Each arc does not change the device operation, the device remains RUNNING, P2P quiesced or in STOP.

Can you comment on this? Does the device stay in RUNNING state or does it transition to RUNNING | PRE_COPY?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely doesn't stay in the exact same state though.

By state do you mean RUNNING etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, RUNNING and PRE_COPY are distinct states that have the same behaviour from the client's perspective but do things differently behind the scenes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, the documentation says:

Each arc does not change the device operation, the device remains RUNNING, P2P quiesced or in STOP.

while you say:

It definitely doesn't stay in the exact same state though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused too, I had been thinking of them as distinct states but I guess it's supposed to be more like different behaviour in the same state? That sounds a bit dodgy?

Copy link
Member

@tmakatos tmakatos Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, and I was hoping you'd look at the kernel code to figure it out 😄.

migration data remains.

Request
^^^^^^^

The request payload for this message is a structure of the following format.

+-------+--------+------+
| Name | Offset | Size |
+=======+========+======+
| argsz | 0 | 4 |
+-------+--------+------+
| size | 4 | 8 |
+-------+--------+------+

* *argsz* is the size of the above structure.

* *size* is the size of the migration data to read.

Reply
^^^^^

The reply payload for this message is a structure of the following format.

+-------+--------+----------+
| Name | Offset | Size |
+=======+========+==========+
| argsz | 0 | 4 |
+-------+--------+----------+
| size | 4 | 8 |
+-------+--------+----------+
| data | 12 | variable |
+-------+--------+----------+

* *argsz* is the size of the above structure.

* *size* indicates the size of returned migration data. If this is less than the
requested size, there is no more migration data to read.

* *data* contains the migration data.

``VFIO_USER_MIG_DATA_WRITE``
----------------------------

This command is used to write data to the destination migration server while it
is in the ``RESUMING`` state.

As above, this replaces the ``data_fd`` file descriptor for transport of
migration data, and as such, the migration data is treated like a stream.

Request
^^^^^^^

The request payload for this message is a structure of the following format.

+-------+--------+----------+
| Name | Offset | Size |
+=======+========+==========+
| argsz | 0 | 4 |
+-------+--------+----------+
| size | 4 | 8 |
+-------+--------+----------+
| data | 12 | variable |
+-------+--------+----------+

* *argsz* is the size of the above structure.

* *size* is the size of the migration data to be written.

* *data* contains the migration data.

Reply
^^^^^

There is no reply payload for this message.

Appendices
==========
Expand Down