From e31b36ae78e0bad298fcb06bdaaef4ddfaa46d62 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Thu, 6 Jul 2023 08:37:06 +0000 Subject: [PATCH 1/9] update to latest version of the spec Signed-off-by: William Henderson --- docs/vfio-user.rst | 469 +++++++++------------------------------------ 1 file changed, 88 insertions(+), 381 deletions(-) diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst index 9e1314e3..0d96477a 100644 --- a/docs/vfio-user.rst +++ b/docs/vfio-user.rst @@ -1,5 +1,4 @@ .. include:: - ******************************** vfio-user Protocol Specification ******************************** @@ -322,9 +321,9 @@ usual ``msg_size`` field in the header, not the ``argsz`` field. In a reply, the server sets ``argsz`` field to the size needed for a full payload size. This may be less than the requested maximum size. This may be -larger than the requested maximum size: in that case, the payload reply header -is returned, but the ``argsz`` field in the reply indicates the needed size, -allowing a client to allocate a larger buffer for holding the reply before +larger than the requested maximum size: in that case, the full payload is not +included in the reply, but the ``argsz`` field in the reply indicates the needed +size, allowing a client to allocate a larger buffer for holding the reply before trying again. In addition, during negotiation (see `Version`_), the client and server may @@ -337,8 +336,9 @@ Protocol Specification To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the -little-endian format, although this may be relaxed in future revisions in cases -where the client and server are both big-endian. +endianness of the host system, although this may be relaxed in future +revisions in cases where the client and server run on different hosts +with different endianness. Unless otherwise specified, all sizes should be presumed to be in bytes. @@ -365,7 +365,7 @@ Name Command Request Direction ``VFIO_USER_DMA_READ`` 11 server -> client ``VFIO_USER_DMA_WRITE`` 12 server -> client ``VFIO_USER_DEVICE_RESET`` 13 client -> server -``VFIO_USER_DIRTY_PAGES`` 14 client -> server +``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server ====================================== ========= ================= Header @@ -488,30 +488,45 @@ format: Capabilities: -+--------------------+--------+------------------------------------------------+ -| Name | Type | Description | -+====================+========+================================================+ -| max_msg_fds | number | Maximum number of file descriptors that can be | -| | | received by the sender in one message. | -| | | Optional. If not specified then the receiver | -| | | must assume a value of ``1``. | -+--------------------+--------+------------------------------------------------+ -| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | -| | | see `Read and Write Operations`_. Optional, | -| | | with a default value of 1048576 bytes. | -+--------------------+--------+------------------------------------------------+ -| migration | object | Migration capability parameters. If missing | -| | | then migration is not supported by the sender. | -+--------------------+--------+------------------------------------------------+ ++--------------------+---------+------------------------------------------------+ +| Name | Type | Description | ++====================+=========+================================================+ +| max_msg_fds | number | Maximum number of file descriptors that can be | +| | | received by the sender in one message. | +| | | Optional. If not specified then the receiver | +| | | must assume a value of ``1``. | ++--------------------+---------+------------------------------------------------+ +| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | +| | | see `Read and Write Operations`_. Optional, | +| | | with a default value of 1048576 bytes. | ++--------------------+---------+------------------------------------------------+ +| pgsizes | number | Page sizes supported in DMA map operations | +| | | or'ed together. Optional, with a default value | +| | | of supporting only 4k pages. | ++--------------------+---------+------------------------------------------------+ +| max_dma_maps | number | Maximum number DMA map windows that can be | +| | | valid simultaneously. Optional, with a | +| | | value of 65535 (64k-1). | ++--------------------+---------+------------------------------------------------+ +| migration | object | Migration capability parameters. If missing | +| | | then migration is not supported by the sender. | ++--------------------+---------+------------------------------------------------+ +| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | +| | | are supported if the value is ``true``. | ++--------------------+---------+------------------------------------------------+ The migration capability contains the following name/value pairs: -+--------+--------+-----------------------------------------------+ -| Name | Type | Description | -+========+========+===============================================+ -| pgsize | number | Page size of dirty pages bitmap. The smallest | -| | | between the client and the server is used. | -+--------+--------+-----------------------------------------------+ ++-----------------+--------+--------------------------------------------------+ +| Name | Type | Description | ++=================+========+==================================================+ +| pgsize | number | Page size of dirty pages bitmap. The smallest | +| | | between the client and the server is used. | ++-----------------+--------+--------------------------------------------------+ +| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` | +| | | and ``VFIO_DMA_UNMAP`` messages. Optional, | +| | | with a default value of 256MB. | ++-----------------+--------+--------------------------------------------------+ Reply ^^^^^ @@ -607,56 +622,18 @@ The request payload for this message is a structure of the following format: +--------------+--------+------------------------+ | flags | 4 | 4 | +--------------+--------+------------------------+ -| | +-----+-----------------------+ | -| | | Bit | Definition | | -| | +=====+=======================+ | -| | | 0 | get dirty page bitmap | | -| | +-----+-----------------------+ | -| | | 1 | unmap all regions | | -| | +-----+-----------------------+ | -+--------------+--------+------------------------+ | address | 8 | 8 | +--------------+--------+------------------------+ | size | 16 | 8 | +--------------+--------+------------------------+ * *argsz* is the maximum size of the reply payload. -* *flags* contains the following DMA region attributes: - - * *get dirty page bitmap* indicates that a dirty page bitmap must be - populated before unmapping the DMA region. The client must provide a - `VFIO Bitmap`_ structure, explained below, immediately following this - entry. - * *unmap all regions* indicates to unmap all the regions previously - mapped via `VFIO_USER_DMA_MAP`. This flag cannot be combined with - *get dirty page bitmap* and expects *address* and *size* to be 0. - +* *flags* is unused in this version. * *address* is the base DMA address of the DMA region. * *size* is the size of the DMA region. The address and size of the DMA region being unmapped must match exactly a -previous mapping. The size of request message depends on whether or not the -*get dirty page bitmap* bit is set in Flags: - -* If not set, the size of the total request message is: 16 + 24. - -* If set, the size of the total request message is: 16 + 24 + 16. - -.. _VFIO Bitmap: - -VFIO Bitmap Format -"""""""""""""""""" - -+--------+--------+------+ -| Name | Offset | Size | -+========+========+======+ -| pgsize | 0 | 8 | -+--------+--------+------+ -| size | 8 | 8 | -+--------+--------+------+ - -* *pgsize* is the page size for the bitmap, in bytes. -* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap header. +previous mapping. Reply ^^^^^ @@ -665,14 +642,8 @@ Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is mapped then the server must release all references to that DMA region before replying, which potentially includes in-flight DMA transactions. -The server responds with the original DMA entry in the request. If the -*get dirty page bitmap* bit is set in flags in the request, then -the server also includes the `VFIO Bitmap`_ structure sent in the request, -followed by the corresponding dirty page bitmap, where each bit represents -one page of size *pgsize* in `VFIO Bitmap`_ . +The server responds with the original DMA entry in the request. -The total size of the total reply message is: -16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set). ``VFIO_USER_DEVICE_GET_INFO`` ----------------------------- @@ -888,7 +859,7 @@ VFIO region info cap sparse mmap +----------+--------+------+ | offset | 8 | 8 | +----------+--------+------+ -| size | 16 | 9 | +| size | 16 | 8 | +----------+--------+------+ | ... | | | +----------+--------+------+ @@ -902,39 +873,6 @@ VFIO region info cap sparse mmap The VFIO sparse mmap area is defined in ```` (``struct vfio_region_info_cap_sparse_mmap``). -VFIO region type cap header -""""""""""""""""""""""""""" - -+------------------+---------------------------+ -| Name | Value | -+==================+===========================+ -| id | VFIO_REGION_INFO_CAP_TYPE | -+------------------+---------------------------+ -| version | 0x1 | -+------------------+---------------------------+ -| next | | -+------------------+---------------------------+ -| region info type | VFIO region info type | -+------------------+---------------------------+ - -This capability is defined when a region is specific to the device. - -VFIO region info type cap -""""""""""""""""""""""""" - -The VFIO region info type is defined in ```` -(``struct vfio_region_info_cap_type``). - -+---------+--------+------+ -| Name | Offset | Size | -+=========+========+======+ -| type | 0 | 4 | -+---------+--------+------+ -| subtype | 4 | 4 | -+---------+--------+------+ - -The only device-specific region type and subtype supported by vfio-user is -``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1). ``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` -------------------------------------- @@ -1000,7 +938,7 @@ Reply * *argsz* is the size of the region IO FD info structure plus the total size of the sub-region array. Thus, each array entry "i" is at offset - i * ((argsz - 16) / count). Note that currently this is 40 bytes for both IO + i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO FD types, but this is not to be relied on. As elsewhere, this indicates the full reply payload size needed. * *flags* must be zero @@ -1016,8 +954,8 @@ Note that it is the client's responsibility to verify the requested values (for example, that the requested offset does not exceed the region's bounds). Each sub-region given in the response has one of two possible structures, -depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` (0) or -``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` (1): +depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or +``VFIO_USER_IO_FD_TYPE_IOREGIONFD``: Sub-Region IO FD info format (ioeventfd) """""""""""""""""""""""""""""""""""""""" @@ -1477,290 +1415,59 @@ Reply This command message is sent from the client to the server to reset the device. Neither the request or reply have a payload. -``VFIO_USER_DIRTY_PAGES`` -------------------------- - -This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the client -to the server in order to control logging of dirty pages, usually during a live -migration. +``VFIO_USER_REGION_WRITE_MULTI`` +-------------------------------- -Dirty page tracking is optional for server implementation; clients should not -rely on it. +This message can be used to coalesce multiple device write operations +into a single messgage. It is only used as an optimization when the +outgoing message queue is relatively full. Request ^^^^^^^ -+-------+--------+-----------------------------------------+ -| Name | Offset | Size | -+=======+========+=========================================+ -| argsz | 0 | 4 | -+-------+--------+-----------------------------------------+ -| flags | 4 | 4 | -+-------+--------+-----------------------------------------+ -| | +-----+----------------------------------------+ | -| | | Bit | Definition | | -| | +=====+========================================+ | -| | | 0 | VFIO_IOMMU_DIRTY_PAGES_FLAG_START | | -| | +-----+----------------------------------------+ | -| | | 1 | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | | -| | +-----+----------------------------------------+ | -| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | -| | +-----+----------------------------------------+ | -+-------+--------+-----------------------------------------+ - -* *argsz* is the size of the VFIO dirty bitmap info structure for - ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply payload - -* *flags* defines the action to be performed by the server: - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start logging - pages it dirties. Logging continues until explicitly disabled by - ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``. - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logging - dirty pages. - - * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to return - the dirty bitmap for a specific IOVA range. The IOVA range is specified by - a "VFIO Bitmap Range" structure, which must immediately follow this - "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_. - This operation is only valid if logging of dirty pages has been previously - started. - - These flags are mutually exclusive with each other. - -This part of the request is analogous to VFIO's ``struct -vfio_iommu_type1_dirty_bitmap``. - -.. _VFIO Bitmap Range Format: - -VFIO Bitmap Range Format -"""""""""""""""""""""""" - -+--------+--------+------+ -| Name | Offset | Size | -+========+========+======+ -| iova | 0 | 8 | -+--------+--------+------+ -| size | 8 | 8 | -+--------+--------+------+ -| bitmap | 16 | 24 | -+--------+--------+------+ ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ +| wrs | 8 | variable | ++---------+--------+----------+ -* *iova* is the IOVA offset +* *wr_cnt* is the number of device writes coalesced in the message +* *wrs* is an array of device writes defined below -* *size* is the size of the IOVA region +Single Device Write Format +"""""""""""""""""""""""""" -* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_. ++--------+--------+----------+ +| Name | Offset | Size | ++========+========+==========+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | 8 | ++--------+--------+----------+ -This part of the request is analogous to VFIO's ``struct -vfio_iommu_type1_dirty_bitmap_get``. +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. This format can + only describe writes of 8 bytes or less. +* *data* is the data to write. Reply ^^^^^ -For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or -``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload. - -For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as follows: - -+--------------+--------+-----------------------------------------+ -| Name | Offset | Size | -+==============+========+=========================================+ -| argsz | 0 | 4 | -+--------------+--------+-----------------------------------------+ -| flags | 4 | 4 | -+--------------+--------+-----------------------------------------+ -| | +-----+----------------------------------------+ | -| | | Bit | Definition | | -| | +=====+========================================+ | -| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | -| | +-----+----------------------------------------+ | -+--------------+--------+-----------------------------------------+ -| bitmap range | 8 | 40 | -+--------------+--------+-----------------------------------------+ -| bitmap | 48 | variable | -+--------------+--------+-----------------------------------------+ - -* *argsz* is the size required for the full reply payload (dirty pages structure - + bitmap range structure + actual bitmap) -* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` -* *bitmap range* is the same bitmap range struct provided in the request, as - defined in `VFIO Bitmap Range Format`_. -* *bitmap* is the actual dirty pages bitmap corresponding to the range request - -VFIO Device Migration Info --------------------------- - -A device may contain a migration region (of type -``VFIO_REGION_TYPE_MIGRATION``). The beginning of the region must contain -``struct vfio_device_migration_info``, defined in ````. This -subregion is accessed like any other part of a standard vfio-user region -using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``. - -+---------------+--------+--------------------------------+ -| Name | Offset | Size | -+===============+========+================================+ -| device_state | 0 | 4 | -+---------------+--------+--------------------------------+ -| | +-----+-------------------------------+ | -| | | Bit | Definition | | -| | +=====+===============================+ | -| | | 0 | VFIO_DEVICE_STATE_V1_RUNNING | | -| | +-----+-------------------------------+ | -| | | 1 | VFIO_DEVICE_STATE_V1_SAVING | | -| | +-----+-------------------------------+ | -| | | 2 | VFIO_DEVICE_STATE_V1_RESUMING | | -| | +-----+-------------------------------+ | -+---------------+--------+--------------------------------+ -| reserved | 4 | 4 | -+---------------+--------+--------------------------------+ -| pending_bytes | 8 | 8 | -+---------------+--------+--------------------------------+ -| data_offset | 16 | 8 | -+---------------+--------+--------------------------------+ -| data_size | 24 | 8 | -+---------------+--------+--------------------------------+ - -* *device_state* defines the state of the device: - - The client initiates device state transition by writing the intended state. - The server must respond only after it has successfully transitioned to the new - state. If an error occurs then the server must respond to the - ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly and - must remain at the previous state, or in case of internal error it must - transition to the error state, defined as - ``VFIO_DEVICE_STATE_V1_RESUMING | VFIO_DEVICE_STATE_V1_SAVING``. The client - must re-read the device state in order to determine it afresh. - - The following device states are defined: - - +-----------+---------+----------+-----------------------------------+ - | _RESUMING | _SAVING | _RUNNING | Description | - +===========+=========+==========+===================================+ - | 0 | 0 | 0 | Device is stopped. | - +-----------+---------+----------+-----------------------------------+ - | 0 | 0 | 1 | Device is running, default state. | - +-----------+---------+----------+-----------------------------------+ - | 0 | 1 | 0 | Stop-and-copy state | - +-----------+---------+----------+-----------------------------------+ - | 0 | 1 | 1 | Pre-copy state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 0 | 0 | Resuming | - +-----------+---------+----------+-----------------------------------+ - | 1 | 0 | 1 | Invalid state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 1 | 0 | Error state | - +-----------+---------+----------+-----------------------------------+ - | 1 | 1 | 1 | Invalid state | - +-----------+---------+----------+-----------------------------------+ - - Valid state transitions are shown in the following table: - - +-------------------------+---------+---------+---------------+----------+----------+ - | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy | Resuming | - +=========================+=========+=========+===============+==========+==========+ - | Stopped | \- | 1 | 0 | 0 | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Running | 1 | \- | 1 | 1 | 1 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Stop-and-copy | 1 | 1 | \- | 0 | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Pre-copy | 0 | 0 | 1 | \- | 0 | - +-------------------------+---------+---------+---------------+----------+----------+ - | Resuming | 0 | 1 | 0 | 0 | \- | - +-------------------------+---------+---------+---------------+----------+----------+ - - A device is migrated to the destination as follows: - - * The source client transitions the device state from the running state to - the pre-copy state. This transition is optional for the client but must be - supported by the server. The source server starts sending device state data - to the source client through the migration region while the device is - running. - - * The source client transitions the device state from the running state or the - pre-copy state to the stop-and-copy state. The source server stops the - device, saves device state and sends it to the source client through the - migration region. - - The source client is responsible for sending the migration data to the - destination client. - - A device is resumed on the destination as follows: - - * The destination client transitions the device state from the running state - to the resuming state. The destination server uses the device state data - received through the migration region to resume the device. - - * The destination client provides saved device state to the destination - server and then transitions the device to back to the running state. - -* *reserved* This field is reserved and any access to it must be ignored by the - server. - -* *pending_bytes* Remaining bytes to be migrated by the server. This field is - read only. - -* *data_offset* Offset in the migration region where the client must: - - * read from, during the pre-copy or stop-and-copy state, or - - * write to, during the resuming state. - - This field is read only. - -* *data_size* Contains the size, in bytes, of the amount of data copied to: - - * the source migration region by the source server during the pre-copy or - stop-and copy state, or - - * the destination migration region by the destination client during the - resuming state. - -Device-specific data must be stored at any position after -``struct vfio_device_migration_info``. Note that the migration region can be -memory mappable, even partially. In practise, only the migration data portion -can be memory mapped. - -The client processes device state data during the pre-copy and the -stop-and-copy state in the following iterative manner: - - 1. The client reads ``pending_bytes`` to mark a new iteration. Repeated reads - of this field is an idempotent operation. If there are no migration data - to be consumed then the next step depends on the current device state: - - * pre-copy: the client must try again. - - * stop-and-copy: this procedure can end and the device can now start - resuming on the destination. - - 2. The client reads ``data_offset``; at this point the server must make - available a portion of migration data at this offset to be read by the - client, which must happen *before* completing the read operation. The - amount of data to be read must be stored in the ``data_size`` field, which - the client reads next. - - 3. The client reads ``data_size`` to determine the amount of migration data - available. - - 4. The client reads and processes the migration data. - - 5. Go to step 1. - -Note that the client can transition the device from the pre-copy state to the -stop-and-copy state at any time; ``pending_bytes`` does not need to become zero. ++---------+--------+----------+ +| Name | Offset | Size | ++=========+========+==========+ +| wr_cnt | 0 | 8 | ++---------+--------+----------+ -The client initializes the device state on the destination by setting the -device state in the resuming state and writing the migration data to the -destination migration region at ``data_offset`` offset. The client can write the -source migration data in an iterative manner and the server must consume this -data before completing each write operation, updating the ``data_offset`` field. -The server must apply the source migration data on the device resume state. The -client must write data on the same order and transaction size as read. +* *wr_cnt* is the number of device writes completed. -If an error occurs then the server must fail the read or write operation. It is -an implementation detail of the client how to handle errors. Appendices ========== From f7538867d4d3b1fc58658200b685984351f7afaa Mon Sep 17 00:00:00 2001 From: William Henderson Date: Thu, 6 Jul 2023 08:39:35 +0000 Subject: [PATCH 2/9] start updating spec for migration v2 Signed-off-by: William Henderson --- docs/vfio-user.rst | 304 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 291 insertions(+), 13 deletions(-) diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst index 0d96477a..53327f2c 100644 --- a/docs/vfio-user.rst +++ b/docs/vfio-user.rst @@ -4,7 +4,7 @@ vfio-user Protocol Specification ******************************** -------------- -Version_ 0.9.1 +Version_ 0.9.2 -------------- .. contents:: Table of Contents @@ -366,6 +366,9 @@ Name Command Request Direction ``VFIO_USER_DMA_WRITE`` 12 server -> client ``VFIO_USER_DEVICE_RESET`` 13 client -> server ``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server +``VFIO_USER_DEVICE_FEATURE`` 16 client -> server +``VFIO_USER_MIG_DATA_READ`` 17 client -> server +``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server ====================================== ========= ================= Header @@ -515,18 +518,7 @@ Capabilities: | | | are supported if the value is ``true``. | +--------------------+---------+------------------------------------------------+ -The migration capability contains the following name/value pairs: - -+-----------------+--------+--------------------------------------------------+ -| Name | Type | Description | -+=================+========+==================================================+ -| pgsize | number | Page size of dirty pages bitmap. The smallest | -| | | between the client and the server is used. | -+-----------------+--------+--------------------------------------------------+ -| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` | -| | | and ``VFIO_DMA_UNMAP`` messages. Optional, | -| | | with a default value of 256MB. | -+-----------------+--------+--------------------------------------------------+ +The migration capability is currently an empty object. Reply ^^^^^ @@ -1468,6 +1460,292 @@ Reply * *wr_cnt* is the number of device writes completed. +``VFIO_USER_DEVICE_FEATURE`` +---------------------------- + +This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or +probe feature data of the device. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+--------------------------------+ +| Name | Offset | Size | ++=======+========+================================+ +| argsz | 0 | 4 | ++-------+--------+--------------------------------+ +| flags | 4 | 4 | ++-------+--------+--------------------------------+ +| | +---------+---------------------------+ | +| | | Bit | Definition | | +| | +=========+===========================+ | +| | | 0 to 15 | VFIO_DEVICE_FEATURE_MASK | | +| | +---------+---------------------------+ | +| | | 16 | VFIO_DEVICE_FEATURE_GET | | +| | +---------+---------------------------+ | +| | | 17 | VFIO_DEVICE_FEATURE_SET | | +| | +---------+---------------------------+ | +| | | 18 | VFIO_DEVICE_FEATURE_PROBE | | +| | +---------+---------------------------+ | ++-------+--------+--------------------------------+ +| data | 8 | variable | ++-------+--------+--------------------------------+ + +* *argsz* is the size of the above structure. + +* *flags* defines the action to be performed by the server and upon which + feature: + + * ``VFIO_DEVICE_FEATURE_MASK`` is the 16-bit feature index of the relevant + feature. + + * ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the feature data. + + * ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to + that given in the ``data`` field of the payload. + + * ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature + support. If ``VFIO_DEVICE_FEATURE_GET/SET`` are also set, the probe will + only return success if all of the indicated methods are supported. + + ``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually + exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``. + +* *data* is specific to the particular feature. It is not used for probing. + +This part of the request is analogous to VFIO's ``struct vfio_device_feature``. + +Reply +^^^^^ + +The reply payload must be the same as the request payload for setting or +probing a feature. For getting a feature's data, the data is added in the data +section and its length is added to ``argsz``. + +Device Features +^^^^^^^^^^^^^^^ + +The only device features supported by vfio-user are those related to migration. + +``VFIO_DEVICE_FEATURE_MIGRATION`` +""""""""""""""""""""""""""""""""" + +This feature indicates that the device can support the migration API through +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING`` +and ``ERROR`` states are always supported. Support for additional states is +indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be +set. + +There is no data field of the request message. + +The data field of the reply message is structured as follows: + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=======+========+===========================+ +| flags | 0 | 8 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=====+==========================+ | +| | | 0 | VFIO_MIGRATION_STOP_COPY | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_MIGRATION_P2P | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_MIGRATION_PRE_COPY | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ + +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE`` +"""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO +device. The new state is supplied in ``device_state``. The state transition must +fully complete before the reply is sent. + +The data field of the reply message, as well as the ``SET`` request message, is +structured as follows: + ++--------------+--------+------+ +| Name | Offset | Size | ++==============+========+======+ +| device_state | 0 | 4 | ++--------------+--------+------+ +| data_fd | 4 + 4 | ++--------------+--------+------+ + +* *device_state* is the current state of the device (for ``GET``) or the + state to transition to (for ``SET``). It is defined by the + ``vfio_device_mig_state`` enum as detailed below. These states are the states + of the device migration Finite State Machine. + ++--------------------------------+-------+---------------------------------------------------------------------+ +| Name | State | Description | ++================================+=======+=====================================================================+ +| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ + +* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and + ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data + transport. + +Direct State Transitions +"""""""""""""""""""""""" + +The device migration FSM is a Mealy machine, so actions are taken upon the arcs +between FSM states. The following transitions need to be supported by the +server, a subset of those defined in ```` +(``enum vfio_device_mig_state``). + +* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device. The ``STOP_COPY`` arc terminates the data transfer session. + +* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing + of the migration data. Stop the operation of the device. If the delivered data + is found to be incomplete, inconsistent, or otherwise invalid, fail the + ``SET`` command and optionally transition to the ``ERROR`` state. + +* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is + now fully operational. + +* ``STOP -> RUNNING``: Start the operation of the device. + +* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving + the device state. The device operation is unchanged, but data transfer begins. + ``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of + states. + +* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop + device operation. + +* ``STOP -> RESUMING``: Start the process of restoring the device state. The + internal device state may be changed to prepare the device to receive the + migration data. + +The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should +return an error if requested. + +``ERROR`` cannot be specified as a device state, but any transition request can +be failed and then move the state into ``ERROR`` if the server was unable to +execute the requested arc AND was unable to restore the device into any valid +state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to +return back to ``RUNNING``. + +If ``PRE_COPY`` is not supported, arcs touching it are removed. + +Complex State Transitions +""""""""""""""""""""""""" + +The remaining possible transitions are to be implemented as combinations of the +above FSM arcs. As there are multiple paths, the path should be selected based +on the following rules: + +* Select the shortest path. + +* The path cannot have saving group states as interior arcs, only start/end + states. + +``VFIO_USER_MIG_DATA_READ`` +--------------------------- + +This command is used to read data from the source migration server while it is +in a saving group state (``PRE_COPY`` or ``STOP_COPY``). + +This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the +``data_fd`` file descriptor in ```` +(``struct vfio_device_feature_mig_state``) to enable all data transport to use +the single already-established UNIX socket. Hence, the migration data is +treated like a stream, so the client must continue reading until no more +migration data remains. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+------+ +| Name | Offset | Size | ++=======+========+======+ +| argsz | 0 | 4 | ++-------+--------+------+ +| size | 4 | 8 | ++-------+--------+------+ + +* *argsz* is the size of the above structure. + +* *size* is the size of the migration data to read. + +Reply +^^^^^ + +The reply payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 8 | ++-------+--------+----------+ +| data | 12 | variable | ++-------+--------+----------+ + +* *argsz* is the size of the above structure. + +* *size* indicates the size of returned migration data. If this is less than the + requested size, there is no more migration data to read. + +* *data* contains the migration data. + +``VFIO_USER_MIG_DATA_WRITE`` +---------------------------- + +This command is used to write data to the destination migration server while it +is in the ``RESUMING`` state. + +As above, this replaces the ``data_fd`` file descriptor for transport of +migration data, and as such, the migration data is treated like a stream. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 8 | ++-------+--------+----------+ +| data | 12 | variable | ++-------+--------+----------+ + +* *argsz* is the size of the above structure. + +* *size* is the size of the migration data to be written. + +* *data* contains the migration data. + +Reply +^^^^^ + +There is no reply payload for this message. Appendices ========== From 2d38f2fb4aadd762be71266efa8a977c787d0dc5 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Mon, 26 Jun 2023 13:07:06 +0000 Subject: [PATCH 3/9] start implementing migration v2 (WIP) Signed-off-by: William Henderson --- include/vfio-user.h | 39 +++++++++++++++++++++++++++++ lib/libvfio-user.c | 43 ++++++++++++++++++++++++++++++++ lib/migration.c | 54 +++++++++++++++++++++++++++++++++++++++++ lib/migration.h | 9 +++++++ test/py/libvfio_user.py | 20 ++++++++++++++- 5 files changed, 164 insertions(+), 1 deletion(-) diff --git a/include/vfio-user.h b/include/vfio-user.h index 52f08707..8b7697e7 100644 --- a/include/vfio-user.h +++ b/include/vfio-user.h @@ -67,6 +67,7 @@ enum vfio_user_command { VFIO_USER_DMA_WRITE = 12, VFIO_USER_DEVICE_RESET = 13, VFIO_USER_DIRTY_PAGES = 14, + VFIO_USER_DEVICE_FEATURE = 15, VFIO_USER_MAX, }; @@ -228,6 +229,44 @@ struct vfio_user_bitmap_range { #endif /* VFIO_REGION_TYPE_MIGRATION */ +/* Analogous to vfio_device_feature */ +struct vfio_user_device_feature { + uint32_t argsz; + uint32_t flags; +#define VFIO_DEVICE_FEATURE_MASK (0xffff) /* 16-bit feature index */ +#define VFIO_DEVICE_FEATURE_GET (1 << 16) /* Get feature into data[] */ +#define VFIO_DEVICE_FEATURE_SET (1 << 17) /* Set feature from data[] */ +#define VFIO_DEVICE_FEATURE_PROBE (1 << 18) /* Probe feature support */ + uint8_t data[]; +} __attribute__((packed)); + +/* Analogous to vfio_device_feature_migration */ +struct vfio_user_device_feature_migration { + uint64_t flags; +#define VFIO_MIGRATION_STOP_COPY (1 << 0) +#define VFIO_MIGRATION_P2P (1 << 1) +#define VFIO_MIGRATION_PRE_COPY (1 << 2) +} __attribute__((packed)); +#define VFIO_DEVICE_FEATURE_MIGRATION 1 + +/* Analogous to vfio_device_feature_mig_state */ +struct vfio_user_device_feature_mig_state { + uint32_t device_state; + uint32_t data_fd; +} __attribute__((packed)); +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2 + +enum vfio_device_mig_state { + VFIO_DEVICE_STATE_ERROR = 0, + VFIO_DEVICE_STATE_STOP = 1, + VFIO_DEVICE_STATE_RUNNING = 2, + VFIO_DEVICE_STATE_STOP_COPY = 3, + VFIO_DEVICE_STATE_RESUMING = 4, + VFIO_DEVICE_STATE_RUNNING_P2P = 5, + VFIO_DEVICE_STATE_PRE_COPY = 6, + VFIO_DEVICE_STATE_PRE_COPY_P2P = 7, +}; + #ifdef __cplusplus } #endif diff --git a/lib/libvfio-user.c b/lib/libvfio-user.c index 94524a2f..01a12579 100644 --- a/lib/libvfio-user.c +++ b/lib/libvfio-user.c @@ -1052,6 +1052,45 @@ handle_dirty_pages(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) return ret; } +static int +handle_device_feature(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) +{ + struct vfio_user_device_feature *req = msg->in.iov.iov_base; + + if (vfu_ctx->migration == NULL) { + return -EINVAL; + } + + if (!migration_feature_supported(req->flags)) { + // FIXME what error code to return? we really want "not supported" + // instead of "not permitted"? + return -EINVAL; + } + + ssize_t ret; + + if (req->flags & VFIO_DEVICE_FEATURE_PROBE) { + msg->out.iov.iov_base = msg->in.iov.iov_base; + msg->out.iov.iov_len = msg->in.iov.iov_len; + + ret = 0; + } else if (req->flags & VFIO_DEVICE_FEATURE_GET) { + msg->out.iov.iov_base = calloc(8, 1); + msg->out.iov.iov_len = 8; + + ret = migration_feature_get(vfu_ctx, req->flags, + msg->out.iov.iov_base); + } else if (req->flags & VFIO_DEVICE_FEATURE_SET) { + msg->out.iov.iov_base = msg->in.iov.iov_base; + msg->out.iov.iov_len = msg->in.iov.iov_len; + + ret = migration_feature_set(vfu_ctx, req->flags, + msg->out.iov.iov_base); + } + + return ret; +} + static vfu_msg_t * alloc_msg(struct vfio_user_header *hdr, int *fds, size_t nr_fds) { @@ -1221,6 +1260,10 @@ handle_request(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) } break; + case VFIO_USER_DEVICE_FEATURE: + ret = handle_device_feature(vfu_ctx, msg); + break; + default: msg->processed_cmd = false; vfu_log(vfu_ctx, LOG_ERR, "bad command %d", msg->hdr.cmd); diff --git a/lib/migration.c b/lib/migration.c index 794e7b81..ead121e5 100644 --- a/lib/migration.c +++ b/lib/migration.c @@ -534,6 +534,60 @@ migration_region_access(vfu_ctx_t *vfu_ctx, char *buf, size_t count, return count; } +bool +migration_feature_supported(uint32_t flags) { + switch (flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + return !(flags & VFIO_DEVICE_FEATURE_SET); // not supported for set + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: + return true; + default: + return false; + }; +} + +ssize_t +migration_feature_get(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf) +{ + struct vfio_user_device_feature_migration *res; + struct vfio_user_device_feature_mig_state *state; + + switch (flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + res = buf; + // FIXME are these always supported? Can we consider to be + // "supported" if said support is just an empty callback? + res->flags = VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY; + + return 0; + + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: + state = buf; + state->device_state = vfu_ctx->migration->info.device_state; + + return 0; + + default: + return -EINVAL; + + }; +} + +ssize_t +migration_feature_set(UNUSED vfu_ctx_t *vfu_ctx, uint32_t flags, + UNUSED void *buf) +{ + if (flags & VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE) { + // struct vfio_user_device_feature_mig_state *res = buf; + + // TODO migrate to res->device_state state if allowed to + + return 0; + } + + return -EINVAL; +} + bool MOCK_DEFINE(device_is_stopped_and_copying)(struct migration *migr) { diff --git a/lib/migration.h b/lib/migration.h index 26fd744f..73916465 100644 --- a/lib/migration.h +++ b/lib/migration.h @@ -52,6 +52,15 @@ ssize_t migration_region_access(vfu_ctx_t *vfu_ctx, char *buf, size_t count, loff_t pos, bool is_write); +bool +migration_feature_supported(uint32_t flags); + +ssize_t +migration_feature_get(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf); + +ssize_t +migration_feature_set(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf); + bool migration_available(vfu_ctx_t *vfu_ctx); diff --git a/test/py/libvfio_user.py b/test/py/libvfio_user.py index 8848dbff..2261951e 100644 --- a/test/py/libvfio_user.py +++ b/test/py/libvfio_user.py @@ -124,6 +124,16 @@ VFIO_DEVICE_STATE_V1_RESUMING = (1 << 2) VFIO_DEVICE_STATE_MASK = ((1 << 3) - 1) +VFIO_DEVICE_FEATURE_MIGRATION = (1) +VFIO_DEVICE_FEATURE_MASK = (0xffff) +VFIO_DEVICE_FEATURE_GET = (1 << 16) +VFIO_DEVICE_FEATURE_SET = (1 << 17) +VFIO_DEVICE_FEATURE_PROBE = (1 << 18) + +VFIO_MIGRATION_STOP_COPY = (1 << 0) +VFIO_MIGRATION_P2P = (1 << 1) +VFIO_MIGRATION_PRE_COPY = (1 << 2) + # libvfio-user defines @@ -171,7 +181,8 @@ def is_32bit(): VFIO_USER_DMA_WRITE = 12 VFIO_USER_DEVICE_RESET = 13 VFIO_USER_DIRTY_PAGES = 14 -VFIO_USER_MAX = 15 +VFIO_USER_DEVICE_FEATURE = 15 +VFIO_USER_MAX = 16 VFIO_USER_F_TYPE_COMMAND = 0 VFIO_USER_F_TYPE_REPLY = 1 @@ -569,6 +580,13 @@ def __str__(self): (hex(self.dma_addr), self.region, hex(self.length), hex(self.offset), self.writeable) +class vfio_user_device_feature(Structure): + _pack_ = 1 + _fields_ = [ + ("argsz", c.c_uint32), + ("flags", c.c_uint32) + ] + class vfio_user_migration_info(Structure): _pack_ = 1 From 4bf29bc6a2852084dad63a99670fe7914a821ed7 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Mon, 26 Jun 2023 16:02:08 +0000 Subject: [PATCH 4/9] first draft libvfio-user migration v2 implementation Signed-off-by: William Henderson --- include/libvfio-user.h | 128 +----------- include/vfio-user.h | 7 - lib/libvfio-user.c | 102 ++-------- lib/migration.c | 440 +++-------------------------------------- lib/migration.h | 9 +- lib/migration_priv.h | 95 ++------- 6 files changed, 63 insertions(+), 718 deletions(-) diff --git a/include/libvfio-user.h b/include/libvfio-user.h index 0a8af2b4..8a9e237c 100644 --- a/include/libvfio-user.h +++ b/include/libvfio-user.h @@ -583,21 +583,8 @@ typedef enum { VFU_MIGR_STATE_RESUME } vfu_migr_state_t; -#define VFU_MIGR_CALLBACKS_VERS 1 +#define VFU_MIGR_CALLBACKS_VERS 2 -/* - * Callbacks during the pre-copy and stop-and-copy phases. - * - * The client executes the following steps to copy migration data: - * - * 1. get_pending_bytes: device must return amount of migration data - * 2. prepare_data: device must prepare migration data - * 3. read_data: device must provide migration data - * - * The client repeats the above steps until there is no more migration data to - * return (the device must return 0 from get_pending_bytes to indicate that - * there are no more migration data to be consumed in this iteration). - */ typedef struct { /* @@ -615,83 +602,13 @@ typedef struct { * FIXME maybe we should create a single callback and pass the state? */ int (*transition)(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state); - - /* Callbacks for saving device state */ - - /* - * Function that is called to retrieve the amount of pending migration - * data. If migration data were previously made available (function - * prepare_data has been called) then calling this function signifies that - * they have been read (e.g. migration data can be discarded). If the - * function returns 0 then migration has finished and this function won't - * be called again. - * - * The amount of pending migration data returned by the device does not - * necessarily have to monotonically decrease over time and does not need - * to match the amount of migration data returned via the @size argument in - * prepare_data. It can completely fluctuate according to the needs of the - * device. These semantics are derived from the pending_bytes register in - * VFIO. Therefore the value returned by get_pending_bytes must be - * primarily regarded as boolean, either 0 or non-zero, as far as migration - * completion is concerned. More advanced vfio-user clients can make - * assumptions on how migration is progressing on devices that guarantee - * that the amount of pending migration data decreases over time. - */ - uint64_t (*get_pending_bytes)(vfu_ctx_t *vfu_ctx); - - /* - * Function that is called to instruct the device to prepare migration data - * to be read when in pre-copy or stop-and-copy state, and to prepare for - * receiving migration data when in resuming state. - * - * When in pre-copy and stop-and-copy state, the function must return only - * after migration data are available at the specified offset. This - * callback is called once per iteration. The amount of data available - * pointed to by @size can be different that the amount of data returned by - * get_pending_bytes in the beginning of the iteration. - * - * In VFIO, the data_offset and data_size registers can be read multiple - * times during an iteration and are invariant, libvfio-user simplifies - * this by caching the values and returning them when read, guaranteeing - * that prepare_data() is called only once per migration iteration. - * - * When in resuming state, @offset must be set to where migration data must - * written. @size points to NULL. - * - * The callback should return -1 on error, setting errno. - */ - int (*prepare_data)(vfu_ctx_t *vfu_ctx, uint64_t *offset, uint64_t *size); - - /* - * Function that is called to read migration data. offset and size can be - * any subrange on the offset and size previously returned by prepare_data. - * The function must return the amount of data read or -1 on error, setting - * errno. - * - * This function can be called even if the migration data can be memory - * mapped. - */ + ssize_t (*read_data)(vfu_ctx_t *vfu_ctx, void *buf, uint64_t count, uint64_t offset); - /* Callbacks for restoring device state */ - - /* - * Fuction that is called for writing previously stored device state. The - * function must return the amount of data written or -1 on error, setting - * errno. - */ ssize_t (*write_data)(vfu_ctx_t *vfu_ctx, void *buf, uint64_t count, uint64_t offset); - /* - * Function that is called when client has written some previously stored - * device state. - * - * The callback should return -1 on error, setting errno. - */ - int (*data_written)(vfu_ctx_t *vfu_ctx, uint64_t count); - } vfu_migration_callbacks_t; /** @@ -722,45 +639,11 @@ _Static_assert(VFIO_DEVICE_STATE_STOP == 0, #endif /* VFIO_REGION_TYPE_MIGRATION_DEPRECATED */ -/* - * The currently defined migration registers; if using migration callbacks, - * these are handled internally by the library. - * - * This is analogous to struct vfio_device_migration_info. - */ -struct vfio_user_migration_info { - /* VFIO_DEVICE_STATE_* */ - uint32_t device_state; - uint32_t reserved; - uint64_t pending_bytes; - uint64_t data_offset; - uint64_t data_size; -}; - -/* - * Returns the size of the area needed to hold the migration registers at the - * beginning of the migration region; guaranteed to be page aligned. - */ -size_t -vfu_get_migr_register_area_size(void); - -/** - * vfu_setup_device_migration provides an abstraction over the migration - * protocol: the user specifies a set of callbacks which are called in response - * to client accesses of the migration region; the migration region read/write - * callbacks are not called after this function call. Offsets in callbacks are - * relative to @data_offset. - * - * @vfu_ctx: the libvfio-user context - * @callbacks: migration callbacks - * @data_offset: offset in the migration region where data begins. - * - * @returns 0 on success, -1 on error, sets errno. - */ int vfu_setup_device_migration_callbacks(vfu_ctx_t *vfu_ctx, - const vfu_migration_callbacks_t *callbacks, - uint64_t data_offset); + uint64_t flags, + const vfu_migration_callbacks_t + *callbacks); /** * Triggers an interrupt. @@ -903,7 +786,6 @@ enum { VFU_PCI_DEV_ROM_REGION_IDX, VFU_PCI_DEV_CFG_REGION_IDX, VFU_PCI_DEV_VGA_REGION_IDX, - VFU_PCI_DEV_MIGR_REGION_IDX, VFU_PCI_DEV_NUM_REGIONS, }; diff --git a/include/vfio-user.h b/include/vfio-user.h index 8b7697e7..e11ca6fa 100644 --- a/include/vfio-user.h +++ b/include/vfio-user.h @@ -222,13 +222,6 @@ struct vfio_user_bitmap_range { struct vfio_user_bitmap bitmap; } __attribute__((packed)); -#ifndef VFIO_REGION_TYPE_MIGRATION - -#define VFIO_REGION_TYPE_MIGRATION (3) -#define VFIO_REGION_SUBTYPE_MIGRATION (1) - -#endif /* VFIO_REGION_TYPE_MIGRATION */ - /* Analogous to vfio_device_feature */ struct vfio_user_device_feature { uint32_t argsz; diff --git a/lib/libvfio-user.c b/lib/libvfio-user.c index 01a12579..176a8115 100644 --- a/lib/libvfio-user.c +++ b/lib/libvfio-user.c @@ -83,21 +83,16 @@ vfu_log(vfu_ctx_t *vfu_ctx, int level, const char *fmt, ...) } static size_t -get_vfio_caps_size(bool is_migr_reg, vfu_reg_info_t *reg) +get_vfio_caps_size(vfu_reg_info_t *reg) { - size_t type_size = 0; size_t sparse_size = 0; - if (is_migr_reg) { - type_size = sizeof(struct vfio_region_info_cap_type); - } - if (reg->nr_mmap_areas != 0) { sparse_size = sizeof(struct vfio_region_info_cap_sparse_mmap) + (reg->nr_mmap_areas * sizeof(struct vfio_region_sparse_mmap_area)); } - return type_size + sparse_size; + return sparse_size; } /* @@ -106,7 +101,7 @@ get_vfio_caps_size(bool is_migr_reg, vfu_reg_info_t *reg) * points accordingly. */ static int -dev_get_caps(vfu_ctx_t *vfu_ctx, vfu_reg_info_t *vfu_reg, bool is_migr_reg, +dev_get_caps(vfu_ctx_t *vfu_ctx, vfu_reg_info_t *vfu_reg, struct vfio_region_info *vfio_reg, int **fds, size_t *nr_fds) { struct vfio_info_cap_header *header; @@ -120,16 +115,6 @@ dev_get_caps(vfu_ctx_t *vfu_ctx, vfu_reg_info_t *vfu_reg, bool is_migr_reg, header = (struct vfio_info_cap_header*)(vfio_reg + 1); - if (is_migr_reg) { - type = (struct vfio_region_info_cap_type *)header; - type->header.id = VFIO_REGION_INFO_CAP_TYPE; - type->header.version = 1; - type->header.next = 0; - type->type = VFIO_REGION_TYPE_MIGRATION; - type->subtype = VFIO_REGION_SUBTYPE_MIGRATION; - vfio_reg->cap_offset = sizeof(struct vfio_region_info); - } - if (vfu_reg->mmap_areas != NULL) { int i, nr_mmap_areas = vfu_reg->nr_mmap_areas; if (type != NULL) { @@ -218,14 +203,6 @@ region_access(vfu_ctx_t *vfu_ctx, size_t region, char *buf, if (ret == -1) { goto out; } - } else if (region == VFU_PCI_DEV_MIGR_REGION_IDX) { - if (vfu_ctx->migration == NULL) { - vfu_log(vfu_ctx, LOG_ERR, "migration not enabled"); - ret = ERROR_INT(EINVAL); - goto out; - } - - ret = migration_region_access(vfu_ctx, buf, count, offset, is_write); } else { vfu_region_access_cb_t *cb = vfu_ctx->reg_info[region].cb; @@ -293,8 +270,7 @@ is_valid_region_access(vfu_ctx_t *vfu_ctx, size_t size, uint16_t cmd, return false; } - if (unlikely(device_is_stopped_and_copying(vfu_ctx->migration) && - index != VFU_PCI_DEV_MIGR_REGION_IDX)) { + if (unlikely(device_is_stopped_and_copying(vfu_ctx->migration))) { vfu_log(vfu_ctx, LOG_ERR, "cannot access region %zu while device in stop-and-copy state", index); @@ -421,8 +397,7 @@ handle_device_get_region_info(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) vfu_reg = &vfu_ctx->reg_info[in_info->index]; if (vfu_reg->size > 0) { - caps_size = get_vfio_caps_size(in_info->index == VFU_PCI_DEV_MIGR_REGION_IDX, - vfu_reg); + caps_size = get_vfio_caps_size(vfu_reg); } msg->out.iov.iov_len = MIN(sizeof(*out_info) + caps_size, in_info->argsz); @@ -457,9 +432,8 @@ handle_device_get_region_info(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) /* Only actually provide the caps if they fit. */ if (in_info->argsz >= out_info->argsz) { out_info->flags |= VFIO_REGION_INFO_FLAG_CAPS; - ret = dev_get_caps(vfu_ctx, vfu_reg, - in_info->index == VFU_PCI_DEV_MIGR_REGION_IDX, - out_info, &msg->out.fds, &msg->out.nr_fds); + ret = dev_get_caps(vfu_ctx, vfu_reg, out_info, &msg->out.fds, + &msg->out.nr_fds); if (ret < 0) { return ret; } @@ -1393,8 +1367,7 @@ static bool access_needs_quiesce(const vfu_ctx_t *vfu_ctx, size_t region_index, uint64_t offset) { - return access_migration_needs_quiesce(vfu_ctx, region_index, offset) - || access_is_pci_cap_exp(vfu_ctx, region_index, offset); + return access_is_pci_cap_exp(vfu_ctx, region_index, offset); } static bool @@ -1892,38 +1865,6 @@ copyin_mmap_areas(vfu_reg_info_t *reg_info, return 0; } -static bool -ranges_intersect(size_t off1, size_t size1, size_t off2, size_t size2) -{ - /* - * For two ranges to intersect, the start of each range must be before the - * end of the other range. - * TODO already defined in lib/pci_caps.c, maybe introduce a file for misc - * utility functions? - */ - return (off1 < (off2 + size2) && off2 < (off1 + size1)); -} - -static bool -maps_over_migr_regs(struct iovec *iov) -{ - return ranges_intersect(0, vfu_get_migr_register_area_size(), - (size_t)iov->iov_base, iov->iov_len); -} - -static bool -validate_sparse_mmaps_for_migr_reg(vfu_reg_info_t *reg) -{ - int i; - - for (i = 0; i < reg->nr_mmap_areas; i++) { - if (maps_over_migr_regs(®->mmap_areas[i])) { - return false; - } - } - return true; -} - EXPORT int vfu_setup_region(vfu_ctx_t *vfu_ctx, int region_idx, size_t size, vfu_region_access_cb_t *cb, int flags, @@ -1969,12 +1910,6 @@ vfu_setup_region(vfu_ctx_t *vfu_ctx, int region_idx, size_t size, return ERROR_INT(EINVAL); } - if (region_idx == VFU_PCI_DEV_MIGR_REGION_IDX && - size < vfu_get_migr_register_area_size()) { - vfu_log(vfu_ctx, LOG_ERR, "invalid migration region size %zu", size); - return ERROR_INT(EINVAL); - } - for (i = 0; i < nr_mmap_areas; i++) { struct iovec *iov = &mmap_areas[i]; if ((size_t)iov_end(iov) > size) { @@ -2006,15 +1941,6 @@ vfu_setup_region(vfu_ctx_t *vfu_ctx, int region_idx, size_t size, } } - if (region_idx == VFU_PCI_DEV_MIGR_REGION_IDX) { - if (!validate_sparse_mmaps_for_migr_reg(reg)) { - vfu_log(vfu_ctx, LOG_ERR, - "migration registers cannot be memory mapped"); - errno = EINVAL; - goto err; - } - } - return 0; err: @@ -2093,27 +2019,21 @@ vfu_setup_irq_state_callback(vfu_ctx_t *vfu_ctx, enum vfu_dev_irq_type type, } EXPORT int -vfu_setup_device_migration_callbacks(vfu_ctx_t *vfu_ctx, - const vfu_migration_callbacks_t *callbacks, - uint64_t data_offset) +vfu_setup_device_migration_callbacks(vfu_ctx_t *vfu_ctx, uint64_t flags, + const vfu_migration_callbacks_t *callbacks) { int ret = 0; assert(vfu_ctx != NULL); assert(callbacks != NULL); - if (vfu_ctx->reg_info[VFU_PCI_DEV_MIGR_REGION_IDX].size == 0) { - vfu_log(vfu_ctx, LOG_ERR, "no device migration region"); - return ERROR_INT(EINVAL); - } - if (callbacks->version != VFU_MIGR_CALLBACKS_VERS) { vfu_log(vfu_ctx, LOG_ERR, "unsupported migration callbacks version %d", callbacks->version); return ERROR_INT(EINVAL); } - vfu_ctx->migration = init_migration(callbacks, data_offset, &ret); + vfu_ctx->migration = init_migration(callbacks, flags, &ret); if (vfu_ctx->migration == NULL) { vfu_log(vfu_ctx, LOG_ERR, "failed to initialize device migration"); return ERROR_INT(ret); diff --git a/lib/migration.c b/lib/migration.c index ead121e5..f3013c58 100644 --- a/lib/migration.c +++ b/lib/migration.c @@ -42,14 +42,7 @@ bool MOCK_DEFINE(vfio_migr_state_transition_is_valid)(uint32_t from, uint32_t to) { - return migr_states[from].state & (1 << to); -} - -EXPORT size_t -vfu_get_migr_register_area_size(void) -{ - return ROUND_UP(sizeof(struct vfio_user_migration_info), - sysconf(_SC_PAGE_SIZE)); + return transitions[from][to]; } /* @@ -57,22 +50,19 @@ vfu_get_migr_register_area_size(void) * in vfu_ctx_t. */ struct migration * -init_migration(const vfu_migration_callbacks_t * callbacks, - uint64_t data_offset, int *err) +init_migration(const vfu_migration_callbacks_t *callbacks, + uint64_t flags, int *err) { struct migration *migr; - if (data_offset < vfu_get_migr_register_area_size()) { - *err = EINVAL; - return NULL; - } - migr = calloc(1, sizeof(*migr)); if (migr == NULL) { *err = ENOMEM; return NULL; } + migr->flags = flags; + /* * FIXME: incorrect, if the client doesn't give a pgsize value, it means "no * migration support", handle this @@ -81,13 +71,10 @@ init_migration(const vfu_migration_callbacks_t * callbacks, migr->pgsize = sysconf(_SC_PAGESIZE); /* FIXME this should be done in vfu_ctx_realize */ - migr->info.device_state = VFIO_DEVICE_STATE_V1_RUNNING; - migr->data_offset = data_offset; + migr->state = VFIO_DEVICE_STATE_RUNNING; migr->callbacks = *callbacks; if (migr->callbacks.transition == NULL || - migr->callbacks.get_pending_bytes == NULL || - migr->callbacks.prepare_data == NULL || migr->callbacks.read_data == NULL || migr->callbacks.write_data == NULL) { free(migr); @@ -100,35 +87,30 @@ init_migration(const vfu_migration_callbacks_t * callbacks, void MOCK_DEFINE(migr_state_transition)(struct migration *migr, - enum migr_iter_state state) + enum vfio_device_mig_state state) { assert(migr != NULL); /* FIXME validate that state transition */ - migr->iter.state = state; + migr->state = state; } vfu_migr_state_t -MOCK_DEFINE(migr_state_vfio_to_vfu)(uint32_t device_state) +MOCK_DEFINE(migr_state_vfio_to_vfu)(enum vfio_device_mig_state state) { - switch (device_state) { - case VFIO_DEVICE_STATE_V1_STOP: + switch (state) { + case VFIO_DEVICE_STATE_STOP: return VFU_MIGR_STATE_STOP; - case VFIO_DEVICE_STATE_V1_RUNNING: + case VFIO_DEVICE_STATE_RUNNING: return VFU_MIGR_STATE_RUNNING; - case VFIO_DEVICE_STATE_V1_SAVING: - /* - * FIXME How should the device operate during the stop-and-copy - * phase? Should we only allow the migration data to be read from - * the migration region? E.g. Access to any other region should be - * failed? This might be a good question to send to LKML. - */ + case VFIO_DEVICE_STATE_STOP_COPY: return VFU_MIGR_STATE_STOP_AND_COPY; - case VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING: - return VFU_MIGR_STATE_PRE_COPY; - case VFIO_DEVICE_STATE_V1_RESUMING: + case VFIO_DEVICE_STATE_RESUMING: return VFU_MIGR_STATE_RESUME; + case VFIO_DEVICE_STATE_PRE_COPY: + return VFU_MIGR_STATE_PRE_COPY; + default: + return -1; } - return -1; } /** @@ -165,8 +147,7 @@ MOCK_DEFINE(migr_trans_to_valid_state)(vfu_ctx_t *vfu_ctx, struct migration *mig return ret; } } - migr->info.device_state = device_state; - migr_state_transition(migr, VFIO_USER_MIGR_ITER_STATE_INITIAL); + migr_state_transition(migr, device_state); // TODO confused return 0; } @@ -180,360 +161,12 @@ MOCK_DEFINE(handle_device_state)(vfu_ctx_t *vfu_ctx, struct migration *migr, assert(migr != NULL); - if (!vfio_migr_state_transition_is_valid(migr->info.device_state, - device_state)) { + if (!vfio_migr_state_transition_is_valid(migr->state, device_state)) { return ERROR_INT(EINVAL); } return migr_trans_to_valid_state(vfu_ctx, migr, device_state, notify); } -/** - * Returns 0 on success, -1 on error setting errno. - */ -static ssize_t -handle_pending_bytes(vfu_ctx_t *vfu_ctx, struct migration *migr, - uint64_t *pending_bytes, bool is_write) -{ - assert(migr != NULL); - assert(pending_bytes != NULL); - - if (is_write) { - return ERROR_INT(EINVAL); - } - - if (migr->iter.state == VFIO_USER_MIGR_ITER_STATE_FINISHED) { - *pending_bytes = 0; - return 0; - } - - switch (migr->iter.state) { - case VFIO_USER_MIGR_ITER_STATE_INITIAL: - case VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED: - /* - * FIXME what happens if data haven't been consumed in the previous - * iteration? Check https://www.spinics.net/lists/kvm/msg228608.html. - */ - *pending_bytes = migr->iter.pending_bytes = migr->callbacks.get_pending_bytes(vfu_ctx); - - if (*pending_bytes == 0) { - migr_state_transition(migr, VFIO_USER_MIGR_ITER_STATE_FINISHED); - } else { - migr_state_transition(migr, VFIO_USER_MIGR_ITER_STATE_STARTED); - } - break; - case VFIO_USER_MIGR_ITER_STATE_STARTED: - /* - * FIXME We might be wrong returning a cached value, check - * https://www.spinics.net/lists/kvm/msg228608.html - * - */ - *pending_bytes = migr->iter.pending_bytes; - break; - default: - return ERROR_INT(EINVAL); - } - return 0; -} - -/* - * FIXME reading or writing migration registers with the wrong device state or - * out of sequence is undefined, but should not result in EINVAL, it should - * simply be ignored. However this way it's easier to catch development errors. - * Make this behavior conditional. - */ - -/** - * Returns 0 on success, -1 on error setting errno. - */ -static ssize_t -handle_data_offset_when_saving(vfu_ctx_t *vfu_ctx, struct migration *migr, - bool is_write) -{ - int ret = 0; - - assert(migr != NULL); - - if (is_write) { - vfu_log(vfu_ctx, LOG_ERR, "data_offset is RO when saving"); - return ERROR_INT(EINVAL); - } - - switch (migr->iter.state) { - case VFIO_USER_MIGR_ITER_STATE_STARTED: - ret = migr->callbacks.prepare_data(vfu_ctx, &migr->iter.offset, - &migr->iter.size); - if (ret != 0) { - return ret; - } - /* - * FIXME must first read data_offset and then data_size. They way we've - * implemented it now, if data_size is read before data_offset we - * transition to state VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED without - * calling callbacks.prepare_data, which is wrong. Maybe we need - * separate states for data_offset and data_size. - */ - migr_state_transition(migr, VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED); - break; - case VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED: - /* - * data_offset is invariant during a save iteration. - */ - break; - default: - vfu_log(vfu_ctx, LOG_ERR, - "reading data_offset out of sequence is undefined"); - return ERROR_INT(EINVAL); - } - - return 0; -} - -/** - * Returns 0 on success, -1 on error setting errno. - */ -static ssize_t -handle_data_offset(vfu_ctx_t *vfu_ctx, struct migration *migr, - uint64_t *offset, bool is_write) -{ - int ret; - - assert(migr != NULL); - assert(offset != NULL); - - switch (migr->info.device_state) { - case VFIO_DEVICE_STATE_V1_SAVING: - case VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING: - ret = handle_data_offset_when_saving(vfu_ctx, migr, is_write); - if (ret == 0 && !is_write) { - *offset = migr->iter.offset + migr->data_offset; - } - return ret; - case VFIO_DEVICE_STATE_V1_RESUMING: - if (is_write) { - /* TODO writing to read-only registers should be simply ignored */ - vfu_log(vfu_ctx, LOG_ERR, "bad write to migration data_offset"); - return ERROR_INT(EINVAL); - } - ret = migr->callbacks.prepare_data(vfu_ctx, offset, NULL); - if (ret != 0) { - return ret; - } - *offset += migr->data_offset; - return 0; - } - /* TODO improve error message */ - vfu_log(vfu_ctx, LOG_ERR, - "bad access to migration data_offset in state %s", - migr_states[migr->info.device_state].name); - return ERROR_INT(EINVAL); -} - -/** - * Returns 0 on success, -1 on failure setting errno. - */ -static ssize_t -handle_data_size_when_saving(vfu_ctx_t *vfu_ctx, struct migration *migr, - bool is_write) -{ - assert(migr != NULL); - - if (is_write) { - /* TODO improve error message */ - vfu_log(vfu_ctx, LOG_ERR, "data_size is RO when saving"); - return ERROR_INT(EINVAL); - } - - if (migr->iter.state != VFIO_USER_MIGR_ITER_STATE_STARTED && - migr->iter.state != VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED) { - vfu_log(vfu_ctx, LOG_ERR, - "reading data_size ouf of sequence is undefined"); - return ERROR_INT(EINVAL); - } - return 0; -} - -/** - * Returns 0 on success, -1 on error setting errno. - */ -static ssize_t -handle_data_size_when_resuming(vfu_ctx_t *vfu_ctx, struct migration *migr, - uint64_t size, bool is_write) -{ - assert(migr != NULL); - - if (is_write) { - return migr->callbacks.data_written(vfu_ctx, size); - } - return 0; -} - -/** - * Returns 0 on success, -1 on failure setting errno. - */ -static ssize_t -handle_data_size(vfu_ctx_t *vfu_ctx, struct migration *migr, - uint64_t *size, bool is_write) -{ - int ret; - - assert(vfu_ctx != NULL); - assert(size != NULL); - - switch (migr->info.device_state){ - case VFIO_DEVICE_STATE_V1_SAVING: - case VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING: - ret = handle_data_size_when_saving(vfu_ctx, migr, is_write); - if (ret == 0 && !is_write) { - *size = migr->iter.size; - } - return ret; - case VFIO_DEVICE_STATE_V1_RESUMING: - return handle_data_size_when_resuming(vfu_ctx, migr, *size, is_write); - } - /* TODO improve error message */ - vfu_log(vfu_ctx, LOG_ERR, "bad access to data_size"); - return ERROR_INT(EINVAL); -} - -/** - * Returns 0 on success, -1 on failure setting errno. - */ -ssize_t -MOCK_DEFINE(migration_region_access_registers)(vfu_ctx_t *vfu_ctx, char *buf, - size_t count, loff_t pos, - bool is_write) -{ - struct migration *migr = vfu_ctx->migration; - int ret; - uint32_t *device_state, old_device_state; - - assert(migr != NULL); - - switch (pos) { - case offsetof(struct vfio_user_migration_info, device_state): - if (count != sizeof(migr->info.device_state)) { - vfu_log(vfu_ctx, LOG_ERR, - "bad device_state access size %zu", count); - return ERROR_INT(EINVAL); - } - device_state = (uint32_t *)buf; - if (!is_write) { - *device_state = migr->info.device_state; - return 0; - } - old_device_state = migr->info.device_state; - vfu_log(vfu_ctx, LOG_DEBUG, - "migration: transitioning from state %s to state %s", - migr_states[old_device_state].name, - migr_states[*device_state].name); - - ret = handle_device_state(vfu_ctx, migr, *device_state, true); - if (ret == 0) { - vfu_log(vfu_ctx, LOG_DEBUG, - "migration: transitioned from state %s to state %s", - migr_states[old_device_state].name, - migr_states[*device_state].name); - } else { - vfu_log(vfu_ctx, LOG_ERR, - "migration: failed to transition from state %s to state %s", - migr_states[old_device_state].name, - migr_states[*device_state].name); - } - break; - case offsetof(struct vfio_user_migration_info, pending_bytes): - if (count != sizeof(migr->info.pending_bytes)) { - vfu_log(vfu_ctx, LOG_ERR, - "bad pending_bytes access size %zu", count); - return ERROR_INT(EINVAL); - } - ret = handle_pending_bytes(vfu_ctx, migr, (uint64_t *)buf, is_write); - break; - case offsetof(struct vfio_user_migration_info, data_offset): - if (count != sizeof(migr->info.data_offset)) { - vfu_log(vfu_ctx, LOG_ERR, - "bad data_offset access size %zu", count); - return ERROR_INT(EINVAL); - } - ret = handle_data_offset(vfu_ctx, migr, (uint64_t *)buf, is_write); - break; - case offsetof(struct vfio_user_migration_info, data_size): - if (count != sizeof(migr->info.data_size)) { - vfu_log(vfu_ctx, LOG_ERR, - "bad data_size access size %zu", count); - return ERROR_INT(EINVAL); - } - ret = handle_data_size(vfu_ctx, migr, (uint64_t *)buf, is_write); - break; - default: - vfu_log(vfu_ctx, LOG_ERR, - "bad migration region register offset %#llx", - (ull_t)pos); - return ERROR_INT(EINVAL); - } - return ret; -} - -ssize_t -migration_region_access(vfu_ctx_t *vfu_ctx, char *buf, size_t count, - loff_t pos, bool is_write) -{ - struct migration *migr = vfu_ctx->migration; - ssize_t ret; - - assert(migr != NULL); - assert(buf != NULL); - - /* - * FIXME don't call the device callback if the migration state is in not in - * pre-copy/stop-and-copy/resuming state, since the behavior is undefined - * in that case. - */ - - if (pos + count <= sizeof(struct vfio_user_migration_info)) { - ret = migration_region_access_registers(vfu_ctx, buf, count, - pos, is_write); - if (ret != 0) { - return ret; - } - } else { - - if (pos < (loff_t)migr->data_offset) { - /* - * TODO we can simply ignore the access to that part and handle - * any access to the data region properly. - */ - vfu_log(vfu_ctx, LOG_WARNING, - "bad access to dead space %#llx - %#llx in migration region", - (ull_t)pos, - (ull_t)(pos + count - 1)); - return ERROR_INT(EINVAL); - } - - pos -= migr->data_offset; - if (is_write) { - ret = migr->callbacks.write_data(vfu_ctx, buf, count, pos); - if (ret < 0) { - return -1; - } - } else { - /* - * FIXME says: - * - * d. Read data_size bytes of data from (region + data_offset) from the - * migration region. - * - * Does this mean that partial reads are not allowed? - */ - ret = migr->callbacks.read_data(vfu_ctx, buf, count, pos); - if (ret < 0) { - return -1; - } - } - } - - return count; -} - bool migration_feature_supported(uint32_t flags) { switch (flags & VFIO_DEVICE_FEATURE_MASK) { @@ -563,26 +196,23 @@ migration_feature_get(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf) case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: state = buf; - state->device_state = vfu_ctx->migration->info.device_state; + state->device_state = vfu_ctx->migration->state; return 0; default: return -EINVAL; - }; } ssize_t -migration_feature_set(UNUSED vfu_ctx_t *vfu_ctx, uint32_t flags, - UNUSED void *buf) +migration_feature_set(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf) { if (flags & VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE) { - // struct vfio_user_device_feature_mig_state *res = buf; - - // TODO migrate to res->device_state state if allowed to - - return 0; + struct vfio_user_device_feature_mig_state *res = buf; + struct migration *migr = vfu_ctx->migration; + + return handle_device_state(vfu_ctx, migr, res->device_state, true); } return -EINVAL; @@ -591,13 +221,13 @@ migration_feature_set(UNUSED vfu_ctx_t *vfu_ctx, uint32_t flags, bool MOCK_DEFINE(device_is_stopped_and_copying)(struct migration *migr) { - return migr != NULL && migr->info.device_state == VFIO_DEVICE_STATE_V1_SAVING; + return migr != NULL && migr->state == VFIO_DEVICE_STATE_STOP_COPY; } bool MOCK_DEFINE(device_is_stopped)(struct migration *migr) { - return migr != NULL && migr->info.device_state == VFIO_DEVICE_STATE_V1_STOP; + return migr != NULL && migr->state == VFIO_DEVICE_STATE_STOP; } size_t @@ -622,18 +252,4 @@ migration_set_pgsize(struct migration *migr, size_t pgsize) return 0; } -bool -access_migration_needs_quiesce(const vfu_ctx_t *vfu_ctx, size_t region_index, - uint64_t offset) -{ - /* - * Writing to the migration state register with an unaligned access won't - * trigger this check but that's not a problem because - * migration_region_access_registers will fail the access. - */ - return region_index == VFU_PCI_DEV_MIGR_REGION_IDX - && vfu_ctx->migration != NULL - && offset == offsetof(struct vfio_user_migration_info, device_state); -} - /* ex: set tabstop=4 shiftwidth=4 softtabstop=4 expandtab: */ diff --git a/lib/migration.h b/lib/migration.h index 73916465..449f8412 100644 --- a/lib/migration.h +++ b/lib/migration.h @@ -46,11 +46,7 @@ struct migration * init_migration(const vfu_migration_callbacks_t *callbacks, - uint64_t data_offset, int *err); - -ssize_t -migration_region_access(vfu_ctx_t *vfu_ctx, char *buf, size_t count, - loff_t pos, bool is_write); + uint64_t flags, int *err); bool migration_feature_supported(uint32_t flags); @@ -74,6 +70,9 @@ migration_get_pgsize(struct migration *migr); int migration_set_pgsize(struct migration *migr, size_t pgsize); +uint64_t +migration_get_flags(struct migration *migr); + MOCK_DECLARE(bool, vfio_migr_state_transition_is_valid, uint32_t from, uint32_t to); diff --git a/lib/migration_priv.h b/lib/migration_priv.h index d5643af2..be31bc32 100644 --- a/lib/migration_priv.h +++ b/lib/migration_priv.h @@ -33,93 +33,28 @@ #include -/* - * FSM to simplify saving device state. - */ -enum migr_iter_state { - VFIO_USER_MIGR_ITER_STATE_INITIAL, - VFIO_USER_MIGR_ITER_STATE_STARTED, - VFIO_USER_MIGR_ITER_STATE_DATA_PREPARED, - VFIO_USER_MIGR_ITER_STATE_FINISHED -}; - struct migration { - /* - * TODO if the user provides an FD then should mmap it and use the migration - * registers in the file - */ - struct vfio_user_migration_info info; + uint64_t flags; + enum vfio_device_mig_state state; size_t pgsize; vfu_migration_callbacks_t callbacks; - uint64_t data_offset; - - /* - * This is only for the saving state. The resuming state is simpler so we - * don't need it. - */ - struct { - enum migr_iter_state state; - uint64_t pending_bytes; - uint64_t offset; - uint64_t size; - } iter; }; -struct migr_state_data { - uint32_t state; - const char *name; +/* valid migration state transitions + indexed by vfio_device_mig_state enum */ +static const bool transitions[8][8] = { + {0, 0, 0, 0, 0, 0, 0, 0}, + {0, 0, 0, 1, 1, 0, 0, 0}, + {0, 0, 0, 0, 0, 0, 1, 0}, + {0, 1, 0, 0, 0, 0, 0, 0}, + {0, 1, 0, 0, 0, 0, 0, 0}, + {0, 0, 0, 0, 0, 0, 0, 0}, + {0, 0, 1, 0, 0, 0, 0, 0}, + {0, 0, 0, 0, 0, 0, 0, 0} }; -#define VFIO_DEVICE_STATE_V1_ERROR (VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING) - -/* valid migration state transitions */ -static const struct migr_state_data migr_states[(VFIO_DEVICE_STATE_MASK + 1)] = { - [VFIO_DEVICE_STATE_V1_STOP] = { - .state = - (1 << VFIO_DEVICE_STATE_V1_STOP) | - (1 << VFIO_DEVICE_STATE_V1_RUNNING), - .name = "stopped" - }, - [VFIO_DEVICE_STATE_V1_RUNNING] = { - .state = - (1 << VFIO_DEVICE_STATE_V1_STOP) | - (1 << VFIO_DEVICE_STATE_V1_RUNNING) | - (1 << VFIO_DEVICE_STATE_V1_SAVING) | - (1 << (VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING)) | - (1 << VFIO_DEVICE_STATE_V1_RESUMING) | - (1 << VFIO_DEVICE_STATE_V1_ERROR), - .name = "running" - }, - [VFIO_DEVICE_STATE_V1_SAVING] = { - .state = - (1 << VFIO_DEVICE_STATE_V1_STOP) | - (1 << VFIO_DEVICE_STATE_V1_RUNNING) | - (1 << VFIO_DEVICE_STATE_V1_SAVING) | - (1 << VFIO_DEVICE_STATE_V1_ERROR), - .name = "stop-and-copy" - }, - [VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING] = { - .state = - (1 << VFIO_DEVICE_STATE_V1_STOP) | - (1 << VFIO_DEVICE_STATE_V1_SAVING) | - (1 << VFIO_DEVICE_STATE_V1_RUNNING | VFIO_DEVICE_STATE_V1_SAVING) | - (1 << VFIO_DEVICE_STATE_V1_ERROR), - .name = "pre-copy" - }, - [VFIO_DEVICE_STATE_V1_RESUMING] = { - .state = - (1 << VFIO_DEVICE_STATE_V1_RUNNING) | - (1 << VFIO_DEVICE_STATE_V1_RESUMING) | - (1 << VFIO_DEVICE_STATE_V1_ERROR), - .name = "resuming" - } -}; - -MOCK_DECLARE(ssize_t, migration_region_access_registers, vfu_ctx_t *vfu_ctx, - char *buf, size_t count, loff_t pos, bool is_write); - MOCK_DECLARE(void, migr_state_transition, struct migration *migr, - enum migr_iter_state state); + enum vfio_device_mig_state state); MOCK_DECLARE(vfu_migr_state_t, migr_state_vfio_to_vfu, uint32_t device_state); @@ -129,4 +64,4 @@ MOCK_DECLARE(int, state_trans_notify, vfu_ctx_t *vfu_ctx, #endif -/* ex: set tabstop=4 shiftwidth=4 softtabstop=4 expandtab: */ +/* ex: set tabstop=4 shiftwidth=4 softtabstop=4 expandtab: */ \ No newline at end of file From ddf1f0aea2fb6ddd45baa5fb7222c2575be8e777 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Tue, 27 Jun 2023 09:14:21 +0000 Subject: [PATCH 5/9] implement migration v2 read/write Signed-off-by: William Henderson --- include/libvfio-user.h | 6 ++--- include/vfio-user.h | 17 +++++++++++++ lib/libvfio-user.c | 7 ++++++ lib/migration.c | 55 ++++++++++++++++++++++++++++++++++++++++++ lib/migration.h | 6 +++++ 5 files changed, 87 insertions(+), 4 deletions(-) diff --git a/include/libvfio-user.h b/include/libvfio-user.h index 8a9e237c..f4d302ae 100644 --- a/include/libvfio-user.h +++ b/include/libvfio-user.h @@ -603,11 +603,9 @@ typedef struct { */ int (*transition)(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state); - ssize_t (*read_data)(vfu_ctx_t *vfu_ctx, void *buf, - uint64_t count, uint64_t offset); + ssize_t (*read_data)(vfu_ctx_t *vfu_ctx, void *buf, uint64_t count); - ssize_t (*write_data)(vfu_ctx_t *vfu_ctx, void *buf, uint64_t count, - uint64_t offset); + ssize_t (*write_data)(vfu_ctx_t *vfu_ctx, void *buf, uint64_t count); } vfu_migration_callbacks_t; diff --git a/include/vfio-user.h b/include/vfio-user.h index e11ca6fa..405810fa 100644 --- a/include/vfio-user.h +++ b/include/vfio-user.h @@ -68,6 +68,8 @@ enum vfio_user_command { VFIO_USER_DEVICE_RESET = 13, VFIO_USER_DIRTY_PAGES = 14, VFIO_USER_DEVICE_FEATURE = 15, + VFIO_USER_MIG_DATA_READ = 16, + VFIO_USER_MIG_DATA_WRITE = 17, VFIO_USER_MAX, }; @@ -260,6 +262,21 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_PRE_COPY_P2P = 7, }; +// FIXME awful names below + +// used for read request and write response +struct vfio_user_mig_data_without_data { + uint32_t argsz; + uint32_t size; +} __attribute__((packed)); + +// used for write request and read response +struct vfio_user_mig_data_with_data { + uint32_t argsz; + uint32_t size; + uint8_t data[]; +} __attribute__((packed)); + #ifdef __cplusplus } #endif diff --git a/lib/libvfio-user.c b/lib/libvfio-user.c index 176a8115..ec161adc 100644 --- a/lib/libvfio-user.c +++ b/lib/libvfio-user.c @@ -1238,6 +1238,13 @@ handle_request(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) ret = handle_device_feature(vfu_ctx, msg); break; + case VFIO_USER_MIG_DATA_READ: + ret = handle_mig_data_read(vfu_ctx, msg); + break; + + case VFIO_USER_MIG_DATA_WRITE: + ret = handle_mig_data_write(vfu_ctx, msg); + default: msg->processed_cmd = false; vfu_log(vfu_ctx, LOG_ERR, "bad command %d", msg->hdr.cmd); diff --git a/lib/migration.c b/lib/migration.c index f3013c58..3a35987c 100644 --- a/lib/migration.c +++ b/lib/migration.c @@ -218,6 +218,61 @@ migration_feature_set(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf) return -EINVAL; } +ssize_t +handle_mig_data_read(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) +{ + struct migration *migr = vfu_ctx->migration; + struct vfio_user_mig_data_without_data *req = msg->in.iov.iov_base; + + if (vfu_ctx->migration == NULL) { + return -EINVAL; + } + + if (migr->state != VFIO_DEVICE_STATE_PRE_COPY + && migr->state != VFIO_DEVICE_STATE_STOP_COPY) { + return -EINVAL; + } + + msg->out.iov.iov_len = sizeof(req) + req->size; + msg->out.iov.iov_base = calloc(1, msg->out.iov.iov_len); + + struct vfio_user_mig_data_with_data *res = msg->out.iov.iov_base; + + ssize_t ret = migr->callbacks.read_data(vfu_ctx, &res->data, req->size); + + res->size = ret; + res->argsz = ret + sizeof(req); + + if (ret < 0) { + return -1; + } + + return ret; +} + +ssize_t +handle_mig_data_write(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) +{ + struct migration *migr = vfu_ctx->migration; + struct vfio_user_mig_data_with_data *req = msg->in.iov.iov_base; + + if (vfu_ctx->migration == NULL) { + return -EINVAL; + } + + if (migr->state != VFIO_DEVICE_STATE_RESUMING) { + return -EINVAL; + } + + ssize_t ret = migr->callbacks.write_data(vfu_ctx, &req->data, req->size); + + if (ret < 0) { + return -1; + } + + return ret; +} + bool MOCK_DEFINE(device_is_stopped_and_copying)(struct migration *migr) { diff --git a/lib/migration.h b/lib/migration.h index 449f8412..fb5cfe66 100644 --- a/lib/migration.h +++ b/lib/migration.h @@ -57,6 +57,12 @@ migration_feature_get(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf); ssize_t migration_feature_set(vfu_ctx_t *vfu_ctx, uint32_t flags, void *buf); +ssize_t +handle_mig_data_read(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg); + +ssize_t +handle_mig_data_write(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg); + bool migration_available(vfu_ctx_t *vfu_ctx); From 3228e38bfa2b87fd15de7a4bf5d54289213f20b3 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Tue, 27 Jun 2023 10:37:08 +0000 Subject: [PATCH 6/9] move samples to migration v2 Signed-off-by: William Henderson --- samples/client.c | 227 ++++++++++++++++++++----------------- samples/gpio-pci-idio-16.c | 58 ++-------- samples/server.c | 108 ++---------------- 3 files changed, 141 insertions(+), 252 deletions(-) diff --git a/samples/client.c b/samples/client.c index d4abd219..998a9aaa 100644 --- a/samples/client.c +++ b/samples/client.c @@ -217,9 +217,7 @@ static bool get_region_vfio_caps(struct vfio_info_cap_header *header, struct vfio_region_info_cap_sparse_mmap **sparse) { - struct vfio_region_info_cap_type *type; unsigned int i; - bool migr = false; while (true) { switch (header->id) { @@ -234,16 +232,6 @@ get_region_vfio_caps(struct vfio_info_cap_header *header, (ull_t)(*sparse)->areas[i].size); } break; - case VFIO_REGION_INFO_CAP_TYPE: - type = (struct vfio_region_info_cap_type*)header; - if (type->type != VFIO_REGION_TYPE_MIGRATION || - type->subtype != VFIO_REGION_SUBTYPE_MIGRATION) { - errx(EXIT_FAILURE, "bad region type %d/%d", type->type, - type->subtype); - } - migr = true; - printf("client: migration region\n"); - break; default: errx(EXIT_FAILURE, "bad VFIO cap ID %#x", header->id); } @@ -252,7 +240,7 @@ get_region_vfio_caps(struct vfio_info_cap_header *header, } header = (struct vfio_info_cap_header*)((char*)header + header->next - sizeof(struct vfio_region_info)); } - return migr; + return false; } static void @@ -347,8 +335,7 @@ get_device_region_info(int sock, uint32_t index) if (get_region_vfio_caps((struct vfio_info_cap_header*)(region_info + 1), &sparse)) { if (sparse != NULL) { - assert((index == VFU_PCI_DEV_BAR1_REGION_IDX && nr_fds == 2) || - (index == VFU_PCI_DEV_MIGR_REGION_IDX && nr_fds == 1)); + assert((index == VFU_PCI_DEV_BAR1_REGION_IDX && nr_fds == 2)); assert(nr_fds == sparse->nr_areas); mmap_sparse_areas(fds, region_info, sparse); } @@ -386,7 +373,7 @@ get_device_info(int sock, struct vfio_user_device_info *dev_info) err(EXIT_FAILURE, "failed to get device info"); } - if (dev_info->num_regions != 10) { + if (dev_info->num_regions != 9) { errx(EXIT_FAILURE, "bad number of device regions %d", dev_info->num_regions); } @@ -526,6 +513,100 @@ access_region(int sock, int region, bool is_write, uint64_t offset, return 0; } +static int +set_migration_state(int sock, uint32_t state) +{ + static int msg_id = 0xfab1; + struct vfio_user_device_feature req = { + .argsz = 16, + .flags = VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE + }; + struct vfio_user_device_feature_mig_state change_state = { + .device_state = state, + .data_fd = 0 + }; + struct iovec send_iovecs[3] = { + [1] = { + .iov_base = &req, + .iov_len = sizeof(req) + }, + [2] = { + .iov_base = &change_state, + .iov_len = sizeof(change_state) + } + }; + + int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_DEVICE_FEATURE, + send_iovecs, 3, NULL, 0, NULL, + &req, sizeof(req), NULL, 0); + + if (ret < 0) { + return -1; + } + + return ret; +} + +static int +read_migr_data(int sock, void *buf, size_t len) +{ + static int msg_id = 0x6904; + struct vfio_user_mig_data_without_data req = { + .argsz = 8, + .size = len + }; + struct iovec send_iovecs[2] = { + [1] = { + .iov_base = &req, + .iov_len = sizeof(req) + } + }; + struct vfio_user_mig_data_with_data *res = calloc(1, sizeof(req) + len); + + int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_READ, + send_iovecs, 2, NULL, 0, NULL, + res, sizeof(req) + len, NULL, 0); + + if (ret < 0) { + free(res); + return -1; + } + + memcpy(buf, res->data, res->size); + + return res->size; +} + +static int +write_migr_data(int sock, void *buf, size_t len) +{ + static int msg_id = 0x2023; + struct vfio_user_mig_data_with_data req = { + .argsz = 8 + len, + .size = len + }; + struct iovec send_iovecs[3] = { + [1] = { + .iov_base = &req, + .iov_len = sizeof(req) + }, + [2] = { + .iov_base = buf, + .iov_len = len + } + }; + + int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_WRITE, + send_iovecs, 3, NULL, 0, NULL, + &req, sizeof(req), NULL, 0); + + if (ret < 0) { + return -1; + } + + return ret; +} + static void access_bar0(int sock, time_t *t) { @@ -753,61 +834,34 @@ static size_t do_migrate(int sock, size_t nr_iters, struct iovec *migr_iter) { int ret; - uint64_t pending_bytes, data_offset, data_size; + bool is_more = true; size_t i = 0; assert(nr_iters > 0); - /* XXX read pending_bytes */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - offsetof(struct vfio_user_migration_info, pending_bytes), - &pending_bytes, sizeof(pending_bytes)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to read pending_bytes"); - } - - for (i = 0; i < nr_iters && pending_bytes > 0; i++) { - - /* XXX read data_offset and data_size */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - offsetof(struct vfio_user_migration_info, data_offset), - &data_offset, sizeof(data_offset)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to read data_offset"); - } + for (i = 0; i < nr_iters && is_more; i++) { - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - offsetof(struct vfio_user_migration_info, data_size), - &data_size, sizeof(data_size)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to read data_size"); - } + migr_iter[i].iov_len = 1024; + migr_iter[i].iov_base = malloc(migr_iter[i].iov_len); - migr_iter[i].iov_len = data_size; - migr_iter[i].iov_base = malloc(data_size); if (migr_iter[i].iov_base == NULL) { err(EXIT_FAILURE, "failed to allocate migration buffer"); } /* XXX read migration data */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - data_offset, - (char *)migr_iter[i].iov_base, data_size); + ret = read_migr_data(sock, migr_iter[i].iov_base, migr_iter[i].iov_len); if (ret < 0) { err(EXIT_FAILURE, "failed to read migration data"); } - /* FIXME send migration data to the destination client process */ - - /* - * XXX read pending_bytes again to indicate to the server that the - * migration data have been consumed. - */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - offsetof(struct vfio_user_migration_info, pending_bytes), - &pending_bytes, sizeof(pending_bytes)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to read pending_bytes"); + if (ret < (int)migr_iter[i].iov_len) { + // FIXME is it pointless shuffling stuff around? + void* buf = malloc(ret); + memcpy(buf, migr_iter[i].iov_base, ret); + free(migr_iter[i].iov_base); + migr_iter[i].iov_base = buf; + migr_iter[i].iov_len = ret; + is_more = false; } } return i; @@ -884,10 +938,8 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, * XXX set device state to pre-copy. This is technically optional but any * VMM that cares about performance needs this. */ - device_state = VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RUNNING; - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, device_state), - &device_state, sizeof(device_state)); + device_state = VFIO_DEVICE_STATE_PRE_COPY; + ret = set_migration_state(sock, device_state); if (ret < 0) { err(EXIT_FAILURE, "failed to write to device state"); } @@ -905,10 +957,8 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, printf("client: setting device state to stop-and-copy\n"); - device_state = VFIO_DEVICE_STATE_V1_SAVING; - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, device_state), - &device_state, sizeof(device_state)); + device_state = VFIO_DEVICE_STATE_STOP_COPY; + ret = set_migration_state(sock, device_state); if (ret < 0) { err(EXIT_FAILURE, "failed to write to device state"); } @@ -921,10 +971,8 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, } /* XXX read device state, migration must have finished now */ - device_state = VFIO_DEVICE_STATE_V1_STOP; - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, device_state), - &device_state, sizeof(device_state)); + device_state = VFIO_DEVICE_STATE_STOP; + ret = set_migration_state(sock, device_state); if (ret < 0) { err(EXIT_FAILURE, "failed to write to device state"); } @@ -941,8 +989,7 @@ migrate_to(char *old_sock_path, int *server_max_fds, int ret, sock; char *sock_path; struct stat sb; - uint32_t device_state = VFIO_DEVICE_STATE_V1_RESUMING; - uint64_t data_offset, data_len; + uint32_t device_state = VFIO_DEVICE_STATE_RESUMING; size_t i; uint32_t dst_crc; char buf[bar1_size]; @@ -993,54 +1040,26 @@ migrate_to(char *old_sock_path, int *server_max_fds, negotiate(sock, server_max_fds, server_max_data_xfer_size, pgsize); /* XXX set device state to resuming */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, device_state), - &device_state, sizeof(device_state)); + ret = set_migration_state(sock, device_state); if (ret < 0) { err(EXIT_FAILURE, "failed to set device state to resuming"); } for (i = 0; i < nr_iters; i++) { - /* XXX read data offset */ - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, false, - offsetof(struct vfio_user_migration_info, data_offset), - &data_offset, sizeof(data_offset)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to read migration data offset"); - } - /* XXX write migration data */ - /* - * TODO write half of migration data via regular write and other half via - * memopy map. - */ - printf("client: writing migration device data %#llx-%#llx\n", - (ull_t)data_offset, - (ull_t)(data_offset + migr_iters[i].iov_len - 1)); - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - data_offset, migr_iters[i].iov_base, - migr_iters[i].iov_len); + printf("client: writing migration device data iter %zu\n", i); + ret = write_migr_data(sock, migr_iters[i].iov_base, + migr_iters[i].iov_len); if (ret < 0) { err(EXIT_FAILURE, "failed to write device migration data"); } - - /* XXX write data_size */ - data_len = migr_iters[i].iov_len; - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, data_size), - &data_len, sizeof(data_len)); - if (ret < 0) { - err(EXIT_FAILURE, "failed to write migration data size"); - } } /* XXX set device state to running */ - device_state = VFIO_DEVICE_STATE_V1_RUNNING; - ret = access_region(sock, VFU_PCI_DEV_MIGR_REGION_IDX, true, - offsetof(struct vfio_user_migration_info, device_state), - &device_state, sizeof(device_state)); + device_state = VFIO_DEVICE_STATE_RUNNING; + ret = set_migration_state(sock, device_state); if (ret < 0) { err(EXIT_FAILURE, "failed to set device state to running"); } diff --git a/samples/gpio-pci-idio-16.c b/samples/gpio-pci-idio-16.c index b50f407b..b3232491 100644 --- a/samples/gpio-pci-idio-16.c +++ b/samples/gpio-pci-idio-16.c @@ -77,49 +77,23 @@ migration_device_state_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state) return 0; } -static uint64_t -migration_get_pending_bytes(UNUSED vfu_ctx_t *vfu_ctx) -{ - if (dirty) { - return sizeof(pin); - } - return 0; -} - -static int -migration_prepare_data(UNUSED vfu_ctx_t *vfu_ctx, - uint64_t *offset, uint64_t *size) -{ - *offset = 0; - if (size != NULL) { /* null means resuming */ - *size = sizeof(pin); - } - return 0; -} - static ssize_t -migration_read_data(UNUSED vfu_ctx_t *vfu_ctx, void *buf, - uint64_t size, uint64_t offset) +migration_read_data(UNUSED vfu_ctx_t *vfu_ctx, void *buf, uint64_t size) { - assert(offset == 0); assert(size == sizeof(pin)); - memcpy(buf, &pin, sizeof(pin)); - dirty = false; - return 0; -} -static int -migration_data_written(UNUSED vfu_ctx_t *vfu_ctx, uint64_t count) -{ - assert(count == sizeof(pin)); - return 0; + if (dirty) { + memcpy(buf, &pin, sizeof(pin)); + dirty = false; + return sizeof(pin); + } else { + return 0; + } } static ssize_t -migration_write_data(UNUSED vfu_ctx_t *vfu_ctx, void *buf, - uint64_t size, uint64_t offset) +migration_write_data(UNUSED vfu_ctx_t *vfu_ctx, void *buf, uint64_t size) { - assert(offset == 0); assert(size == sizeof(pin)); memcpy(&pin, buf, sizeof(pin)); return 0; @@ -145,16 +119,10 @@ main(int argc, char *argv[]) int opt; struct sigaction act = { .sa_handler = _sa_handler }; vfu_ctx_t *vfu_ctx; - size_t migr_regs_size = vfu_get_migr_register_area_size(); - size_t migr_data_size = sysconf(_SC_PAGE_SIZE); - size_t migr_size = migr_regs_size + migr_data_size; const vfu_migration_callbacks_t migr_callbacks = { .version = VFU_MIGR_CALLBACKS_VERS, .transition = &migration_device_state_transition, - .get_pending_bytes = &migration_get_pending_bytes, - .prepare_data = &migration_prepare_data, .read_data = &migration_read_data, - .data_written = &migration_data_written, .write_data = &migration_write_data }; @@ -214,13 +182,7 @@ main(int argc, char *argv[]) } if (enable_migr) { - ret = vfu_setup_region(vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX, migr_size, - NULL, VFU_REGION_FLAG_RW, NULL, 0, -1, 0); - if (ret < 0) { - err(EXIT_FAILURE, "failed to setup migration region"); - } - ret = vfu_setup_device_migration_callbacks(vfu_ctx, &migr_callbacks, - migr_regs_size); + ret = vfu_setup_device_migration_callbacks(vfu_ctx, 0, &migr_callbacks); if (ret < 0) { err(EXIT_FAILURE, "failed to setup device migration"); } diff --git a/samples/server.c b/samples/server.c index e059d9ca..62b79319 100644 --- a/samples/server.c +++ b/samples/server.c @@ -285,37 +285,11 @@ migration_device_state_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state) return 0; } -static uint64_t -migration_get_pending_bytes(vfu_ctx_t *vfu_ctx) -{ - struct server_data *server_data = vfu_get_private(vfu_ctx); - return server_data->migration.pending_bytes; -} - -static int -migration_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset, uint64_t *size) -{ - struct server_data *server_data = vfu_get_private(vfu_ctx); - - *offset = 0; - if (size != NULL) { - *size = server_data->migration.pending_bytes; - } - return 0; -} - static ssize_t -migration_read_data(vfu_ctx_t *vfu_ctx, void *buf, - uint64_t size, uint64_t offset) +migration_read_data(vfu_ctx_t *vfu_ctx, void *buf, uint64_t size) { struct server_data *server_data = vfu_get_private(vfu_ctx); - if (server_data->migration.state != VFU_MIGR_STATE_PRE_COPY && - server_data->migration.state != VFU_MIGR_STATE_STOP_AND_COPY) - { - return size; - } - /* * For ease of implementation we expect the client to read all migration * data in one go; partial reads are not supported. This is allowed by VFIO @@ -328,7 +302,7 @@ migration_read_data(vfu_ctx_t *vfu_ctx, void *buf, * more complex state tracking which exceeds the scope of this sample. */ - if (offset != 0 || size != server_data->migration.pending_bytes) { + if (size != server_data->migration.pending_bytes) { errno = EINVAL; return -1; } @@ -344,8 +318,7 @@ migration_read_data(vfu_ctx_t *vfu_ctx, void *buf, } static ssize_t -migration_write_data(vfu_ctx_t *vfu_ctx, void *data, - uint64_t size, uint64_t offset) +migration_write_data(vfu_ctx_t *vfu_ctx, void *data, uint64_t size) { struct server_data *server_data = vfu_get_private(vfu_ctx); char *buf = data; @@ -354,10 +327,10 @@ migration_write_data(vfu_ctx_t *vfu_ctx, void *data, assert(server_data != NULL); assert(data != NULL); - if (offset != 0 || size < server_data->bar1_size) { + if (size < server_data->bar1_size) { vfu_log(vfu_ctx, LOG_DEBUG, "XXX bad migration data write %#llx-%#llx", - (unsigned long long)offset, - (unsigned long long)offset + size - 1); + (unsigned long long)0, + (unsigned long long)size - 1); errno = EINVAL; return -1; } @@ -379,33 +352,6 @@ migration_write_data(vfu_ctx_t *vfu_ctx, void *data, return 0; } - -static int -migration_data_written(UNUSED vfu_ctx_t *vfu_ctx, UNUSED uint64_t count) -{ - /* - * We apply migration state directly in the migration_write_data callback, - * so we don't need to do anything here. We would have to apply migration - * state in this callback if the migration region was memory mappable, in - * which case we wouldn't know when the client wrote migration data. - */ - - return 0; -} - -static size_t -nr_pages(size_t size) -{ - return (size / sysconf(_SC_PAGE_SIZE) + - (size % sysconf(_SC_PAGE_SIZE) > 1)); -} - -static size_t -page_align(size_t size) -{ - return nr_pages(size) * sysconf(_SC_PAGE_SIZE); -} - int main(int argc, char *argv[]) { char template[] = "/tmp/libvfio-user.XXXXXX"; @@ -414,7 +360,6 @@ int main(int argc, char *argv[]) int opt; struct sigaction act = {.sa_handler = _sa_handler}; const size_t bar1_size = 0x3000; - size_t migr_regs_size, migr_data_size, migr_size; struct server_data server_data = { .migration = { .state = VFU_MIGR_STATE_RUNNING @@ -426,10 +371,7 @@ int main(int argc, char *argv[]) const vfu_migration_callbacks_t migr_callbacks = { .version = VFU_MIGR_CALLBACKS_VERS, .transition = &migration_device_state_transition, - .get_pending_bytes = &migration_get_pending_bytes, - .prepare_data = &migration_prepare_data, .read_data = &migration_read_data, - .data_written = &migration_data_written, .write_data = &migration_write_data }; @@ -488,9 +430,6 @@ int main(int argc, char *argv[]) * are mappable. The client can still mmap the 2nd page, we can't prohibit * this under Linux. If we really want to prohibit it we have to use * separate files for the same region. - * - * We choose to use a single file which contains both BAR1 and the migration - * registers. They could also be completely different files. */ if ((tmpfd = mkstemp(template)) == -1) { err(EXIT_FAILURE, "failed to create backing file"); @@ -500,16 +439,7 @@ int main(int argc, char *argv[]) server_data.bar1_size = bar1_size; - /* - * The migration registers aren't memory mappable, so in order to make the - * rest of the migration region memory mappable we must effectively reserve - * an entire page. - */ - migr_regs_size = vfu_get_migr_register_area_size(); - migr_data_size = page_align(bar1_size + sizeof(time_t)); - migr_size = migr_regs_size + migr_data_size; - - if (ftruncate(tmpfd, server_data.bar1_size + migr_size) == -1) { + if (ftruncate(tmpfd, server_data.bar1_size) == -1) { err(EXIT_FAILURE, "failed to truncate backing file"); } server_data.bar1 = mmap(NULL, server_data.bar1_size, PROT_READ | PROT_WRITE, @@ -529,29 +459,7 @@ int main(int argc, char *argv[]) err(EXIT_FAILURE, "failed to setup BAR1 region"); } - /* setup migration */ - - struct iovec migr_mmap_areas[] = { - [0] = { - .iov_base = (void *)migr_regs_size, - .iov_len = migr_data_size - }, - }; - - /* - * The migration region comes after bar1 in the backing file, so offset is - * server_data.bar1_size. - */ - ret = vfu_setup_region(vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX, migr_size, - NULL, VFU_REGION_FLAG_RW, migr_mmap_areas, - ARRAY_SIZE(migr_mmap_areas), tmpfd, - server_data.bar1_size); - if (ret < 0) { - err(EXIT_FAILURE, "failed to setup migration region"); - } - - ret = vfu_setup_device_migration_callbacks(vfu_ctx, &migr_callbacks, - migr_regs_size); + ret = vfu_setup_device_migration_callbacks(vfu_ctx, 0, &migr_callbacks); if (ret < 0) { err(EXIT_FAILURE, "failed to setup device migration"); } From 45e4246eef6c25cb1160006c8248aa8157347e42 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Wed, 5 Jul 2023 15:11:57 +0000 Subject: [PATCH 7/9] fix: make client/server sample work with migration v2 Signed-off-by: William Henderson --- include/libvfio-user.h | 6 +- include/vfio-user.h | 4 +- lib/libvfio-user.c | 23 ++++--- lib/migration.c | 10 ++- lib/migration.h | 3 + lib/migration_priv.h | 19 +++--- samples/client.c | 96 ++++++++++++++------------- samples/server.c | 146 +++++++++++++++++++++++++++++------------ 8 files changed, 194 insertions(+), 113 deletions(-) diff --git a/include/libvfio-user.h b/include/libvfio-user.h index f4d302ae..e99584db 100644 --- a/include/libvfio-user.h +++ b/include/libvfio-user.h @@ -86,7 +86,11 @@ dma_sg_size(void); * reads or writes of the socket connection will not need to block, since both * APIS are synchronous. */ -#define LIBVFIO_USER_FLAG_ATTACH_NB (1 << 0) +#define LIBVFIO_USER_FLAG_ATTACH_NB (1) + +// Should the server start in the migration "RESUMING" state? Given to +// init_migration +#define LIBVFIO_USER_MIG_FLAG_START_RESUMING (1 << 0) typedef enum { VFU_TRANS_SOCK, diff --git a/include/vfio-user.h b/include/vfio-user.h index 405810fa..50cfb4ad 100644 --- a/include/vfio-user.h +++ b/include/vfio-user.h @@ -267,13 +267,13 @@ enum vfio_device_mig_state { // used for read request and write response struct vfio_user_mig_data_without_data { uint32_t argsz; - uint32_t size; + uint64_t size; } __attribute__((packed)); // used for write request and read response struct vfio_user_mig_data_with_data { uint32_t argsz; - uint32_t size; + uint64_t size; uint8_t data[]; } __attribute__((packed)); diff --git a/lib/libvfio-user.c b/lib/libvfio-user.c index ec161adc..581fc87a 100644 --- a/lib/libvfio-user.c +++ b/lib/libvfio-user.c @@ -894,15 +894,14 @@ static int device_reset(vfu_ctx_t *vfu_ctx, vfu_reset_type_t reason) { int ret; - + ret = call_reset_cb(vfu_ctx, reason); if (ret < 0) { return ret; } if (vfu_ctx->migration != NULL) { - return handle_device_state(vfu_ctx, vfu_ctx->migration, - VFIO_DEVICE_STATE_V1_RUNNING, false); + migr_state_transition(vfu_ctx->migration, VFIO_DEVICE_STATE_RUNNING); } return 0; } @@ -1044,8 +1043,10 @@ handle_device_feature(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) ssize_t ret; if (req->flags & VFIO_DEVICE_FEATURE_PROBE) { - msg->out.iov.iov_base = msg->in.iov.iov_base; + msg->out.iov.iov_base = malloc(msg->in.iov.iov_len); msg->out.iov.iov_len = msg->in.iov.iov_len; + memcpy(msg->out.iov.iov_base, msg->in.iov.iov_base, + msg->out.iov.iov_len); ret = 0; } else if (req->flags & VFIO_DEVICE_FEATURE_GET) { @@ -1055,11 +1056,12 @@ handle_device_feature(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) ret = migration_feature_get(vfu_ctx, req->flags, msg->out.iov.iov_base); } else if (req->flags & VFIO_DEVICE_FEATURE_SET) { - msg->out.iov.iov_base = msg->in.iov.iov_base; + msg->out.iov.iov_base = malloc(msg->in.iov.iov_len); msg->out.iov.iov_len = msg->in.iov.iov_len; + memcpy(msg->out.iov.iov_base, msg->in.iov.iov_base, + msg->out.iov.iov_len); - ret = migration_feature_set(vfu_ctx, req->flags, - msg->out.iov.iov_base); + ret = migration_feature_set(vfu_ctx, req->flags, req->data); } return ret; @@ -1244,6 +1246,7 @@ handle_request(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) case VFIO_USER_MIG_DATA_WRITE: ret = handle_mig_data_write(vfu_ctx, msg); + break; default: msg->processed_cmd = false; @@ -1348,7 +1351,9 @@ MOCK_DEFINE(cmd_allowed_when_stopped_and_copying)(uint16_t cmd) { return cmd == VFIO_USER_REGION_READ || cmd == VFIO_USER_REGION_WRITE || - cmd == VFIO_USER_DIRTY_PAGES; + cmd == VFIO_USER_DIRTY_PAGES || + cmd == VFIO_USER_DEVICE_FEATURE || + cmd == VFIO_USER_MIG_DATA_READ; } bool @@ -2046,7 +2051,7 @@ vfu_setup_device_migration_callbacks(vfu_ctx_t *vfu_ctx, uint64_t flags, return ERROR_INT(ret); } - return 0; + return ret; } #ifdef DEBUG diff --git a/lib/migration.c b/lib/migration.c index 3a35987c..4ab91812 100644 --- a/lib/migration.c +++ b/lib/migration.c @@ -71,7 +71,11 @@ init_migration(const vfu_migration_callbacks_t *callbacks, migr->pgsize = sysconf(_SC_PAGESIZE); /* FIXME this should be done in vfu_ctx_realize */ - migr->state = VFIO_DEVICE_STATE_RUNNING; + if (flags & LIBVFIO_USER_MIG_FLAG_START_RESUMING) { + migr->state = VFIO_DEVICE_STATE_RESUMING; + } else { + migr->state = VFIO_DEVICE_STATE_RUNNING; + } migr->callbacks = *callbacks; if (migr->callbacks.transition == NULL || @@ -233,7 +237,7 @@ handle_mig_data_read(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) return -EINVAL; } - msg->out.iov.iov_len = sizeof(req) + req->size; + msg->out.iov.iov_len = msg->in.iov.iov_len + req->size; msg->out.iov.iov_base = calloc(1, msg->out.iov.iov_len); struct vfio_user_mig_data_with_data *res = msg->out.iov.iov_base; @@ -241,7 +245,7 @@ handle_mig_data_read(vfu_ctx_t *vfu_ctx, vfu_msg_t *msg) ssize_t ret = migr->callbacks.read_data(vfu_ctx, &res->data, req->size); res->size = ret; - res->argsz = ret + sizeof(req); + res->argsz = ret + msg->in.iov.iov_len; if (ret < 0) { return -1; diff --git a/lib/migration.h b/lib/migration.h index fb5cfe66..f817f66f 100644 --- a/lib/migration.h +++ b/lib/migration.h @@ -79,6 +79,9 @@ migration_set_pgsize(struct migration *migr, size_t pgsize); uint64_t migration_get_flags(struct migration *migr); +MOCK_DECLARE(void, migr_state_transition, struct migration *migr, + enum vfio_device_mig_state state); + MOCK_DECLARE(bool, vfio_migr_state_transition_is_valid, uint32_t from, uint32_t to); diff --git a/lib/migration_priv.h b/lib/migration_priv.h index be31bc32..ac58b1fc 100644 --- a/lib/migration_priv.h +++ b/lib/migration_priv.h @@ -43,19 +43,16 @@ struct migration { /* valid migration state transitions indexed by vfio_device_mig_state enum */ static const bool transitions[8][8] = { - {0, 0, 0, 0, 0, 0, 0, 0}, - {0, 0, 0, 1, 1, 0, 0, 0}, - {0, 0, 0, 0, 0, 0, 1, 0}, - {0, 1, 0, 0, 0, 0, 0, 0}, - {0, 1, 0, 0, 0, 0, 0, 0}, - {0, 0, 0, 0, 0, 0, 0, 0}, - {0, 0, 1, 0, 0, 0, 0, 0}, - {0, 0, 0, 0, 0, 0, 0, 0} + {0, 0, 0, 0, 0, 0, 0, 0}, // ERROR + {0, 0, 1, 1, 1, 0, 0, 0}, // STOP + {0, 1, 0, 0, 0, 0, 1, 0}, // RUNNING + {0, 1, 0, 0, 0, 0, 0, 0}, // STOP_COPY + {0, 1, 0, 0, 0, 0, 0, 0}, // RESUMING + {0, 0, 0, 0, 0, 0, 0, 0}, // RUNNING_P2P + {0, 0, 1, 1, 0, 0, 0, 0}, // PRE_COPY + {0, 0, 0, 0, 0, 0, 0, 0} // PRE_COPY_P2P }; -MOCK_DECLARE(void, migr_state_transition, struct migration *migr, - enum vfio_device_mig_state state); - MOCK_DECLARE(vfu_migr_state_t, migr_state_vfio_to_vfu, uint32_t device_state); MOCK_DECLARE(int, state_trans_notify, vfu_ctx_t *vfu_ctx, diff --git a/samples/client.c b/samples/client.c index 998a9aaa..121122aa 100644 --- a/samples/client.c +++ b/samples/client.c @@ -62,6 +62,8 @@ static char const *irq_to_str[] = { [VFU_DEV_REQ_IRQ] = "REQ" }; +static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; + void vfu_log(UNUSED vfu_ctx_t *vfu_ctx, UNUSED int level, const char *fmt, ...) @@ -458,7 +460,6 @@ access_region(int sock, int region, bool is_write, uint64_t offset, .iov_len = data_len } }; - static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; struct vfio_user_region_access *recv_data; size_t nr_send_iovecs, recv_data_len; int op, ret; @@ -535,15 +536,23 @@ set_migration_state(int sock, uint32_t state) .iov_len = sizeof(change_state) } }; + void* response = malloc(sizeof(req) + sizeof(change_state)); + pthread_mutex_lock(&mutex); int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_DEVICE_FEATURE, send_iovecs, 3, NULL, 0, NULL, - &req, sizeof(req), NULL, 0); + response, sizeof(req) + sizeof(change_state), + NULL, 0); + pthread_mutex_unlock(&mutex); if (ret < 0) { return -1; } + assert(memcmp(&req, response, sizeof(req)) == 0); + assert(memcmp(&change_state, response + sizeof(req), + sizeof(change_state)) == 0); + return ret; } @@ -552,7 +561,7 @@ read_migr_data(int sock, void *buf, size_t len) { static int msg_id = 0x6904; struct vfio_user_mig_data_without_data req = { - .argsz = 8, + .argsz = 12, .size = len }; struct iovec send_iovecs[2] = { @@ -563,9 +572,11 @@ read_migr_data(int sock, void *buf, size_t len) }; struct vfio_user_mig_data_with_data *res = calloc(1, sizeof(req) + len); + pthread_mutex_lock(&mutex); int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_READ, send_iovecs, 2, NULL, 0, NULL, res, sizeof(req) + len, NULL, 0); + pthread_mutex_unlock(&mutex); if (ret < 0) { free(res); @@ -574,6 +585,8 @@ read_migr_data(int sock, void *buf, size_t len) memcpy(buf, res->data, res->size); + free(res); + return res->size; } @@ -582,7 +595,7 @@ write_migr_data(int sock, void *buf, size_t len) { static int msg_id = 0x2023; struct vfio_user_mig_data_with_data req = { - .argsz = 8 + len, + .argsz = 12 + len, .size = len }; struct iovec send_iovecs[3] = { @@ -596,9 +609,11 @@ write_migr_data(int sock, void *buf, size_t len) } }; + pthread_mutex_lock(&mutex); int ret = tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_WRITE, send_iovecs, 3, NULL, 0, NULL, &req, sizeof(req), NULL, 0); + pthread_mutex_unlock(&mutex); if (ret < 0) { return -1; @@ -816,32 +831,17 @@ usage(char *argv0) basename(argv0)); } -/* - * Normally each time the source client (QEMU) would read migration data from - * the device it would send them to the destination client. However, since in - * our sample both the source and the destination client are the same process, - * we simply accumulate the migration data of each iteration and apply it to - * the destination server at the end. - * - * Performs as many migration loops as @nr_iters or until the device has no - * more migration data (pending_bytes is zero), which ever comes first. The - * result of each migration iteration is stored in @migr_iter. @migr_iter must - * be at least @nr_iters. - * - * @returns the number of iterations performed - */ static size_t -do_migrate(int sock, size_t nr_iters, struct iovec *migr_iter) +do_migrate(int sock, size_t max_iters, size_t max_iter_size, + struct iovec *migr_iter) { int ret; + size_t i; bool is_more = true; - size_t i = 0; - assert(nr_iters > 0); + for (i = 0; i < max_iters && is_more; i++) { - for (i = 0; i < nr_iters && is_more; i++) { - - migr_iter[i].iov_len = 1024; + migr_iter[i].iov_len = max_iter_size; migr_iter[i].iov_base = malloc(migr_iter[i].iov_len); if (migr_iter[i].iov_base == NULL) { @@ -898,7 +898,7 @@ fake_guest(void *arg) if (ret != 0) { err(EXIT_FAILURE, "fake guest failed to write garbage to BAR1"); } - crc = rte_hash_crc(buf, fake_guest_data->bar1_size, crc); + crc = rte_hash_crc(buf, fake_guest_data->bar1_size, 0); __sync_synchronize(); } while (!fake_guest_data->done); @@ -913,7 +913,6 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, { uint32_t device_state; int ret; - size_t _nr_iters; pthread_t thread; struct fake_guest_data fake_guest_data = { .sock = sock, @@ -922,13 +921,15 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, .crcp = crcp }; + size_t max_iter_size = 4096; + ret = pthread_create(&thread, NULL, fake_guest, &fake_guest_data); if (ret != 0) { errno = ret; err(EXIT_FAILURE, "failed to create pthread"); } - *nr_iters = 2; + *nr_iters = 8; *migr_iters = malloc(sizeof(struct iovec) * *nr_iters); if (*migr_iters == NULL) { err(EXIT_FAILURE, NULL); @@ -944,8 +945,11 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, err(EXIT_FAILURE, "failed to write to device state"); } - _nr_iters = do_migrate(sock, 1, *migr_iters); - assert(_nr_iters == 1); + ret = do_migrate(sock, *nr_iters, max_iter_size, *migr_iters); + if (ret < 0) { + err(EXIT_FAILURE, "failed to do migration in pre-copy state"); + } + printf("client: stopping fake guest thread\n"); fake_guest_data.done = true; __sync_synchronize(); @@ -963,11 +967,9 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, err(EXIT_FAILURE, "failed to write to device state"); } - _nr_iters += do_migrate(sock, 1, (*migr_iters) + _nr_iters); - if (_nr_iters != 2) { - errx(EXIT_FAILURE, - "expected 2 iterations instead of %zu while in stop-and-copy state", - _nr_iters); + size_t iters = do_migrate(sock, *nr_iters, max_iter_size, *migr_iters); + if (ret < 0) { + err(EXIT_FAILURE, "failed to do migration in stop-and-copy state"); } /* XXX read device state, migration must have finished now */ @@ -977,7 +979,7 @@ migrate_from(int sock, size_t *nr_iters, struct iovec **migr_iters, err(EXIT_FAILURE, "failed to write to device state"); } - return _nr_iters; + return iters; } static int @@ -1007,9 +1009,10 @@ migrate_to(char *old_sock_path, int *server_max_fds, if (ret == -1) { err(EXIT_FAILURE, "failed to fork"); } - if (ret > 0) { /* child (destination server) */ + if (ret == 0) { /* child (destination server) */ char *_argv[] = { path_to_server, + (char *)"-r", // start in VFIO_DEVICE_STATE_RESUMING (char *)"-v", sock_path, NULL @@ -1039,12 +1042,6 @@ migrate_to(char *old_sock_path, int *server_max_fds, negotiate(sock, server_max_fds, server_max_data_xfer_size, pgsize); - /* XXX set device state to resuming */ - ret = set_migration_state(sock, device_state); - if (ret < 0) { - err(EXIT_FAILURE, "failed to set device state to resuming"); - } - for (i = 0; i < nr_iters; i++) { /* XXX write migration data */ @@ -1057,11 +1054,11 @@ migrate_to(char *old_sock_path, int *server_max_fds, } } - /* XXX set device state to running */ - device_state = VFIO_DEVICE_STATE_RUNNING; + /* XXX set device state to stop to finish the transfer */ + device_state = VFIO_DEVICE_STATE_STOP; ret = set_migration_state(sock, device_state); if (ret < 0) { - err(EXIT_FAILURE, "failed to set device state to running"); + err(EXIT_FAILURE, "failed to set device state to stop"); } /* validate contents of BAR1 */ @@ -1075,6 +1072,15 @@ migrate_to(char *old_sock_path, int *server_max_fds, if (dst_crc != src_crc) { fprintf(stderr, "client: CRC mismatch: %u != %u\n", src_crc, dst_crc); abort(); + } else { + fprintf(stdout, "client: CRC match, we did it! :)\n"); + } + + /* XXX set device state to running */ + device_state = VFIO_DEVICE_STATE_RUNNING; + ret = set_migration_state(sock, device_state); + if (ret < 0) { + err(EXIT_FAILURE, "failed to set device state to running"); } return sock; diff --git a/samples/server.c b/samples/server.c index 62b79319..ce9acc44 100644 --- a/samples/server.c +++ b/samples/server.c @@ -62,7 +62,8 @@ struct server_data { size_t bar1_size; struct dma_regions regions[NR_DMA_REGIONS]; struct { - uint64_t pending_bytes; + uint64_t pending_read; + uint64_t pending_write; vfu_migr_state_t state; } migration; }; @@ -134,7 +135,7 @@ bar1_access(vfu_ctx_t *vfu_ctx, char * const buf, if (is_write) { if (server_data->migration.state == VFU_MIGR_STATE_PRE_COPY) { /* dirty the whole thing */ - server_data->migration.pending_bytes = server_data->bar1_size; + server_data->migration.pending_read = server_data->bar1_size; } memcpy(server_data->bar1 + offset, buf, count); } else { @@ -260,19 +261,20 @@ migration_device_state_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state) if (setitimer(ITIMER_REAL, &new, NULL) != 0) { err(EXIT_FAILURE, "failed to disable timer"); } - server_data->migration.pending_bytes = server_data->bar1_size + sizeof(time_t); /* FIXME BAR0 region size */ + server_data->migration.pending_read = server_data->bar1_size + sizeof(time_t); /* FIXME BAR0 region size */ break; case VFU_MIGR_STATE_PRE_COPY: - /* TODO must be less than size of data region in migration region */ - server_data->migration.pending_bytes = server_data->bar1_size; + server_data->migration.pending_read = server_data->bar1_size; break; case VFU_MIGR_STATE_STOP: /* FIXME should gracefully fail */ - assert(server_data->migration.pending_bytes == 0); + assert(server_data->migration.pending_read == 0); break; case VFU_MIGR_STATE_RESUME: + server_data->migration.pending_write = server_data->bar1_size + sizeof(time_t); break; case VFU_MIGR_STATE_RUNNING: + assert(server_data->migration.pending_write == 0); ret = arm_timer(vfu_ctx, server_data->bar0); if (ret < 0) { return ret; @@ -291,30 +293,59 @@ migration_read_data(vfu_ctx_t *vfu_ctx, void *buf, uint64_t size) struct server_data *server_data = vfu_get_private(vfu_ctx); /* - * For ease of implementation we expect the client to read all migration - * data in one go; partial reads are not supported. This is allowed by VFIO - * however we don't yet support it. Similarly, when resuming, partial - * writes are supported by VFIO, however we don't in this sample. - * * If in pre-copy state we copy BAR1, if in stop-and-copy state we copy * both BAR1 and BAR0. Since we always copy BAR1 in the stop-and-copy state, * copying BAR1 in the pre-copy state is pointless. Fixing this requires * more complex state tracking which exceeds the scope of this sample. */ - if (size != server_data->migration.pending_bytes) { - errno = EINVAL; - return -1; + if (server_data->migration.pending_read == 0 || size == 0) { + return 0; } - memcpy(buf, server_data->bar1, server_data->bar1_size); + uint32_t total_read = server_data->bar1_size; + if (server_data->migration.state == VFU_MIGR_STATE_STOP_AND_COPY) { - memcpy(buf + server_data->bar1_size, &server_data->bar0, - sizeof(server_data->bar0)); + total_read += sizeof(server_data->bar0); + } + + uint32_t read_start = total_read - server_data->migration.pending_read; + uint32_t read_end = MIN(read_start + size, total_read); // exclusive + assert(read_end > read_start); + + uint32_t bytes_read = read_end - read_start; + + if (read_end <= server_data->bar1_size) { + // case 1: entire read lies within bar1 + // TODO check the following is always allowed + + memcpy(buf, server_data->bar1 + read_start, bytes_read); + } else if ( + read_start < server_data->bar1_size // starts in bar1 + && read_end > server_data->bar1_size // ends in bar0 + ) { + // case 2: part of the read in bar1 and part of the read in bar0 + // TODO check the following is always allowed + + uint32_t length_in_bar1 = server_data->bar1_size - read_start; + uint32_t length_in_bar0 = read_end - server_data->bar1_size; + assert(length_in_bar1 + length_in_bar0 == bytes_read); + + memcpy(buf, server_data->bar1 + read_start, length_in_bar1); + memcpy(buf + length_in_bar1, &server_data->bar0, length_in_bar0); + } else if (read_start >= server_data->bar1_size) { + // case 3: entire read lies within bar0 + // TODO check the following is always allowed + + read_start -= server_data->bar1_size; + read_end -= server_data->bar1_size; + + memcpy(buf, &server_data->bar0 + read_start, bytes_read); } - server_data->migration.pending_bytes = 0; - return size; + server_data->migration.pending_read -= bytes_read; + + return bytes_read; } static ssize_t @@ -322,34 +353,53 @@ migration_write_data(vfu_ctx_t *vfu_ctx, void *data, uint64_t size) { struct server_data *server_data = vfu_get_private(vfu_ctx); char *buf = data; - int ret; assert(server_data != NULL); assert(data != NULL); - if (size < server_data->bar1_size) { - vfu_log(vfu_ctx, LOG_DEBUG, "XXX bad migration data write %#llx-%#llx", - (unsigned long long)0, - (unsigned long long)size - 1); - errno = EINVAL; - return -1; - } - - memcpy(server_data->bar1, buf, server_data->bar1_size); - buf += server_data->bar1_size; - size -= server_data->bar1_size; - if (size == 0) { + if (server_data->migration.pending_write == 0 || size == 0) { return 0; } - if (size != sizeof(server_data->bar0)) { - errno = EINVAL; - return -1; + + uint32_t total_write = server_data->bar1_size + sizeof(server_data->bar0); + + uint32_t write_start = total_write - server_data->migration.pending_write; + uint32_t write_end = MIN(write_start + size, total_write); // exclusive + assert(write_end > write_start); + + uint32_t bytes_written = write_end - write_start; + + if (write_end <= server_data->bar1_size) { + // case 1: entire write lies within bar1 + // TODO check the following is always allowed + + memcpy(server_data->bar1 + write_start, buf, bytes_written); + } else if ( + write_start < server_data->bar1_size // starts in bar1 + && write_end > server_data->bar1_size // ends in bar0 + ) { + // case 2: part of the write in bar1 and part of the write in bar0 + // TODO check the following is always allowed + + uint32_t length_in_bar1 = server_data->bar1_size - write_start; + uint32_t length_in_bar0 = write_end - server_data->bar1_size; + assert(length_in_bar1 + length_in_bar0 == bytes_written); + + memcpy(server_data->bar1 + write_start, buf, length_in_bar1); + memcpy(&server_data->bar0, buf + length_in_bar1, length_in_bar0); + } else if (write_start >= server_data->bar1_size) { + // case 3: entire write lies within bar0 + // TODO check the following is always allowed + + write_start -= server_data->bar1_size; + write_end -= server_data->bar1_size; + + memcpy(&server_data->bar0 + write_start, buf, bytes_written); } - memcpy(&server_data->bar0, buf, sizeof(server_data->bar0)); - ret = bar0_access(vfu_ctx, buf, sizeof(server_data->bar0), 0, true); - assert(ret == (int)size); /* FIXME */ - return 0; + server_data->migration.pending_write -= bytes_written; + + return bytes_written; } int main(int argc, char *argv[]) @@ -357,6 +407,7 @@ int main(int argc, char *argv[]) char template[] = "/tmp/libvfio-user.XXXXXX"; int ret; bool verbose = false; + bool destination = false; int opt; struct sigaction act = {.sa_handler = _sa_handler}; const size_t bar1_size = 0x3000; @@ -375,13 +426,19 @@ int main(int argc, char *argv[]) .write_data = &migration_write_data }; - while ((opt = getopt(argc, argv, "v")) != -1) { + while ((opt = getopt(argc, argv, "vr")) != -1) { switch (opt) { case 'v': verbose = true; break; + case 'r': + destination = true; + server_data.migration.state = VFU_MIGR_STATE_RESUME; + server_data.migration.pending_write = + bar1_size + sizeof(time_t); + break; default: /* '?' */ - errx(EXIT_FAILURE, "Usage: %s [-v] ", argv[0]); + errx(EXIT_FAILURE, "Usage: %s [-v] [-r] ", argv[0]); } } @@ -459,7 +516,12 @@ int main(int argc, char *argv[]) err(EXIT_FAILURE, "failed to setup BAR1 region"); } - ret = vfu_setup_device_migration_callbacks(vfu_ctx, 0, &migr_callbacks); + ret = vfu_setup_device_migration_callbacks( + vfu_ctx, + destination ? LIBVFIO_USER_MIG_FLAG_START_RESUMING : 0, + &migr_callbacks + ); + if (ret < 0) { err(EXIT_FAILURE, "failed to setup device migration"); } From dcc535fd14822c601e15e7bb535764b327eba878 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Thu, 6 Jul 2023 14:02:11 +0000 Subject: [PATCH 8/9] add DMA dirty page logging to the spec Signed-off-by: William Henderson --- docs/vfio-user.rst | 101 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 101 insertions(+) diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst index 53327f2c..08882ae8 100644 --- a/docs/vfio-user.rst +++ b/docs/vfio-user.rst @@ -1528,6 +1528,22 @@ Device Features ^^^^^^^^^^^^^^^ The only device features supported by vfio-user are those related to migration. +They are a subset of those supported in the VFIO implementation of the Linux +kernel. + ++----------------------------------------+-------+ +| Name | Value | ++========================================+=======+ +| VFIO_DEVICE_FEATURE_MIGRATION | 1 | ++----------------------------------------+-------+ +| VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE | 2 | ++----------------------------------------+-------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_START | 6 | ++----------------------------------------+-------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP | 7 | ++----------------------------------------+-------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT | 8 | ++----------------------------------------+-------+ ``VFIO_DEVICE_FEATURE_MIGRATION`` """"""""""""""""""""""""""""""""" @@ -1604,6 +1620,91 @@ structured as follows: * *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data transport. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` / ``VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP`` +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, start/stop DMA logging. These features can +also be probed to determine whether the device supports DMA logging. + +When DMA logging is started, a range of IOVAs to monitor is provided and the +device can optimize its logging to cover only the IOVA range given. Only DMA +writes are logged. + +The data field of the ``SET`` request is structured as follows: + ++------------+--------+----------+ +| Name | Offset | Size | ++============+========+==========+ +| page_size | 0 | 8 | ++------------+--------+----------+ +| num_ranges | 8 | 4 | ++------------+--------+----------+ +| reserved | 12 | 4 | ++------------+--------+----------+ +| ranges | 16 | variable | ++------------+--------+----------+ + +* *page_size* hints what tracking granularity the device should try to achieve. + If the device cannot do the hinted page size then it's the driver's choice + which page size to pick based on its support. On output the device will return + the page size it selected. + +* *num_ranges* is the number of IOVA ranges to monitor. A value of zero + indicates that all writes should be logged. + +* *ranges* is an array of ``vfio_user_device_feature_dma_logging_range`` + entries: + ++--------+--------+------+ +| Name | Offset | Size | ++========+========+======+ +| iova | 0 | 8 | ++--------+--------+------+ +| length | 8 | 8 | ++--------+--------+------+ + + * *iova* is the base IO virtual address + * *length* is the length of the range to log + +Upon success, the response data field will be the same as the request, unless +the page size was changed, in which case this will be reflected in the response. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT`` +"""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_GET``, returns the dirty bitmap for a specific IOVA +range. This operation is only valid if logging of dirty pages has been +previously started by setting ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START``. + +The data field of the request is structured as follows: + ++-----------+--------+------+ +| Name | Offset | Size | ++===========+========+======+ +| iova | 0 | 8 | ++-----------+--------+------+ +| length | 8 | 8 | ++-----------+--------+------+ +| page_size | 16 | 8 | ++-----------+--------+------+ + +* *iova* is the base IO virtual address + +* *length* is the length of the range + +* *page_size* is the unit of granularity of the bitmap, and must be a power of + two. It doesn't have to match the value given to + ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` because the driver will format its + internal logging to match the reporting page size possibly by replicating bits + if the internal page size is lower than requested + +The data field of the response is identical, except with the bitmap added on +the end at offset 24. + +The mapping of IOVA to bits is given by: + +``bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))`` Direct State Transitions """""""""""""""""""""""" From 7c68798a31254ca82449bd4eff54784939455566 Mon Sep 17 00:00:00 2001 From: William Henderson Date: Thu, 6 Jul 2023 15:34:25 +0000 Subject: [PATCH 9/9] fix: correct command enum for updated spec Signed-off-by: William Henderson --- include/vfio-user.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/include/vfio-user.h b/include/vfio-user.h index 50cfb4ad..42ffa296 100644 --- a/include/vfio-user.h +++ b/include/vfio-user.h @@ -66,10 +66,11 @@ enum vfio_user_command { VFIO_USER_DMA_READ = 11, VFIO_USER_DMA_WRITE = 12, VFIO_USER_DEVICE_RESET = 13, - VFIO_USER_DIRTY_PAGES = 14, - VFIO_USER_DEVICE_FEATURE = 15, - VFIO_USER_MIG_DATA_READ = 16, - VFIO_USER_MIG_DATA_WRITE = 17, + VFIO_USER_DIRTY_PAGES = 14, // TODO remove + VFIO_USER_REGION_WRITE_MULTI = 15, // TODO implement + VFIO_USER_DEVICE_FEATURE = 16, + VFIO_USER_MIG_DATA_READ = 17, + VFIO_USER_MIG_DATA_WRITE = 18, VFIO_USER_MAX, };