diff --git a/docs/overview.html b/docs/overview.html index ec8145ab..59e60a0b 100644 --- a/docs/overview.html +++ b/docs/overview.html @@ -455,23 +455,32 @@

VDI: VEuPathDB Dataset Installer

  • Events
  • Complications and Gotchas @@ -705,6 +718,24 @@

    Update Meta: Metadata Installation and Updates

    A dataset’s actual data cannot be installed into a target system without the meta or 'control' records first being present.

    +
    + + + + + +
    + + +
    +

    As a special case, because the meta/control records are required to be present +in a target system for an install to happen, on successful completion, the +update-meta lane fires a dataset reconciliation event to try and avoid the +possibility of a long wait for a dataset to be installed if the install event +hits before the update-meta event. See Event Ordering for more information.

    +
    +
    +

    Install Data: Dataset Content Installation

    @@ -749,47 +780,286 @@

    Hard-Delete

    Dataset Reconciliation

    - +
    +

    The dataset reconciliation lane is responsible for examining the state of a +dataset in the object store and ensuring that state is accurately reflected in +the internals of VDI as well as in all relevant installation targets. The +dataset reconciliation lane attempts corrections or updates to target system +state by firing events for the other lanes to pick up and process.

    +
    +
    +

    For example, if the dataset reconciler receives a reconciliation event for a +dataset that is marked as deleted in the object store, but is not yet +uninstalled from an installation target, the dataset reconciler will fire an +uninstallation event for the dataset for the +uninstallation lane to process.

    +

    Events

    -
    -

    Event Ordering

    - +
    +

    VDI-internal events are 3-field JSON messages containing only the identifiers +for the relevant dataset and an event source indicator which informs whether the +event originated from an object store bucket event, or if a reconciliation +process fired the event.

    + +
    +Example Event +
    +
    +
    +
    {
    +  "userID": 123456,
    +  "datasetID": "VNY9UUYo8ZA",
    +  "source": "ObjectStore"
    +}
    -
    -

    Plugins

    -
    -
    -

    The Plugin Server

    - +
    +
    +
    +
    +

    These events do not contain any additional information as the state of the +dataset or object store may have changed by the time the event is processed. +When an event is eventually processed by the relevant lane, that lane is +responsible for validating the status of the dataset before operating on that +dataset.

    -

    Plugin Scripts

    +

    Event Types

    +
    +

    Events themselves are not actually 'typed', the type is determined by which +Kafka topic the event message is submitted to.

    +
    +

    Import

    - +
    +

    When a dataset is initially uploaded, events are submitted to the import topic +when an import ready file or a +dataset meta file are put into the object store. This means +that for every upload, 2 import events will be fired.

    +
    +
    +

    The dataset reconciliation lane may also fire import events +if it finds that import-ready files are present in the object store, but install +ready files are not.

    +
    -

    Update Meta

    - +

    Install

    +
    +

    An install event is fired when a install-ready files are put +into the object store.

    +
    +
    +

    Install events may also be fired by the dataset reconciliation +lane if it finds that install-ready files are present in the object store but +the dataset is not yet installed into all of its target systems.

    +
    -

    Check Compatibility

    - +

    Update-Meta

    +
    +

    The update meta event is fired when a dataset meta file is put +into the object store.

    +
    -

    Install Data

    +

    Uninstall

    +
    +

    Uninstall events are fired when a delete flag is put into the +object store for a dataset.

    +
    +
    +

    The uninstall event may also be fired by the dataset +reconciliation lane if the dataset is found to have a delete flag in the +object store, but is not yet uninstalled from one or more of the dataset’s +install targets.

    +
    +
    +
    +

    Hard-Delete

    +
    +

    Hard delete events are fired when objects are actually removed from the object +store by the dataset pruner.

    +
    +
    +
    +

    Share

    -

    Uninstall

    +

    Reconciliation

    +
    +
    +
    +

    Event Ordering

    +
    +

    If given a single, isolated VDI instance under no load, events would happen in a +predictable order:

    +
    +
    +
      +
    1. +

      Install Meta

      +
    2. +
    3. +

      Import

      +
    4. +
    5. +

      Install

      +
    6. +
    +
    +
    +

    In practice, however, multiple VDI instances are running simultaneously which +leads to datasets being replicated over from other instances, and load is +unpredictable, which means events may happen in an unpredictable order.

    +
    +
    +

    To illustrate this: imagine a replicated dataset’s install-ready data is made +available before any other dataset files, in this case, the install-dataset +event may fire before update meta, resulting in the event being rejected due to +missing control records in the install target. Then, when the metadata is +replicated over, the update-meta event will fire after the install was already +attempted.

    +
    +
    +

    To account for the fact that event ordering is unpredictable in practice there +are a few rules in place to prevent unnecessary processing as well as make sure +the few events that are dependent on one another happen in the correct order +relative to one another.

    +
    +
    +

    Additionally, lane operations are idempotent to ensure that if/when events are +processed unnecessarily, the end result is the same.

    +
    +
    +

    Import

    +
    +

    The import event is one of the first events fired for a newly uploaded dataset. +For replicated datasets, however, this event may not be necessary at all.

    +
    +
    +

    To try and avoid doing extra work the import process will be skipped if the +dataset 'directory' in the object store already contains +install ready files and a +dataset manifest file.

    +
    +
    +
    +

    Install Meta

    +
    +

    Along with the import event, the install/update meta event is one of the first +events fired for a new dataset.

    +
    +
    +

    This event being processed is a prerequisite of a dataset being installed into +any target systems. To account for the likelihood that this event will be fired +after an install is attempted in the case of dataset replication, the +update-meta lane fires an additional dataset reconciliation +event to make sure an install event is fired again if one had already been +rejected for the dataset.

    +
    +
    +
    +

    Install Data

    +
    +

    For an install-data event to be processed, an update meta +event must have already been processed to create the control records in the +relevant target systems.

    +
    +
    +

    import → install-meta → install-data →

    +
    +
    +
    +
    +
    +
    +

    Plugins

    +
    +
    +

    VDI plugins are implemented as a collection of scripts in any language executed +by separate service instances that are wrapped by a standard HTTP API. Plugin +services are registered with the primary VDI instance via environment variables.

    +
    +
    +

    The Plugin Server

    + +
    +

    The plugin server is a small HTTP server exposing 4 RPC-style endpoints that +trigger the execution of one or more scripts that are registered with the plugin +server instance.

    +
    +
    +

    Depending on the endpoint, data may be posted to the plugin to be used by the +plugin script, and data may be returned to VDI to be put into the object store.

    +
    +
    +
    +

    Plugin Scripts

    +
    +
    +
    Import
    +
    +

    The import script accepts the raw upload data and performs syntactic validation +as well as any reformatting necessary to prepare the data for installation.

    +
    +
    +
    +
    +
    +
    Update Meta
    +
    +

    The update meta script is handed the full metadata for a dataset and may be used +to perform custom metadata installation steps beyond those performed by the +VDI service itself.

    +
    +
    +
    +
    +
    +
    Check Compatibility
    +
    +

    The check compatibility script is a pre-install step executed to ensure that the +data in the dataset is compatible with the target system.

    +
    +

    This script is run as part of the install step immediately before the install +data script itself is run. It has access to the install ready +set of files.

    +
    +
    +
    +
    +
    +
    +
    Install Data
    +
    +

    The install-data script takes the install-ready data and performs the +installation of that data into a target system.

    +
    +
    +
    +
    +
    +
    Uninstall
    +
    +

    The uninstall script is responsible for removing all data for a dataset from a +target system.

    +
    +
    @@ -931,7 +1201,8 @@

    Import Ready Files

    In future versions of VDI the raw user upload would be in a separate file raw-upload.zip which would be replaced by import-ready.zip once the upload -had been sanity and security checked.

    +has been sanity and security checked (a process which is currently in-line in +the REST service).

    @@ -957,7 +1228,8 @@

    Dataset Metadata

    from the user and the initial upload process.

    Metadata Contents
    @@ -1028,7 +1300,7 @@

    Dataset Metadata

    Visibility

    -

    Enum[String]

    +

    github Enum[String]

    A visibility indicator for a dataset that controls who can see the dataset by default, once installed.

    @@ -1153,15 +1425,303 @@

    Dataset Metadata

    +
    +vdi-meta.json Example +
    +
    +
    +
    {
    +  "created": "2024-05-23T16:25:44-04:00",
    +  "dependencies": [
    +    {
    +      "resourceDisplayName": "Some Data",
    +      "resourceIdentifier": "some_data",
    +      "resourceVersion": "20160416"
    +    }
    +  ],
    +  "description": "The description of some dataset that I uploaded.",
    +  "name": "My Dataset",
    +  "origin": "direct-upload",
    +  "owner": 123456789,
    +  "projects": [
    +    "PlasmoDB",
    +    "ClinEpiDB"
    +  ],
    +  "sourceUrl": "https://my.datafile.hosting.site/files/my-data.zip",
    +  "summary": "A short summary.",
    +  "type": {
    +    "name": "genelist",
    +    "version": "1.0"
    +  },
    +  "visibility": "private"
    +}
    +
    +
    +
    +

    Dataset Manifest

    - +
    +

    The vdi-manifest.json file contains a manifest of the input and output files +of the dataset import process.

    +
    + +
    +
    Manifest Contents
    +
    + +++++ + + + + + + + + + + + + +

    Input Files

    Array of File Info

    Array containing the name and size of each of the files that were present in +the import-ready.zip file.

    Output Files

    Array of File Info

    Array containing the name and size of each of the files that was produced by +the relevant plugin’s import script.

    +
    +
    +
    +vdi-manifest.json Schema +
    +
    +
    +
    {
    +  "$schema": "https://json-schema.org/draft-07/schema",
    +  "type": "object",
    +  "definitions": {
    +    "file-info": {
    +      "type": "object",
    +      "properties": {
    +        "filename": {
    +          "type": "string"
    +        },
    +        "fileSize": {
    +          "type": "integer",
    +          "minimum": 0
    +        }
    +      },
    +      "required": [
    +        "filename",
    +        "fileSize"
    +      ]
    +    }
    +  },
    +  "properties": {
    +    "inputFiles": {
    +      "type": "array",
    +      "items": {
    +        "$ref": "#/definitions/file-info"
    +      },
    +      "additionalItems": false
    +    },
    +    "outputFiles": {
    +      "type": "array",
    +      "items": {
    +        "$ref": "#/definitions/file-info"
    +      },
    +      "additionalItems": false
    +    }
    +  },
    +  "required": [
    +    "inputFiles",
    +    "outputFiles"
    +  ]
    +}
    +
    +
    +
    +
    +
    +vdi-manifest.json Example +
    +
    +
    +
    {
    +  "inputFiles": [
    +    {
    +      "filename": "my-upload.biom",
    +      "fileSize": 123124
    +    }
    +  ],
    +  "outputFiles": [
    +    {
    +      "filename": "meta.json",
    +      "fileSize": 10276
    +    },
    +    {
    +      "filename": "data.tsv",
    +      "fileSize": 75021
    +    }
    +  ]
    +}
    +
    +
    +
    +

    Shares

    - +
    +

    Shares of datasets from a dataset’s owner to other target users are represented +in the object store as a directory structure. Within an individual dataset’s +'directory' in the object store, if a dataset has at least one share, there will +be a subdirectory named "shares". The contents of this "shares" directory is +one or more additional subdirectories, each named with the user ID of the share +recipient. Inside each recipient directory is 2 flag files. One indicates the +status of the offer from the dataset owner and the other indicates the status +of the receipt from the share recipient.

    +
    +
    +

    This 2-flag system allows the dataset owner to revoke a share after it has been +created, and also allows the share recipient to accept or reject share offers.

    +
    +
    +

    Share Offer

    + +
    +
    Offer Contents
    +
    + +++++ + + + + + + + +

    Action

    github Enum[String]

    A string value of "grant" or "revoke" indicating the status of the share +offer.

    +
    +
    +
    +offer.json Schema +
    +
    +
    +
    {
    +  "$schema": "https://json-schema.org/draft-07/schema",
    +  "type": "object",
    +  "properties": {
    +    "action": {
    +      "type": "string",
    +      "enum": [
    +        "grant",
    +        "revoke"
    +      ]
    +    }
    +  },
    +  "required": [
    +    "action"
    +  ],
    +  "additionalProperties": false
    +}
    +
    +
    +
    +
    +
    +offer.json Example +
    +
    +
    +
    {
    +  "action": "grant"
    +}
    +
    +
    +
    +
    +
    +
    +

    Share Receipt

    + +
    +
    Receipt Contents
    +
    + +++++ + + + + + + + +

    Action

    github Enum[String]

    A string value of "accept" or "reject" indicating the status of the share +receipt.

    +
    +
    +
    +receipt.json Schema +
    +
    +
    +
    {
    +  "$schema": "https://json-schema.org/draft-07/schema",
    +  "type": "object",
    +  "properties": {
    +    "action": {
    +      "type": "string",
    +      "enum": [
    +        "accept",
    +        "reject"
    +      ]
    +    }
    +  },
    +  "required": [
    +    "action"
    +  ],
    +  "additionalProperties": false
    +}
    +
    +
    +
    +
    +
    +receipt.json Example +
    +
    +
    +
    {
    +  "action": "reject"
    +}
    +
    +
    +
    +
    +
    @@ -1171,8 +1731,20 @@

    Complications and Gotchas

    MinIO and Event Replication

    -

    TODO: minio doesn’t fire events on data replication in spite of their promise of - s3 compatibility.

    +

    TODO: rephrase this

    +
    +
    +

    While MinIO repeatedly promises "unyielding" compatibility with +AWS S3, it unfortunately does not follow through on that promise. MinIO’s +implementers made the decision to disable object events on replication, which +means that VDI’s core driver is non-functional for datasets replicated in from +an external MinIO instance.

    +
    +
    +

    When this change to MinIO was discovered, a new 'slim' mode was added to the +Reconciliation Scheduler that runs every few minutes to attempt to catch +replicated data and fire events to keep the local system up to date without +needing to wait for the full reconciliation run which runs much less frequently.

    @@ -1195,12 +1767,6 @@

    Terminology

    -

    Event Ordering

    -
    -

    import → install-meta → install-data →

    -
    -
    -

    Unexpected Hiccups and Outages

    @@ -1231,7 +1797,7 @@

    Document TODOs