Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow examples and use cases ... #279

Open
pvretano opened this issue Feb 18, 2022 · 71 comments
Open

Workflow examples and use cases ... #279

pvretano opened this issue Feb 18, 2022 · 71 comments

Comments

@pvretano
Copy link
Contributor

Following on from issue #278, the purpose of this issue is to capture examples of workflows from the various approaches (OpenEO, OAPIP Part 3, etc.), compare them and see where there is commonality and where there are differences. The goal is to converge on some conformance classes for Part 3.

Be specific with the examples, provide code if you can and try to make then not-too-long! ;)

@fmigneault
Copy link
Contributor

fmigneault commented Feb 18, 2022

Using CWL, the Workflow process WorkflowStageCopyImages is deployed using this definition:
https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/WorkflowStageCopyImages/deploy.json

It encapsulate 2 chained processes (steps), defined using following deployments respectively:
https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/DockerStageImages/deploy.json
https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/DockerCopyImages/deploy.json

All 3 processes embed the CWL definition in their executionUnit[0].unit field.

Execution uses the following payload:
https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/WorkflowStageCopyImages/execute.json

On submitted execution, the workflow will run the process chain, first "generating an image" from the input string, and the second process will do a simple pass-through of the file contents.

The chaining logic is all defined by CWL. Because of in/out entries under steps in the workflow, it is possible to connect, parallelize, aggregate, etc. I/O however we want, without any need for reprocessing if there is duplicating of data sources for intermediate steps.

OGC API - Processes itself has no actual knowledge of single Process vs Workflow chaining. The implementer can decide to parse the CWL and execute it as they see fit. From the external user point of view, atomic and workflow processes are distinguishable in terms of inputs/outputs. If need be, intermediate processes can be executed by themselves as well.

Side Notes

  1. Processes use deployment, but they could well be pre-deployed or builtin in the application. This is irrelevant for Part 3.
  2. Sample workflow uses (eg. "run": "DockerStageImages") to refer to the chained processes. This can be replaced by a full-URL to dispatch executions on distinct OGC API - Processes instances if need be. In this case, it assumes "same instance" with "run": "{processId}".
  3. Definitions use the "old" OGC schemas where inputs/outputs were defined as lists of objects rather than the current <id>:definition mapping. This is not an issue, I just haven't converted the samples because our implementation supports both variants.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 21, 2022

Examples and use cases for OGC API - Processes - Part 3: Workflows & Chaining

(apologies for a complete failure at trying to make it not-too-long)

Scenario 1: Land Cover Classification (collection input / remote collections / collection output)

Say we have a server providing a vast collections of sentinel-2 data to which new scenes captured by the satellites get added continuously. That data is hypothetically available from a hypothetical OGC API implementation deployed at https://esa.int/ogcapi/collections/sentinel2:level2A and supports a number of OGC API specifications, including Coverages, (coverage) Tiles, EDR and DGGS.

Say we have another server providing MODIS data at https://usgs.gov/ogcapi/collections/modis which has a lower spatial resolution, but higher temporal resolution.

Research center A has developed and trained a Machine Learning model able to classify land cover from MODIS and sentinel-2 data and published it as a Process in an OGC API - Processes implementation with support for Part 3 at https://research-alpha.org/ogcapi/processes/landcover. The process has some degree of flexibility which allows to tweak the results of the classification.

Research center B wants to experiment with land cover classification, and they use their favorite OGC API client and first discovers the existence of the Land Cover classification process searching for "land cover" keywords from a central OGC catalog of trusted OGC API deployments of certified implementations. The client fetches the process description, and from it it can see what types of inputs are expected. Inputs are qualified with a geodata class which allows the client to easily discover implementations able to supply the data it needs. From the same central OGC catalog, it discovers the MODIS and sentinel-2 data sources as perfect fits, and automatically generates a workflow execution request that looks like this (despite its simplicity, the user still does not need to see it):

{
   "process" : "https://research-alpha.org/ogcapi/processes/landcover",
   "inputs" : {
      "modis_data" : { "collection" : "https://usgs.gov/ogcapi/collections/modis" },
      "sentinel2_data" : { "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A" }
   }
}

Happy to first try the defaults, the researcher clicks OK. By default the process generates a land cover classification for one specific year.
This results in POSTing the execution request to the process execution end-point at https://research-alpha.org/ogcapi/processes/landcover/execution with a response=collection query parameter indicating that the client wishes to use the Workflows & Chaining collection output conformance class / execution mode. The response will be a collection description document including details such as the spatiotemporal extent, links to the access mechanisms and media types supported for the result.

When receiving the request, the first thing the Workflows implementation on research-alpha.org will do is to validate those collection URLs as safe and retrieve their collection description to verify that they are proper inputs. This includes parsing information about the spatiotemporal extent of the collections as well as data access mechanism (e.g. Coverages, Tiles, DGGS...) and supported media types. The server recognizes the inputs as valid (e.g. it sees that their geodata class is a match) and plans on using OGC API - Tiles to retrieve data from the server, since both data inputs advertise support for coverage tiles. Confident that it can acommmodate the workflow being registered, the server responds to the request by generating a collection description document where the spatiotemporal extent spans the intersection of both inputs (e.g. 2016..last year for the whole Earth). The document also declares that the results can be requested either via OGC API - Coverages (with discrete categories), as OGC API - Features, or as OGC API - Tiles (either as coverage tiles or vector tiles).

The client works best with vector tiles (as it uses Vulkan or WebGL to render them client-side), and supports Mapbox Vector Tiles which is one of the media types declared as supported in the response. The response included a link to tilesets of the results of the workflow execution request as Mapbox Vector Tiles. The client selects a tileset using the GNOSISGlobalGrid TileMatrixSet which is suitable for EPSG:4326 / CRS:84 for the whole world (including polar regions). That tileset includes a templated link to trigger processing of a particular resolution and area and request the result for a specific tile: https://research-alpha.org/ogcapi/internal-workflows/600d-c0ffee/tiles/GNOSISGlobalGrid/{tileMatrix}/{tileRow}/{tileCol}.mvt.

The client now requests tiles for the current visualization scale and extent currently displayed on its virtual globe, by replacing the parameter variables with tile matrices, rows and colums. Since the collection also advertised a temporal extent with a yearly resolution and support for OGC API - Tiles datetime conformance class, the client also specified that it is interested in last year with an additional datetime="2021-01-01T00:0:00Z" query parameter.

The research-alpha.org receives the requests and starts distributing the work. First it needs to acquire the data from the source collections. It sends request to retrieve MODIS and sentinel-2 data tiles.

The sentinel-2 server supports a "filter" query parameter allowing to filter data by cloud cover at both the scene metadata as well as the cell data values level to create a cloud-free mosaic of multiple scenes, e.g. "filter=scene.cloud_cover < 50 AND cell.cloud_cover < 15". It also supports returning a flattened GeoTIFF when requesting a temporal interval and a "sortby" parameter to order with the least cloud cover will be preserved (on top): "sortby=cell.cloud_cover(desc)".

The trained model requires imagery from different times during the year, so it uses the datetime query parameter to request from the sentinel-2 monthly interval images, with the least amount of cloud possible.

For the MODIS data, the server supports requesting tiles for a whole month with daily values (preserving the temporal dimension).

The internal landcover executable behind the process takes as input 12 netCDF coverages of MODIS (with daily values) and 12 monthly GeoTIFF cloud-free sentinel-2 imagery with raw band values. It supports generating a classified discrete coverage or a multi-polygon feature collection as a result, with one feature per land cover category. It is invoked in parallel for each tile, and maybe further accelerated using GPUs with several cores.

As soon as all the necessary input data is available to process one tile, the prediction for that tile is executed (using the model which persists in shared memory as long as it has been used recently). As soon as the prediction is complete for a tile, the result is returned.

Due to the parallel nature of the requests/processing, the small pieces of data being requested and processed, the use of GPU acceleration, and the use of efficient and well optimized technology, the client starts receiving the result tiles within 1 or 2 seconds. The client immediately starts displaying the results with a default style sheet and caches the resulting tiles.

Now the user starts zooming in on an area of interest. The lower resolution tiles are still displayed on the globe while waiting for more refined results to come in (requested for a more detailed zoom level / a smaller scale denominator). Soon those show up on the client display and the user starts seeing interesting classification results. If the user zooms back out, the lower-resolution / larger area results are still cached, so the user does not see a black screen.

The user notices that a classification looks off for a particular land cover category. The user goes back in the execution request / workflow editor and tweaks an input parameter that should correct the situation. The client POSTs a new execution request as a result, which results in a new collection response and a new link to generate tiles of the results. The client invalidates the currently cache tiles which no longer reflect this updated workflow. The server validates the workflow immediately because it still has active connections to the input collections used and does not need to validate them again. The new response comes back quickly and the client can display the result again, which looks good.

The landcover process server had cached responses from the previous MODIS and sentinel-2, so it does not need to go back to make those requests again. It simply needs to re-run the prediction model with the new parameters.

The user explores areas of interest at different resolutions of interest and results keep coming in quickly. The user is satisfied with the results and now select a large area to export at a detailed scale. A lot of the results equired for this operation have already been cached during the exploration phase by the client and / or the landcover server. The "batch process" finishes quickly. The user is very happy with OGC API - Processs workflows after having succeeded producing a land cover map in 15 minutes from discovery to the resulting map.

We demonstrated a similar scenario in MOAW project using sentinel-2 data from EuroDataCube / SentinelHub. See JSON process description.

Scenario 2: Custom map rendering (remote process / nested process)

As a slight twist to Scenario 1, the user wishes to render a map server-side using an their own server (but it could just as easily be on any server implementing a maps rendering process) instead of rendering it client-side.

The server has a RenderMap process that takes in a list of layers as input. The result of the process is available either using OGC API - Maps or as map tiles using OGC API - Tiles, in a variety of CRSes and TileMatrixSets.

The discovery process and selection of processes and input is very similar as in Scenario 1, except this time the RenderMap process will be the one to which the client will be POSTing the execution request. The landcover process will become nested process, its output being an input into the RenderMap process, and could be rendered on top of a sentinel-2 mosaic:

{
   "process" : "https://research-beta.org/ogcapi/processes/RenderMap",
   {
      "inputs" : {
         "layers" : [
            {
               "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A",
               "ogcapiParameters" : {
                  "filter" : "scene.cloud_cover < 50 and cell.cloud_cover < 15",
                  "sortby": "cell.cloud_cover(desc)"
               }
            },
            {
               "process" : "https://research-alpha.org/ogcapi/processes/landcover",
               "inputs" : {
                  "modis_data" : { "collection" : "https://usgs.gov/ogcapi/collections/modis" },
                  "sentinel2_data" : { "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A" }
               }
            }
         ]
      }
   }
}

The RenderMap process may also take in other input parameters, e.g. a style definition.

In a similar manner to Scenario 1, the client will receive a collection description document, this time links to map tilesets and to a map available for the results. The client decides to trigger the processing and request results using OGC API - Maps, and build a request specifying a World Mercator (EPSG:3395) CRS, a bounding box, a date & time, and a width for the result (height is automatically calculated from the normal aspect ratio):

https://research-beta.org/ogcapi/internal-workflows/b357-c0ffee/map.png?crs=EPSG:3395&bbox=-80,40,-70,45&bbox-crs=OGC:CRS84&datetime=2021-01-01T00:0:00Z&width=8192.

Although the client is requesting a WorldMercator map, the RenderMap process implementation might still leverage vector tiles using the GNOSISGlobalGrid tile matrix set, and thus submit multiple requests to the landcover process server, acting in the same way as the client-side renderer in scenario 1.

See JSON process description for our implementation of such a process.

Scenario 3: Publishing the results of a workflow (virtual collections)

The researcher may now want to publish the map as a dedicated and persistent OGC API collection.
Through some "collections" transaction mechanism, the client may POST the workflow definition using a dedicated media type for Processes execution requests, and with proper authentication, to e.g. to /collections to create the collection.

The server can execute the processing based on requests received for that collection, but would also cache results to optimize processing, bandwidth, memory and disk resources.

The collection description may also link to the workflow source, making it easy to reproduce and adapt to similar and derived uses.

As new data gets added to the source collections, caches expire and the virtual collection is always up to date. Rather than the providers having to continously run batch processes using up a lot of resources for areas / resolution of interest that will be mostly out of date before any client is interested in the data, they can instead prioritize resources on the latest requests and on most important (e.g. disaster response). The server can also prioritize resources to pre-empted request that will follow the current request patterns when it has free cycles. Such pre-emption could offset the latency in workflows with a larger number of hops.

This can also be done in the backend without users of the API being aware, but offering these explicit capabilities facilitate the reproducibility and re-use.

Scenario 4: Backend workflow and EVI expression (nested process / deploy workflow)

For this scenario, let's assume the landcover process is itself a workflow that leverages other processes.
It could e.g. have been deployed using Processes - Part 2: Deploy, Replace, Undeploy by POSTing it to /processes using a dedicated media type for execution requests (implementations can potentially automatically determine inputs and their schemas by parsing the nested processes that are used as well as their inputs, and analyzing the "input" properties defined in the workflow, so uploading a process description is not absolutely necessary).

In addition to the raw sentinel-2 band, the classification algorithm might for example utilize a pre-computed vegetation index, and specify the filtering logic discussed earlier.

landcover process workflow
inputs: modis_data, sentinel2_data
{datetime} refers to OGC API datetime parameter used when triggering processing
coverage_processor creates a new coverage based on bands expressions
randomForestPredict runs a random forest classification prediction based on a pre-trained model and input coverages

{
   "process" : "https://research-alpha.org/ogcapi/processes/randomForestPredict",
   "inputs" : {
      "trainedModel" : "https://research-alpha.org/ogcapi/models/sentinel2ModisLandCover",
      "data" :
      [
          { "$ref" : "#/components/monthlyInput", "{month}" :  1 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  2 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  3 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  4 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  5 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  6 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  7 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  8 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  9 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 10 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 11 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 12 }
      ]
   },
   "components" :
   {
      "modis":
      {
         "input" : "modis_data",
         "format": { "mediaType": "application/netcdf" },
         "ogcapiParameters" : {
            "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" }
         }
      },
      "sentinel2":
      {
         "input" : "sentinel2_data",
         "format": { "mediaType": "image/tiff; application=geotiff" },
         "ogcapiParameters" : {
            "filter" : "scene.cloud_cover < 50 and cell.cloud_cover < 15",
            "sortby": "cell.cloud_cover(desc)",
            "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" }
         }
      },
      "monthlyInput":
      {
         { "$ref" : "#/components/modis" },
         { "$ref" : "#/components/sentinel2" },
         {
            "process" : "https://research-alpha.org/ogcapi/processes/coverage_processor",
            "inputs" : {
               "data" : { "$ref" : "#/components/sentinel2" },
               "fields" : { "evi" : "2.5 * (B08 - B04) / (1 + B08 + 6 * B04 + -7.5 * B02)" }
            }
         }
      }
   }
}

Our implementation of the RFClassify process works in a similar way, but up until now it has been implemented as a single process integrating Python and Scitkit-learn. This example introduces new capabilities that would make it easier to implement this as a workflow:

  • Re-usable components to avoid duplicating inputs / nested processes
  • OGC API parameters allowing to be more specific on how an OGC API collection would be accessed (they would only apply when an OGC API collection is used as an input, and when those parameters are defined for the specific OGC API access mechanism negotiated)
  • The use of e.g. {datetime} to refer to aspects of how the processing is triggered by the OGC API data access mechanisms

Scenario 5: Point cloud gridifier (landing page output)

In this scenario, a collection supporting point cloud requests (e.g. as .las using OGC API - Tiles) is provided as an input, and the process generates two outputs for it by gridifying the point cloud: an ortho-rectified imagery and a DSM. In order to have access to both outputs, the client uses a response=landingPage instead of a collection description.

{
  "process" : "https://example.com/ogcapi/processes/PCGridify",
  "inputs" : {
     "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },
     "fillDistance" : 100,
     "classes" : [ "ground", "highVegetation" ]
  }
}

The response is an OGC API landing page, with two collections available (one for the ortho imagery and one for the DSM).
A client wishing to nest this workflow and use one specific output can specify which to include using the usual "outputs" property:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",
  "inputs" : {
     "dataset" : { "collection" : "https://example.com/ogcapi/collections/osm:roads" },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",
        "inputs" : {
           "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },
           "fillDistance" : 100,
           "classes" : [ "roads" ]
        },
        "outputs" : { "dsm" : { } }
     },
     "preference" : "shortest",
     "mode" : "pedestrian",
     "waypoints" : { "value" : {
        "type" : "MultiPoint",
        "coordinates" : [
           [ -71.20290940, 46.81266578 ],
           [ -71.20735275, 46.80701663 ]
        ]
        }
     }
   }
}

An interesting extension of this use case is to generate the point cloud from a photogrametry process using a collection of oblique imagery at one end, and to use use the process in a another workflow doing classification / segmentation and conversion into mesh, that a client can trigger by requesting 3D content using OGC API - GeoVolumes.

See JSON process descriptions for the Point cloud gridifier and the Routing engine in our implementation of such processes.

Scenario 6: Fewer round-trips (immediate access)

As a way to reduce the number of round-trips, the ability to submit workflows to other end-points has been considered.
e.g. in Scenario 1, the client could submit the execution request to /processes/landcover/tiles/GNOSISGlobalGrid instead of to /processes/landcover/execution to immediately receive a vector tileset of the result (which will already contain the resulting tiles templated URL), instead of having to follow links from the returned collection description -> list of vector tilesets -> tileset.
This is also useful in demonstrating Workflows in action by posting a workflow execution request directly to a .../map or .../map/tiles/{tileMatrixSetId}/{tileMatrix}/{tileRow}/{tileCol} and receiving (e.g. in Swagger UI).

For a live example of this capability, try POSTing the following execution request to the following end-points:

  • https://maps.ecere.com/ogcapi/processes/RenderMap/map
  • https://maps.ecere.com/ogcapi/processes/RenderMap/map/tiles/GNOSISGlobalGrid
  • https://maps.ecere.com/ogcapi/processes/RenderMap/map/GNOSISGlobalGrid/0/0/0
{
  "process": "https://maps.ecere.com/ogcapi/processes/RenderMap",
  "inputs": {
    "layers": [
      { "collection": "https://maps.ecere.com/ogcapi/collections/SRTM_ViewFinderPanorama" }
    ]
  }
}

More examples (in Annex B) and additional details in draft MOAW discussion paper currently @ https://maps.ecere.com/moaw/DiscussionPaper-Draft3.pdf.

@m-mohr
Copy link

m-mohr commented Feb 21, 2022

(Sorry, I forgot to work on examples and was just reminded once Peter opened the issue. As such my contribution is rather short and a bit incomplete for now.)

It seems there are multiple different base "use cases":

  1. Simply retrieve some data
  2. Process data by providing "low-level" processing instructions (e.g. band math + temporal mean + linear stretching)
  3. Process data just by providing some "high-level" processing instructions (e.g. a landcover process as described above)
  4. Provide a complete processing environment + processing instructions (e.g. a Docker container, I think you all call them application packages?)

All this may also include:

A. Publishing results
B. Downloading/Accessing results
C. Interchange results across back-ends

In openEO, the focus is on 3 while it seems the previous posts here are focussing more on the parts. This is all not mutually exclusive though. So how can you achieve the use cases above in openEO:

Use Case 1 (data retrieval)

You need to send a load_collection + save_result to the back-end and store the data in a format you wish to get.

image

Depending on the execution mode you may get different results:

A. You can publish the data using web services, e.g. WMTS, using openEO's "secondary web service" API.
B. You can download a single file using synchronous processing or create a STAC catalog with your requested data using batch processing.
C. Similarly, you'd create a batch job and then you could load the result_from another back-end (using load_result). This can be automated in code, but doesn't happen automagically yet.

Use Case 2 (low-level processing instructions)

That's the main goal of openEO and that's where it probably shines most. A substantial amount of work has lead to a list of pre-defined processes that can be used for data cube operations, math etc. See https://processes.openeo.org for a list of processes. These can easily be chained (in a process graph) to a "high-level" process, we call them user-defined processes.

The EVI example mentioned above looks like this in "visual mode" (child process graphs not shown):

image

(Please note the code below is auto-generated from the Editor that is used for the visual mode above. As such the code may not be exactly what an experienced user would write.)

This is the corresponding code from Python:

# Loading the data; The order of the specified bands is important for the following reduce operation.
dc = connection.load_collection(collection_id = "COPERNICUS/S2", spatial_extent = {"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35}, temporal_extent = ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"], bands = ["B8", "B4", "B2"])

# Compute the EVI
B02 = dc.band("B02")
B04 = dc.band("B04")
B08 = dc.band("B08")
evi = (2.5 * (B8 - B4)) / ((B8 + 6.0 * B4 - 7.5 * B2) + 1.0)

# Compute a minimum time composite by reducing the temporal dimension
mintime = evi.reduce_dimension(reducer = "min", dimension = "t")

def fn1(x, context = None):
    datacube2 = process("linear_scale_range", x = x, inputMin = -1, inputMax = 1, outputMax = 255)
    return datacube2

# Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
datacube1 = mintime.apply(process = fn1)
save = datacube1.save_result(format = "GTIFF")

# The process can be executed synchronously (see below), as batch job or as web service now
result = connection.execute(save)

This is the corresponding code in R:

p = processes()

# Loading the data; The order of the specified bands is important for the following reduce operation.
dc = p$load_collection(id = "COPERNICUS/S2", spatial_extent = list("west" = 16.06, "south" = 48.06, "east" = 16.65, "north" = 48.35), temporal_extent = list("2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"), bands = list("B8", "B4", "B2"))

# Compute the EVI
evi_ <- function(x, context) {
  b2 <- x[1]
  b4 <- x[2]
  b8 <- x[3]
  return((2.5 * (b8 - b4)) / ((b8 + 6 * b4 - 7.5 * b2) + 1))
}

# reduce_dimension bands with the defined formula
evi <- p$reduce_dimension(data = dc, reducer = evi_, dimension = "bands")

mintime = function(data, context = NULL) {
	return(p$min(data = data))
}
# Compute a minimum time composite by reducing the temporal dimension
mintime = p$reduce_dimension(data = evi, reducer = mintime, dimension = "t")

fn1 = function(x, context = NULL) {
	datacube2 = p$linear_scale_range(x = x, inputMin = -1, inputMax = 1, outputMax = 255)
	return(datacube2)
}
# Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
datacube1 = p$apply(data = mintime, process = fn1)
save = p$save_result(data = datacube1, format = "GTIFF")

# The process can be executed synchronously (see below), as batch job or as web service now
result = compute_result(graph = save)

This is the corresponding code in JS:

let builder = await connection.buildProcess();

// Loading the data; The order of the specified bands is important for the following reduce operation.
let dc = builder.load_collection("COPERNICUS/S2", {"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35}, ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"], ["B8", "B4", "B2"]);

// Compute the EVI.
let evi = builder.reduce_dimension(dc, new Formula("2.5*(($B8-$B4)/(1+$B8+6*$B4+(-7.5)*$B2))"), "bands");

let mintime = function(data, context = null) {
	let min = this.min(data);
	return min;
}
// Compute a minimum time composite by reducing the temporal dimension
let mintime = builder.reduce_dimension(evi, mintime, "t");

// Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
let datacube1 = builder.apply(mintime, new Formula("linear_scale_range(x, -1, 1, 0, 255)"));
let save = builder.save_result(datacube1, "GTIFF");

// The process can be executed synchronously (see below), as batch job or as web service now
let result = await connection.computeResult(save);

And this is how it looks like in JSON as process (graph):

{
  "process_graph": {
    "1": {
      "process_id": "apply",
      "arguments": {
        "data": {
          "from_node": "mintime"
        },
        "process": {
          "process_graph": {
            "2": {
              "process_id": "linear_scale_range",
              "arguments": {
                "x": {
                  "from_parameter": "x"
                },
                "inputMin": -1,
                "inputMax": 1,
                "outputMax": 255
              },
              "result": true
            }
          }
        }
      },
      "description": "Stretch range from -1 / 1 to 0 / 255 for PNG visualization."
    },
    "dc": {
      "process_id": "load_collection",
      "arguments": {
        "id": "COPERNICUS/S2",
        "spatial_extent": {
          "west": 16.06,
          "south": 48.06,
          "east": 16.65,
          "north": 48.35
        },
        "temporal_extent": [
          "2018-01-01T00:00:00Z",
          "2018-01-31T23:59:59Z"
        ],
        "bands": [
          "B8",
          "B4",
          "B2"
        ]
      },
      "description": "Loading the data; The order of the specified bands is important for the following reduce operation."
    },
    "evi": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "dc"
        },
        "reducer": {
          "process_graph": {
            "nir": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 0
              }
            },
            "sub": {
              "process_id": "subtract",
              "arguments": {
                "x": {
                  "from_node": "nir"
                },
                "y": {
                  "from_node": "red"
                }
              }
            },
            "div": {
              "process_id": "divide",
              "arguments": {
                "x": {
                  "from_node": "sub"
                },
                "y": {
                  "from_node": "sum"
                }
              }
            },
            "p3": {
              "process_id": "multiply",
              "arguments": {
                "x": 2.5,
                "y": {
                  "from_node": "div"
                }
              },
              "result": true
            },
            "sum": {
              "process_id": "sum",
              "arguments": {
                "data": [
                  1,
                  {
                    "from_node": "nir"
                  },
                  {
                    "from_node": "p1"
                  },
                  {
                    "from_node": "p2"
                  }
                ]
              }
            },
            "red": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 1
              }
            },
            "p1": {
              "process_id": "multiply",
              "arguments": {
                "x": 6,
                "y": {
                  "from_node": "red"
                }
              }
            },
            "blue": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 2
              }
            },
            "p2": {
              "process_id": "multiply",
              "arguments": {
                "x": -7.5,
                "y": {
                  "from_node": "blue"
                }
              }
            }
          }
        },
        "dimension": "bands"
      },
      "description": "Compute the EVI. Formula: 2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE)"
    },
    "mintime": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "evi"
        },
        "reducer": {
          "process_graph": {
            "min": {
              "process_id": "min",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        },
        "dimension": "t"
      },
      "description": "Compute a minimum time composite by reducing the temporal dimension"
    },
    "save": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "1"
        },
        "format": "GTIFF"
      },
      "result": true
    }
  }
}

For details about our data cubes and related processes: https://openeo.org/documentation/1.0/datacubes.html
For details about common smaller "use cases" see the openEO Cookbook: https://openeo.org/documentation/1.0/cookbook/

Use Case 3 (high-level processing instructions)

Any process that you define you can also store as a high-level process that others can execute and re-use. So the EVI process above could simply be stored and then be executed with a single process call. Then your process ist as simple as:

image

Which in the three programming languages looks as such:

# Python
datacube = connection.datacube_from_process("evi")
result = connection.execute(datacube)
# R
p = processes()
result = compute_result(graph = p$evi())
// JavaScript
let builder = await connection.buildProcess();
let result = await connection.computeResult(builder.evi());

and in JSON:

{
  "id": "evi",
  "process_graph": {
    "1": {
      "process_id": "evi",
      "arguments": {},
      "result": true
    }
  }
}

This is simplified though, you'd probably want to defined parameters (e.g. collection id or extents) and pass them later.

Use Case 4 (processing environments)

We only cater partially for this. Right now, back-ends can provide certain pre-configured environments to run user code (so-called UDFs). This is currently implemented for Python and R and the environments usually differ by the software and libraries installed. Then you would send your code using run_udf as part of an openEO process graph.

We could extend the openEO API relatively easily in a way that user could push their own environments to the servers, but ultimately this was never the goal of openEO and as such could be covered by another standard.

What I haven't captured yet

  • Execution of a UDF
  • Parametrization of a user-defined process
  • Creation of batch jobs and secondary web services
  • Loading from external sources
  • probably more?

@m-mohr
Copy link

m-mohr commented Feb 21, 2022

Sorry, I had the meeting in my calendar for 16:00 CET for whatever reason and thus only heard like the last minutes of the call. Did you conclude on something? Otherwise, happy to join the next telco again.

@pvretano
Copy link
Contributor Author

pvretano commented Feb 21, 2022

@m-mohr nope ... no conclusions yet. @jerstlouis and @fmigneault presented their examples so it would be good if at the next meeting you could present your examples. One outcome of today's meeting was that @fmigneault will try to cast one of @jerstlouis examples in CWL. There will also be a recording of today's meeting available if you want to listen to the meeting. @bpross-52n can you post the recording somewhere when it is available?

@fmigneault
Copy link
Contributor

fmigneault commented Feb 21, 2022

Following is the conversion exercise example Scenario 5 : RoutingEngine provided by @jerstlouis

The first process is PCGridify. It takes all inputs that where in the nested process from Scenario 5 example, and produces a DSM file from the input point cloud.

{
    "processDescription": {
        "id": "PCGridify",
        "version": "0.0.1",
        "inputs": {
            "data": {
                "title": "Feature Collection of Point Cloud to gridify",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "fillDistance": {
                "schema": {
                    "type": "integer"
                }
            },
            "classes": {
                "schema": {
                    "type": "array",
                    "items": "string"
                }
            },
        }
        "outputs": {
            "dsm": {
                "schema": {
                    "type": "object",
                    "additionalProperties": {}
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "CommandLineTool",
                "baseCommand": ["PCGridify"],
                "arguments": ["-t", "$(runtime.outdir)"],
                "requirements": {
                    "DockerRequirement": {
                        "dockerPull": "example/PCGridify"
                    }
                },
                "inputs": {
                    "data": {
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 1
                        }
                    },
                    "fillDistance": {
                        "type": "float",
                        "inputBinding": {
                            "position": 2
                        }
                    },
                    "fillDistance": {
                        "type": "array",
                        "items": "string",
                        "inputBinding": {
                            "position": 3
                        }
                    }
                },
                "outputs": {
                    "dsm": {
                        "type": "File",
                        "outputBinding": {
                            "glob": "*.dsm"
                        }
                    }
                },
                "$namespaces": {
                    "iana": "https://www.iana.org/assignments/media-types/"
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/dockerizedApplication"
}

The second process is RouteProcessor. It takes the OpenStreetMap feature collection and "some preprocessed DSM" to generate the estimated route in a plain text file.

{
    "processDescription": {
        "id": "RouteProcessor",
        "version": "0.0.1",
        "inputs": {
            "dataset": {
                "title": "Collection of osm:roads"
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "elevationModel": {
                "title": "DSM file reference",
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            },
            "preference": {
                "schema": {
                    "type": "string"
                }
            },
            "mode": {
                "schema": {
                    "type": "string"
                }
            },
            "waypoints": {
                "schema": {
                    "type": "object",
                    "required": [
                        "type", 
                        "coordinates"
                    ],
                    "properties": [
                        "type": {
                            "type": "string"
                        }
                        "coordinates": {
                            "type": "array",
                            "items": {
                                "type": "array",
                                "items": "float"
                            }
                        }
                    ]
                }
            }
        }
        "outputs": {
            "route": {
                "format": {
                    "mediaType": "text/plain"
                }
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "CommandLineTool",
                "baseCommand": ["RoutingEngine"],
                "arguments": ["-t", "$(runtime.outdir)"],
                "requirements": {
                    "DockerRequirement": {
                        "dockerPull": "example/RoutingEngine"
                    }
                },
                "inputs": {
                    "dataset": {
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 1
                        }
                    },
                    "elevationModel": {
                        "type": "File",
                        "inputBinding": {
                            "position": 2
                        }
                    },
                    "waypoints": {
                        "doc": "Feature Collection",
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 3
                        }
                    },
                    "preference": {
                        "type": "string",
                        "inputBinding": {
                            "prefix": "-P"
                        }
                    },
                    "mode": {
                        "type": "string",
                        "inputBinding": {
                            "prefix": "-M"
                        }
                    }
                },
                "outputs": {
                    "route": {
                        "type": "File",
                        "format": "iana:text/plain",
                        "outputBinding": {
                            "glob": "*.txt"
                        }
                    }
                },
                "$namespaces": {
                    "iana": "https://www.iana.org/assignments/media-types/"
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/dockerizedApplication"
}

Finally, the RoutingEngine Workflow is defined as follows. It only takes the point cloud and the routing data as input.
All other intermediate parameters are "hidden away" for the external user using predefined {"default": <value>} for this workflow implementation.

In the steps section, I applied different names for Workflow level vs Application level inputs to better illustrate how the chaining relationship of I/O is accomplished by CWL.

{
    "processDescription": {
        "id": "RoutingEngine",
        "version": "0.0.1",
        "inputs": {
            "point_cloud": {
                "title": "Feature Collection of Point Cloud to gridify",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "roads_data": {
                "tite": "Collection of osm:roads",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "routing_mode": {
                "schema": {
                    "type": "string"
                }
            }
        }
        "outputs": {
            "estimated_route": {
                "format": {
                    "mediaType": "text/plain"
                }
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "Workflow",
                "inputs": {
                    "point_cloud": {
                        "doc": "Point cloud that will be gridified",
                        "type": "File"
                    }
                    "roads_data": {
                        "doc": "Feature collection of osm:roads",
                        "type": "File"
                    },
                    "routing_mode": {
                        "schema": {
                            "type": "string", 
                            "enum": [
                                "pedestrian",
                                "car"
                            ]
                        }
                    }
                },
                "outputs": {
                    "estimated_route": {
                        "type": "File",
                        "outputSource": "routing/route"
                    }
                },
                "steps": {
                    "gridify": {
                        "run": "PCGridify",
                        "in": {
                            "data": "point_cloud",
                            "classes": { "default": [ "roads" ] },
                            "fillDistance": { "default": 100 }
                        },
                        "out": [
                            "dsm"
                        ]
                    },
                    "routing": {
                        "run": "RouteProcessor",
                        "in": {
                            "dataset": "roads_data",
                            "elevationModel": "gridify/dsm",
                            "preference": { "default": "shortest"},
                            "mode": "routing_mode"
                        },
                        "out": [
                            "route"
                        ]
                    }
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/workflow"
}

I would like to had that these examples are extremely verbose on purpose only to demonstrate the complete capabilities of chaining potential. There are no ambiguity on how to chain elements whatsoever, no matter the amount of processes and I/O implied in the complete workflow.

At least half of all those definitions could be automatically generated as we can see that there are a lot of repetition between CWL's type definitions and schema of I/O from OGC API - Processes definitions. The https://github.com/crim-ca/weaver implementation actually allows inferring OGC API - Processes I/O definitions from CWL using those similarities, and I mostly never need to provide any explicit I/O for the OGC API - Processes portion of the payloads.

@p3dr0
Copy link
Member

p3dr0 commented Feb 21, 2022

As mentioned during the telco here is the example fan-out application design pattern using the CWL ScatterFeatureRequirement requirement.

Scatter Crop Application Example

Example of a CWL that scatters the processing from an array of input values.

cwlVersion: v1.0
$graph:
- class: Workflow
  label: Sentinel-2 product crop
  doc: This application crops bands from a Sentinel-2 product
  id: s2-cropper

  requirements:
  - class: ScatterFeatureRequirement

  inputs:
    product:
      type: Directory
      label: Sentinel-2 input
      doc: Sentinel-2 Level-1C or Level-2A input reference
    bands:
      type: string[]
      label: Sentinel-2 bands
      doc: Sentinel-2 list of bands to crop
    bbox:
      type: string
      label: bounding box
      doc: Area of interest expressed as a bounding box
    proj:
      type: string
      label: EPSG code
      doc: Projection EPSG code for the bounding box
      default: "EPSG:4326"

  outputs:
    results:
      outputSource:
      - node_crop/cropped_tif
      type: Directory[]

  steps:

    node_crop:

      run: "#crop-cl"

      in:
        product: product
        band: bands
        bbox: bbox
        epsg: proj

      out:
        - cropped_tif

      scatter: band
      scatterMethod: dotproduct

- class: CommandLineTool

  id: crop-cl

  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/crop-container

  baseCommand: crop
  arguments: []

  inputs:
    product:
      type: Directory
      inputBinding:
        position: 1
    band:
      type: string
      inputBinding:
        position: 2
    bbox:
      type: string
      inputBinding:
        position: 3
    epsg:
      type: string
      inputBinding:
        position: 4

  outputs:
    cropped_tif:
      outputBinding:
        glob: .
      type: Directory

$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.0
schemas:
- http://schema.org/version/9.0/schemaorg-current-http.rdf

Composite two-step Workflow Example

This section extends the previous example with an Application Package that is a two-step workflow that crops (using scatter over the bands) and creates a composite image.

cwlVersion: v1.0
$graph:
- class: Workflow
  label: Sentinel-2 RGB composite
  doc: This application generates a Sentinel-2 RGB composite over an area of interest
  id: s2-compositer
  requirements:
  - class: ScatterFeatureRequirement
  - class: InlineJavascriptRequirement
  - class: MultipleInputFeatureRequirement
  inputs:
    product:
      type: Directory
      label: Sentinel-2 input
      doc: Sentinel-2 Level-1C or Level-2A input reference
    red:
      type: string
      label: red channel
      doc: Sentinel-2 band for red channel
    green:
      type: string
      label: green channel
      doc: Sentinel-2 band for green channel
    blue:
      type: string
      label: blue channel
      doc: Sentinel-2 band for blue channel
    bbox:
      type: string
      label: bounding box
      doc: Area of interest expressed as a bounding bbox
    proj:
      type: string
      label: EPSG code
      doc: Projection EPSG code for the bounding box coordinates
      default: "EPSG:4326"
  outputs:
    results:
      outputSource:
      - node_composite/rgb_composite
      type: Directory
  steps:
    node_crop:
      run: "#crop-cl"
      in:
        product: product
        band: [red, green, blue]
        bbox: bbox
        epsg: proj
      out:
        - cropped_tif
      scatter: band
      scatterMethod: dotproduct
    node_composite:
      run: "#composite-cl"
      in:
        tifs:
          source:  node_crop/cropped_tif
        lineage: product
      out:
        - rgb_composite

- class: CommandLineTool
  id: crop-cl
  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/crop-container
  baseCommand: crop
  arguments: []
  inputs:
    product:
      type: Directory
      inputBinding:
        position: 1
    band:
      type: string
      inputBinding:
        position: 2
    bbox:
      type: string
      inputBinding:
        position: 3
    epsg:
      type: string
      inputBinding:
        position: 4
  outputs:
    cropped_tif:
      outputBinding:
        glob: '*.tif'
      type: File

- class: CommandLineTool
  id: composite-cl
  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/composite-container
    InlineJavascriptRequirement: {}
  baseCommand: composite
  arguments:
  - $( inputs.tifs[0].path )
  - $( inputs.tifs[1].path )
  - $( inputs.tifs[2].path )
  inputs:
    tifs:
      type: File[]
    lineage:
      type: Directory
      inputBinding:
        position: 4
  outputs:
    rgb_composite:
      outputBinding:
        glob: .
      type: Directory

$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.0
schemas:
- http://schema.org/version/9.0/schemaorg-current-http.rdf

Please check OGC 20-089 section 8.5 Application Pattern and 8.6. Extended Workflows for more information about these examples.

@ghobona
Copy link
Contributor

ghobona commented Feb 21, 2022

A 2008 paper listing some workflow languages used in e-Science is at https://www.dcc.ac.uk/guidance/briefing-papers/standards-watch-papers/workflow-standards-e-science

Note that BPMN is also an ISO standard.

Some related engineering reports:

I'm not suggesting that OGC API - Processes - Part 3 should use BPMN instead of CWL. I am pointing out that there is a case for supporting multiple workflow languages, if possible.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 21, 2022

@fmigneault

The https://github.com/crim-ca/weaver implementation actually allows inferring OGC API - Processes I/O definitions from CWL using those similarities

This is related to what I was suggesting in Scenario 3 above:

implementations can potentially automatically determine inputs and their schemas by parsing the nested processes that are used as well as their inputs, and analyzing the "input" properties defined in the workflow, so uploading a process description is not absolutely necessary

I think it could be possible when creating a process from a workflow (through Part 2: Deploy, Replace, Undeploy) to infer the process description, and then potentially add additional metadata (e.g. a title, input descriptions that cannot be inferred, etc.). In this case, the media type of the payload would be a media type specific to the workflow language... (e.g. CWL, OpenEO, execution request extended with the capabilities I initially proposed for Part 3, i.e. OGC API collections and nested process execution request, and those identified by @ghobona ).

This could be done e.g. by separating out the processDescription from the executionUnit and (similar to OGC API - Styles where we first POST a style, and then add metadata to it, but the content-type of the stylesheet is exactly e.g. SLD/SE or MapboxGL Style) -- in this case the media type could be CWL directly. Another example of an execution unit media type could be a Jupyter notebook.
This all works in the context of Part 2 - Deploy, Replace, Undeploy, but technically using different workflow languages / chaining could also potentially be supported directly at /execution, so that a new process does not need to first be "deployed" but could be executed ad-hoc as I suggested in Part 3.

@fmigneault
Copy link
Contributor

@jerstlouis
I agree that workflows could technically be generated on the fly with direct POST on /execution, but I personally don't like this approach too much if the process description is also generated "just-in-time" from different combinations of execution media-type/workflows.

I think is would be a major pain point against OGC API - Processes interoperability because there would basically be no way to replicate executions since we are not even sure which process description gets executed. Each implementation could parse the contents in a completely different manner and generate different process descriptions. This is working against the purpose of the standard being developed in my opinion. The advantage of deployment, although it needs extra steps, is that at the very least we obtain some kind of standard description prior to execution that allows us validate if the process to run was parsed correctly.

Can you please elaborate more on the following part? I'm not sure I understand what you propose.

separating out the processDescription from the executionUnit [...] where we first POST a style, and then add metadata to it

Do you mean that there would be 1 "Workflow Engine" process without any specific inputs/outputs, and that each /execution request would need to submit the full CWL (or whichever else) as the executionUnit each time? What would be the point of the process description in this case, since the core element of the process cannot be known as it would be mostly generated from the submitted executionUnit? It feels like the "POSTing of style" is basically doing a process deployment.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

@fmigneault

I agree that workflows could technically be generated on the fly with direct POST on /execution, but I personally don't like this approach too much if the process description is also generated "just-in-time" from different combinations of execution media-type/workflows.

The original idea for these ad-hoc workflows in Part 3 is to allow clients to discover data and processes and immediately make use of these, without requiring special authentication privileges on any of the servers. In that context, I imagined that this would involve lower level processes already made available (and described) using OGC API - Processes (with support for Part 3, or support only Core and using an adapter like the one we developed). I am not sure how well this capability could extend to CWL or OpenEO as well, but was just throwing it out as a possibility because I think you had mentioned before that this could make sense.

The idea is not to replace deployment either... Processes or virtual persistent collections could still be created with those workflows, but the ad-hoc mechanism can provide a way to test and tweak the workflow before publishing it and making it widely available.

Do you mean that there would be 1 "Workflow Engine" process without any specific inputs/outputs, and that each /execution request would need to submit the full CWL (or whichever else) as the executionUnit each time?

In the current draft of Part 3, there is always a top-level process (the one closest to the client in the chain), and the execution request is POSTed to that process. The "process" property is only required for the nested processes, and actually this has resulted in confusion when specifying one (optional) top-level process but POSTing the workflow to the wrong process execution end-point.

There could be a "workflow engine" process as you suggest that requires the "process" key even for the top-level process, avoiding that potential confusion. This might also make more sense with CWL if there is not always a top-level OGC API - Process involved at the top of the workflow.

What would be the point of the process description in this case, since the core element of the process cannot be known as it would be mostly generated from the submitted executionUnit?

Sorry, I might have been adding confusion by mixing up two separate things:

  • a) (Part 2) Deploy, Replace, Undeploy (deploying a process, potentially using a "workflow" as a payload)
  • b) (Part 3) ad-hoc execution of workflows

In those "Styles" examples and providing executionUnit and processDescription details separately, I was suggesting this mainly for a), i.e. POSTing the executionUnit content directly to /processes (which allows to use different media types specific to its content) in order to deploy a new process without being forced to provide a processDescription (since it can mostly be inferred).

For b) ad-hoc workflows, in the context of Part 3 as originally proposed, it mainly means re-using processes and data collections already deployed, in a more complex high-level workflow. I imagined the same could potentially be done with CWL. Process descriptions are not involved here (except for any processes used internally, whose descriptions are useful to put together the workflow).

@pvretano
Copy link
Contributor Author

Just a gentle reminder, Part 2 is not called "OGC API - Processes - Part 2: Deploy, Replace, Undeploy" (i.e. DRU) ... It is no longer called "Transactions".

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

@fmigneault Thank you for adapting those examples in so much details!

To attempt to get the cross-walk going, some first comments of those 3 JSON snippets:

  • The first two JSON snippets with CWL unit class CommandLineTool allow to define the PCGridify and RouteProcessor processes and cover functionality which had not been considered for Part 3: Workflows and chaining, but which I think fits well within the current Application Package Best Practice and Part 2: Deploy, Replace, Undeploy. The execution unit portion of these is what I suggested could potentially be POSTed directly to /processes with a CWL media type if the process description can be inferred from the CWL directly.
  • The third JSON snippet with CWL unit class Workflow serves a similar purpose to what was proposed for Part 3: Workflows and chaining, e.g. if "run" can refer to local (or potentially remote) OGC API processes. Such a Workflow execution unit (the CWL directly) is what could potentially be POSTed to execute a workflow in an ad-hoc manner. Just like the execution request-based workflows, it also makes sense to deploy such workflows as a blackbox process taking inputs and generating outputs using Part 2: DRU.
  • I wonder to what extent the idea of leaving out some aspects of the execution could also be applied to CWL, to also support processing triggered by data access? e.g. allowing to process different areas and resolution of interests as they are requested by the end-user client, allowing data inputs to come in either from an OGC API collection (which leaves open a lot of flexibility on how to retrieve the data), from a direct URL to a file, or from an embedded payload (not requiring either specifically for the process), to leave open negotiation of which OGC API should be used to transfer data from a collection (e.g. to use vector tiles or Features), or which particular format at each hop of the workflow (when involving different servers)?

@fmigneault
Copy link
Contributor

fmigneault commented Feb 22, 2022

@jerstlouis

I see. Yes, I was confused about the additional process and POSTing aspect of the Workflows.

I agree with you, the examples I provide converting to CWL are the processes what would be dynamically generated if one wants to represent something POSTed as Part 3 on /execution using the Part 2 concepts, which can then be processed equivalently.

Indeed, the run can be a full URI where a remote process is called. This refers more to the ADES/EMS portion of previous testbeds though, since remote deployment of the process might be needed.

Regarding your 3rd point (#279 (comment)),
CWL allows some dynamic values to be resolved at runtime using inline JavaScript definitions. I believe this could be used for doing what you mention, but I would most probably define a separate process instead since that would make things clearer IMO (more details below - point 4).

Looking back at your examples, I understand a bit more the various use cases presented and I believe it is possible to consider all new proposed functionalities separately to better illustrate concerns.

1. Collection data type

e.g.: Inputs that have this kind of definition:

    "layers": [
      { "collection": "https://maps.ecere.com/ogcapi/collections/SRTM_ViewFinderPanorama" }
    ]

In my opinion, this should be a new type in itself, similar to bounding boxes. I don't think this should be part of Part 3 per se (or at least consider it as a separate feature).
It could be something on its own that could work very well work in conjunction with Core or any extension.
What executes this definition behind the scene could very well be some "CollectionFetcher" process that acts like an independent Process using either Part 2 or Part 3 methods, which ever the implementer feels is more appropriate.

I believe more details needs to be provided because there are some use cases where some "magic" happens such as when ogcapiParameters filter or sortby are provided. This is more than just crs as for bounding boxes inputs.
I remember @pvretano also highlighting this processing ambiguity when he asked why not simply append those as query parameters after the URL. There is some additional logic that handles those parameters that are not replicable.

2. Nested processes (the main Part 3 feature)

e.g.: A definition as follows:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",   <---------- optional (POST `/processes/RoutingEngine/execution`)
  "inputs" : {
     "dataset" : { "collection" : "https://example.com/ogcapi/collections/osm:roads" },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",  <--- required, dispatch to this process
        "inputs" : {
           "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },   
           "fillDistance" : 100,
           "classes" : [ "roads" ]
        },
        "outputs" : { "dsm" : { } }   <--- (*) this is what chains everything together
     },

(*)
The specification should make it clear that one and only one output ID is allowed there (can't pick many outputs and plug them into the parent single input this definition is nested under). Given that restriction, this definition seems sufficient IMO to generate the corresponding CWL-based processes in #279 (comment)

I also think there would be no need for custom-workflow/separate POSTing of executionUnit at execution time if this is what the major portion of Part 3 limits itself to.

3. Components schema to reuse definitions

This refers to provided Scenario 4. I find the reuse of definitions with #/components/` a very nice concept feature.
I believe to make this work using an equivalent CWL approach, we need to improve/add some details.

The monthlyInput[2].process that refers to coverage_processor makes sense. It can be handled similarly to above point (2). Is it probably only missing the outputs definition to tell which output to connect to the parent process input. Following that, it would be possible to generate the full workflow automatically.

The other two items from Scenario 4 (modis and sentinel2) are too complicated. Most probably, @jerstlouis, you would be the only one to know how this is handled because there is nothing providing details how this is applied. Contrary to the last element under monthlyInput where a process reference is provided, those define more parameters and no way to know how inputs and outputs are connected between each other. Are they supposed to do a collection call as in point (1)?

4. Expressions

I think the expressions {month} and {datetime} should be avoided for Part 3. (Maybe make that a Part 4 extension?)
This is not something that is very obvious nor easy to implement, although it looks conceptually very convenient.

Firstly, datetime is picked from the request query parameter (how about other sources, how to tell?). I don't see why you wouldn't simply substitute the request query value directly in the body when submitting it (since you need to submit it anyway) to avoid the complicated parsing that would otherwise be required.

Second, "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" } shows a specific handling for datetime object. In this case, it works because datetime is assumed to be converted to a datetime object with year, month, etc. properties.
What about other kind of handling though? Convert to float, int, split string, etc. There are too many use cases that explodes the scope of Part 3.

For example, in CWL, it is possible to do similar substitutions, but to processes them, a full inline parsing using JavaScript is required, and there are constantly issues related to how parsing must be done for one case or another, for different kind of objects, whether they are nested under some field or another, how to link them all as variables, etc.
An example of inline definition is "$(runtime.outdir)" in my examples, but it can get much more convoluted.

I don't think many developers would adopt Part 3 Workflow/Chaining capabilities (in itself relatively simple) when such a big implementation requirement must be supported as well. I think OGC API - Processes should keep it simple. I could see the {datetime} and {month} definition easily replaced by another nested process: http... reference that simply runs a CurrentDate process which returns that datetime as output.

5. New operations

This mostly revolves around Scenario 1 and Scenario 6.
The addition of endpoints:

  • /processes/RenderMap/map/...
  • /processes/landcover/tiles/...`
  • etc.?

In my opinion, this also unreasonably increases the scope of Part 3 that should focus only on Workflows (ie: nesting/chaining processes).
It feels like a separate feature (somewhat related to other OGC API - XXX standards) that adds extra capabilities to processes, but not themselves relevant for Workflows. It would possible to combine those features simultaneously, sure, but completely different parsing methodologies are needed, which would warrant a different extension.

This is also the portion I find no way to easily convert to a CWL equivalent dynamically, simply because there is no detail about it.
It also seem to be the only use case where POSTing of different kinds of Workflows/Media-Type at distinct endpoints is required, which is what brought a lot of confusion in the first place.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

Thanks @fmigneault for the additional feedback. I will try to address everything you commented on, please let me know if I missed something.

First, I think what you are pointing out is that there is a range of functionality covered in those scenarios which makes sense to organize into different conformance classes. Which of these conformance classes make it into Part 3 remains to be agreed upon, and I think as @pvretano pointed out is one of the main point of this exercise (although perhaps that was specifically referring to conformance classes for different workflow languages).

Note that even if these conformance classes are regrouped in one Processes - Part 3 specification, an implementation could decide to implement any number of the conformance classes, and potentially none of them would be required. Therefore I suggest we focus first on the definition of these conformance classes, and worry later about how to regroup those conformance classes in one or more specification / document.

In my scenarios 1-6 above, the names in parentheses are the conformance classes that I had suggested previously.
I presented these at the July 2021 OGC API - Processes Code Sprint and here is the summary from the key slide which might help put things in perspective:

Envisioned conformance classes:

  • CollectionOutput: Collection as output
    ?response=collection
  • LandingPageOutput: Dataset API Landing Page as output
    ?response=landingPage
  • CollectionInput: Collection as input
    { "collection" : "https://server.com/ogcapi/collections/someCollection" }
  • RemoteCollection: Remote collection as input
    { "collection" : "https://example.com/ogcapi/collections/someCollection" }
  • NestedProcess: Nested process as input
    { "process" : "https://server.com/ogcapi/processes/someProcess" }
  • RemoteProcess: Remote process as input
    { "process" : "https://example.com/ogcapi/processes/someProcess" }
  • ImmediateAccess: POST to /processes/{processId}/{accesstype} (Features, Tiles, Coverages, Maps)
  • DeployWorkflow: POST to /processes support for { "input" : "SomeParameter"}

Indeed, the run can be a full URI where a remote process is called. This refers more to the ADES/EMS portion of previous testbeds though, since remote deployment of the process might be needed.

In the conformance classes suggested for Part 3, this refers specifically to NestedProcess and RemoteProcess. With RemoteProcess, there would be no need to first deploy the process, whereas with NestedProcess, a process would need to be deployed first in order to use it in a workflow.

  1. Collection type In my opinion, this should be a new type in itself, similar to bounding boxes. I don't think this should be part of Part 3 per se (or at least consider it as a separate feature).

This is specifically the CollectionInput conformance class. I agree that this bit alone is very useful by itself, but it is also what greatly simplifies the chaining, because it works hand in hand with the CollectionOutput conformance class. CollectionOutput allows accessing the output of a process as an OGC API collection. Any process that accepts a collection input is automatically able to use a nested process (whether local or remote) that can generate a collection output.

I fully agree that CollectionInput is useful by itself, in fact there was a perfect example in Testbed 17 - GeoDataCube where the 52 North team implemented support for a LANDSAT-8 Collection input in their Machine Learning classification process / pygeoapi deployment.

Whether this conformance class is added to OGC API - Processes - Part 1: Core 2.0 or OGC API - Processes - Part 3: Workflows and Chaining however does not really matter.

I believe more details needs to be provided because there are some use cases where some "magic" happens such as when ogcapiParameters filter or sortby are provided. This is more than just crs as for bounding boxes inputs.

In full agreement here, as these are details that need to be worked out with more experimentation. Using OGC API collections leaves a lot of flexibility, some of which might be useful to leave up to the hop end-points to negotiate between themselves, but a filter that further qualifies the collection is a good example of wanting to restrict the content of that collection directly within the workflow.

The datetime parameter use case in this scenario where daily datasets are used to generate a yearly dataset, but for which the process needs to first generate monthly coverages is another good example where the end-user query datetime (yearly) needs to become a monthly request to the MODIS and sentinel-2 requests.

The specification should make it clear that one and only one output ID is allowed there (can't pick many outputs and plug them into the parent single input this definition is nested under).

  1. When a process generates a single output, there should be no need to specify the output (that is already the case in Part 1: Core).
  2. I would be inclined not to completely rule out the possibility of a process accepting as "one" input "multiple" outputs (i.e., a dataset with multiple collections). An example of this might be a process taking in an OpenStreetMap PBF and generating a multi-collection dataset (e.g. roads, buildings, amenities...). The Process Description would need to describe the "input" as multiple feature collections somehow.... Conceptually, it could still be considered "one" input.
  3. In most use cases, this restriction makes sense. But I feel like not having this restriction seems more of a communication / documentation issue in how this would normally be used, vs. a real benefit in preventing the possibility of one process output being a multi-collection dataset.
  4. This is somewhat related to Scenario 5 and the LandingPageOutput conformance class (the ability to access the results of process as a multi-collection dataset / OGC API landing page).

I also think there would be no need for custom-workflow/separate POSTing of executionUnit at execution time if this is what the major portion of Part 3 limits itself to.

I am a bit confused by that comment. executionUnit is a concept of the Application Package Best Practice and related to Part 2: DRU, if I understand correctly. The proposed Part 3 NestedProcess conformance class defines the possibility to include nested process as part of submitting an execution request at /processes/{processId}/execution. The RemoteProcess conformance class allows those process to be on another server (without requiring to first deploy it to the server to which the execution request is submitted).

The monthlyInput[2].process that refers to coverage_processor makes sense. It can be handled similarly to above point (2). Is it probably only missing the outputs definition to tell which output to connect to the parent process input.

The process description for coverage_processor in this case would define a single output (the resulting coverage), therefore it is not necessary to specify it (as in the published Processes - Part 1: Core).

Contrary to the last element under monthlyInput where a process reference is provided, those define more parameters and no way to know how inputs and outputs are connected between each other. Are they supposed to do a collection call as in point (1)?

I think the confusion here is caused by the use of the { "input" : {parameterNameHere} } defined in the DeployWorkflow conformance class. This Scenario 4 workflow is intended to be deployed as a process rather than being submitted as an execution request (similar to your 3rd JSON snippet with CWL Workflow unit class), and therefore must be supplied inputs: modis_data and sentinel2_data. Those defined inputs would be replaced by the collections supplied in the Scenario 1 example, which presumably invokes the process that Scenario 4 defines using a workflow. These input : are equivalent to the "inputs" in the CWL Workflow unit class, except that they are inferred from wherever they are used rather than being explicitly listed. i.e. if { "input" : "modis_data" } is used in two places in the workflow, that is the same "modis_data" input to the process being defined by the workflow.

Potentially, those inputs could also be supplied as embedded data to the process created by the workflow, and in that case the ogcapiParameters and format would not be meaningful/used -- the filtering and proper format would have had to be done prior to submitting the data as input to the landcover process defined by the Scenario 4 workflow.

I think the expressions {month} and {datetime} should be avoided for Part 3. (Maybe make that a Part 4 extension?)
This is not something that is very obvious nor easy to implement, although it looks conceptually very convenient.

I agree this is more complicated and I just came up with those while trying to put this Scenario 4 example together.
In more typical use cases, the {bbox} or {datetime} from the OGC API collection data requests would just flow through to the nested processes / collection inputs. But in this case, we only needed the "year" portion of the datetime, and wanted to re-use the same modis / sentinel2 / monthlyInput components but changing the {month}.

Some of this capability to reference how the OGC API collection data request were made I think would make sense to include as part of the CollectionOutput conformance class (e.g. {datetime} and {bbox}).

The capability to use e.g. {month} (i.e. an arbitrary {templateVariable}) together with the $ref might make sense as part of a ReusableComponents conformance class.

Second, "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" } shows a specific handling for datetime object.

You are right this is a specific capability here to be able to specify a monthly request using only the year that was provided by the OGC API request triggering the processing, but specifying a different month. What I wished for while writing this was functions to build an ISO8601 string as the OGC API datetime parameter would expect that JSON does not have, so there is some assumption that this works somehow.

This Scenario 4 example is testing new grounds in terms of the capabilities of execution request-based workflow definitions as explored so far, but despite a few things to iron out I feel it manages to very concisely and clearly express slightly more complex / practical workflows.

I could see the {datetime} and {month} definition easily replaced by another nested process: http... reference that simply runs a CurrentDate process which returns that datetime as output.

I would welcome suggestions in how to better express it. I actually considered whether I needed to define new processes but this was the best balance I could manage Sunday night in terms clarity / conciseness / least-hackish / ease of implementation. These aspects are definitely still Work in Progress :) The idea here is that {datetime} referred to the OGC API data access that triggered the process (CollectionOutput), whereas the ogcapiParameters.dateTime refers to the datetime that will be passed to the input collections form which data is being requested (CollectionInput). I'm not sure how a separate process could help with this, unless you mean using a process to do the job that a JSON function could have done? (which is what I also thought of, but discarded as likely worst off in terms of clarity, conciseness, least-hackish AND ease of implementation).

New operations / This mostly revolves around Scenario 1 and Scenario 6.

I have to clarify here that Scenario 6 is completely different from Scenario 1 in this regard.
Scenario 1 and 2 use the OGC API data access capabilities (e.g. Tiles and Maps) on the collection generated by the process, where making a request to the collection using these access mechanisms triggers processing. This is what the CollectionOutput conformance class defines, and as discussed above is a key thing to make it easy to chain nested processes when processes support both CollectionInput and CollectionOutput.

In my opinion, this also unreasonably increases the scope of Part 3 that should focus only on Workflows (ie: nesting/chaining processes).

If we are talking about CollectionOutput, I think it does fit well within Workflows and chaining because it provides an easy way to connect the output of any process as an input to any other process, and it enables the use of Tiles and DGGS zones as enablers for parallelism, distributedness, and real-time "just what you need right now" with hot workflows working on small pieces at a time, rather than batched processing ("wait a long time / use up a lot of resources, and what you get in the end might actually not be what you wanted, might never end up being used, or might be outdated by the time it is used").

If we are talking about ImmediateAccess which is covered by Scenario 6, it is a much less essential capability, but as I explained it is quite useful for demonstration purposes (e.g. to demonstrate a PNG response of a workflow directly in SwaggerUI as a single operation), and to some extent to provide fewer server round-trips (e.g. submitting a workflow and getting a templated Tiles URI in a single step).

It also seem to be the only use case where POSTing of different kinds of Workflows/Media-Type at distinct endpoints is required, which is what brought a lot of confusion in the first place.

Seems like there is still some confusion about POSTing workflows and media types, so I will try to clear this out :)

  • Scenario 6 / ImmediateAccess proposes the ability to POST workflows to different resource types for demonstration and shortcuts purposes -- much less essential capability which could be defined separately and that I don't mind if we forget about it completely while we discuss all the much more important conformance classes.
  • In conjunction with Part 2: DRU, a workflow could be POSTed to /processes with different media types (e.g. execution request, CWL...) to deploy a new process
  • A workflow / execution request could potentially be POSTed to /collections to create a new persistent virtual collection from a workflow / process execution
  • A workflow / execution request can be POSTed to /processes/{processId}/execution (as in OGC API - Processes - Part 1: Core) to execute it. Different media types could potentially allow using the Workflow CWL unit class here.
  • Also with /execution, the CollectionOutput and LandingPageOutput proposed for Part 3 introduce a new "execution mode" (instead of sync or async) whereas the immediate response is a collection description or a landing page, and the processing only gets triggered when requesting data from the resulting collection(s).

This is also the portion I find no way to easily convert to a CWL equivalent dynamically, simply because there is no detail about it.

Leaving Scenario 6 aside, and focusing on the CollectionOutput capability (e.g. Scenario 1) whereas making an OGC API data request triggers process execution to generate the data for that response, would there be something equivalent? I don't think there are many details missing other than those in respective OGC API specifications (e.g. Tiles, Maps, Coverages...). The data access OGC APIs specify how to request data from an OGC API collection, and an implementation of Part 3:
CollectionOutput
is able to feed the data for that access mechanism to return when it is requested from that virtual collection.

One other nice thing about CollectionOutput is that it makes support for visualizing the output from workflows in visualization clients much easier (than e.g. Processes - Part 1: Core) with doing very little specifically to implement process / workflow execution. This capability is e.g. implemented in the GDAL OGC API driver (and thus available in QGIS as well). It was also easily implemented in clients by participants in Testbed 17 / GeoDataCube.

Thanks!

@jerstlouis
Copy link
Member

@bpross-52n @pvretano Please add a workflow/chaining label! ;)

@mr-c
Copy link

mr-c commented Feb 22, 2022

A 2008 paper listing some workflow languages used in e-Science is at https://www.dcc.ac.uk/guidance/briefing-papers/standards-watch-papers/workflow-standards-e-science

FYI: A modern list (that is continually being updated) with over 300 workflow systems/languages/frameworks known to be used for data analysis: https://s.apache.org/existing-workflow-systems

There is another list at https://workflows.community/systems that just started. This younger list aims to be a better classified subset of the big list: only the systems that are still being maintained.

@fmigneault
Copy link
Contributor

fmigneault commented Feb 22, 2022

@jerstlouis

Envisioned conformance classes:

Nice. I missed the presentation about those.

DeployWorkflow: POST to /processes support for { "input" : "SomeParameter"}

We must be careful not to overlap with Part 2 here. This is the same method/endpoint to deploy the complete process.

[...] it is also what greatly simplifies the chaining, because it works hand in hand with the CollectionOutput conformance class. CollectionOutput allows accessing the output of a process as an OGC API collection. Any process that accepts a collection input is automatically able to use a nested process (whether local or remote) that can generate a collection output.

This made me think that we must consider some parameter in the payload that will tell the nested process to return the output this way. Maybe for example "outputs" : { "dsm" : { } } could be replaced by "outputs" : { "dsm" : { "response": "collection" } }. Otherwise it is assumed CollectionOutput is returned, which is not the default for all currently existing implementations.

When a process generates a single output, there should be no need to specify the output (that is already the case in Part 1: Core).

I agree that could be allowed if the default was to return CollectionOutput, but since processes are not expected to do so by default (from Core), I think the proposed { "response": "raw|document|collection" } addition above would always be required.

I would be inclined not to completely rule out the possibility of a process accepting as "one" input "multiple" outputs (i.e., a dataset with multiple collections). An example of this might be a process taking in an OpenStreetMap PBF and generating a multi-collection dataset (e.g. roads, buildings, amenities...). The Process Description would need to describe the "input" as multiple feature collections somehow.... Conceptually, it could still be considered "one" input.
In most use cases, this restriction makes sense. But I feel like not having this restriction seems more of a communication / documentation issue in how this would normally be used, vs. a real benefit in preventing the possibility of one process output being a multi-collection dataset.

I agree. By "multiple outputs", I specifically refer to the variable {outputID} that forms the key in the output mapping. If under that key, an array of collections is returned, this is perfectly fine if the parent input that receives it accepts maxOccurs>1.
The reason why I think it should be restricted, is allowing multiple {outputID} at the same time implies there must be a way to concatenate all the outputs together to pass it to the input. Because of the large quantity of different output types, formats and representations, this is not trivial.

I'm not sure how a separate process could help with this, unless you mean using a process to do the job that a JSON function could have done? (which is what I also thought of, but discarded as likely worst off in terms of clarity, conciseness, least-hackish AND ease of implementation).

I think that a process (lets call it DatetimeParser) that receives as input (datetime) value, and returns as output (parsedDatetime) a JSON formed as { "year" : { "{datetime}.year" }, "month" : "{month}" } would do the trick. The parent process that nests DatetimeParser for one of its input would simply chain the returned JSON as the input value. Here, the { "parsedDatetime" : { "response": "raw" } } could be used to highlight that the value is passed as data rather than a document or a collection.

If we are talking about CollectionOutput, I think it does fit well within Workflows and chaining because it provides an easy way to connect the output of any process as an input to any other process

For this (Collection[Inputs|Outputs] working in hand with Workflows), I totally agree.
It is the /tiles, /map, etc. new features (ImmediateAccess) that IMO are out of scope for Part 3 workflow chaining. It is again highlighted by your clarification regarding POSTing workflows and media types.

Leaving Scenario 6 aside, and focusing on the CollectionOutput capability (e.g. Scenario 1) whereas making an OGC API data request triggers process execution to generate the data for that response, would there be something equivalent?

I think this is possible to map to CWL definitions dynamically if only Collection[Input|Output] are used. I think I would resolve parsing of collection input using a CollectionHandler process that takes the collection URL and any other additional parameters as JSON. That process would be in charge to call the relevant OGC API operation to retrieve the collection, and return it as output. All existing Processes/Workflows from Part 2 could then dynamically generate a sub-workflow by inserting this CollectionHandler when { "collection": ... } is specified as execution input.
In the same manner, Tiles, Maps, Coverages, etc. handlers would be distinct CWL parser/handlers.
I prefer to have many sub-processes in a large Workflow chain that accomplish very small tasks to convert data in various manners, rather then having OGC API itself have to embed custom handling for each new input/data variation.

I think it is important to keep extensions separate for their relevant capabilities, although they can work together afterwards.
This is because, realistically, implementers that try to conform to any new Part should try to implement most of it rather than handpick conformance classes under it. Otherwise, there is no point to have a Part in the first place.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

@fmigneault

We must be careful not to overlap with Part 2 here. This is the same method/endpoint to deploy the complete process.

The POST operation to /processes is defined by part 2. The DeployWorkflow conformance class would define that a workflow is a valid payload for Part 2, and that { "input" : {someparameter} } is how to define an input to a workflow deployed as a process.

This made me think that we must consider some parameter in the payload that will tell the nested process to return the output this way.

Well the idea here is that the end-points of any particular hop of that workflow would be the ones deciding whether CollectionOutput is used or not, based on conformance support. It is not required that they do so, e.g. if the Processes server does not support CollectionOutput, Processes - Core could be used and requests could be made using sync or async execution mode -- there is no assumption that one or the other is used.

I agree that could be allowed if the default was to return CollectionOutput, but since processes are not expected to do so by default (from Core), I think the proposed { "response": "raw|document|collection" } addition above would always be required.

Not necessary as I just pointed out, and raw vs. document is gone with #272 (2.0?).

It is the /tiles, /map, etc. new features (ImmediateAccess) that IMO are out of scope for Part 3 workflow chaining. It is again highlighted by your clarification regarding POSTing workflows and media types.

To make things super clear:

CollectionOutput allows to request a ?response=collection which will return a collection description with links to access mechanisms, and then potentially request tiles to trigger results (if Tiles is supported), e.g. https://research-alpha.org/ogcapi/internal-workflows/600d-c0ffee/tiles/GNOSISGlobalGrid/{tileMatrix}/{tileRow}/{tileCol}.mvt in Scenario 1.

ImmediateAccess allows to both POST the workflow and request a tile at the same time, or to POST a workflow and get a tileset right away, as in Scenario 6. e.g. POST workflow to https://maps.ecere.com/ogcapi/processes/RenderMap/map ,
https://maps.ecere.com/ogcapi/processes/RenderMap/map/tiles/GNOSISGlobalGrid or
https://maps.ecere.com/ogcapi/processes/RenderMap/map/GNOSISGlobalGrid/0/0/0 .

ImmediateAccess is a nice to have for demonstration and skipping HTTP roundtrips. I don't mind if it doesn't end up in Part 3.
CollectionOutput is a key capabiliy proposed for Part 3.

I prefer to have many sub-processes in a large Workflow chain that accomplish very small tasks to convert data in various manners, rather then having OGC API itself have to embed custom handling for each new input/data variation.

Well one of the important ideas with the CollectionInput / CollectionOutput conformance classes is to leave flexibility to make workflows as generic and re-usable as possible with different OGC API implementations. For example, one might re-use the exact same workflow with different servers or data sources but in practice some will end-up exchanging data using DGGS, others with Tiles, others with Coverages; or another will negotiate netCDF, while another will negotiate Zarr, or GRIB. And the workflow does not need to change at all to accommodate all of these.

It also leaves the workflow itself really reflecting exactly what the user is trying to do: apply this process to these data sources, feed its input to this other process, and all the exchange and communication details are left out of the workflow definition for negotiation by the hops.

Of course any implemention of this is free to convert this in the back-end to smaller tasks and sub-processes invocations internally.

I think it is important to keep extensions separate for their relevant capabilities, although they can work together afterwards.
This is because, realistically, implementers that try to conform to any new Part should try to implement most of it rather than handpick conformance classes under it. Otherwise, there is no point to have a Part in the first place.

I think there are different opinions about this throughout OGC. With the building blocks approach, I believe that the fundamental granularity that matters for implementation is the conformance classes, whereas the parts are just a necessary organization of the conformance clases into specification documents for publication and other practical reasons. Taking OGC API - Tiles - Part 1: Core as an example, there is definitely no expectation that any implementation will implement all of its conformance classes. So I disagree that handpicking conformance classes to implement is a bad thing, just like handpicking which OGC API / parts one implements in an OGC API implementation is not a bad thing.

More importantly, I think the modularity of OGC API building blocks makes it easy to start by implementing one or more conformance class, and gradually add support for additional ones based on practical needs and resources available.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

I was thinking that we could define a WellKnownProcess that allows executing command line tools with an execution request workflow, similar to the approach used in your example using CWL to define base processes @fmigneault :

Scenario 7

This would be POSTed to /processes (Part 2: DRU) to create the PCGridify process in Scenario 5

{
   "process" : "http://example.com/ogcapi/processes/ExecuteCommand",
   "inputs" : {
      "command" : "PCGridify",
      "requirements" : {
         "docker" : { "pull": "example/PCGridify" }
      },
      "stdin" : { "input" : "data", "format": { "mediaType": "application/vnd.las" } },
      "arguments" : [
         "-fillDistance",
         { "input" : "fillDistance", "schema" : { "type" : "number" } },
         "-classes",
         { "input" : "classes", "schema" : { "type" : "array", "items" : { "type" : "string" } } },
         "-orthoOutput",
         "outFile1"
      ]
   },
   "outputs" :
   {
      "stdout" : {
         "output" : "dsm",
         "format": { "mediaType": "image/tiff; application=geotiff" }
      },
      "outFile1" : {
         "output" : "ortho",
         "format": { "mediaType": "image/tiff; application=geotiff" }
      }
   }
}

Realizing that we also probably need this { "output" : {outputName} } in the DeployWorkflow conformance class to support returning multiple outputs and name outputs from a workflow deployed as a process.

@fmigneault
Copy link
Contributor

@jerstlouis
If the CWL nomenclature is used, I think it would be better to simply embed it directly without modification (similar to executionUnit in my examples).
Using #279 (comment) representation, I don't see the advantage of placing everything under inputs/outputs with extra input/output sub-keys for each item to tell which are the "real" inputs of the process. It looks like an hybrid of the CWL and traditional process description, which will just make it harder to parse in Part 2.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

@fmigneault This is to allow the Part 3 execution request approach / DeployWorkflow to work as a Content-Type option for deploying workflows as a process with Part 2, will a WellKnown process that can execute a command line tool.

It is not a process description, but an execution request as currently defined in OGC API - Processes - Part 1: Core using the extensions defined in Part 3 DeployWorkflow conformance class ("input" and "output"). The process description for the resulting PCGridify process could be inferred from its inputs and outputs and generated automatically. There is nothing CWL in there except for the inspiration from your example and the docker pull requirements :).

One could still POST CWL instead of this execution request workflow of course to deploy a process, or an application package that bundles a process description + CWL in the executionUnit, as different supported Content-Types to deploy processes.

@pvretano
Copy link
Contributor Author

@jerstlouis something seems wonky here! There should be no need for a "DeployWorkflow". Whether the execution unit of a process is a Docker container, or a Python script or a CWL workflow, that should not matter. All processes should be deployed the same way (i.e. POST to the /processes endpoint as described in Part 2). I am confused.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 22, 2022

@pvretano

Whether the execution unit of a process is a Docker container, or a Python script or a CWL workflow, that should not matter. All processes should be deployed the same way (i.e. POST to the /processes endpoint as described in Part 2).

In full agreement with that. We might need different media types or JSON profiles for OGC API - Processes execution requests, for CWL, and for application packages, for this.

What I call the Part 3 - DeployWorkflow conformance class is:

  • The definition of a particular content type that defines a workflow which can be used in conjunction with Part 2 to deploy a process (execution request which may include some of the extensions defined in the other Part 3 conformance classes).
  • Specifically this "input" and "output" capability allowing to return one or more output from the workflow, and define inputs to the workflow in those execution requests, for workflows which are not meant to be executed directly, but to be used to define a new process.

So it is those new properties to define inputs & outputs, plus a particular Content-Type for a Part 2 Deploy operation.

Will Part 2 have different conformance classes for different Content-Types? (e.g. like the different Tiles encodings conformance classes). There is already one for OGC Application Package.

If not this DeployWorkflow conformance class, which conformance class could define the capability to define "input" and "output" of the workflow itself for using the workflow as a process, rather than a ready-to-execute execution request? It could be potentially be an Execution Request Deployment Package in Part 2 instead.

NOTE: Different media types for CWL, execution request than for application package are in the context of NOT using the OGC Application Package conformance class defined in Part 2. It is also a possibility to include the execution request-style workflow (just like CWL) in the execution unit of an application package.

Personally I find that the process description is something that the server should generate, not be provided as part of a Part 2 deployment, because it includes information about how the processes implementation is able to execute things (e.g. sync/async execution mode), and it may be able to accept more e.g. formats than the executionUnit being provided, and because most of the process description can often be inferred from the executionUnit alone. Therefore I don't like the current application package approach very much, and would prefer directly providing the executionUnit as the payload to the POST.

@fmigneault
Copy link
Contributor

@jerstlouis
I am also confused.
You are referring to an execution request but simultaneously saying that you POST on /processes?
As @pvretano mentions, I don't think Part 3 should POST any differently than Part 2 does by already accommodating different kind of execution units. If anything, Part 3 should try to work in concert with Part 2, not redefine similar concepts on its own.

@jerstlouis
Copy link
Member

jerstlouis commented Mar 1, 2022

@pvretano

I am not exactly sure how one could be confused about the difference between deploying a process (i.e. adding it to the API) and describing a process (i.e. getting its current definition)

The ambiguity seems to be between the definition vs. description of a process, as in that statement you made right there.

Using my understanding of those terms, when deploying a process using an OGC application package, a description is provided in the processDescription field, whereas the definition is provided in the executionUnit field.

When retrieving the process description (GET /processes/{processId}, it is the process description that is returned, not its definition (e.g. the CWL executionUnit).

Optionally being able to retrieve the definition of a process as well makes sense if you want to allow users to re-use and adapt a particular workflow, but that would be a different operation (e.g. GET /processes/{processId}/workflow).

I had initially suggested this capability for a workflow deployed as a persistent virtual collection, but it applies to a workflow deployed as a process as well):

A collection document resulting from a workflow may expose its source process (workflow) execution document.

About:

but can you please be a little more specific about where the wording is ambiguous so that I can tighten it up!?

First I want to point out that the README gets it perfectly right:

This extension provides the ability to deploy, replace and undeploy processes, using an OGC Application Package definition containing the execution unit instructions for running deployed processes.

and so does the HTTP PUT description:

The HTTP PUT method is used to replace the definition of a previously, dynamically added processes that is accessible via the API.

Right below is where it gets muddied:

This extension does not mandate that a specific processes description language or vocabulary be used. However, in order to promote interoperability,

this extension defines a conformance class, OGC Application Package, that defines a formal process description language encoded using

The OGC Application Package includes BOTH a description and a definition (called executionUnit). My argument is that the executionUnit is the most important piece and as a whole the package should be considered a definition, as in the README and the PUT description. That is because you could often infer most or all of the description from the executionUnit.

Also in the ASCII sequence diagram below:

Body contains a formal description of the process to add (e.g. OGC Application Package)

and the other one.

Note that as we discussed previously, a per-process OpenAPI description of a process would make a lot of sense for Part 1 (e.g. GET /proceses/{processId}?f=oas30) . Such an OpenAPI document would describe the process to be able to execute it, but does not define the process or the workflow behind it in any way. So it's really important to clearly distinguish between description and definition.

@fmigneault
Copy link
Contributor

@jerstlouis

Optionally being able to retrieve the definition of a process as well makes sense if you want to allow users to re-use and adapt a particular workflow, but that would be a different operation (e.g. GET /processes/{processId}/workflow).

I agree with you on this. For our implementation, we actually use GET /processes/{processId}/package to make it generic since it is not always a workflow. Process description and definition are indeed retrieved by separate requests.

I think the processDescription in the deployment payload is adequately named, as it is intended only for the process description returned later by GET /processes/{processId} (extended with some platform specific metadata).

This extension does not mandate that a specific processes description language or vocabulary be used. However, in order to promote interoperability,

Instead of even saying process definition for that sentence, I suggest to explicitly use execution unit definition to avoid the possible description/definition confusion altogether. It is only the execution unit (CWL, etc.) that can be anything.

@jerstlouis
Copy link
Member

jerstlouis commented Mar 1, 2022

@fmigneault

I am also not sure to see where there is confusion between the deployment and description portions.

I've clarified above that the ambiguity is between description and definition.

Similar to a feature catalog, or pretty much any REST API that has POST/GET requests revolving around "some object", the data that you want to be retrieved by GET will be strongly similar to the POSTed one.

Unfortunately here we currently have a clear mismatch between the POST/PUT and the GET.
The GET returns a description, whereas the POST/PUT provide a definition.
For example with ogcapppkg/CWL, the GET returns only the processDescription field, whereas the PUT includes both the processDescription and the executionUnit (CWL).

I wouldn't say they are completely different things as one strongly depends on the other (the POSTed process will be the one eventually described).

The GET does describe the definition of what was POSTed, but it is a description, which is fundamentally different from the definition.

Whenever I mentioned Deployable/Execution Workflows or Workflow Description, I am using the same nomenclature as defined in Core and Part 2. In other words, some schema that allows the user to POST nested processes on /processes to deploy a "workflow", which can then be described on GET /processes/{processId}, and executed with POST /processes/{processId}/execution. In a way, you can see it as a "Process that just so happens to define a workflow chain" instead of a single operation.

I would like to avoid using the word description to refer to this and call that a definition, to avoid confusion with the process description returned by GET /processes/{processid} (which does not include the executionUnit), as I was suggesting to @pvretano that we change those instances where the word description is used in Part 2 to definition.

On the other hand, an Execution Workflow (i.e.: Part 3) that is directly POSTed on the execute endpoint (without prior deploy/describe requests), does not have its "workflow chain" defined yet. It is resolved dynamically when processing the execution contents.

I don't understand why you say that the Part 3 execution workflow does not have its workflow chain defined yet? How I understand it is that the Part 3 execution request workflow is the workflow chain. Some detailed aspects of it are resolved dynamically as part of the ad-hoc execution or deployment of the workflow (e.g. format & API negotiation for data exchange a particular hop), but the overrall chain is already defined.

I have not seen anything regarding that. Maybe you can provide a reference? From my understanding, once the Execution Workflow is POSTed, the result obtained as output is the same as when running an atomic process. Considering that, the workflow definition is effectively lost in the engine that applies the "workflow chain". The only workaround to this would be to deploy that workflow before executing it, but again, this poses a lot more problems as previously mentioned.

I would point out that this also applies to the CWL included in the application packages execution unit. A GET /processes/{processId} does not return the executionUnit/CWL, only the description. As I mentioned above, I had originally suggested that a workflow may be exposed for a persistent virtual collection (example), but it also makes perfect sense for workflows deployed as a process as well, e.g. GET /processes/{processId}/workflow (which could return the CWL or the Part 3 execution request workflow).

I will try to find time to address the other points you touched on in the message, but it's a busy week ;)

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

@jerstlouis yup, you are right. I'll clean up the wording a bit. I think the correct statement is that you POST a "description" of a process (i.e an application package that includes ithe processes' definition) to the /processes endpoint and you GET the definition of a process from the /processes/{processId} endpoint. There is currently no way through the API to get the "description" (i.e. the application package) of a process but perhaps the endpoint @fmigneault proposed (/processes/{processId}/package) would suffice ... or maybe (/packages/{processId}) would be better.

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

@jerstlouis GET gets the definition of a process. There is no way to GET the description of the process where description is the definition PLUS other information an OAProc endpoint needs to be able to actually deploy a process. The description of the process is what we call the application package.

Is everyone in agreement with this terminology?

@jerstlouis
Copy link
Member

@pvretano It's the other way around :)

You POST a definition at /processes, and you GET a description at /processes/{processId}.

We don't yet have a GET operation for retrieving the definition, but I had suggested GET /processes/{processid}/workflow and Francis implemented GET /processes/{processId}/package so that it is not specific to workflows.

How about GET /processes/{processId}/definition or GET /processes/{processId}/executionUnit ?

@fmigneault
Copy link
Contributor

fmigneault commented Mar 1, 2022

@jerstlouis

I would like to avoid using the word description to refer to this and call that a definition, to avoid confusion with the process description returned by GET /processes/{processid} (which does not include the executionUnit), as I was suggesting to @pvretano that we change those instances where the word description is used in Part 2 to definition.

Fine with me.

I don't understand why you say that the Part 3 execution workflow does not have its workflow chain defined yet? How I understand it is that the Part 3 execution request workflow is the workflow chain.

At the moment the request is submitted with details on how to chain I/O, the Workflow is not yet defined from the point of view of the API. After the contents are parsed, some workflow definition can then be dumped to file, database or held in memory by the runner that will execute it, then the workflow exists. I'm just pointing out that using a deployed workflow, the API doesn't even need to parse the payload, it is already aware of the full workflow definition. Because these different workflow interpretations happen at different times, it is important to properly identify them, to avoid the same confusion as for the process description/definition.

@pvretano
I personally prefer to have package under the /processes/{processId}/package because it is tightly coupled with the process.

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

Yikes. Stop! @jerstlouis @fmigneault Please chime in with ONE WORD answers. I don't want an essay! ;)
What do we call what you get from GET /processes/{processId}? A definition or a description? I call it a definition.
What do we call what you POST to /processes to deploy a process? A definition or a description? I call it a description.
What do we call what your would get from /processes/{processId}/package? A definition or a description? I call it description.

@jerstlouis
Copy link
Member

jerstlouis commented Mar 1, 2022

@pvretano

GET /processes/{processId? A description -- That's what it is called in Part 1.
POST /processes -- A definition.
GET /processes/{processId}/(package / definition / executionUnit) -- A definition.

@fmigneault
Copy link
Contributor

fmigneault commented Mar 1, 2022

What do we call what you get from GET /processes/{processId? description
What do we call what you POST to /processes to deploy a process? description + executionUnit (or package :P) aka definition
What do we call what your would get from /processes/{processId{/package? executionUnit/package only

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

So, we GET a description and we POST a definition. I will update the terminology in Part 2 accordingly! OK?

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

Excellent! Progress ... :)

@pvretano
Copy link
Contributor Author

pvretano commented Mar 1, 2022

Created issue #282 to resolve the definition versus description terminology issue in part 2. Please review and add comments about the question I pose in #282. ... and make them SHORT comments please! ;)

@jerstlouis
Copy link
Member

jerstlouis commented Mar 2, 2022

@fmigneault About:

I also find that POSTing the "workflow chain" each time on the execution endpoint doesn't align with deploy/describe concepts. The whole point of deploy is to persist the process definition and reuse it. Part 3 redefines the workflow dynamically for each execution request, requiring undeploy/re-deploy or replace each time, to make it work with Part 2.

The MOAW workflows (Part 3 execution request-based workflow definitions) can either be used to define deployable workflows and deployed with Part 2, or executed in an ad-hoc manner by POSTing them to an execution end-point -- both options are possible (separate capabilities: a server could support either or both). Both ad-hoc execution and deployed workflows could also make sense with CWL and OpenEO process graphs.

Alternatively, if undeploy/re-deploy/replace is not done each time, and that the "workflow chain" remains persisted, then why bother re-POSTing it again as in Part 3 instead of simply re-using the persisted definition? They are not complementary on that aspect.

Part 3 defines the "ac-hoc workflow execution" capability as a way to allow using pre-deployed (local and/or remote) processes (i.e. NestedProcess/RemoteProcess) and (local and/or remote) collections (i.e. CollectionInput/RemoteCollection) which does not require the client to have access to deploy new processes. With the CollectionOutput capability, even "ad-hoc workflow execution" can be POSTed only once, and data can be retrieved from it for many different regions / resolutions without having to POST the workflow for each process-triggering data request.

It is not exactly the same though. For Execution Workflow, we need to add more details such as the outputs in the nested process to tell which one to bubble up to the parent process input. It is not a "big change", but still a difference.

The selection of "outputs" is a capability already in the Core execution request. Nested processes is really the only extension for ad-hoc execution.

A pre-deployed/described Workflow would not need this information, since all details regarding the "workflow chain" already exist. Only in that case, the execution request is exactly the same syntax as for any process execution.

The DeployableWorkflows are what needs the wiring of inputs/outputs of the overall process being deployed to the inputs/outputs of the processes internally, so that is another extension specific to that capability.

Still in both cases, it's the exact same execution request schema with very specific extensions.

@fmigneault
Copy link
Contributor

@jerstlouis
You really need to explain how this deployment with "ad-hoc workflow execution" works with a concrete example. I don't see how it can happen. If you POST on the execution endpoint (async or sync), you either receive the job status/location or the outputs directly. Where is the deployed workflow information? How do you provide details about which processID to deploy it as? How can the user making that request know where is the deployed process to describe it or execute it again without re-POSTing the workflow?

@jerstlouis
Copy link
Member

jerstlouis commented Mar 2, 2022

@fmigneault

If you POST on the execution endpoint (async or sync), you either receive the job status/location or the outputs directly

Correct, plus Part 3 introduces the CollectionOutput and LandingPageOutput execution modes returning a collection description and landing page respectively (with client then triggering processing via data access requests, e.g. Coverages or Tiles).

Where is the deployed workflow information?

I think we are lost in terminology again, because what I mean by "ad-hoc workflow execution" (POSTing directly to an execution end-point) is the polar opposite of "deployed workflow". However, in the case of CollectionOutput and LandingPageOutput, you could include a link to the "workflow definition" in the response. I imagine this link could also be included in the case of a job status / results document response.

How do you provide details about which processID to deploy it as?

The "ad-hoc workflow execution" is to avoid having to deploy it as a process. (e.g. there are fewer safety issue with executing already deployed processes vs. deploying new ones; or an EMS may only execute processes but not have ADES capabilities).

How can the user making that request know where is the deployed process to describe it or execute it again without re-POSTing the workflow?

In the case of CollectionOutput and LandingPageOutput, the client just makes different OGC API data request from the links in the response. In Sync / Async mode, the user cannot -- they need to submit another ad-hoc execution (that's why it's an ad-hoc execution: no need to deploy first).

Now in contrast to the "ad-hoc workflow execution", the "deployable workflow" is what you can deploy as a process, using Part 2. That can be done with CWL, or OpenEO, or a MOAW workflow (extended from the Processes - Part 1: Core execution request + nested processes + input/output wiring of overall process to internal processes) in the execution unit. That execution unit can be included in an "OGC JSON Application Package", or be directly the Content-Type POSTed to /processes.

Does that make things more clear?

@fmigneault
Copy link
Contributor

@jerstlouis
It brings some clarifications but there are still some items I'm not sure to understand.

So if I follow correctly, the deployment of this ad-hoc workflow could be defined and referenced by a link for description provided in LandingPageOutput, but there is no methodology or schema provided by Part 3 to indicate how this deployment would be done, nor even what a MOAW workflow definition would look like? (note: I don't consider the payload in the execution body a definition itself because it employs values, which cannot be deployed as is to create a process description with I/O types. It's more like making use of the definition, but it would be wrong to have specific execution values in the process description).

If that is the case, I don't think it is fair to say "The MOAW workflows [...] can either be used to define deployable workflows" if an example of a workflow definition inferred from the execution chain is not provided as example. It seems to contradict with "they need to submit another ad-hoc execution". What would a MOAW workflow even look like then when calling GET /processes/{processId}/definition with application/moaw+json ? Does it even make sense to have something returned by that request, since it is effectively ignored and re-submitted with a potentially different ad-hoc execution workflow?

@jerstlouis
Copy link
Member

jerstlouis commented Mar 2, 2022

@fmigneault

What would a MOAW workflow even look like then when calling GET /processes/{processId}/definition with application/moaw+json

It would look like the Part 1 execute request, with the following two extensions:

  • The execute request itself is a valid input, and a "process": (processURL) property is added (NestedProcess)
  • An "input" : (idHere) property is added to input, and an "output" : (idHere) property is added to an individual output. This allows to wire the inputs/outputs of the overall process (the black box) to the internal processes inside the box (the constituents processes). This extension only applies to Deployable Workflows, not to ad-hoc workflows which do not leave any open input/output.

I don't consider the payload in the execution body a definition itself because it employs values, which cannot be deployed as is to create a process description with I/O types.

I am not sure I understand your view on this... If you consider a workflow with single hop, it is identical to a Processes - Part 1: Core. If you have 1 nested process, the top-level process receiving workflow acts as a Processes - Part 1: Core client with that nested process. So since it works for one hop, why wouldn't it work with any number of hops?

since it is effectively ignored and re-submitted with a potentially different ad-hoc execution workflow?

I don't understand what you mean by this... It seems like you might possibly be mixing up the execution request invoking the blackbox process vs. the execution request defining the workflow that invokes processes internally (not the blackbox process). Could that be the case?

@fmigneault
Copy link
Contributor

fmigneault commented Mar 2, 2022

It is not really about the number of hops. There is no issue about quantity of nested processes or how they connect to each other.
The issue it about the content of the execution payload.

When the ad-hoc workflow is submitted for execution, the values are embedded in the body (this is fine in itself, no problem).
Very simplified:

{  
    "process": "url-top-most",
    "inputs": { 
        "input-1": {
            "process": "url-nested",
			"inputs": {
			   "some-input":  "<some-real-data-here raw|href|collection|...>"
			},
			"outputs": { "that-one": {} }
		}
    }
}

The problem happens when trying to explain the behaviour between Part 2 and Part 3. The above payload is not a direct definition.

Let's say there was a way for the user to indicate they want that exact chain to be process mychain (ie: POSTing it on /process), and deploy it with Part 2 using MOAW format, the bodies returned by GET /processes/mychain and GET /processes/mychain/definition + application/moaw+json can do 2 things:

  1. both substitute "<some-real-data-here raw|href|collection|...>" by some { "schema": ... } object to make it a generic input type of the process that can be called with alternative values on the execution endpoint.
    The process description of the full workflow should only indicate "some-input" since under "inputs", since this is the only value that can be provided, others are enforced by the workflow.
  2. the process definition enforces those specific values (workflow not tweakable), but then "some-input" CANNOT be an input listed in process description since it cannot be provided at the execution endpoint.

If this mychain process cannot be executed directly with just { "some-input" : "my-alternative-data" }, but must instead provide the above payload entirely again, then Part 2 deploy using MOAW as no reason to exist. Deploying a Part 3 workflow brings nothing new, because it is resolved ad-hoc on the execution endpoint.

@jerstlouis
Copy link
Member

jerstlouis commented Mar 2, 2022

@fmigneault

If that example workflow is intended to be a DeployableWorkflow, and "some-input" is an input parameter left open to be specified when executing mychain, then it should use the "input": ... extension intended for that as I described above:

{  
    "process": "url-top-most",
    "inputs": { 
        "input-1": {
            "process": "url-nested",
			"inputs": {
			   "some-input": { "input" : "myChainInput1" }
			},
			"outputs": { "that-one": { "output": "myChainOutput1" } }
		}
    }
}

That wires the "myChainInput1" input of the myChain blackbox to the "some-input" of the "url-nested" internal process (and same for output).

A process description for myChain can be fully inferred from this, at least in terms of inputs / outputs (but not things like title and description cannot without providing details).
The process description for myChain will list as inputs "myChainInput1" and as outputs "myChainOutput1".
The type of "myChainInput1" can be inferred from the type of "some-input" in the url-nested's process description, since that is where it is used.
The type of "myChainOutput1" can be inferred from the type of "that-one" in the url-nested's process description, since that is where it is used.

This is a DeployableWorkflow, so nothing to do with the "ad-hoc workflow execution" (which does not leave any input/output open, but would provide values for all inputs).

And to clarify again ad-hoc workflow stands in opposition to deployed workflow:

  • Deployable workflow: POST to /processes with Part 2 (some inputs usually left to be supplied)
  • Ad-hoc workflows: POST to /processes/{processId}/execution where {processId} is the top-level process (or a single workflow execution end-point and top-level "process" property is required) -- values provided for all inputs in execution request

Does that help?

@fmigneault
Copy link
Contributor

Yes, that helped a lot.

My following question is not about if process description can or cannot be inferred (it definitely can), but rather which approach between (1) and (2) in #279 (comment) must be undertaken?

Is it safe to say that if "input" : "myChainInput1" is specified, then the process description would become (case 1):

{
   "id": "myChain",
   "inputs": { 
       "myChainInput1": { "schema" :  { "type": "string (a guess from 'some-input')" } }
   }, 
   "outputs": {
       "myChainOutput1": { "schema": { "type": "string (a guess from 'that-one')" } }
    }
}

But if "input" : "myChainInput1" was omitted (case 2), then the above process description would instead have {"inputs": {}} (ie: the execution request does not take any input, all is constant) ?

Also to make sure, would "outputs" of myChain also contain the outputs of "process": "url-top-most" (not explicitly listed)? Otherwise what was the point to execute this parent process in the workflow chain?

I think DeployableWorkflow and "ad-hoc workflow execution" could be considered as a whole, because I could take advantage of the similar structure to do both deploy+execute using this:

        "inputs": {
            "some-input": { "input" : "myChainInput1 (for deploy)", "value": "<some-data> (for execute)" }
        }

Mapping from/to MOAW/CWL would then be very much possible.

@jerstlouis
Copy link
Member

jerstlouis commented Mar 2, 2022

Is it safe to say that if "input" : "myChainInput1" is specified, then the process description would become (case 1):
But if "input" : "myChainInput1" was omitted (case 2), then the above process description would instead have {"inputs": {}} (ie: the execution request does not take any input, all is constant) ?

Correct, but then the workflow is not really intended to be deployed as a process as it does not accept any input. It would make more sense as an ad-hoc workflow execution, or POSTed as a persistent virtual collection to /collections instead.

Also to make sure, would "outputs" of myChain also contain the outputs of "process": "url-top-most" (not explicitly listed)? Otherwise what was the point to execute this parent process in the workflow chain?

My thinking (which is relatively recent since I realized we were missing this "output" while working out this thread's scenarios) is that if any "output" is specified, then there is no implied outputs. If no "output" is specified, then the top-level process's outputs are implied.

You are right that the top-level process would be pointless in this case, so for the example to make sense we should also specify another "output" from url-top-most's which would be a second output from mychain.

I think DeployableWorkflow and "ad-hoc workflow execution" could be considered as a whole, because I could take advantage of the similar structure to do both deploy+execute using this:

Well yes the MOAW syntax is the same in both cases, and much the same as Part 1 as well -- re-usability was definitely the goal.

Mapping from/to MOAW/CWL would then be very much possible.

Awesome :)

         "inputs": {
            "some-input": { "input" : "myChainInput1 (for deploy)", "value": "<some-data> (for execute)" }
        }

Would that ever happen in the same workflow though?
I would think that you either deploy or execute...
At the point where you execute the deployed workflow, you replace the "input" by the "value".

With CollectionInput, "collection": (collectionURL) is also a placeholder for different pieces of data sourced from that collection at different resolutions and areas of interest, using any API+formats combination supported by both ends of the hop.

@m-mohr
Copy link

m-mohr commented Mar 14, 2022

@pvretano I've just seen the MulAdd example in the tiger team recordings. I think it would be a first good step to translate that into openEO to see how it compares. Can you point me to the example? I can't really read the URL in the video. Then I could do a quick crosswalk...

@jerstlouis
Copy link
Member

@m-mohr In the meantime, for Mul and Add processes taking two operand inputs value1 and value2 it would look something like:

{
  "process": "https://example.com/ogcapi/processes/Mul",
  "inputs": {
   "value1": 10.2,
   "value2": {
      "process": "https://example.com/ogcapi/processes/Add",
      "inputs": {
         "value1": 3.14,
         "value2": 5.7
      }
    }
  }
}

@m-mohr
Copy link

m-mohr commented Mar 14, 2022

Thanks, @jerstlouis, but I was looking at another example from @pvretano which had a lot more metadata included. The full example from him would be better to crosswalk as it shows more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants