Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow developers to create docker-based tasks through the API #21

Open
9 tasks
hobu opened this issue Jan 11, 2017 · 3 comments
Open
9 tasks

Allow developers to create docker-based tasks through the API #21

hobu opened this issue Jan 11, 2017 · 3 comments

Comments

@hobu
Copy link
Member

hobu commented Jan 11, 2017

There are many situations where GRiD's API to ease fetching of data are simply not going to be good enough. The most common is that people want to run things over very large volumes of data when the pipes to transmit the data are not big enough. Consider some following scenarios:

  1. Alpha organization wants to run an automated feature extraction algorithm, tuned with their own parameters and settings, over a city-sized region. At the moment, the only way for them to achieve that task is to fetch a city-sized region from GRiD, which is often in the 100s of GB in size, and then run it themselves on their own servers. Any immediacy requirements in the mix mean they end up replicating GRiD's entire data holdings to achieve these tasks rather than offload them to GRiD as intended.

  2. Beta algorithm researcher wants to test variants of her algorithm on the same 125 GB patch. After a week of iterating using GRiD's API by continually pushing new iterations of the tool, the algorithm is verified to do what it says on the label, and the researcher now desires to allow other GRiD users to their algorithm over their own 100+ GB patches of data.

  3. Gamma GRiD developer was tasked with integrating a one-off TDA for a small group of GRiD users.

I would like to propose the following additions to GRiD's APIs:

  • API users are able to create an "API task" that runs a docker container that was given the following inputs:

    • A JSON dictionary of key-value pairs of arguments
    • A single, Docker COPY'd, PDAL-readable point cloud file in a specific in-container path location
    • A -v mounted volume, say /grid that maps to the task's output directory
    • ENTRYPOINT is defined for calling the container to run the task with the JSON dictionary as stdin, stderr mapped for logging, and all output, including stdout-type stuff, written to /grid
    • Security considerations like the container has no network access, etc...
  • A per-GRiD-instance "App" registry is created that registered developers can push docker images to. Registered developers would obtain the certificate for access to this registry once they are verified and authenticated through their "application settings" page in GRiD. They push an image to the registry with a name that must match the app-id they created above.

  • API to allow users to POST/PUT to the task with the name/id of existing export file/content and the "API task" ID and the JSON dictionary of arguments. This item then includes all celery task tracking, queue management, logging, etc. to achieve the execution of the task to completion or error. The same progress callbacks and other API decoration being added to standard GRiD tasks might also be useful here...

  • "Access tags for Apps" to allow administrators to grant execute permissions of "API tasks" to access tags. After the App is shared, GRiD proper would take a copy of the container and App and place it in its own registry to prevent unsupervised revoke/removal.

I would propose that our first implementation only support API use and consumption. API consumers would be on their own to manage the business logic of the contents of their JSON arguments. This pattern that I'm proposing seems rather obvious, and I wonder if there are existing implementations of it that are thought through with more care.

An implementation of this mechanism would benefit GRiD in a number of important ways. It would further enhance the capabilities of self-service consumers of GRiD data. It would make it even more convenient for the GRiD team to integrate typical "fetch data, do stuff to it, output data to user" TDA-like tasks that we are often asked to integrate. Finally, it would open up access to GRiD's access, task management, and cloud resources in a much wider way.

I know there are tons of gotchas here, but I'm interested in hearing credible technical arguments for or against reorienting some of our architecture to support this mode of operation.

@chambbj
Copy link
Member

chambbj commented Jan 11, 2017

👍 to providing this in some form.

FWIW, you can find some documentation on how DigitalGlobe does this with GBDX here. Not that you'd want to replicate exactly, but there may be some additional considerations.

@chambbj
Copy link
Member

chambbj commented Jan 11, 2017

I think your third scenario is the one I'd always imagined. Although I don't think it has to be a "one-off TDA" or a "small group of GRiD users". I see this as just another means of allowing external developers to provide processing capabilities without requiring them to stand up their own external services. You are no longer forcing them into the PDAL box for integration into the GRiD export workflows either.

@AlexMountain
Copy link
Contributor

What do you guys think about https://github.com/hydroshare/django_docker_processes ?

It looks like they've tackled a lot of the overhead in dealing with 3rd party docker containers and it leverages a similar architecture to what GRiD already has. It does appear to do a bit more than we need, but would you guys say it's similar to what we're looking for here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants