Skip to content
This repository has been archived by the owner on Nov 7, 2024. It is now read-only.

Support using local files if appropriately configured. #35

Merged
merged 4 commits into from
Nov 15, 2018

Conversation

manthey
Copy link
Member

@manthey manthey commented Oct 19, 2018

This requires girder/girder_worker#321.

@kotfic
Copy link
Contributor

kotfic commented Oct 23, 2018

In discussion with others two issues have been raised with this approach.

  1. The assetstore blobs files not maintaining extensions.
  2. Assetstore file paths which could be opened by the task and then mutated

Because these issues require developers to be highly aware of the possible issues and program defensively around them downstream (i.e. in the task) this approach adds a configuration option only allowing this if it is enabled. I take this to be used as a guard rail to protect the unsuspecting against potentially "dangerous" behavior. Is that a fair assessment?

If so, i have concerns with this approach. It introduces cognitive and deploy-time complexity to support a particular implementation of this local file checking optimization. Getting data in and out of Girder is very task and project dependent. The purpose of the transforms is to provide an API that Girder Worker can rely on; the intention has always been that projects would implement their own transforms if they didn't fall into common use cases (e.g. download a file, download all the files in an item etc).

It was my hope that checking for a local file was going to have a clean implementation that could take place behind the API without developers needing to be aware of it. Smarter folks than me have pointed out that's probably not going to be the case which is why i think this transform should be moved into the large image code base where it can implement a solution that works best for large image.

@manthey
Copy link
Member Author

manthey commented Oct 23, 2018

In regards to extensions, girder_worker_utils is already failing to keep them. In the older core downloads, girder_worker downloaded files from girder and gave them the original name of the file (perhaps slightly sanitized). girder_worker_utils downloads to a name which is the file id, losing this information. I can't see how (1) is any worse than this -- for imported files it is markedly better.

In regard to (2), yes, a file could be maliciously or accidentally mutated if the deployment has opted in to use direct files. This functionality was available in the girder_worker / girder_io plugin. If each project has to implement this sort of functionality separately, it seems a shame; large_image isn't the only project that could benefit from direct file access. This will lead to an additional repository (girder_worker_more_utils?) that contain such functionality so multiple projects can use them without unnecessary cross dependencies.

@jeffbaumes
Copy link
Member

Would a possible solution be to remove the setting and have another transform instead of adding functionality to GirderFileId? Something like GirderUnsafeDirectFileId? An appropriate name would cause the user to lookup in what ways it could be "unsafe". This type of transform seems appropriate as a "util": huge potential performance gains at the tradeoff of no data immutability guarantee.

@manthey
Copy link
Member Author

manthey commented Oct 23, 2018

I'd be happy to move it to another transform, though I would suggest still leaving the setting -- it may be deployment specific if direct files can or should be used. Currently, we have opt-in settings on both Girder and Girder-worker.

@kotfic
Copy link
Contributor

kotfic commented Oct 23, 2018

@jeffbaumes my concern is creating another dumping ground for project specific code at varying levels of deprecation that get picked up by other projects who are under the impression that they are stable, useful, and well supported. I would much rather projects implement their own transforms and then after several projects have explored the problem and demonstrated the utility of an approach discuss upstreaming into this repository.

@manthey makes my point exactly. The currently implemented transforms were originally intended as examples, not as a set of backwards compatible implementations or as a set of stable, useful well supported transforms. Their purpose was to exercise the API and help shake out the first set of problems. Despite this people have already integrated them into projects meaning they now have costs associated with "breaking" compatibility.

@manthey
Copy link
Member Author

manthey commented Oct 23, 2018

Why have this repository at all if it isn't to have transforms useful to multiple projects? If these are only examples, this repo should have been named something different -- girder_worker_transform_examples, perhaps. The name of the repo implies that this is where one should place utilities useful to girder_worker (and it is required by girder_worker, implying that these aren't just examples).

We have multiple projects that have illustrated the utility of this already (in the old form). How many projects have to find utility before it is allowed?

@kotfic
Copy link
Contributor

kotfic commented Oct 23, 2018

So for one, this repository contains more than just transforms. And of course its goal is to have simple, clear transforms that are useful to multiple projects. Dumping a bunch of stuff in here of varying utility, rife with wildly different expectations about downstream behavior doesn't seem like the right way to achieve that goal.

I'm hesitant to support any design decisions around the old style of girder worker as evidence for wide adoption of a particular implementation of a feature. the old style of girder worker was riddled with intolerable hacking designed to accommodate a fundamentally bad architecture. I'm not arguing that the feature isn't useful (at least not at the moment). I'm arguing that its implementation is non-trivial and requires making opinionated choices - as even a brief discussion with other engineers has born out. Finally, its i'm not suggesting that there is some kind of counting game for upstreaming transforms, only that there should be consensus on an approach, which there clearly isn't.

@kotfic
Copy link
Contributor

kotfic commented Oct 23, 2018

While that's all well and good it doesn't help us move forward. I'd like to suggest as a compromise a separate transform that accepts a keyword argument local_file in the transform() function it checks for local_file's existence and if it's readable returns that instead of downloading the file. Determining whether to use this transform, including whether that is based on a run-time configuration should be handled in the context that instantiates the transform. Is that a workable path forward?

@danlamanna
Copy link
Member

danlamanna commented Oct 23, 2018 via email

@kotfic
Copy link
Contributor

kotfic commented Oct 25, 2018

Here is the approach i'd like to propose. Where something like get_transform is defined and used inside large_image and the GirderFileIdCheckLocal (please a better name!) is committed to a contrib module under girder_io

For example:

large_image code:

# Utility managed somewhere inside large_image codebase
def get_transform(file_id):

    local_file_path = None
    if Setting().get(PluginSettings.DIRECT_PATH):
        try:
            local_file_path = File().getLocalFilePath(
                File().load(file_id, force=True))
        except FilePathException:
            pass

    if local_file_path is None:
        return GirderFileId(file_id)
    else:
        return GirderFileIdCheckLocal(file_id, local_path=local_file_path)

contrib code:

# Class committed to a contrib module in girder_worker_utils
class GirderFileIdCheckLocal(GirderFileId):
    """
    This transform downloads a Girder File to the local machine and passes its
    local path into the function.
    :param _id: The ID of the file to download.
    :type _id: str
    """
    def __init__(self, _id, local_path=None, **kwargs):
        super(GirderFileIdCheckLocal, self).__init__(**kwargs)
        self.local_path = local_path
        
    def transform(self):
        if self.local_path is not None and\
           os.path.isfile(self.local_path):
            return self.local_path
        return super(GirderFileIdCheckLocal, self).transform()
        

This way the decision to use local file checking is managed purely through the producer and there is no need to have additional configuration on the girder worker side. This also mostly removes the need for girder/girder_worker#321 (with the exception of the tmp_dir stuff)

I know this doesn't address the temporary directory issue but that's somewhat related to what's happening in #37 and honestly it would be nice if we could integrate some of the work @zachmullen and @manthey are doing before moving forward with that feature

@manthey
Copy link
Member Author

manthey commented Oct 25, 2018

The rationale for having a setting on the worker side of this, is that you could have multiple workers where one worker could (and should) use a direct path, and one worker can't or shouldn't. As an explicit example, in testing, I've had computer A run Girder and Girder Worker (and clearly have a shared file system), while, to test scaling, have had computer B run Girder Worker. Girder Worker A should use a direct path, and Girder Worker B shouldn't.

Probably a worker that shouldn't use a direct path wouldn't be able to reach the direct path, so perhaps this is worrying about something that isn't necessary.

@manthey
Copy link
Member Author

manthey commented Oct 25, 2018

It seems like the code to check whether direct paths are allowed should be in the same repo as the transform that can pass along the direct path. Moving it to the large_image code base would mean also moving it to slicer_cli_web, gaia, etc., and, I assume, removing the setting from the remote worker plugin and adding it to each of those plugins. It seems like that will make things more complicated to maintain and more complex to deploy.

@kotfic
Copy link
Contributor

kotfic commented Oct 25, 2018

It isn't necessary to have the setting at all. Each project can decide for themselves whether or not they want to use a transform that includes local file checking.

Local file checking costs nothing (from a performance perspective) compared to network transfer. That's why my original position was to include it in GirderFileId, As others have pointed out it makes some opinionated choices so it probably doesn't belong in GirderFileId.

Great, lets put it in a contrib module. Opting into this behavior is now a matter of using one transform or the other. Considering the fact that the transform i've proposed falls back to default GirderFileId behavior then as long as a project is ok with the choices the local file checking transform has made (e.g. potentially no extension, file mutability) then they should just use that transform. That is a design decision, not a runtime decision. The setting is unnecessary in both the girder and girder worker contexts.

@manthey
Copy link
Member Author

manthey commented Oct 25, 2018

Some deployments can use direct paths, some can't. It isn't project specific, it is deployment specific. For instance, if we want security against mutability, a deployment with read-only mounts on Girder Worker could use this safely, but one with r/w mounts could be set to not use it.

@manthey
Copy link
Member Author

manthey commented Oct 25, 2018

For a contrib module, would having it be at girder_worker_utils/contrib/girder_io.py be the appropriate spot? Or would you prefer a different location/structure?

@kotfic
Copy link
Contributor

kotfic commented Oct 25, 2018

I would put it in girder_worker_utils/transforms/contrib/girder_io.py

@zachmullen
Copy link
Member

There isn't a CONTRIBUTING file in this repo -- what is the purpose of this contrib package? Seems like it should be documented somewhere.

@kotfic
Copy link
Contributor

kotfic commented Oct 25, 2018

@zachmullen It is not documented because it is something being introduced as a compromise in this PR.

@manthey Mutability and extensions are an issue for the tasks that use the transform. If a task never tries to mutate a file read only or r/w mounting is irrelevant. If you really want to protect against mutability wouldn't it be better to require the file to be readonly before returning the path? Wouldn't it be better to just check if there is an extension if your task needs an extension?

@zachmullen
Copy link
Member

Sorry, I had missed @danlamanna 's comment:

One idea that comes to mind is a contrib module which can host experimental
transforms for downstreams to use. This could serve as an incubator for
converging around an official supported API for transforms, while allowing
downstreams to benefit from DRY code.

This is fine, I just want to make sure this is documented, since the expectations of a "contrib" package are not universally known.

@manthey manthey force-pushed the support-local-files branch 4 times, most recently from 2b84e4f to ba8ac80 Compare November 6, 2018 14:46
Add tests for the GirderFileId transform and for the new
contrib.GirderFileIdAllowDirect transform.
@manthey manthey force-pushed the support-local-files branch from ba8ac80 to 0f416e4 Compare November 6, 2018 15:22
@manthey
Copy link
Member Author

manthey commented Nov 6, 2018

@kotfic I've refactored this so that it is in the contrib package and uses an environment variable rather than a girder_worker setting. I also added a test for the transform (and for the existing GirderFileId transform). The tests for this transform mock Girder's file model; perhaps there is a better way to do that (without pulling in Girder).

from girder.models.file import File
from girder.exceptions import FilePathException
self.file_id = _id
file = File().load(self.file_id, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach would be to take this logic out of the transform and optionally accept a path to check on the consumer side. While this has the disadvantage of replicating a few lines of code in the places where the transform is used, it has the upside of making the transform much more widely applicable (including in contexts outside of Girder), and significantly simplifying the transform and the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as this transform module is named girder_io, the transform is not intended to be used outside of Girder. If someone needs that, they can write a different module. Transforms for Girder should be useful in that context, and not add more burden and repetition than necessary. I'd rather have this once than in multiple places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, using this inside or outside the context of Girder wasn't really the point I was trying to make.

My primary technical concerns with the PR as-is are:

  1. import guarding inside __init__ is likely to lead to unexpected behavior and other design problems that increase complexity (e.g. the test mocking).
  2. In many contexts this is going to be called from an environment where the file model is already loaded (e.g. a request handler) and so, as implemented, in those cases there is going to have to be a second trip to the database.

From a more philosophical position:

  1. the File() object is only available in the context of girder-as-application while transforms are intended to be a set of library utilities with minimal dependencies. By tucking imports and the File() object into the __init__ you're relying on the existence of a request context, a database connection and a whole load of other mechanics. Ultimately it subverts an otherwise clean separation between library and application.

With regard to (3) I know the GirderClientTransform class already breaks this separation. To be honest I'm increasingly regretting the decision to implement the transform that way, not only because now the API is fixed across a number of projects that are using it, but also because it appears to have led people to believe that it was a good idea. I am of the opinion that it was not and I want to prevent further spread of the approach.

The purpose of the transforms are to collect state from the producer and make that state available to the consumer so the task can ignore the I/O details. The initialization of the transform is the mechanism by which that state should be collected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in the existing girder_io transforms, each and every transform that uses GirderClientTransform instantiates a separate instance of the girder client, so if you need two files, you have two clients. Worse than what you point out in (2), the transform creates an instance of girder client even when it isn't necessary, and then a typical task creates another instance of girder client to return data to girder.

We don't actually want separation and isolation, otherwise we would serialize an entire file and pass that to girder_worker and not use the client at all. I'd argue we want less separation -- to have at most one girder_client when getting multiple files from the same instance of Girder with the same credentials and to reuse that client when returning data to Girder.

As you said, I thought the purpose of the transforms was to collect state from the producer. I think the other purpose of a transform is to provide utility -- the producer shouldn't have to replicate the same code over and over again if it is always going to be used by the transform. Do you recommend that the transform be passed the File model and the File document? Or that there is a separate helper function somewhere else getDataForGirderFileAllowDirectTransform which returns the id, name, and path (if available)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your transform doesn't need a girder client it shouldn't derive from GirderClientTransform

Tasks should rarely ever create a girder client - they should return whatever values make sense for the domain of that task. Those values should be handled by a ResultsTransform unless they need to use a girder client in which case they should use a GirderClientResultsTransform

Girder Client's maintain a handful of strings as their state, instantiating multiple clients represents negligible overhead and provides the capability for different transforms to have different levels of access. Moreover, keeping transforms separate prevents having to implement complex, highly coupled, logic in the worker to either implicitly or explicitly manage the shared context between transforms. If there is a more compelling reason than instantiation of multiple girder clients to take on that kind of maintenance burden i'm open to it.

Maintaining discipline with regard to the scope of girder worker transforms means developers can rely on a consistent understanding of the role these transforms play in the architecture of the larger system. While that may make this transform marginally less useful it makes all transforms more comprehensible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachmullen yeah basically -

  • __init__ is called in the producer (usually a request thread or an event binding)
  • transform is called in the consumer (the worker process)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the transform to avoid using the File model is fine, but it isn't useful by itself. We then want a utility function that can be called to get the transform without having to do the same work in each of the callers. To this end, we could create a girder_worker_utils.girder_utils module that would contain something like:

from girder.exceptions import FilePathException
from girder.models.file import File
from ..transforms.contrib.girder_io import GirderFileIdAllowDirect


def paramsForGirderFileIdAllowDirectTransform(file):
    try:
        local_path = File().getLocalFilePath(file)
    except FilePathException:
        local_path = None
    return {'_id': str(file['_id']), 'name': file['name'], 'local_path': local_path}


def girderFileAllowDirect(file, **kwargs):
    params = kwargs.copy()
    params.update(paramsForGirderFileIdAllowDirectTransform(file))
    return GirderFileIdAllowDirect(**params)

Since the GirderFileIdAllowDirect transform is of no use directly, anything using it would instead of doing

from girder_worker_utils.transforms.contrib import GirderFileIdAllowDirect
inputpath=GirderFileIdAllowDirect(file_id)

you would do

from girder_worker_utils.girder_utils import girderFileAllowDirect
inputpath=girderFileAllowDirect(file)

If you don't think utilities to help communicate between Girder and Girder Worker belong in girder_worker_utils, what common library do you suggest instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think I've made it clear that I feel the two or three lines of code belongs in the endpoints that are using the transform. if I had to suggest a location where utilities for communication between Girder and Girder Worker belong it would be in girder_worker.girder_plugin package - keeping in mind that that package currently maintains a lot of legacy code that is likely to be removed once the girder_worker.run code is removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While having the code outside of this library reduces the maintenance burden of this library, it increases the cognitive burden of using it and the maintenance burden of all the places that use it. I disagree with this, but will do so in the interests of actually getting something done.

Also, this is the first place where you've suggested where you think such code aught to exist, but then immediately caution that that location is the wrong place, too. This suggests the design of this library and girder worker are ill-conceived, as there seems to be no decent way to not replicate code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and a larger topic that would be good for a planning meeting discussion IMO.

  • What should go in what library?
  • What is the purpose of girder_worker_utils, in particular why isn't a lot of this stuff just part of the girder_worker library?

@zachmullen
Copy link
Member

My understanding was that the contrib package here was meant to indicate that its contents are somewhat experimental, or may be less maintained and/or stable than the other stuff. I think the transform as written qualifies for that concept (I know of at least two active projects that have explicitly requested this capability). Hence, to me that is equivalent to the question of whether the contrib package itself belongs in this library or somewhere else. I'm personally fine with it here, but I think it demands documentation about its level of maintainedness and stability, just some sort of "use at your own risk" disclaimer that lets us wash our hands of some of the risk/cost associated with features like this that are not as pure or may not prove to be good ideas in the long run.

@kotfic
Copy link
Contributor

kotfic commented Nov 14, 2018

@zachmullen from my perspective this belongs in the contrib module because it is easy for a developer to accidentally use it in a way that corrupts the girder database. That isn't really an excuse to subvert the basic scope and expectations of transforms. I might normally agree that we could "wash our hands" of it except that this is going to be immediately used across a bunch of different projects and so in-so-far as this is going to be the repository that coordinates use across those projects, i want to enforce a clean separation between library and application.

@zachmullen
Copy link
Member

@kotfic that makes sense to me. And I get the concern for wanting to formally separate what is happening on the Girder side from what is happening on the worker side. The de facto separation is by the fact that __init__ runs on the producer and transform runs on the consumer, but I can see why you don't consider that a clean separation in this case, and I guess the real argument might be that dependency injection / separation of concerns is a generally good goal.

After thinking about this more, I'm inclined to agree with you that we should move the path computation logic out of the transform and instead pass the path in. Right away, it would be useful for those cases where someone wants to pass a symlinked version of the file that preserves the actual name/extension of the file. There could be more things like that which we haven't even considered, and externalizing that logic means that won't be an issue.

On the other subject: I'm not that worried about the fact that a worker task might corrupt an assetstore file; to me that's a deploy time concern that can be achieved by the following

  1. Set the file creation mode on the assetstore in question to 0640. (This is a configurable field of filesystem assetstores.)
  2. Run the workers as a different user, but one that is still in the primary group of the girder server user.

Something like imported files might confound that completely, which only strengthens the argument for moving the file path logic out of the transform -- it could get complicated, and different downstreams might want to implement different behaviors for the odd cases.

@manthey manthey force-pushed the support-local-files branch from 789b972 to d579dd1 Compare November 15, 2018 14:19
Copy link
Member

@zachmullen zachmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the typo

def _allowDirectPath(self):
"""
Check if the worker environment permits direct paths. This just checks
if the environment variable GW_DIRECT_PATHS is set to a non-empry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty*

"""
if os.environ.get('GW_DIRECT_PATHS'):
return True
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just be return bool(os.environ.get('GW_DIRECT_PATHS')), but it's up to you which you like better.

@kotfic
Copy link
Contributor

kotfic commented Nov 15, 2018

Thanks for slugging through this with me @manthey. I'm sorry it took so long to get in. I really appreciate you staying with it despite the disagreements

@manthey manthey merged commit d36e09b into master Nov 15, 2018
@manthey manthey deleted the support-local-files branch November 15, 2018 18:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants