-
Notifications
You must be signed in to change notification settings - Fork 7
Support using local files if appropriately configured. #35
Conversation
In discussion with others two issues have been raised with this approach.
Because these issues require developers to be highly aware of the possible issues and program defensively around them downstream (i.e. in the task) this approach adds a configuration option only allowing this if it is enabled. I take this to be used as a guard rail to protect the unsuspecting against potentially "dangerous" behavior. Is that a fair assessment? If so, i have concerns with this approach. It introduces cognitive and deploy-time complexity to support a particular implementation of this local file checking optimization. Getting data in and out of Girder is very task and project dependent. The purpose of the transforms is to provide an API that Girder Worker can rely on; the intention has always been that projects would implement their own transforms if they didn't fall into common use cases (e.g. download a file, download all the files in an item etc). It was my hope that checking for a local file was going to have a clean implementation that could take place behind the API without developers needing to be aware of it. Smarter folks than me have pointed out that's probably not going to be the case which is why i think this transform should be moved into the large image code base where it can implement a solution that works best for large image. |
In regards to extensions, girder_worker_utils is already failing to keep them. In the older core downloads, girder_worker downloaded files from girder and gave them the original name of the file (perhaps slightly sanitized). girder_worker_utils downloads to a name which is the file id, losing this information. I can't see how (1) is any worse than this -- for imported files it is markedly better. In regard to (2), yes, a file could be maliciously or accidentally mutated if the deployment has opted in to use direct files. This functionality was available in the girder_worker / girder_io plugin. If each project has to implement this sort of functionality separately, it seems a shame; large_image isn't the only project that could benefit from direct file access. This will lead to an additional repository ( |
Would a possible solution be to remove the setting and have another transform instead of adding functionality to |
I'd be happy to move it to another transform, though I would suggest still leaving the setting -- it may be deployment specific if direct files can or should be used. Currently, we have opt-in settings on both Girder and Girder-worker. |
@jeffbaumes my concern is creating another dumping ground for project specific code at varying levels of deprecation that get picked up by other projects who are under the impression that they are stable, useful, and well supported. I would much rather projects implement their own transforms and then after several projects have explored the problem and demonstrated the utility of an approach discuss upstreaming into this repository. @manthey makes my point exactly. The currently implemented transforms were originally intended as examples, not as a set of backwards compatible implementations or as a set of stable, useful well supported transforms. Their purpose was to exercise the API and help shake out the first set of problems. Despite this people have already integrated them into projects meaning they now have costs associated with "breaking" compatibility. |
Why have this repository at all if it isn't to have transforms useful to multiple projects? If these are only examples, this repo should have been named something different -- We have multiple projects that have illustrated the utility of this already (in the old form). How many projects have to find utility before it is allowed? |
So for one, this repository contains more than just transforms. And of course its goal is to have simple, clear transforms that are useful to multiple projects. Dumping a bunch of stuff in here of varying utility, rife with wildly different expectations about downstream behavior doesn't seem like the right way to achieve that goal. I'm hesitant to support any design decisions around the old style of girder worker as evidence for wide adoption of a particular implementation of a feature. the old style of girder worker was riddled with intolerable hacking designed to accommodate a fundamentally bad architecture. I'm not arguing that the feature isn't useful (at least not at the moment). I'm arguing that its implementation is non-trivial and requires making opinionated choices - as even a brief discussion with other engineers has born out. Finally, its i'm not suggesting that there is some kind of counting game for upstreaming transforms, only that there should be consensus on an approach, which there clearly isn't. |
While that's all well and good it doesn't help us move forward. I'd like to suggest as a compromise a separate transform that accepts a keyword argument |
One idea that comes to mind is a contrib module which can host experimental
transforms for downstreams to use. This could serve as an incubator for
converging around an official supported API for transforms, while allowing
downstreams to benefit from DRY code.
…On Tue, Oct 23, 2018, 8:08 AM Christopher Kotfila ***@***.***> wrote:
While that's all well and good it doesn't help us move forward. I'd like
to suggest as a compromise a separate transform that accepts a keyword
argument local_file in the transform() function it checks for
local_file's existence and if it's readable returns that instead of
downloading the file. Determining whether to use this transform, including
whether that is based on a run-time configuration should be handled in the
context that instantiates the transform. Is that a workable path forward?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAlH5Wlb2vv5QHx-91uVBV-wB625fW8uks5unzDigaJpZM4Xwk-I>
.
|
Here is the approach i'd like to propose. Where something like For example: large_image code: # Utility managed somewhere inside large_image codebase
def get_transform(file_id):
local_file_path = None
if Setting().get(PluginSettings.DIRECT_PATH):
try:
local_file_path = File().getLocalFilePath(
File().load(file_id, force=True))
except FilePathException:
pass
if local_file_path is None:
return GirderFileId(file_id)
else:
return GirderFileIdCheckLocal(file_id, local_path=local_file_path) contrib code: # Class committed to a contrib module in girder_worker_utils
class GirderFileIdCheckLocal(GirderFileId):
"""
This transform downloads a Girder File to the local machine and passes its
local path into the function.
:param _id: The ID of the file to download.
:type _id: str
"""
def __init__(self, _id, local_path=None, **kwargs):
super(GirderFileIdCheckLocal, self).__init__(**kwargs)
self.local_path = local_path
def transform(self):
if self.local_path is not None and\
os.path.isfile(self.local_path):
return self.local_path
return super(GirderFileIdCheckLocal, self).transform()
This way the decision to use local file checking is managed purely through the producer and there is no need to have additional configuration on the girder worker side. This also mostly removes the need for girder/girder_worker#321 (with the exception of the tmp_dir stuff) I know this doesn't address the temporary directory issue but that's somewhat related to what's happening in #37 and honestly it would be nice if we could integrate some of the work @zachmullen and @manthey are doing before moving forward with that feature |
The rationale for having a setting on the worker side of this, is that you could have multiple workers where one worker could (and should) use a direct path, and one worker can't or shouldn't. As an explicit example, in testing, I've had computer A run Girder and Girder Worker (and clearly have a shared file system), while, to test scaling, have had computer B run Girder Worker. Girder Worker A should use a direct path, and Girder Worker B shouldn't. Probably a worker that shouldn't use a direct path wouldn't be able to reach the direct path, so perhaps this is worrying about something that isn't necessary. |
It seems like the code to check whether direct paths are allowed should be in the same repo as the transform that can pass along the direct path. Moving it to the large_image code base would mean also moving it to slicer_cli_web, gaia, etc., and, I assume, removing the setting from the remote worker plugin and adding it to each of those plugins. It seems like that will make things more complicated to maintain and more complex to deploy. |
It isn't necessary to have the setting at all. Each project can decide for themselves whether or not they want to use a transform that includes local file checking. Local file checking costs nothing (from a performance perspective) compared to network transfer. That's why my original position was to include it in Great, lets put it in a contrib module. Opting into this behavior is now a matter of using one transform or the other. Considering the fact that the transform i've proposed falls back to default |
Some deployments can use direct paths, some can't. It isn't project specific, it is deployment specific. For instance, if we want security against mutability, a deployment with read-only mounts on Girder Worker could use this safely, but one with r/w mounts could be set to not use it. |
For a contrib module, would having it be at |
I would put it in |
There isn't a CONTRIBUTING file in this repo -- what is the purpose of this |
@zachmullen It is not documented because it is something being introduced as a compromise in this PR. @manthey Mutability and extensions are an issue for the tasks that use the transform. If a task never tries to mutate a file read only or r/w mounting is irrelevant. If you really want to protect against mutability wouldn't it be better to require the file to be readonly before returning the path? Wouldn't it be better to just check if there is an extension if your task needs an extension? |
Sorry, I had missed @danlamanna 's comment:
This is fine, I just want to make sure this is documented, since the expectations of a "contrib" package are not universally known. |
2b84e4f
to
ba8ac80
Compare
Add tests for the GirderFileId transform and for the new contrib.GirderFileIdAllowDirect transform.
ba8ac80
to
0f416e4
Compare
@kotfic I've refactored this so that it is in the contrib package and uses an environment variable rather than a girder_worker setting. I also added a test for the transform (and for the existing GirderFileId transform). The tests for this transform mock Girder's file model; perhaps there is a better way to do that (without pulling in Girder). |
from girder.models.file import File | ||
from girder.exceptions import FilePathException | ||
self.file_id = _id | ||
file = File().load(self.file_id, force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach would be to take this logic out of the transform and optionally accept a path to check on the consumer side. While this has the disadvantage of replicating a few lines of code in the places where the transform is used, it has the upside of making the transform much more widely applicable (including in contexts outside of Girder), and significantly simplifying the transform and the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as this transform module is named girder_io
, the transform is not intended to be used outside of Girder. If someone needs that, they can write a different module. Transforms for Girder should be useful in that context, and not add more burden and repetition than necessary. I'd rather have this once than in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, using this inside or outside the context of Girder wasn't really the point I was trying to make.
My primary technical concerns with the PR as-is are:
- import guarding inside
__init__
is likely to lead to unexpected behavior and other design problems that increase complexity (e.g. the test mocking). - In many contexts this is going to be called from an environment where the file model is already loaded (e.g. a request handler) and so, as implemented, in those cases there is going to have to be a second trip to the database.
From a more philosophical position:
- the File() object is only available in the context of girder-as-application while transforms are intended to be a set of library utilities with minimal dependencies. By tucking imports and the File() object into the
__init__
you're relying on the existence of a request context, a database connection and a whole load of other mechanics. Ultimately it subverts an otherwise clean separation between library and application.
With regard to (3) I know the GirderClientTransform
class already breaks this separation. To be honest I'm increasingly regretting the decision to implement the transform that way, not only because now the API is fixed across a number of projects that are using it, but also because it appears to have led people to believe that it was a good idea. I am of the opinion that it was not and I want to prevent further spread of the approach.
The purpose of the transforms are to collect state from the producer and make that state available to the consumer so the task can ignore the I/O details. The initialization of the transform is the mechanism by which that state should be collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in the existing girder_io transforms, each and every transform that uses GirderClientTransform
instantiates a separate instance of the girder client, so if you need two files, you have two clients. Worse than what you point out in (2), the transform creates an instance of girder client even when it isn't necessary, and then a typical task creates another instance of girder client to return data to girder.
We don't actually want separation and isolation, otherwise we would serialize an entire file and pass that to girder_worker and not use the client at all. I'd argue we want less separation -- to have at most one girder_client when getting multiple files from the same instance of Girder with the same credentials and to reuse that client when returning data to Girder.
As you said, I thought the purpose of the transforms was to collect state from the producer. I think the other purpose of a transform is to provide utility -- the producer shouldn't have to replicate the same code over and over again if it is always going to be used by the transform. Do you recommend that the transform be passed the File model and the File document? Or that there is a separate helper function somewhere else getDataForGirderFileAllowDirectTransform
which returns the id, name, and path (if available)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your transform doesn't need a girder client it shouldn't derive from GirderClientTransform
Tasks should rarely ever create a girder client - they should return whatever values make sense for the domain of that task. Those values should be handled by a ResultsTransform
unless they need to use a girder client in which case they should use a GirderClientResultsTransform
Girder Client's maintain a handful of strings as their state, instantiating multiple clients represents negligible overhead and provides the capability for different transforms to have different levels of access. Moreover, keeping transforms separate prevents having to implement complex, highly coupled, logic in the worker to either implicitly or explicitly manage the shared context between transforms. If there is a more compelling reason than instantiation of multiple girder clients to take on that kind of maintenance burden i'm open to it.
Maintaining discipline with regard to the scope of girder worker transforms means developers can rely on a consistent understanding of the role these transforms play in the architecture of the larger system. While that may make this transform marginally less useful it makes all transforms more comprehensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachmullen yeah basically -
__init__
is called in the producer (usually a request thread or an event binding)transform
is called in the consumer (the worker process)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor the transform to avoid using the File
model is fine, but it isn't useful by itself. We then want a utility function that can be called to get the transform without having to do the same work in each of the callers. To this end, we could create a girder_worker_utils.girder_utils
module that would contain something like:
from girder.exceptions import FilePathException
from girder.models.file import File
from ..transforms.contrib.girder_io import GirderFileIdAllowDirect
def paramsForGirderFileIdAllowDirectTransform(file):
try:
local_path = File().getLocalFilePath(file)
except FilePathException:
local_path = None
return {'_id': str(file['_id']), 'name': file['name'], 'local_path': local_path}
def girderFileAllowDirect(file, **kwargs):
params = kwargs.copy()
params.update(paramsForGirderFileIdAllowDirectTransform(file))
return GirderFileIdAllowDirect(**params)
Since the GirderFileIdAllowDirect
transform is of no use directly, anything using it would instead of doing
from girder_worker_utils.transforms.contrib import GirderFileIdAllowDirect
inputpath=GirderFileIdAllowDirect(file_id)
you would do
from girder_worker_utils.girder_utils import girderFileAllowDirect
inputpath=girderFileAllowDirect(file)
If you don't think utilities to help communicate between Girder and Girder Worker belong in girder_worker_utils
, what common library do you suggest instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think I've made it clear that I feel the two or three lines of code belongs in the endpoints that are using the transform. if I had to suggest a location where utilities for communication between Girder and Girder Worker belong it would be in girder_worker.girder_plugin
package - keeping in mind that that package currently maintains a lot of legacy code that is likely to be removed once the girder_worker.run
code is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While having the code outside of this library reduces the maintenance burden of this library, it increases the cognitive burden of using it and the maintenance burden of all the places that use it. I disagree with this, but will do so in the interests of actually getting something done.
Also, this is the first place where you've suggested where you think such code aught to exist, but then immediately caution that that location is the wrong place, too. This suggests the design of this library and girder worker are ill-conceived, as there seems to be no decent way to not replicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point and a larger topic that would be good for a planning meeting discussion IMO.
- What should go in what library?
- What is the purpose of girder_worker_utils, in particular why isn't a lot of this stuff just part of the girder_worker library?
My understanding was that the |
@zachmullen from my perspective this belongs in the contrib module because it is easy for a developer to accidentally use it in a way that corrupts the girder database. That isn't really an excuse to subvert the basic scope and expectations of transforms. I might normally agree that we could "wash our hands" of it except that this is going to be immediately used across a bunch of different projects and so in-so-far as this is going to be the repository that coordinates use across those projects, i want to enforce a clean separation between library and application. |
@kotfic that makes sense to me. And I get the concern for wanting to formally separate what is happening on the Girder side from what is happening on the worker side. The de facto separation is by the fact that After thinking about this more, I'm inclined to agree with you that we should move the path computation logic out of the transform and instead pass the path in. Right away, it would be useful for those cases where someone wants to pass a symlinked version of the file that preserves the actual name/extension of the file. There could be more things like that which we haven't even considered, and externalizing that logic means that won't be an issue. On the other subject: I'm not that worried about the fact that a worker task might corrupt an assetstore file; to me that's a deploy time concern that can be achieved by the following
Something like imported files might confound that completely, which only strengthens the argument for moving the file path logic out of the transform -- it could get complicated, and different downstreams might want to implement different behaviors for the odd cases. |
789b972
to
d579dd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the typo
def _allowDirectPath(self): | ||
""" | ||
Check if the worker environment permits direct paths. This just checks | ||
if the environment variable GW_DIRECT_PATHS is set to a non-empry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty*
""" | ||
if os.environ.get('GW_DIRECT_PATHS'): | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just be return bool(os.environ.get('GW_DIRECT_PATHS'))
, but it's up to you which you like better.
Thanks for slugging through this with me @manthey. I'm sorry it took so long to get in. I really appreciate you staying with it despite the disagreements |
This requires girder/girder_worker#321.