-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening a GATE XML document results in the wrong document name #62
Comments
This has always been the case, it's the difference between using a datastore (which is conceptually saving a document and then letting you re-load the same document) vs using GATE XML (where you're saving a representation of a document and then later creating a new document from that saved representation). No other document format changes the document name as part of the parse process, indeed by the time the doc format kicks in mid way through |
Hmmmmm, I'm all conflicted now as to what it should do. My problem with the current approach is that we always claim that GATE XML is a lossless representation of the document, but it turns out that when we save in GATE XML we throw away the document name (so it isn't actually available when we reload anyway). I'm tempted to say that we should be saving the name and restoring it when using GATE XML so that we are being truly lossless, but I can see that might open up a whole other can of worms. |
I see where you're coming from, and GATE XML is a lot closer to being lossless now than it used to be but there's still some things it can't represent - object graphs, for example, where two different annotations that have features whose value refers to the same Java object will end up each pointing to a separate instance of the object when re-loaded. |
I do not understand what this is about: the name of a document is meta-information, not content and should not be part of what is saved as the document content in the first place? By default the name inside gate reflects what the name on the file system was, but this is a convention that could get changed by other ways of reading in a representation of a document and can be changed after the document has been loaded. |
Maybe but in theory so are the document features and we save those (including things like source URL and mimetype which will also be wrong I think). I spotted this as I changed the names from the filenames to something sensible and that info was lost. If the name isn't saved then we shouldn't claim the GATE XML format is lossless as I think a lot of people would assume that includes it's name. If we aren't saving the name (except in datastores which see less and less use as we use larger and larger corpora) what's really the point in allowing people to change the name at all? |
I disagree: lossless compression of a file saves and restores the content of the file but does not influence your choice of how to name the uncompressed file. That is a feature not a bug. I think the question about changing the name inside of GATE is important: in most GUIs a name of a loaded file always reflects the file and is immutable (and also does NOT contain some auto-generated stuff at the end). But I think one could argue that allowing to change the name inside GATE can be useful in situations where one wants to better organize things. What I think is relevant here is how the GATE GUI should map not just the file name to a document when loading, but also the document name to a file name when saving -- most editors do this properly but GATE does not. So if you rename the document, then save it, the convention from other editors would be to offer the new name as the default when saving. |
That assumes that you view a GATE document as having a one-to-one relationship with a file and I would claim that in many cases that isn't true; tweets for example. Yes when you reload a GATE XML document there is a one-to-one relationship but not always. As I said I'm now really not sure what it should do. All I know is that we have an API that allows us to set the name of a Document object and a method of saving that object which we claim is lossless but which doesn't store the complete set of information about the object. To me that seems an odd situation to be in. |
Hmm I think I understand this a bit better now and I thing I start to agree with you. We have really two completely different kinds of names here:
So if we would already store the intrinsic name, we could have a document stored as doc1.xml that has the intrinsic name doc2.xml. Or one with extrinsic name doc1.xml that has the intrinsic name "Wikipedia document Barack Obama". But getting rid of either of the two can be extremely confusing: ignoring extrinsic and loading "doc1.xml" and getting doc2.xml would be terrible. But as you said setting the intrinsic name to "Wikipedia document Barack Obama" before saving and then getting back doc1.xml because we ignore intrinsic is also confusing. So I think we are really mixing up two entirely different kinds of names here, aren't we? (I was only talking about the extrinsic name previously because this is the only think other editors care about). The whole matter gets even more complicated in GATE since sometimes those random hex strings get appended to the name making the extrinsic and intrinsic names different even when they could easily be identical. I think we have a number of options of how to deal with the fact that there are two names really: e.g. always force the intrinsic one to mirror the extrinsic one (if there is an extrinsic one), have the GUI use and show both (e.g. one as a tooltip), always ignore one or the other etc. Not sure how confusing the one I probably like best personally (have the GUI show and use both names) would be for the average user. I think the exactly same problem really exists with any Resource that can be serialised: e.g. a pipeline has some intrinsic name and can be saved to a file with a completely different name (I always disliked this and tried to manually make them match to avoid confusion). |
Having thought about this a little further, I think what annoys me the most is the inconsistency. The name of a GATE resource is set through a param in I'm aware that not every LR can hold the name through a save/load cycle, but when it could I think we should support that if for no other reason than consistency. I can understand that the approach may lead to weird things happening where the name in GATE looks like a filename but doesn't reflect the name of the file on disc but that doesn't really bother me. My view is that we should actually avoid making the names of resources look like filenames by default as it's misleading, especially as in many cases they get out of sync as soon as you do a save/load cycle anyway. You also already have the sourceUrl document feature that shows you where the document was originally created from; although again this is weird with GATE XML as it shows the original URL and not the URL from which it was loaded on this occasion (should they differ), plus the mime type document feature is nearly always wrong on GATE XML documents as well. |
Thinking even further, we already break the link between a document object and the file from which it came by not having a "save" option only "save as", even if that option does offer the original filename, semantically it's not the same as simply saving back to where the document came from. |
and of course even more confusingly, documents in a corpus which are stored as part of a pipeline are reloaded with the correct name if it has been changed |
For what it is worth, the Python gatenlp package and the formats implemented by the Format_Bdoc plugin see the name of the document as a property of the document now (and not as meta-information about the document) so the name does get saved and re-stored, no matter what the file name used for storing/loading is. |
If you rename a document to something sensible, save the document as GATE XML and then reload it, the sensible name disappears and you get the name generated from the filename again.
The text was updated successfully, but these errors were encountered: