Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize image file naming between full-text imports to Wikisource and file import to Commons #8

Closed
Daniel-Mietchen opened this issue Feb 18, 2014 · 6 comments

Comments

@Daniel-Mietchen
Copy link
Member

Both
https://github.com/konrad/JATS-to-Mediawiki
and
https://github.com/erlehmann/open-access-media-importer
generate file names for images and supplementary files, but the rules used therein are not fully compatible.

Part of #7

@notconfusing notconfusing added this to the Phase 1 - Wikisource & Selected Articles milestone May 5, 2014
@notconfusing
Copy link
Member

@DCoetzee do you have any advice for this? Basically, there is one Bot called "Open Access Media Importer" that generates media on Commons, with one naming scheme. We are creating a new bot "Citation Bot" that may upload pwhat would duplicates of what OAMI does. However, it would be nice if the naming conventions could be moved to suit Citation bot rather than OAMI, since Citation bot is uploading journal articles to Wikisource which are expecting a specific file naming convention, whereas whatever is using OAMI-uploaded files are not name sensitive. Moving files however requires admin. What to do?

@Daniel-Mietchen
Copy link
Member Author

The issue of duplicates is not so much between OAMI (which uploads only audio and video files) and the Citation bot (which will mainly do images), but between Citation bot's uploads and images that have already been uploaded manually by Commons users.

To give an idea of the scale that we are talking about, there are currently ca. 25k files from Open Access scholarly sources (cf. http://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Open+access+%28publishing%29&use_globalusage=1&ns0=1&depth=5 ), of which 16k are audio/ video (basically all from OAMI): http://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Uploaded+with+Open+Access+Media+Importer&use_globalusage=1&ns0=1 . This means there are ca. 9k image files that have been uploaded manually.

With Citation bot, we are planning - in the long run - to upload thousands of full-text articles onto Wikisource and their respective images to Commons, of which a fair share will be amongst the 9k (or more by then) that are already there, so we will have to detect duplicates in one way or another - either before upload to avoid duplicates, or after upload to mark duplicates for deletion. The problem is that there is no good way to capture duplicates with all the possible modifications that may have been made on the way (examples in https://meta.wikimedia.org/wiki/Research:Committee/Areas_of_interest/Open-access_policy/Request_for_Information_on_Public_Access_to_Peer-Reviewed_Scholarly_Publications_Resulting_From_Federally_Funded_Research#Additional_comments ).

This issue #8 is about harmonizing the way that OAMI and Citation bot name the image and media files they upload.

@Daniel-Mietchen
Copy link
Member Author

Reopening, assuming that @notconfusing had just hit the wrong button when closing.

@notconfusing
Copy link
Member

@Daniel-Mietchen, I copied the way that OAMI does naming at the moment, and re-uploaded our 11 article test-set test. I looks like it works. So we have a prototype for this now. Can always be changes.

@notconfusing
Copy link
Member

The filenames are being harmonized now. Sometimes there is an issue with same filenames but image size differences - but thats a different bug wpoa/JATS-to-Mediawiki#20 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants