File extension missing #68

Daniel-Mietchen · 2014-05-17T21:57:48Z

The media files to be embedded should have the correct file name suffix by default, so that edits like
https://en.wikisource.org/w/index.php?title=Biodiversity_Assessment_of_the_Fishes_of_Saba_Bank_Atoll%2C_Netherlands_Antilles&diff=4736919&oldid=4736916
are not necessary any more.

That would be a precursor to #8.

notconfusing · 2014-05-18T19:25:12Z

@Klortho @wrought is it possible to do this in xsl transform? We currently do not modify the output of the transform, so in order to add filenames we would have to start treating the xsl output in python. That's fine if that's the only way, but it does not make the xsl conversion self-contained.

It's is as simple as just append.jpg, or .png, only that one has to check for the existence of each one, since images are not gauranteed to come in any specific format.

wrought · 2014-05-18T20:05:45Z

Here's a sample figure in the .nxml file:

<fig id="F1" orientation="portrait" position="float"><label>Figure 1.</label><caption><p>Phylogeny of the genus <italic><named-content content-type="taxon-name">Bassaricyon</named-content></italic>. Phylogeny generated from the concatenated <italic>CHRNA1</italic> and cytochrome <italic>b</italic> sequences. All analyses consistently recovered the same relationships with high support. Divergence dating was generated in BEAST; bars show the 95% confidence interval at each node. Branches without support are collapsed and outgroup clades have been collapsed, leaving monophyletic groupings with 100% support. Data for <italic>CHRNA1</italic> are missing for <italic><named-content content-type="taxon-name">Bassaricyon gabbii</named-content></italic>, for which DNA was extracted from a museum skull. All nodes in <italic><named-content content-type="taxon-name">Bassaricyon</named-content></italic> have 1.00 Bayesian posterior probability, except the split between <italic><named-content content-type="taxon-name">Bassaricyon gabbii</named-content></italic> and <italic><named-content content-type="taxon-name">Bassaricyon alleni</named-content></italic>/<italic><named-content content-type="taxon-name">Bassaricyon medius</named-content></italic> (0.97 Bayesian posterior probability). Non-focal and outgroup taxa are shaded in gray, <italic><named-content content-type="taxon-name">Bassaricyon</named-content></italic> species and subspecies are color coded, samples of <italic><named-content content-type="taxon-name">Bassaricyon medius medius</named-content></italic> and <italic><named-content content-type="taxon-name">Bassaricyon neblina neblina</named-content></italic> that were collected within 5 km of each other in Ecuador are shaded.</p></caption><graphic xlink:href="ZooKeys-324-001-g001"/></fig>

Most importantly, the <graphic> element:

<graphic xlink:href="ZooKeys-324-001-g001"/>

As you can see, there is no extension stored in this element for this version of the article. However, image files are provided with the rest of the tarball where we find the .nxml file. So, we have some options:

Find if there is a standard convention and use the jats-to-mediawiki xslt to append .jpg or w/e the standard may be.
Implement post-processing of the article xml or the converted mediawiki-markup generated from the xslt using python to check the tarball files for the correct corresponding file extension (e.g. .jpg, .jpeg or .png)
- Currently we don't handle any post-processing of the text, so this would be a start, which already breaks convention.
Use whatever OAMI does to guess/assign file extensions, or subvert the filenames and file extensions that come with the article xml to use whatever OAMI uses.
Update manually

Daniel-Mietchen · 2014-05-27T21:24:46Z

I think we should go with OAMI here.

Klortho · 2014-05-28T12:52:57Z

I agree with that.

erlehmann · 2014-05-28T12:55:31Z

Can someone elaborate on why file extensions are needed on MediaWiki?

erlehmann · 2014-05-28T12:56:55Z

From OAMI source code:

        #TODO: file extension should be adapted for other file formats
        url_path = urlparse.urlsplit(material.url).path
        source_filename = url_path.split('/')[-1]
        assert(mimetype in ('audio', 'video'))
        if mimetype == 'audio':
            extension = 'oga'
        elif mimetype == 'video':
            extension = 'ogv'
        wiki_filename = path.splitext(source_filename)[0] + '.' + extension

notconfusing · 2014-07-03T18:47:01Z

Ok, I've updated our code to search only for jpg and pngs per advice above.

Daniel-Mietchen added this to the Phase 1A - Wikisource & Commons milestone May 17, 2014

Daniel-Mietchen added feature request labels May 17, 2014

Daniel-Mietchen assigned wrought May 17, 2014

wrought changed the title ~~File name suffix missing~~ File extension missing May 18, 2014

notconfusing closed this as completed Jul 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File extension missing #68

File extension missing #68

Daniel-Mietchen commented May 17, 2014

notconfusing commented May 18, 2014

wrought commented May 18, 2014

Daniel-Mietchen commented May 27, 2014

Klortho commented May 28, 2014

erlehmann commented May 28, 2014

erlehmann commented May 28, 2014

notconfusing commented Jul 3, 2014

File extension missing #68

File extension missing #68

Comments

Daniel-Mietchen commented May 17, 2014

notconfusing commented May 18, 2014

wrought commented May 18, 2014

Daniel-Mietchen commented May 27, 2014

Klortho commented May 28, 2014

erlehmann commented May 28, 2014

erlehmann commented May 28, 2014

notconfusing commented Jul 3, 2014