Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automating the collection of examples #94

Closed
Daniel-Mietchen opened this issue Jun 5, 2015 · 4 comments
Closed

Automating the collection of examples #94

Daniel-Mietchen opened this issue Jun 5, 2015 · 4 comments

Comments

@Daniel-Mietchen
Copy link
Member

So far, most of the examples we have discussed have been identified manually. I am thinking about a systematic approach to collecting examples for sets of tags that we consider.

One way to go about that would be to mine PMC's OA Subset (which can be downloaded in bulk) for uses of specific tags and to condense that (perhaps along with any manually provided examples from outside PMC's OA Subset) into some basic usage patterns (think tag-level dialects) that we could use as a basis for discussing best practices and distilling recommendations.

I can think of a number of effects that this may have:

  1. If we compile these stats on a regular basis, we can track the evolution of tagging patterns and use that to
    • monitor uptake of JATS4R recommendations
    • identify cases where JATS4R recommendations may be useful or in need of revision
  2. It would be possible to create mappings between certain dialects and the JATS4R recommendations for the corresponding tags, such that non-compliant articles could be more easily rendered, analyzed or otherwise used than they can now. Some of these mappings could possibly be crowdsourced (e.g. image-only formulas might be transcribable using CAPTCHA-like mechanisms in places frequented by TeX-savvy users).
  3. The error messages in our schematrons could then point to those tag-level dialects and our accompanying annotations as to why they are compliant with our recommendations or not. This would help inform and educate about tagging standards in general and JATS4R in particular.

I have started to explore this but the tools I know are not best suited for this kind of analyses on such a corpus (I am running a grep over night!), so I would welcome your ideas in this regard.

@Melissa37
Copy link

That sounds brilliant

@jats-laura
Copy link
Contributor

  1. If we compile these stats on a regular basis, we can track the evolution of tagging patterns and use that to
    • monitor uptake of JATS4R recommendations
    • identify cases where JATS4R recommendations may be useful or in need of revision

I don't think this is quite accurate. What you get from the PMC OA subset is what PMC normalized. Nothing in that subset is as it was delivered by the publisher...nothing. PMC converts every single XML document it receives to comply with PMC style...even those submitted to us in the JATS DTD.

If the recommendations are for a part of the document that PMC doesn't need to standardize for archiving or display purposes, then sure, we'll pass it through unchanged and you can monitor the uptake. But so far, I haven't seen much PMC wouldn't make some effort to standardize in our output, so I think all you'll really be monitoring is the degree of PMC's uptake.

@Daniel-Mietchen
Copy link
Member Author

Good points, Laura. What about making more of those normalization steps (and the accompanying tools) public? At least for XML supplied by CC BY publishers, this would seem possible.

@hubgit
Copy link
Member

hubgit commented Aug 5, 2015

If you can get it to run (it's 9 years old), Stefano Mazzocchi's Gadget is a nice tool for analysing elements/attributes and their contents in large quantities of XML (e.g. everything in the PMC OA Subset).

@Melissa37 Melissa37 reopened this Jul 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants