Mozilla Open Science sprint #143

Daniel-Mietchen · 2015-06-04T13:11:39Z

We are taking part in the Mozilla Open Science sprint (overview) and welcome contributions to any of the software projects here at WikiProject Open Access, in particular to the YouTube exporter (#82) and the Open Access signalling project.

If you are interested in getting involved, please leave a note here, and we will take things from there.

Daniel-Mietchen · 2015-06-04T19:25:10Z

Here is what I plan to do: get an overview of all the <license> statements within the Open subset of the articles on PubMed Central.

I will update this comment as I move forward.

Day 1

Download the files that contain the XML
- took about half an hour
- total size: 15 GB
unpack: tar -zxvf *.tar.gz
- took over an hour
- total size: 68 GB (after deleting the original gz files)
explored the files in various ways while looking for the use of elements like <permissions>, <inline-formula>, <disp-formula>, <fig>, <ref>, <subj-group>, <kwd-group>
search for license statements:
- grep -ohPR --include="*.nxml" "<license(.*)</license>" .
- with removal of duplicates: grep -ohPR --include="*.nxml" "<license(.*)</license>" | awk '!x[$0]++' > license-statements.txt
- running grep -oHPR --include="*.nxml" "<license(.*)</license>" > license-statements.txt over night
- I am aware that xmlgrep would be more suited to this, but it's not available on that machine
- noticed a strange way to abbreviate Creative Commons licenses and notified publisher

Day 2

the grep resulted in a license-statements.txt of over 370MB, with license statements from over 700k nxml files (not sure why not from all ca. 800k files)
cut -d ":" -f 2- license-statements.txt | awk '!x[$0]++' | sort > license-statements-without-filenames.txt removes the file names and deduplicates license statements
- results in a 17MB file with 4062 license statements that differ in their XML character sequence.
- needs cleanup
running grep -oHPR --include="*.nxml" "<license(.*?)</license>" > license-statements.txt over night

Klortho · 2015-06-05T17:03:51Z

grep -ohPR --include="*.nxml" "<license(.*)</license>" .

If an article has more than one <license> element, this captures everything between the two. Use the non-greedy matcher, instead:

grep -ohPR --include="*.nxml" "<license(.*?)</license>" .

Daniel-Mietchen · 2015-06-06T01:32:40Z

Cool, thanks!

Daniel-Mietchen self-assigned this Jun 4, 2015

Daniel-Mietchen mentioned this issue Jun 5, 2015

Automating the collection of examples JATS4R/JATS4R-Participant-Hub#94

Closed

Daniel-Mietchen added a commit to Daniel-Mietchen/sandbox that referenced this issue Jun 5, 2015

from wpoa/open-access-media-importer#143 (comment)

bc86453

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mozilla Open Science sprint #143

Mozilla Open Science sprint #143

Daniel-Mietchen commented Jun 4, 2015

Daniel-Mietchen commented Jun 4, 2015

Klortho commented Jun 5, 2015

Daniel-Mietchen commented Jun 6, 2015

Mozilla Open Science sprint #143

Mozilla Open Science sprint #143

Comments

Daniel-Mietchen commented Jun 4, 2015

Daniel-Mietchen commented Jun 4, 2015

Day 1

Day 2

Klortho commented Jun 5, 2015

Daniel-Mietchen commented Jun 6, 2015