Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mozilla Open Science sprint #143

Open
Daniel-Mietchen opened this issue Jun 4, 2015 · 3 comments
Open

Mozilla Open Science sprint #143

Daniel-Mietchen opened this issue Jun 4, 2015 · 3 comments
Assignees

Comments

@Daniel-Mietchen
Copy link
Member

We are taking part in the Mozilla Open Science sprint (overview) and welcome contributions to any of the software projects here at WikiProject Open Access, in particular to the YouTube exporter (#82) and the Open Access signalling project.

If you are interested in getting involved, please leave a note here, and we will take things from there.

@Daniel-Mietchen Daniel-Mietchen self-assigned this Jun 4, 2015
@Daniel-Mietchen
Copy link
Member Author

Here is what I plan to do: get an overview of all the <license> statements within the Open subset of the articles on PubMed Central.

I will update this comment as I move forward.

Day 1

  • Download the files that contain the XML
    • took about half an hour
    • total size: 15 GB
  • unpack: tar -zxvf *.tar.gz
    • took over an hour
    • total size: 68 GB (after deleting the original gz files)
  • explored the files in various ways while looking for the use of elements like <permissions>, <inline-formula>, <disp-formula>, <fig>, <ref>, <subj-group>, <kwd-group>
  • search for license statements:
    • grep -ohPR --include="*.nxml" "<license(.*)</license>" .
    • with removal of duplicates: grep -ohPR --include="*.nxml" "<license(.*)</license>" | awk '!x[$0]++' > license-statements.txt
    • running grep -oHPR --include="*.nxml" "<license(.*)</license>" > license-statements.txt over night
    • I am aware that xmlgrep would be more suited to this, but it's not available on that machine
    • noticed a strange way to abbreviate Creative Commons licenses and notified publisher

Day 2

  • the grep resulted in a license-statements.txt of over 370MB, with license statements from over 700k nxml files (not sure why not from all ca. 800k files)
  • cut -d ":" -f 2- license-statements.txt | awk '!x[$0]++' | sort > license-statements-without-filenames.txt removes the file names and deduplicates license statements
    • results in a 17MB file with 4062 license statements that differ in their XML character sequence.
    • needs cleanup
  • running grep -oHPR --include="*.nxml" "<license(.*?)</license>" > license-statements.txt over night

@Klortho
Copy link
Member

Klortho commented Jun 5, 2015

grep -ohPR --include="*.nxml" "<license(.*)</license>" .

If an article has more than one <license> element, this captures everything between the two. Use the non-greedy matcher, instead:

grep -ohPR --include="*.nxml" "<license(.*?)</license>" .

@Daniel-Mietchen
Copy link
Member Author

Cool, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants