Scraping a PDF #51

psychemedia · 2016-06-08T16:54:33Z

How do I scrape a local PDF?

I'm running:

norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb
ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

and using one of your test files trying:

norma  -i /contentmineself/trialsjournal_15_1_511.pdf -o /contentmineself/test_ct/

but all it seems to do is copy the pdf and rename it fulltext.pdf?

If I add the switch --transform pdf2html, as per #38, I get:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more
0    [main] DEBUG org.xmlcml.cmine.args.DefaultArgProcessor  - option in exception  or --transform; (1,2147483647); parseTransform; STRING: null / []; pdf2html; [pdf2html]
java.lang.RuntimeException: invoke runTransform fails
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1052)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    ... 5 more
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more

My complete install is:

RUN apt-get clean -y && apt-get -y update && apt-get -y upgrade && \
  apt-get -y update && apt-get install -y wget ant unzip openjdk-7-jdk  && \
    apt-get clean -y

RUN wget --no-check-certificate https://github.com/ContentMine/norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb

RUN wget --no-check-certificate https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

RUN dpkg -i norma_0.1.SNAPSHOT_all.deb
RUN dpkg -i ami2_0.1.SNAPSHOT_all.deb

RUN npm install --global getpapers

in a basic linux environment with node installed (Dockerhub image node:4.3.2).

Hmm - is this the issue maybe? #21 (comment)

The text was updated successfully, but these errors were encountered:

petermr · 2016-06-08T23:28:27Z

Thanks!

How do I scrape a local PDF?

wrong terminology. You have already scraped it. "How do I transform a PDF to Foo?"

norma  -i /contentmineself/trialsjournal_15_1_511.pdf -o /contentmineself/test_ct/

but all it seems to do is copy the pdf and rename it fulltext.pdf?

Yes, because no --transform``is given. And the only thing it can reasonably do is to normalize the name. Did it create aCTreefolder for thefulltext.pdf`?

If I add the switch --transform pdf2html, as per #38, I get:

If the first command has set up a CTree with fulltext.pdf it should work on that:

norma  --ctree /contentmineself/trialsjournal_15_1_511_pdf -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt

I would use pdf2txt first as it's more self-contained and less experimental.

psychemedia · 2016-06-08T23:52:39Z

re: terminology - I disagree. For me, scraping is the extraction of content in a structured form from a document where the content is not structured in form useful for processing as data. So I can scrape a table from an HTML document. In the HTML doc, the table is structured as a table, but not in a form I can usefully process. Under your terms, I guess that's just a transformation of the HTML. But in the vernacular, it's table scraping?

Re: the ctree commands - thanks; I'm still not clear on what the pipeline is, what components are available, how to wire them together, and what the intermediate data structures are. Is there something I should read....?

petermr · 2016-06-09T07:46:48Z

"scraping" - I've looked at https://en.wikipedia.org/wiki/Data_scraping and agree that "data scraping" could be aligned with our "extraction". I don't think there is a consistent world view. However - rightly or wrongly - we use "scraping" to mean "web scraping" and "extraction" to mean "information extraction" . https://en.wikipedia.org/wiki/Information_extraction .

In CM we have the phases:

crawl
scrape
transform
extract / index

generally transformation represents transforming the document per se rather than extracting bits, though it's woolly - some transformations remove cruft, and some extract tables.

Anyway the more targeted answer is that it should be relatively easy to run PDF2TXT, whereas PDF2SVG2XML is more involved and less predictable.

As always the question is "what do you want to achieve"?

psychemedia · 2016-06-09T09:22:17Z

What do I want to achieve?

get enough of a clue about how to call the different contentmine tools in an appropriate order so I can get a feel for what they do, how they work together and how I might be able to start appropriating them;
task wise: one is to see how easy it is to then add "filters" for scraping new classes of regular PDFs (eg PDFs from a particular journal, or published in a particular style (eg Parliamentary Library briefing docs, perhaps?); my feeling is that I should be able to take this quite far?
the other is more general: explore whether those tools help speed up getting data out of a random collection of arbitrarily and independently styled pdf docs, such as reports from across government or the NHS; becuase of the arbitrary/independent nature of the doc formats, I don't expect this to result in a fully automated pipeline, but I'm interested to see what bits I might be able to usefully do; eg trying to parse data out of charts, other then just running OCR over them, or trying to extract their captions in the report text to act as image metadata for an image gallery generated from the doc, would be a start.

psychemedia · 2016-06-09T09:49:21Z

Re: the pathway. Running:

''''
norma --project /contentmineself/test -i /contentmineself/test/trialsjournal_15_1_511.pdf -o /contentmineself/test/
norma --project /contentmineself/test --ctree /contentmineself/test/trialsjournal_15_1_511 -i fulltext.pdf -o fulltext.pdf.html --transform pdf2html
''''
I get a verbose output log:

.0    [main] DEBUG org.xmlcml.svg2xml.pdf.PDFAnalyzer  - running /contentmineself/test/trialsjournal_15_1_511/fulltext.pdf to target/svg/fulltext
1 = 3275 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: null
3275 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
3289 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: i
6268 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: BDC
7123 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: EMC
7368 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
2 = 7620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
7620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
9726 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
3 = 9991 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
9991 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
10924 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
4 = 10973 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
10973 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
11691 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
5 = 11730 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
11730 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
12396 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
6 = 12406 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
12406 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
12620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
7 = 12634 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
12634 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13011 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
8 = 13025 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
13025 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13301 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
9 = 13316 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
13316 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13538 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream

13568 [main] DEBUG org.xmlcml.svg2xml.pdf.PDFAnalyzer  - target/svg/fulltext files: 9
0~21759 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page1.svg
1~27272 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page2.svg
2~30831 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page3.svg
3~34574 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page4.svg
4~38269 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page5.svg
5~42727 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page6.svg
6~46832 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page7.svg
7~56291 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page8.svg
8~57855 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page9.svg
57880 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.0.svg
58905 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.0.svg
59626 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.3.svg
60783 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.3.svg
<1><2>62996 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.2.svg
63044 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.2.svg
63188 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.12.svg
63228 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.12.svg
<3><4>64570 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.5.2.svg
64832 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.5.2.svg
<5><6><7><8>69238 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.9.3.svg
69485 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.9.3.svg
<9>69738 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - writing to target/output/fulltext/TEXT.0.html
.

which looks as if it created some output? But running eg:

find / -name 'TEXT.0.html' 2>/dev/null

to try to find out where the file was placed returns nothing? So where did the output files go? Or do I have a write permissions issue somewhere?

(Running with --transform pdf2txt worked fine, and I could see the extracted text file...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping a PDF #51

Scraping a PDF #51

psychemedia commented Jun 8, 2016 •

edited

Loading

petermr commented Jun 8, 2016

psychemedia commented Jun 8, 2016 •

edited

Loading

petermr commented Jun 9, 2016

psychemedia commented Jun 9, 2016

psychemedia commented Jun 9, 2016 •

edited

Loading

Scraping a PDF #51

Scraping a PDF #51

Comments

psychemedia commented Jun 8, 2016 • edited Loading

petermr commented Jun 8, 2016

psychemedia commented Jun 8, 2016 • edited Loading

petermr commented Jun 9, 2016

psychemedia commented Jun 9, 2016

psychemedia commented Jun 9, 2016 • edited Loading

psychemedia commented Jun 8, 2016 •

edited

Loading

psychemedia commented Jun 8, 2016 •

edited

Loading

psychemedia commented Jun 9, 2016 •

edited

Loading