-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping a PDF #51
Comments
Thanks!
wrong terminology. You have already scraped it. "How do I transform a PDF to Foo?"
Yes, because no
If the first command has set up a CTree with fulltext.pdf it should work on that:
I would use pdf2txt first as it's more self-contained and less experimental. |
re: terminology - I disagree. For me, scraping is the extraction of content in a structured form from a document where the content is not structured in form useful for processing as data. So I can scrape a table from an HTML document. In the HTML doc, the table is structured as a table, but not in a form I can usefully process. Under your terms, I guess that's just a transformation of the HTML. But in the vernacular, it's table scraping? Re: the ctree commands - thanks; I'm still not clear on what the pipeline is, what components are available, how to wire them together, and what the intermediate data structures are. Is there something I should read....? |
"scraping" - I've looked at https://en.wikipedia.org/wiki/Data_scraping and agree that "data scraping" could be aligned with our "extraction". I don't think there is a consistent world view. However - rightly or wrongly - we use "scraping" to mean "web scraping" and "extraction" to mean "information extraction" . https://en.wikipedia.org/wiki/Information_extraction . In CM we have the phases:
generally transformation represents transforming the document per se rather than extracting bits, though it's woolly - some transformations remove cruft, and some extract tables. Anyway the more targeted answer is that it should be relatively easy to run PDF2TXT, whereas PDF2SVG2XML is more involved and less predictable. As always the question is "what do you want to achieve"? |
What do I want to achieve?
|
Re: the pathway. Running: ''''
which looks as if it created some output? But running eg:
to try to find out where the file was placed returns nothing? So where did the output files go? Or do I have a write permissions issue somewhere? (Running with |
How do I scrape a local PDF?
I'm running:
and using one of your test files trying:
but all it seems to do is copy the pdf and rename it
fulltext.pdf
?If I add the switch
--transform pdf2html
, as per #38, I get:My complete install is:
in a basic linux environment with node installed (Dockerhub image
node:4.3.2
).Hmm - is this the issue maybe? #21 (comment)
The text was updated successfully, but these errors were encountered: