Extract information from bytes #300

asciidiego · 2019-08-25T21:41:42Z

I have a PDF that I have downloaded, so is not saved as a file yet. How can I use textract to extract the text without actually saving the file?

jpweytjens · 2019-08-27T08:12:00Z

What do you mean with "downloaded, but not saved as a file yet"?

Textract requires that you specify the path to the pdf file. So far I have only parsed files that have been saved locally. You might try some of the ideas here, but I don't completly understand what you're trying to do.

asciidiego · 2019-08-27T09:58:52Z

I get the PDFs from a HTTP response. So, with the body (as bytes) I should be able to extract the pdf from the bytes alone, I do not think it's necessary to save the PDF as a file, to then parse it to extract the text to then delete the created file; when it was already in memory as a Python variable.

jpweytjens · 2019-08-27T10:56:23Z

Currently, textract does not supports streams. See also #85, #97 and #99. Perhaps this might be able to help you while we work on support for streams.

multinucliated · 2020-08-29T08:11:48Z

any progress in byte stream ( file.read() ) or you can suggest any other way out ?

shzy2012 · 2021-07-06T03:06:14Z

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

uxtt2000 · 2023-04-08T17:38:21Z

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

That's the solution. Works like a charm and works in the cloud in a stateless function without any filesystem access!
Thanks @shzy2012 !
@jpweytjens : Maybe put this workaround in the docs while streams are not yet supported, as its really good for usage cloudbased
Thanks

jpweytjens mentioned this issue Aug 27, 2019

Process from in memory variable #196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract information from bytes #300

Extract information from bytes #300

asciidiego commented Aug 25, 2019

jpweytjens commented Aug 27, 2019

asciidiego commented Aug 27, 2019

jpweytjens commented Aug 27, 2019

multinucliated commented Aug 29, 2020

shzy2012 commented Jul 6, 2021 •

edited

Loading

uxtt2000 commented Apr 8, 2023

Extract information from bytes #300

Extract information from bytes #300

Comments

asciidiego commented Aug 25, 2019

jpweytjens commented Aug 27, 2019

asciidiego commented Aug 27, 2019

jpweytjens commented Aug 27, 2019

multinucliated commented Aug 29, 2020

shzy2012 commented Jul 6, 2021 • edited Loading

uxtt2000 commented Apr 8, 2023

shzy2012 commented Jul 6, 2021 •

edited

Loading