Parsing PDF with and without images #4217

Gautam3AI · 2025-01-10T06:24:52Z

Gautam3AI
Jan 10, 2025

I've been using get_text to parse text from pdfs.

Now I want to extend it by adding support for parsing text from scanned pdfs, pdfs containing text and images both on a single page.

For this, I started using get_textpage_ocr.

Question
I found that this method get_textpage_ocr works for all my test pdfs with and without images. so, Does get_textpage_ocroutput is always same as get_text if the pdf doesn't contain any images?

Answered by JorjMcKie

Jan 10, 2025

If the page contains no images and if you use parameter full=False then no OCR is being executed. Regular text will then be extracted like normal.

View full answer

JorjMcKie · 2025-01-10T11:00:29Z

JorjMcKie
Jan 10, 2025
Maintainer

If the page contains no images and if you use parameter full=False then no OCR is being executed. Regular text will then be extracted like normal.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing PDF with and without images #4217

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Parsing PDF with and without images #4217

Gautam3AI Jan 10, 2025

Replies: 1 comment

JorjMcKie Jan 10, 2025 Maintainer

Gautam3AI
Jan 10, 2025

JorjMcKie
Jan 10, 2025
Maintainer