Closed
Description
Hi!
I've found IndexError
when pdf file is relatively large. Necessary information is provided below.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.1, crypt_provider=('cryptography', '3.1'), PIL=none
commit 8e1799e
Code + PDF
This is a minimal, complete example that shows the issue:
#! /usr/bin/env python3
import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys
def TestOneInput(fname):
try:
pdf_reader = pypdf.PdfReader(fname)
for page_number, page in enumerate(pdf_reader.pages):
page.extract_text()
except (EmptyFileError, PdfReadError, PdfStreamError):
pass
if __name__ == "__main__":
if len(sys.argv) < 2:
exit(1)
TestOneInput(sys.argv[1])
PoC
crash-e8a85d82de01cab5eb44e7993304d8b9d1544970.pdf
Traceback
This is the complete stderr I see:
entry <entry> in Xref table invalid but object found
...
entry <entry> in Xref table invalid; object not found
Traceback (most recent call last):
File "/fuzz/./poc.py", line 18, in <module>
TestOneInput(sys.argv[1])
File "/fuzz/./poc.py", line 9, in TestOneInput
pdf_reader = pypdf.PdfReader(fname)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 132, in __init__
self._initialize_stream(stream)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 154, in _initialize_stream
self.read(stream)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 615, in read
self._read_xref_tables_and_trailers(stream, startxref, xref_issue_nr)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 871, in _read_xref_tables_and_trailers
startxref = self._read_xref(stream)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 910, in _read_xref
self._read_standard_xref_table(stream)
File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 781, in _read_standard_xref_table
while line[0] in b"\x0D\x0A":
IndexError: index out of range