Skip to content

Using PdfReader causes a crash: IndexError when reading xref table #2886

Closed
@Avgor46

Description

@Avgor46

Hi!

I've found IndexError when pdf file is relatively large. Necessary information is provided below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.1, crypt_provider=('cryptography', '3.1'), PIL=none

commit 8e1799e

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-e8a85d82de01cab5eb44e7993304d8b9d1544970.pdf

Traceback

This is the complete stderr I see:

entry <entry> in Xref table invalid but object found
...
entry <entry> in Xref table invalid; object not found
Traceback (most recent call last):
  File "/fuzz/./poc.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./poc.py", line 9, in TestOneInput
    pdf_reader = pypdf.PdfReader(fname)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 132, in __init__
    self._initialize_stream(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 154, in _initialize_stream
    self.read(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 615, in read
    self._read_xref_tables_and_trailers(stream, startxref, xref_issue_nr)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 871, in _read_xref_tables_and_trailers
    startxref = self._read_xref(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 910, in _read_xref
    self._read_standard_xref_table(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 781, in _read_standard_xref_table
    while line[0] in b"\x0D\x0A":
IndexError: index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-uncaught-exceptionUse this label only for issues caused by broken PDF documents that cannot be recovered.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions