Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing small memory leak in save() #2791

Closed
cormier opened this issue Nov 8, 2023 · 10 comments
Closed

Experiencing small memory leak in save() #2791

cormier opened this issue Nov 8, 2023 · 10 comments

Comments

@cormier
Copy link

cormier commented Nov 8, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

I think there's a memory leak in the save() method in fitz.

To Reproduce (mandatory)

My code adds a PDF cover page to an existing PDF. It merges the two PDFs in memory using fitz's insert_pdf method. The merged PDF is converted to bytes using the write method.

The method that reproduces the memory leak is as follows:

def merge_pdf(content: bytes, coverpage: bytes):
    try:
        with fitz.Document(stream=coverpage, filetype="pdf") as coverpage_pdf, fitz.Document(
            stream=content, filetype="pdf"
        ) as content_pdf:
            coverpage_pdf.insert_pdf(content_pdf)
            doc = coverpage_pdf.write()
            return doc
    except:
        pass

This method takes a snapshot of memory and displays fitz-related statistics by file name and line number.

def take_snapshot():

    snapshot = tracemalloc.take_snapshot()
    for i, filename_stat in enumerate(snapshot.statistics("filename")[:20], 1):
        if "fitz" in str(filename_stat):
            for line_stat in snapshot.statistics("lineno")[:20]:
                if "fitz" in str(line_stat):
                    log.info("line_stat", stat=str(line_stat))
            log.info("filename_stat", i=i, stat=str(filename_stat))

    print("=" * 50)

Finally, I iterate over a set of ~6000 PDFs. For each PDF, I call the merge_pdf method and then the take_snapshot method. During the initial iterations, fitz doesn't appear in the list of statistics by file name or line number. After a few iterations, fitz does appear, and I can see that both the memory it uses and the number of objects referred to it are increasing by 1 to 3 KiB per iteration.

line_stat                      iteration=51 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:4720: size=163 KiB, count=5964, average=28 B
filename_stat                  i=14 iteration=51 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:0: size=177 KiB, count=6018, average=30 B
==================================================
line_stat                      iteration=52 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:4720: size=166 KiB, count=6052, average=28 B
filename_stat                  i=14 iteration=52 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:0: size=178 KiB, count=6097, average=30 B
==================================================
line_stat                      iteration=53 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:4720: size=168 KiB, count=6137, average=28 B
filename_stat                  i=14 iteration=53 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:0: size=181 KiB, count=6192, average=30 B
==================================================
line_stat                      iteration=54 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:4720: size=170 KiB, count=6228, average=28 B
filename_stat                  i=14 iteration=54 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz/fitz.py:0: size=183 KiB, count=6273, average=30 B

Unfortunately, my diagnostic skills are limited at this stage, as I have no experience of SWIG. What could be the next steps to diagnose the memory leak more precisely and eventually solve it? Does the code in Document_save in fitz.py match the contents of lines 1999 to 2081 in fitz.i? Am I right in thinking that the cause of the memory leak lies within this code in fitz.i ? What should I look for to solve the problem?

Your configuration (mandatory)

  • Fedora Linux 38 64 bit
  • '\nPyMuPDF 1.23.6: Python bindings for the MuPDF 1.23.5 library.\nVersion date: 2023-11-06 00:00:01.\nBuilt for Python 3.8 on linux (64-bit).\n'
@JorjMcKie
Copy link
Collaborator

Can you please be so kind and repeat your check with the new, alternative ("rebased") PyMuPDF implementation?
You only need to change the import statement to import fitz_new as fitz.
I am ready to bet that there will be no more memory leaks ... 😎

@cormier
Copy link
Author

cormier commented Nov 8, 2023

Thank you for your reply.

I have repeated the test after changing the import statement to import fitz_new as fitz

I have started my script with mprof run, here is the output of mprot plot after 887 iterations:

leak_with_fitz_new

Here is the output of my logs after 887 iterations:

line_stat                      iteration=884 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:120: size=1726 KiB, count=63137, average=28 B
filename_stat                  i=4 iteration=884 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:0: size=1727 KiB, count=63149, average=28 B
==================================================
line_stat                      iteration=885 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:120: size=1729 KiB, count=63218, average=28 B
filename_stat                  i=4 iteration=885 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:0: size=1729 KiB, count=63230, average=28 B
==================================================
line_stat                      iteration=886 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:120: size=1731 KiB, count=63293, average=28 B
filename_stat                  i=4 iteration=886 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:0: size=1731 KiB, count=63305, average=28 B
==================================================
line_stat                      iteration=887 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:120: size=1733 KiB, count=63377, average=28 B
filename_stat                  i=4 iteration=887 stat=/home/dcormier/.virtualenvs/eruditorg/lib64/python3.8/site-packages/fitz_new/extra.py:0: size=1734 KiB, count=63389, average=28 B

We can see that the values count and size are higher on each iteration.

The references are now to fitz_new/extra.py:120

def Document_save(*args):
    return _extra.Document_save(*args)

@cormier cormier changed the title Experiencing small memory leak in Experiencing small memory leak in save() Nov 8, 2023
@julian-smith-artifex-com
Copy link
Collaborator

I'd like to reproduce this. Could you post your complete test python code?

@cormier
Copy link
Author

cormier commented Nov 13, 2023

Hi @julian-smith-artifex-com,

Thank you for your reply. Please find a python script that will hopefully reproduce the problem.

import gc
#import tracemalloc
import fitz_new as fitz

#def take_memory_snapshot(iteration_count):
#
#    snapshot = tracemalloc.take_snapshot()
#    for i, filename_stat in enumerate(snapshot.statistics("filename")[:20], 1):
#        if "fitz" in str(filename_stat):
#            print(f"filename_stat iteration={iteration_count} i={i} stat={str(filename_stat)}")
#
#    print("=" * 50)
#tracemalloc.start()

def merge_pdf(content: bytes, coverpage: bytes):
   with fitz.Document(stream=coverpage, filetype="pdf") as coverpage_pdf, fitz.Document(
       stream=content, filetype="pdf"
   ) as content_pdf:
       coverpage_pdf.insert_pdf(content_pdf)
       doc = coverpage_pdf.write()
       return doc

def reproduce():
    for iteration_count in range(1, 1000):
        with open(f"content.pdf", "rb") as content_pdf, open(f"coverpage.pdf", "rb") as coverpage_pdf:
            content = content_pdf.read()
            coverpage = coverpage_pdf.read()
            merge_pdf(content, coverpage)
            print(iteration_count)
            gc.collect()
    #take_memory_snapshot(iteration_count)

reproduce()

The code will merge coverpage.pdf into content.pdf 1000 times (a normal use case for me). You can use any PDF you want, but if you want to reproduce exactly what I have done, please find attached the following PDFs.

content.pdf
coverpage.pdf

Commented is code that takes a memory snapshot and print any line related to fitz. I have commented to be 100% it does not create side effects. The code explicitely calls gc.collect after every iteration.

When I run the code like this:

mprof run --python python reproduce.py

I can run mprof plot to get the following graph:

image

@julian-smith-artifex-com
Copy link
Collaborator

Many thanks for your code.

It looks like the leak does not occur with the new rebased implementation (import fitz_new as fitz) with optimisations disabled (PYMUPDF_USE_EXTRA=0, which avoids PyMuPDF C++ code), on Python-3.11. So the problem might be caused by either of Python < 3.11 or a bug in PyMuPDF's classic/rebased C/C++ code.

Here are mprof graphs showing behaviour with Python-3.10 and 3.11 (different OS's but i'm assuming this not significant).

  • *-pymupdf.pdf is classic implementation.
  • *-rebased.pdf is rebased implementation.
  • *-rebased2.pdf is rebased implementation with optimisations disabled.

test_2791-mprof-OpenBSD-3.10.12-rebased2.pdf
test_2791-mprof-OpenBSD-3.10.12-rebased.pdf
test_2791-mprof-OpenBSD-3.10.12-pymupdf.pdf
test_2791-mprof-Linux-3.11.2-rebased2.pdf
test_2791-mprof-Linux-3.11.2-rebased.pdf
test_2791-mprof-Linux-3.11.2-pymupdf.pdf

I'll have a look for a leak in the rebased optimisation code (written in C++, and mostly copied from the original classic implementation).

@cormier
Copy link
Author

cormier commented Nov 14, 2023

Thank you for your reply.

I confirm that there is no leak when using PYMUPDF_USE_EXTRA=0 on Python 3.11. I will perform some tests and if disabling the optimisations does not badly degrade performance I will use this on my production setup. Thanks again !

@julian-smith-artifex-com
Copy link
Collaborator

That's good to know, thanks.

Additional testing of the rebased implementation is always very useful.

I have a patch to the rebase implementation's optimisation code that seems to address most of the problem, reducing the increase over 1000 iterations of psutils's rss measurement (which is apparently what mprof uses) from 1.043x to 1.005x [tested with Python-3.11; haven't tried 3.10 yet]:

diff --git a/src/extra.i b/src/extra.i
index 135cc1e6..ae2681a2 100644
--- a/src/extra.i
+++ b/src/extra.i
@@ -1426,9 +1426,10 @@ static void JM_bytesio_seek(fz_context* ctx, void* opaque, int64_t off, int when
     PyObject* bio = (PyObject*) opaque;
     PyObject* name = PyUnicode_FromString("seek");
     PyObject* pos = PyLong_FromUnsignedLongLong((unsigned long long) off);
-    PyObject_CallMethodObjArgs(bio, name, pos, whence, nullptr);
+    PyObject* rc = PyObject_CallMethodObjArgs(bio, name, pos, whence, nullptr);
+    Py_XDECREF(rc);
     std::string e;
-    PyObject* rc = PyErr_Occurred();
+    rc = PyErr_Occurred();
     if (rc)
     {
         e = "Could not write to Py file obj: " + repr(bio);
@@ -1453,9 +1454,10 @@ static void JM_bytesio_write(fz_context* ctx, void* opaque, const void* data, si
     PyObject* bio = (PyObject*) opaque;
     PyObject* b = PyBytes_FromStringAndSize((const char*) data, (Py_ssize_t) len);
     PyObject* name = PyUnicode_FromString("write");
-    PyObject_CallMethodObjArgs(bio, name, b, nullptr);
+    PyObject* rc = PyObject_CallMethodObjArgs(bio, name, b, nullptr);
+    Py_XDECREF(rc);
     std::string e;
-    PyObject* rc = PyErr_Occurred();
+    rc = PyErr_Occurred();
     if (rc)
     {
         e = "Could not write to Py file obj: " + repr(bio);

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 14, 2023
We need to call Py_XDECREF() on value returned by PyObject_CallMethodObjArgs().
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 14, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 14, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 14, 2023
julian-smith-artifex-com added a commit that referenced this issue Nov 15, 2023
We need to call Py_XDECREF() on value returned by PyObject_CallMethodObjArgs().
@julian-smith-artifex-com
Copy link
Collaborator

julian-smith-artifex-com commented Nov 16, 2023

Here are results for python-3.11 and 3.10 after the commits above.

I think the leak is fixed on rebased and 3.11:

test_2791-mprof-Linux-3.11.2-rebased
test_2791-mprof-Linux-3 11 2-rebased

test_2791-mprof-Linux-3.11.2-rebased (unoptimised)
test_2791-mprof-Linux-3 11 2-rebased2

test_2791-mprof-Linux-3.11.2-classic
test_2791-mprof-Linux-3 11 2-classic

And the leak is reduced on rebased and 3.10:

test_2791-mprof-OpenBSD-3.10.12-rebased
test_2791-mprof-OpenBSD-3 10 12-rebased

test_2791-mprof-OpenBSD-3.10.12-rebased (unoptimised)
test_2791-mprof-OpenBSD-3 10 12-rebased2

test_2791-mprof-OpenBSD-3.10.12-classic
test_2791-mprof-OpenBSD-3 10 12-classic

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.7.

@cormier
Copy link
Author

cormier commented Dec 1, 2023

Thank you so much for your work @julian-smith-artifex-com

I have tested the new release with my script and can confirm that the memory leak has been fixed. I would be ready to upgrade, but that would require me to start using the rebased version of PyMuPDF in production and I want to be sure that I don't introduce regressions doing that.

To provide context, my use case involves merging PDFs with varying standards and quality. If you follow this link, you will be able to download a PDF that consists of a coverpage and a scholarly article. The merging of the two documents has been done with PyMuPDF https://www.erudit.org/en/journals/cjsae/2022-v34-n2-cjsae08031/1099792ar/

I've tested the rebased version with a sample of 10,000 PDFs, focusing on two key questions:

  1. Can the rebased version successfully perform the merge without raising exceptions?
  2. If the merge is successful, do the documents merge as expected? Are there any visual differences between the documents produced by the rebased version and the "standard" version?

Regarding the first question, I observed that the rebased version raises an AttributeError for certain documents that the standard version handles without issue. I'll investigate further and open a separate issue for this.

For the second version, I have compared each of the 10000 PDFs with diff-pdf. For my use case on this set of PDFs, there are no visual differences between the PDFs produced by the rebased version and the PDFs produced by the rebased version.

Finally, I would like to know: would you recommend using the rebased version in production ? Is it stable enough ?

Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants