-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing small memory leak in save() #2791
Comments
Can you please be so kind and repeat your check with the new, alternative ("rebased") PyMuPDF implementation? |
Thank you for your reply. I have repeated the test after changing the import statement to I have started my script with Here is the output of my logs after 887 iterations:
We can see that the values count and size are higher on each iteration. The references are now to def Document_save(*args):
return _extra.Document_save(*args) |
I'd like to reproduce this. Could you post your complete test python code? |
Thank you for your reply. Please find a python script that will hopefully reproduce the problem. import gc
#import tracemalloc
import fitz_new as fitz
#def take_memory_snapshot(iteration_count):
#
# snapshot = tracemalloc.take_snapshot()
# for i, filename_stat in enumerate(snapshot.statistics("filename")[:20], 1):
# if "fitz" in str(filename_stat):
# print(f"filename_stat iteration={iteration_count} i={i} stat={str(filename_stat)}")
#
# print("=" * 50)
#tracemalloc.start()
def merge_pdf(content: bytes, coverpage: bytes):
with fitz.Document(stream=coverpage, filetype="pdf") as coverpage_pdf, fitz.Document(
stream=content, filetype="pdf"
) as content_pdf:
coverpage_pdf.insert_pdf(content_pdf)
doc = coverpage_pdf.write()
return doc
def reproduce():
for iteration_count in range(1, 1000):
with open(f"content.pdf", "rb") as content_pdf, open(f"coverpage.pdf", "rb") as coverpage_pdf:
content = content_pdf.read()
coverpage = coverpage_pdf.read()
merge_pdf(content, coverpage)
print(iteration_count)
gc.collect()
#take_memory_snapshot(iteration_count)
reproduce() The code will merge Commented is code that takes a memory snapshot and print any line related to When I run the code like this:
I can run |
Many thanks for your code. It looks like the leak does not occur with the new rebased implementation ( Here are mprof graphs showing behaviour with Python-3.10 and 3.11 (different OS's but i'm assuming this not significant).
test_2791-mprof-OpenBSD-3.10.12-rebased2.pdf I'll have a look for a leak in the rebased optimisation code (written in C++, and mostly copied from the original classic implementation). |
Thank you for your reply. I confirm that there is no leak when using |
That's good to know, thanks. Additional testing of the rebased implementation is always very useful. I have a patch to the rebase implementation's optimisation code that seems to address most of the problem, reducing the increase over 1000 iterations of
|
We need to call Py_XDECREF() on value returned by PyObject_CallMethodObjArgs().
We need to call Py_XDECREF() on value returned by PyObject_CallMethodObjArgs().
Fixed in 1.23.7. |
Thank you so much for your work @julian-smith-artifex-com I have tested the new release with my script and can confirm that the memory leak has been fixed. I would be ready to upgrade, but that would require me to start using the rebased version of PyMuPDF in production and I want to be sure that I don't introduce regressions doing that. To provide context, my use case involves merging PDFs with varying standards and quality. If you follow this link, you will be able to download a PDF that consists of a coverpage and a scholarly article. The merging of the two documents has been done with PyMuPDF https://www.erudit.org/en/journals/cjsae/2022-v34-n2-cjsae08031/1099792ar/ I've tested the rebased version with a sample of 10,000 PDFs, focusing on two key questions:
Regarding the first question, I observed that the rebased version raises an AttributeError for certain documents that the standard version handles without issue. I'll investigate further and open a separate issue for this. For the second version, I have compared each of the 10000 PDFs with diff-pdf. For my use case on this set of PDFs, there are no visual differences between the PDFs produced by the rebased version and the PDFs produced by the rebased version. Finally, I would like to know: would you recommend using the rebased version in production ? Is it stable enough ? Thank you ! |
Please provide all mandatory information!
Describe the bug (mandatory)
I think there's a memory leak in the
save()
method in fitz.To Reproduce (mandatory)
My code adds a PDF cover page to an existing PDF. It merges the two PDFs in memory using fitz's
insert_pdf
method. The merged PDF is converted tobytes
using thewrite
method.The method that reproduces the memory leak is as follows:
This method takes a snapshot of memory and displays fitz-related statistics by file name and line number.
Finally, I iterate over a set of ~6000 PDFs. For each PDF, I call the merge_pdf method and then the take_snapshot method. During the initial iterations, fitz doesn't appear in the list of statistics by file name or line number. After a few iterations, fitz does appear, and I can see that both the memory it uses and the number of objects referred to it are increasing by 1 to 3 KiB per iteration.
Unfortunately, my diagnostic skills are limited at this stage, as I have no experience of SWIG. What could be the next steps to diagnose the memory leak more precisely and eventually solve it? Does the code in
Document_save
infitz.py
match the contents of lines 1999 to 2081 infitz.i
? Am I right in thinking that the cause of the memory leak lies within this code infitz.i
? What should I look for to solve the problem?Your configuration (mandatory)
The text was updated successfully, but these errors were encountered: