Memory leak when fetching many #986

krnhotwings · 2025-01-23T18:33:37Z

Operating System Name: Debian 12 (base of apache/airflow 2.10.4-python3.12 container)
Target Db2 Server Version: 11.05.0800
Python Version: 3.12.8
ibm_db version: 3.2.5
For non-Windows, output of below commands:
uname: Linux
uname -m: x86_64
Value of below environment variables if set:
PATH: /root/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH: /usr/local/lib

Hi folks, there appears to be some kind of memory leak that was introduced in 3.2.4 when fetching. We noticed this when our airflow infrastructure would run out of memory, and we were able to pinpoint it to ibm_db 3.2.4+. 3.2.3 does NOT exhibit this issue.

Init stuff

podman run --rm --name=python -it python:3.12 /bin/bash
pip install ibm_db==3.2.5
python

import ibm_db
import ibm_db_dbi
dsn = '...'
select_sql = '...' # Select from a very large table

Issues here

# Problems
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.exec_immediate(conn, select_sql)
while (x := ibm_db.fetchmany(stmt, 10000)) is not None and len(x) > 0:
    # Memory usage builds up here
    continue

# Also problems
conn = ibm_db_dbi.connect(dsn, '', '')
cur = conn.cursor()
cur.execute(select_sql)
while (x := cur.fetchmany(10000)) is not None and len(x) > 0:
    # Memory usage also builds up here
    print(x)

No issues here

# No issues doing it this way
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.prepare(conn, select_sql)
ibm_db.execute(stmt)
while True:
    ibm_db.fetch_tuple(stmt)

# No problems here either
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.exec_immediate(conn, select_sql)
while True:
    ibm_db.fetch_tuple(stmt)

It seems unusual that the new ibm_db.fetchmany() API and the existing ibm_db_dbi.Cursor.fetchmany() are both failing since they're independent of one another (the latter depending on ibm_db.fetch_tuple() via ibm_db_dbi.Cursor._fetch_helper())

More background info

What we're doing is getting a sqlalchemy connection and passing that to pandas.read_sql() w/ a chunksize to stream ETL-related processes. Presumably, pandas.read_sql() calls sqlalchemy.engine.CursorResult.fetchmany(), which in turn presumably calls ibm_db_dbi.Cursor.fetchmany().

Airflow 2.10.4-python3.12 installs sqlalchemy 1.4.54

Presumably, ibm_db_sa imports ibm_db_dbi

The text was updated successfully, but these errors were encountered:

imavo · 2025-01-28T15:14:38Z

Suspect code defect in ibm_db.c (of release 3.2.5) line number 15487 and also line number 15529.
Currently both of these line numbers only contain closing brace }.
Changing both lines to read instead:
} else { Py_XDECREF(return_value); }
may resolve the leak. Other solutions may be possible.

bchoudhary6415 · 2025-01-29T05:14:57Z

Hello @krnhotwings

Based on your description, I tried running the following script to reproduce the issue:

import pandas as pd
import ibm_db
import ibm_db_dbi
from sqlalchemy import create_engine

ibm_db_conn = ibm_db.connect("connection_string","", "")

conn = ibm_db_dbi.Connection(ibm_db_conn)

query = "SELECT * FROM animals1"

chunksize = 10000
for chunk in pd.read_sql(query, conn, chunksize=chunksize):
    print(chunk)

conn.close()

I'm getting the expected output without any errors.

As per @imavo's suggestion, I also tried modifying the code as follows:

diff --git a/ibm_db.c b/ibm_db.c
index 5af0bf4..038efbb 100644
--- a/ibm_db.c
+++ b/ibm_db.c
@@ -15488,6 +15488,7 @@ static PyObject* ibm_db_fetchmany(PyObject *self, PyObject *args)
             fetch_count++;
         } else {
             LogMsg(DEBUG, "Fetched value is not a valid row, breaking loop", fileName);
+            Py_XDECREF(return_value);
             break;
         }
     }
@@ -15529,6 +15530,7 @@ static PyObject *ibm_db_fetchall(PyObject *self, PyObject *args)
             }
         } else {
             LogMsg(DEBUG, "Fetched value is not a valid row, breaking loop", fileName);
+            Py_XDECREF(return_value);
             break;
         }
     }

After implementing this change, I'm still getting the expected output without any errors.

Could you please confirm if there's anything I might be missing? Or does the change suggested by @imavo appear to resolve the memory leak issue?

Thank you!

imavo · 2025-01-29T10:04:32Z

@bchoudhary6415

My approach and code change are both different from yours. I explain further below.

First, my approach was to use (on linux only, this will not work on MS-windows) the resource module to show the memory consumption for each iteration of the loop. On my test hardware there are sufficient resources for garbage collection to happen quickly. My working assumption is that during the loop, after each iteration the memory consumption should be fairly constant or at least not quickly increasing, and that consumption should reach some steady state value after a number of loop-iterations. My select query fetched varchar columns that are non-trivial average lengths.

However, if your test environment is MS-Windows then you can use a different module to report the memory consumption per loop iteration, as the resource module is linux-specific.

So my test code to try to recreate the symptom (adjusted with the resource module) is below:

import ibm_db
import resource


def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )


dsn = "..."

select_sql = "select ..."

conn = ibm_db.connect(dsn, '', '') 

stmt = ibm_db.exec_immediate(conn, select_sql)


print(using("before"))

loop_counter=0
while (x := ibm_db.fetchmany(stmt, 10000)) is not None and len(x) > 0:
    # Memory usage builds up here
    loop_counter += 1
    print(using("iteration_"+str(loop_counter)))
    continue

print("\nEnd of loop\n")
print(using("end"))

exit(0)

With the ibm_db release 3.2.5, for every iteration (in my case there were 109 iterations) the screen output from the above script showed that memory usage jumped by at least 6 megabytes per each loop iteration, and it never reached steady state, so went from 45mb (startpoint) to 625mb (end point) approx, and to me this showed there was a leak, and that the leak size per iteration may be similar to the size of the return_value object in ibm_db_fetchmany(). So I suspected that the return_value object in ibm_db_fetchmany was not being freed.

Next my code change is different from yours, so I should explain that my intention is that after the return_value is successfully appended to the list then PyXDECREF(return_value) and continue the loop. My thinking is that the PyList_Append will increment the ref count for return_value (from 1 to 2) , so because return_value is not used anymore in fetchmany until next call to fetchmany, I want to decrement it to encourage its subsequent release at the next call to fetchmany.

After the code change, the output of the script above showed that the memory consumption reached steady state and never increased for the rest of the loop. So decrementing the reference count of return_value object had a marked effect, but whether this is the correct approach is a question for you to consider.

My code change in im_db_fetchmany and ibm_db_fetchall, differs from yours and is below:


           if (PyList_Append(result_list, return_value) == -1) {
                LogMsg(ERROR, "Failed to append row to result list", fileName);
                Py_XDECREF(result_list);
                return NULL;
            } else { Py_XDECREF(return_value); }

bchoudhary6415 · 2025-01-29T11:08:54Z

@imavo
Thank you for the clarifications.

I’m working in a Linux environment. I’ve now applied the correct code change, and after doing so, the output of the script showed that the memory consumption reached a steady state and didn’t increase during the remaining iterations of the loop.

However, when I apply the script below, the memory consumption does not stabilize. Instead, it continues to increase with each loop iteration.

Here’s the script:

import pandas as pd
import ibm_db
import resource
import ibm_db_dbi
from sqlalchemy import create_engine

def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )

ibm_db_conn = ibm_db.connect("connection_string","", "")

conn = ibm_db_dbi.Connection(ibm_db_conn)

query = "SELECT * FROM animals1"

print(using("before"))

loop_counter=0

chunksize = 10000  
for chunk in pd.read_sql(query, conn, chunksize=chunksize):
    loop_counter += 1
    print(chunk)  # Or do your processing here
    print(using("iteration_"+str(loop_counter)))

conn.close()

print("\nEnd of loop\n")
print(using("end"))

The above script internally uses the fetchmany method of the ibm_db_dbi module from the python-ibmdb library to retrieve the data in chunks.

Could you please let me know if I’m doing anything wrong or if there's a particular reason the memory usage keeps increasing in this case? I would appreciate any further guidance on this.

Thank you!

imavo · 2025-01-29T13:28:57Z

@bchoudhary6415
To try to keep the testcase as small as possible ( i.e to use as few modules in the venv) I temporarily avoid using pandas for the moment and instead concentrate on the ibm_db_dbi.fetchmany alone via the fragment given by @krnhotwings at the start of this ticket, i.e. via while (x := cur.fetchmany(10000)) is not None and len(x) > 0:

This testcode below exhibits a leak per loop iteration on the DBI fetchmany:

import ibm_db
import ibm_db_dbi
import resource


def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )


dsn = "..."

select_sql = "select ..."

conn = ibm_db_dbi.connect(dsn, '', '')

cur = conn.cursor()

cur.execute( select_sql )

print(using("before"))

loop_counter=0
while (x := cur.fetchmany(10000)) is not None and len(x) > 0:
    loop_counter += 1
    print(using("iteration_"+str(loop_counter)))

print("\nEnd of loop\n")
print(using("end"))

exit(0)

The reason that the code change I suggested makes no difference, is that the DBI fetchmany() does not use the ibm_db.fetchmany() and instead loops on ibm_db.fetch_tuple().

@krhotwings states that the 3.2.3 release does not show the leak, but the 3.2.4 and 3.2.5 does show a leak with this DBI fetchmany(). I verified this. The biggest change between 3.2.3 and 3.2.4 was the debug logging but need to eliminate any other changes in called functions.

So there is more than one leak problem, i.e. more than one code fix might be needed.

Investigating continues.

imavo · 2025-01-29T16:42:26Z

@bchoudhary6415
There is a pervasive bug introduced by the debugging stuff added in release 3.2.4, so any code that uses the code template below will leak (that is to say every function with arguments that is traced with the template below in ibm_db 3.2.4 or 3.2.5 will leak) :

   PyObject *argsStr = PyObject_Repr(args); // Get string representation of args
   snprintf(messageStr, sizeof(messageStr), ...

PyObject_Repr() will return a new reference
hence Py_XDECREF(argsStr) is needed everywhere after the representation is logged, or a different approach could be used to keep the logging functionality.

This appears to resolve the DBI fetchmany() leak.

I will check if this also resolves the pandas issue, but in the meanwhile you might want to work on a code change.

imavo · 2025-01-29T16:58:58Z

@bchoudhary6415 I checked with the pandas example, after my second code chage (to dispose of the argsStr objects), and find that it also appears to resolve the leak seen with pandas, and memory consumption reaches a steady state.

bchoudhary6415 · 2025-01-30T04:39:54Z

@imavo
Thank you for the suggestion. I will add the code Py_XDECREF(argsStr) where necessary.
However, I suspect that additional changes might be required. I’ll investigate further and update the code accordingly.

Thank you!

imavo · 2025-01-30T08:38:29Z

@bchoudhary6415 When the debug_mode is zero it would be wise that no heap allocation happens that is exclusively for logging. In other words that PyObject_Repr() ought to be conditional on debugging being enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when fetching many #986

Memory leak when fetching many #986

krnhotwings commented Jan 23, 2025 •

edited

Loading

imavo commented Jan 28, 2025

bchoudhary6415 commented Jan 29, 2025

imavo commented Jan 29, 2025

bchoudhary6415 commented Jan 29, 2025

imavo commented Jan 29, 2025 •

edited

Loading

imavo commented Jan 29, 2025 •

edited

Loading

imavo commented Jan 29, 2025

bchoudhary6415 commented Jan 30, 2025

imavo commented Jan 30, 2025

Memory leak when fetching many #986

Memory leak when fetching many #986

Comments

krnhotwings commented Jan 23, 2025 • edited Loading

Init stuff

Issues here

No issues here

More background info

imavo commented Jan 28, 2025

bchoudhary6415 commented Jan 29, 2025

imavo commented Jan 29, 2025

bchoudhary6415 commented Jan 29, 2025

imavo commented Jan 29, 2025 • edited Loading

imavo commented Jan 29, 2025 • edited Loading

imavo commented Jan 29, 2025

bchoudhary6415 commented Jan 30, 2025

imavo commented Jan 30, 2025

krnhotwings commented Jan 23, 2025 •

edited

Loading

imavo commented Jan 29, 2025 •

edited

Loading

imavo commented Jan 29, 2025 •

edited

Loading