Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when fetching many #986

Open
krnhotwings opened this issue Jan 23, 2025 · 9 comments
Open

Memory leak when fetching many #986

krnhotwings opened this issue Jan 23, 2025 · 9 comments

Comments

@krnhotwings
Copy link

krnhotwings commented Jan 23, 2025

  • Operating System Name: Debian 12 (base of apache/airflow 2.10.4-python3.12 container)
  • Target Db2 Server Version: 11.05.0800
  • Python Version: 3.12.8
  • ibm_db version: 3.2.5
  • For non-Windows, output of below commands:
    uname: Linux
    uname -m: x86_64
  • Value of below environment variables if set:
    PATH: /root/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    LD_LIBRARY_PATH: /usr/local/lib

Hi folks, there appears to be some kind of memory leak that was introduced in 3.2.4 when fetching. We noticed this when our airflow infrastructure would run out of memory, and we were able to pinpoint it to ibm_db 3.2.4+. 3.2.3 does NOT exhibit this issue.

Init stuff

podman run --rm --name=python -it python:3.12 /bin/bash
pip install ibm_db==3.2.5
python
import ibm_db
import ibm_db_dbi
dsn = '...'
select_sql = '...' # Select from a very large table

Issues here

# Problems
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.exec_immediate(conn, select_sql)
while (x := ibm_db.fetchmany(stmt, 10000)) is not None and len(x) > 0:
    # Memory usage builds up here
    continue

# Also problems
conn = ibm_db_dbi.connect(dsn, '', '')
cur = conn.cursor()
cur.execute(select_sql)
while (x := cur.fetchmany(10000)) is not None and len(x) > 0:
    # Memory usage also builds up here
    print(x)

No issues here

# No issues doing it this way
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.prepare(conn, select_sql)
ibm_db.execute(stmt)
while True:
    ibm_db.fetch_tuple(stmt)

# No problems here either
conn = ibm_db.connect(dsn, '', '')
stmt = ibm_db.exec_immediate(conn, select_sql)
while True:
    ibm_db.fetch_tuple(stmt)

It seems unusual that the new ibm_db.fetchmany() API and the existing ibm_db_dbi.Cursor.fetchmany() are both failing since they're independent of one another (the latter depending on ibm_db.fetch_tuple() via ibm_db_dbi.Cursor._fetch_helper())

More background info

What we're doing is getting a sqlalchemy connection and passing that to pandas.read_sql() w/ a chunksize to stream ETL-related processes. Presumably, pandas.read_sql() calls sqlalchemy.engine.CursorResult.fetchmany(), which in turn presumably calls ibm_db_dbi.Cursor.fetchmany().

Airflow 2.10.4-python3.12 installs sqlalchemy 1.4.54

Presumably, ibm_db_sa imports ibm_db_dbi

@imavo
Copy link
Contributor

imavo commented Jan 28, 2025

Suspect code defect in ibm_db.c (of release 3.2.5) line number 15487 and also line number 15529.
Currently both of these line numbers only contain closing brace }.
Changing both lines to read instead:
} else { Py_XDECREF(return_value); }
may resolve the leak. Other solutions may be possible.

@bchoudhary6415
Copy link
Contributor

Hello @krnhotwings

Based on your description, I tried running the following script to reproduce the issue:

import pandas as pd
import ibm_db
import ibm_db_dbi
from sqlalchemy import create_engine

ibm_db_conn = ibm_db.connect("connection_string","", "")

conn = ibm_db_dbi.Connection(ibm_db_conn)

query = "SELECT * FROM animals1"

chunksize = 10000
for chunk in pd.read_sql(query, conn, chunksize=chunksize):
    print(chunk)

conn.close()

I'm getting the expected output without any errors.

As per @imavo's suggestion, I also tried modifying the code as follows:

diff --git a/ibm_db.c b/ibm_db.c
index 5af0bf4..038efbb 100644
--- a/ibm_db.c
+++ b/ibm_db.c
@@ -15488,6 +15488,7 @@ static PyObject* ibm_db_fetchmany(PyObject *self, PyObject *args)
             fetch_count++;
         } else {
             LogMsg(DEBUG, "Fetched value is not a valid row, breaking loop", fileName);
+            Py_XDECREF(return_value);
             break;
         }
     }
@@ -15529,6 +15530,7 @@ static PyObject *ibm_db_fetchall(PyObject *self, PyObject *args)
             }
         } else {
             LogMsg(DEBUG, "Fetched value is not a valid row, breaking loop", fileName);
+            Py_XDECREF(return_value);
             break;
         }
     }

After implementing this change, I'm still getting the expected output without any errors.

Could you please confirm if there's anything I might be missing? Or does the change suggested by @imavo appear to resolve the memory leak issue?

Thank you!

@imavo
Copy link
Contributor

imavo commented Jan 29, 2025

@bchoudhary6415

My approach and code change are both different from yours. I explain further below.

First, my approach was to use (on linux only, this will not work on MS-windows) the resource module to show the memory consumption for each iteration of the loop. On my test hardware there are sufficient resources for garbage collection to happen quickly. My working assumption is that during the loop, after each iteration the memory consumption should be fairly constant or at least not quickly increasing, and that consumption should reach some steady state value after a number of loop-iterations. My select query fetched varchar columns that are non-trivial average lengths.

However, if your test environment is MS-Windows then you can use a different module to report the memory consumption per loop iteration, as the resource module is linux-specific.

So my test code to try to recreate the symptom (adjusted with the resource module) is below:

import ibm_db
import resource


def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )


dsn = "..."

select_sql = "select ..."

conn = ibm_db.connect(dsn, '', '') 

stmt = ibm_db.exec_immediate(conn, select_sql)


print(using("before"))

loop_counter=0
while (x := ibm_db.fetchmany(stmt, 10000)) is not None and len(x) > 0:
    # Memory usage builds up here
    loop_counter += 1
    print(using("iteration_"+str(loop_counter)))
    continue

print("\nEnd of loop\n")
print(using("end"))

exit(0)

With the ibm_db release 3.2.5, for every iteration (in my case there were 109 iterations) the screen output from the above script showed that memory usage jumped by at least 6 megabytes per each loop iteration, and it never reached steady state, so went from 45mb (startpoint) to 625mb (end point) approx, and to me this showed there was a leak, and that the leak size per iteration may be similar to the size of the return_value object in ibm_db_fetchmany(). So I suspected that the return_value object in ibm_db_fetchmany was not being freed.

Next my code change is different from yours, so I should explain that my intention is that after the return_value is successfully appended to the list then PyXDECREF(return_value) and continue the loop. My thinking is that the PyList_Append will increment the ref count for return_value (from 1 to 2) , so because return_value is not used anymore in fetchmany until next call to fetchmany, I want to decrement it to encourage its subsequent release at the next call to fetchmany.

After the code change, the output of the script above showed that the memory consumption reached steady state and never increased for the rest of the loop. So decrementing the reference count of return_value object had a marked effect, but whether this is the correct approach is a question for you to consider.

My code change in im_db_fetchmany and ibm_db_fetchall, differs from yours and is below:


           if (PyList_Append(result_list, return_value) == -1) {
                LogMsg(ERROR, "Failed to append row to result list", fileName);
                Py_XDECREF(result_list);
                return NULL;
            } else { Py_XDECREF(return_value); }

@bchoudhary6415
Copy link
Contributor

@imavo
Thank you for the clarifications.

I’m working in a Linux environment. I’ve now applied the correct code change, and after doing so, the output of the script showed that the memory consumption reached a steady state and didn’t increase during the remaining iterations of the loop.

However, when I apply the script below, the memory consumption does not stabilize. Instead, it continues to increase with each loop iteration.

Here’s the script:

import pandas as pd
import ibm_db
import resource
import ibm_db_dbi
from sqlalchemy import create_engine

def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )

ibm_db_conn = ibm_db.connect("connection_string","", "")

conn = ibm_db_dbi.Connection(ibm_db_conn)

query = "SELECT * FROM animals1"

print(using("before"))

loop_counter=0

chunksize = 10000  
for chunk in pd.read_sql(query, conn, chunksize=chunksize):
    loop_counter += 1
    print(chunk)  # Or do your processing here
    print(using("iteration_"+str(loop_counter)))

conn.close()

print("\nEnd of loop\n")
print(using("end"))

The above script internally uses the fetchmany method of the ibm_db_dbi module from the python-ibmdb library to retrieve the data in chunks.

Could you please let me know if I’m doing anything wrong or if there's a particular reason the memory usage keeps increasing in this case? I would appreciate any further guidance on this.

Thank you!

@imavo
Copy link
Contributor

imavo commented Jan 29, 2025

@bchoudhary6415
To try to keep the testcase as small as possible ( i.e to use as few modules in the venv) I temporarily avoid using pandas for the moment and instead concentrate on the ibm_db_dbi.fetchmany alone via the fragment given by @krnhotwings at the start of this ticket, i.e. via while (x := cur.fetchmany(10000)) is not None and len(x) > 0:

This testcode below exhibits a leak per loop iteration on the DBI fetchmany:

import ibm_db
import ibm_db_dbi
import resource


def using(point=""):
    usage=resource.getrusage(resource.RUSAGE_SELF)
    return '''%s: mem=%s mb
           '''%(point,  usage[2]/1024.0 )


dsn = "..."

select_sql = "select ..."

conn = ibm_db_dbi.connect(dsn, '', '')

cur = conn.cursor()

cur.execute( select_sql )

print(using("before"))

loop_counter=0
while (x := cur.fetchmany(10000)) is not None and len(x) > 0:
    loop_counter += 1
    print(using("iteration_"+str(loop_counter)))

print("\nEnd of loop\n")
print(using("end"))

exit(0)

The reason that the code change I suggested makes no difference, is that the DBI fetchmany() does not use the ibm_db.fetchmany() and instead loops on ibm_db.fetch_tuple().

@krhotwings states that the 3.2.3 release does not show the leak, but the 3.2.4 and 3.2.5 does show a leak with this DBI fetchmany(). I verified this. The biggest change between 3.2.3 and 3.2.4 was the debug logging but need to eliminate any other changes in called functions.

So there is more than one leak problem, i.e. more than one code fix might be needed.

Investigating continues.

@imavo
Copy link
Contributor

imavo commented Jan 29, 2025

@bchoudhary6415
There is a pervasive bug introduced by the debugging stuff added in release 3.2.4, so any code that uses the code template below will leak (that is to say every function with arguments that is traced with the template below in ibm_db 3.2.4 or 3.2.5 will leak) :

   PyObject *argsStr = PyObject_Repr(args); // Get string representation of args
   snprintf(messageStr, sizeof(messageStr), ...

PyObject_Repr() will return a new reference
hence Py_XDECREF(argsStr) is needed everywhere after the representation is logged, or a different approach could be used to keep the logging functionality.

This appears to resolve the DBI fetchmany() leak.

I will check if this also resolves the pandas issue, but in the meanwhile you might want to work on a code change.

@imavo
Copy link
Contributor

imavo commented Jan 29, 2025

@bchoudhary6415 I checked with the pandas example, after my second code chage (to dispose of the argsStr objects), and find that it also appears to resolve the leak seen with pandas, and memory consumption reaches a steady state.

@bchoudhary6415
Copy link
Contributor

@imavo
Thank you for the suggestion. I will add the code Py_XDECREF(argsStr) where necessary.
However, I suspect that additional changes might be required. I’ll investigate further and update the code accordingly.

Thank you!

@imavo
Copy link
Contributor

imavo commented Jan 30, 2025

@bchoudhary6415 When the debug_mode is zero it would be wise that no heap allocation happens that is exclusively for logging. In other words that PyObject_Repr() ought to be conditional on debugging being enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants