Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in scran normalization #155

Closed
scottgigante opened this issue Sep 10, 2020 · 15 comments
Closed

Memory leak in scran normalization #155

scottgigante opened this issue Sep 10, 2020 · 15 comments

Comments

@scottgigante
Copy link
Contributor

scottgigante commented Sep 10, 2020

Seems that scran normalization is a) densifying the data and b) not freeing it afterwards.

import openproblems
import psutil
import gc
import time
import scIB


def mem_available():
    time.sleep(1)
    gc.collect()
    time.sleep(1)
    return psutil.virtual_memory().available


initial_memory = mem_available()
adata = openproblems.data.zebrafish.load_zebrafish(test=True)
post_load_memory = mem_available()
del adata
post_delete_memory = mem_available()

adata = openproblems.data.zebrafish.load_zebrafish(test=True)
pre_norm_memory = mem_available()
scIB.preprocessing.normalize(adata)
post_norm_memory = mem_available()
del adata
post_norm_delete_memory = mem_available()

import matplotlib.pyplot as plt
import scprep
fig, ax = plt.subplots()
ax.plot(
    range(6), 
    (
        initial_memory,
        post_load_memory,
        post_delete_memory,
        pre_norm_memory,
        post_norm_memory,
        post_norm_delete_memory,
    )
)

scprep.plot.tools.label_axis(
    ax.xaxis, 
    ticks=range(6),
    ticklabels=[
        "initial",
        "load",
        "delete",
        "reload",
        "norm",
        "delete",
    ],
    ticklabel_rotation=45, 
    ticklabel_horizontal_alignment='right'
)

fig.tight_layout()

Plot shows available memory at each stage of the process.
mem_usage

Related: openproblems-bio/openproblems#54

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 10, 2020

Hmm, thanks for the bug report. This likely has to do with using an R function from python. As scran is run in R, a copy of the data matrix is made and converted into an R matrix object. That object is not deleted at the moment, and therefore the memory is never freed. The R kernel is also probably still running the background.

I'm not sure how to close the R kernel, but we should be able to delete the R object after moving it back at least.

@scottgigante
Copy link
Contributor Author

This seems like a workable solution:

def r_garbage_collection():
    rpy2.robjects.r('rm(list=ls())')
    rpy2.robjects.r('gc()')

image

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 10, 2020

I just tested more or less this in #156, but it didn't work for me... Maybe there's more than the data object stored in the R session though....

Ah, you called the python garbage collector afterwards... i thought this would have sufficed in R.

@scottgigante
Copy link
Contributor Author

Worth noting too that I actually had to call the python garbage collector multiple times for the freed memory to register in psutil, but this shouldn't be necessary for real use. See https://stackoverflow.com/questions/5199334/clearing-memory-used-by-rpy2#comment5873738_5201851

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 10, 2020

Okay, so there could still be a residual issue with the R kernel still existing even after the function is closed. Maybe an ro.r('q()') works for that as well...

@scottgigante
Copy link
Contributor Author

Just tried -- ro.r('q()') closes python too.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 10, 2020

that's what I feared....

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 10, 2020

I have a worse example with python garbage collection:
Screenshot 2020-09-11 at 01 49 34

I'm using sc.datasets.pbmc3k() here, as zebrafish loading doesn't work for me cuz of a HTTP error.

@scottgigante
Copy link
Contributor Author

Did you try running gc.collect() three times in a row? That made a difference for me.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 11, 2020

I already am... trying 10 times:

Screenshot 2020-09-11 at 10 42 02

It's okay, but the loaded R libraries are still there I guess.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 11, 2020

I've now unloaded modules as well in #156, but haven't yet managed to close the session. Still waiting on an answer in rpy2/rpy2#730. Could you take a look and see if it works any better for you with the solution in the PR?

@scottgigante
Copy link
Contributor Author

Looks good to me. Seems releasing memory back to system is a finicky thing so measuring using psutil is imperfect, but I think this is doing the job. The first time I run this it looks back, second time it looks good.

import psutil
import gc
import time
import scIB
import scanpy as sc
import rpy2.robjects as ro


def mem_available():
    time.sleep(1)
    gc.collect()
    gc.collect()
    gc.collect()
    time.sleep(1)
    return psutil.virtual_memory().available


initial_memory = mem_available()
adata = sc.datasets.pbmc3k()
post_load_memory = mem_available()
del adata
post_delete_memory = mem_available()

adata = sc.datasets.pbmc3k()
pre_norm_memory = mem_available()
scIB.preprocessing.normalize(adata)
post_norm_memory = mem_available()
del adata
post_norm_delete_memory = mem_available()
ro.r("ls()")
ro.r("rm(list=ls())")
ro.r("gc()")
post_gc_memory = mem_available()


import matplotlib.pyplot as plt
import scprep
fig, ax = plt.subplots()
ax.plot(
    range(7), 
    (
        initial_memory,
        post_load_memory,
        post_delete_memory,
        pre_norm_memory,
        post_norm_memory,
        post_norm_delete_memory,
        post_gc_memory
    )
)

scprep.plot.tools.label_axis(
    ax.xaxis, 
    ticks=range(7),
    ticklabels=[
        "initial",
        "load",
        "delete",
        "reload",
        "norm",
        "delete",
        "gc",
    ],
    ticklabel_rotation=45, 
    ticklabel_horizontal_alignment='right'
)
ax.set_ylabel("Memory Available")

fig.tight_layout()

mem_usage

@scottgigante
Copy link
Contributor Author

scottgigante commented Sep 11, 2020

Another point: we should probably coerce to dgCMatrix. See MarioniLab/scran#70

@LuckyMD
Copy link
Collaborator

LuckyMD commented Sep 17, 2020

I just saw maybe we can improve further using rpy2.rinterface.endr() to end the R session in the background. The only problem would be that we can't easily restart that session. I guess this might not be necessary though if we reload rpy2 every time the function is called.

@scottgigante
Copy link
Contributor Author

Seems like a good idea. The startup cost is low and the chances the user wants to use R again are pretty slim, right?

Also PR #164 addresses my previous comment and should save a lot of time and memory.

@LuckyMD LuckyMD closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants