-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak? #6046
Comments
doing this is in a loop shows no problem
|
python 'holds' onto memory even after allocation; it reuses if for the next allocation. A leak would show this steadily increasing. |
Cool. Is it possible to release the memory somehow? I have a huge memory problems when reading HDFs. When I am done reading the file, the final dataframe takes just about 20% of memory used for reading. |
possibly #2659 has something for you. Closing as not a bug. |
back to the os; I don't think their is any way to do this (except by exiting the process). If you are processing HDF; use the chunk iterator if possible, that way it won't increase too much. I process hdf in this way, that is I run a process to do a computation (and create an output / new hdf file). Then exit the process. (I a actually multi-process this as the computations and output files are independent). see this recent question (the bottom of my answer) for a nice pattern: http://stackoverflow.com/questions/21295329/fastest-way-to-copy-columns-from-one-dataframe-to-another-using-pandas/21296133?noredirect=1#comment32114620_21296133 |
@Marigold if you post your code for what you are doing I can take a look |
I am trying to merge HDF files on disk (very painful experience). I ended up using smaller chunks too, for now it seems quite ok. |
@Marigold ok...lmk...as I said I do this a lot; it shouldn't be painful :) You are using |
you may find these useful: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore specifically this: |
I saw your post on SO, very nice solution. However, I need to do an outer join. Dont know if it can be done in a similar way. At the moment I have like three types of joins - one sorts the data first (sorting takes ages) and then iterate over both of them, second merges indices first and select data with |
sounds like an interesting problem....can you put up in memory what you are doing (e.g. an example which is all memory based).....? I can think about it |
Here is a little more complex example import pandas as pd
left = pd.DataFrame({'a': [0] * 5, 'b': [1] * 5})
right = pd.DataFrame({'a': range(5), 'c': ['a' * i for i in range(5)]})
left.merge(right, on='a', how='outer') It shows all the possible problems - outer join, NaN values for int columns (was int, but now it has to be float because of NaN) and min_itemsize for string (although this can be found easily in metadata). I was thinking how to do the outer join with the example you provided and it seems pretty intuitive. When you have inner join, just go over both dataframes, look for index values not in inner join and append it to inner join afterwards. Unfortunately, it takes three nested iterations (as in your example) over dataframes. |
you might be able to do this by just selecting the index values of the table (use which gives you basically a frame which the index values AND an integer index, which is in fact the coordinates of those index values) - call these coordinates. Then you can do your joins in memory (keeping around those coordinates), then select the final result using those coordinates. That way you don't actually bring in any data until you need it. |
That's what I do now. It is the fastest method so far (still the selection by index is very slow), slightly faster than looping over two dataframes in your example. I don't think it can get any better, so I would close this discussion for now until something "new" appears. Thanks a lot for help! |
Back to the original question - now with relation to hdf itself. I have a following script import pandas as pd
import psutil
import gc
def create_hdf():
d = pd.DataFrame({'a': range(int(2e7)), 'b': range(int(2e7))})
d.to_hdf('test.h5', 'df', format='t')
def load_hdf(columns):
before = psutil.phymem_usage()
print('Before', before)
d = pd.read_hdf('test.h5', 'df', columns=columns)
gc.collect()
print('MB used by dataframe itself: {:.2f}'.format(float(d.values.nbytes) / 2**20))
after = psutil.phymem_usage()
print('After', after)
print('Memory change in MB {:.2f}'.format((after.used - before.used) / float(2**20))) And here are the results for different columns = None
columns = []
columns = ['a']
I can limit this "memory leak" to some extent by using |
setting columns only causes a reindex hdf is row oriented so it will bring in ALL the columns no matter what u ask and just reindex to give you back what u want if u want to limit peak memory definite use an iterator and concatenate generally I try to work on smaller parts of my stores at once if I need the entire thing u can chunk by iterator or by looping over another axis of the data and selecting (eg you can say select the unique values for a particular field then loop over those ) if u do heavily column oriented stuff you really need a column store see this/ #4454 want to contribute on this? |
Thanks for explanation. I'll definitely look at it and see if I can contribute with something. |
I tried to run the following code (with master)
and got these results
Is it a memory leak or am I doing something wrong?
The text was updated successfully, but these errors were encountered: