Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"segmentation fault" with huge loess stat_smooth (from plotnine) #3

Open
saladpanda opened this issue Jun 1, 2018 · 4 comments
Open

Comments

@saladpanda
Copy link

I'm quite sure this is an issue of scikit-misc so I file it here.
I ran into the following why doing plots with https://github.com/has2k1/plotnine.

#!/usr/bin/env python3

import numpy as np
import pandas as pd
from plotnine import *

time_int   = np.array(range(30000))
time_float = np.linspace(0, 500, 30000)
values = np.random.randint(1, 1000, 30000)

df = pd.DataFrame({'time_int': time_int, 'time_float': time_float, 'values': values})
df.info()

plot1 = ggplot(df, aes(x='time_int', y='values')) \
        + stat_smooth(method='loess')

plot2 = ggplot(df, aes(x='time_float', y='values')) \
        + stat_smooth(method='loess')

# print(plot1) # gives 'out of memory'
print(plot2) # crashes with segfault

With print(plot1) this prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
time_float    30000 non-null float64
time_int      30000 non-null int64
values        30000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 703.2 KB
[skmisc/loess/src/misc.c:34] Out of memory (7200000000 bytes)

With print(plot2) this prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
time_float    30000 non-null float64
time_int      30000 non-null int64
values        30000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 703.2 KB
zsh: segmentation fault (core dumped)  ./test.py
@has2k1
Copy link
Owner

has2k1 commented Jun 4, 2018

I understand and expect the "Out of memory" error given the size of the data; the loess algorithm is O(n^2) in memory. I do not expect a segfault, I think it is related to the low memory situation (probably unchecked malloc).

Both tests run on my system. But for 40000 rows, I get segfaults for both plots.

@saladpanda
Copy link
Author

The "Out of memory" is absolutely expected.
The bug I wanted to report is the segfault.

Now that I tested the above code again I get segfaults for both and can't manage to find a size where I just get "out of memory".

I noticed this while using plotnine in jupyter notebook. I had method set to loess and then increased the size of the dataframe. Suddenly the ipython kernel kept crashing when generating the plot. I think scikit-misc (or plotnine?) should catch this instead of crashing.

@has2k1
Copy link
Owner

has2k1 commented Jun 6, 2018

Yes, the segfaults cause the Jupyter kernel to crash.

@antschum
Copy link

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants