Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow event-wise scores evaluation #17

Open
rth opened this issue Mar 28, 2019 · 7 comments
Open

Slow event-wise scores evaluation #17

rth opened this issue Mar 28, 2019 · 7 comments

Comments

@rth
Copy link

rth commented Mar 28, 2019

Computing the event-wise scores evaluation appears to be very time consuming, mostly due to the repeated calls to pd.to_dataframe within the Eventwise* scores.

At least for ramp_test_submission --quick-test on the starting kit this accounts for most of the runtime,

Screenshot_2019-03-29 out

Originally the starting kit with the --quick-test option takes 63s on my laptop, with event-wise scores disabled this reduces to 9s.

Looking for a way to speeding it up, but generally, such long run time (on a dataset that is not that big) makes iterations slower which is problematic when running events.

@rth rth changed the title Very slow evementwise scores evaluation Very slow element-wise scores evaluation Mar 28, 2019
@glemaitre
Copy link
Contributor

ping @jorisvandenbossche

@rth rth changed the title Very slow element-wise scores evaluation Very slow event-wise scores evaluation Mar 28, 2019
@rth
Copy link
Author

rth commented Mar 28, 2019

To give more details, it's running pd.to_datetime(y_true[:, 0], unit='m'), where y_true[:, 0] is a numpy array of float64 in minutes. I think converting it to a timestamp (e.g. ns) and than back to datetype, without passing by pd.to_datetime might be faster, but I have not found a vectorized way to do that yet.

@jorisvandenbossche
Copy link
Contributor

Yes, we recently had an issue about that on the pandas issue tracker (it might be fixed in the latest pandas release).
In older releases, when you had floats, we are taking a very slow generic object path, while for ints it is optimized:

In [9]: a = np.arange(100000.)                

In [10]: %timeit pd.to_datetime(a, unit='m') 
593 ms ± 6.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit pd.to_datetime(a.astype(int), unit='m')   
2.11 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I don't fully recall, but it might be we can simply cast to ints there? (I mean, the fact that it is float is maybe just because it was concatenated into a 2D array with actual floats, but those dates are originally ints?)

@jorisvandenbossche
Copy link
Contributor

Yes, they are ints:

arr = y_true.index.values.astype('datetime64[m]').astype(int)

So doing a astype(int) in the to_datetime call should be safe and speed up a lot.

@jorisvandenbossche
Copy link
Contributor

See #18

@rth
Copy link
Author

rth commented Mar 29, 2019

Thanks @jorisvandenbossche ! Looks great!

@rth
Copy link
Author

rth commented Mar 29, 2019

Let's keep this open for now, even if the to_datatime is a major improvement -- I'll try see if caching some of calculations in event-wise scores could improve performance more.

@rth rth changed the title Very slow event-wise scores evaluation Slow event-wise scores evaluation Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants