Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling & optimization #11

Open
florian-huber opened this issue Oct 30, 2022 · 2 comments
Open

Profiling & optimization #11

florian-huber opened this issue Oct 30, 2022 · 2 comments

Comments

@florian-huber
Copy link
Collaborator

florian-huber commented Oct 30, 2022

Currently the key data part is implemented using numpy structured arrays.
An alternative approach I tried used pandas multi-index DataFrames (#4 ).

This issue is to explore/discuss different implementations and their performance.

This was referenced Dec 2, 2022
@florian-huber
Copy link
Collaborator Author

florian-huber commented Dec 2, 2022

I tried several implementations including Numpy (multiple variants), Pandas, Polars, Numba (multiple variants).

Numpy structured array

Pro:

  • Fast slicing
  • Allows storing different datatypes.

Cons:

  • Requires several more complex utils functions.
  • Currently really slow when it comes to stacking more layers --> join data step is a serious bottleneck!

Pandas multi-index DataFrames (#4 )

Pro:

  • Fast addition of additional data (joins/merges)
  • Rather accessible implementation under the hood (when compared to the structured arrays)
  • Allows storing different datatypes.

Cons:

  • Slow when it comes to slicing. Since this is a very common action I would consider this to be critical.

Polars DataFrames

Pro:

  • Very fast addition of additional data (joins/merges)
  • Rather accessible implementation under the hood (when compared to the structured arrays)
  • Allows storing different datatypes.

Cons:

  • Notably slower than Numpy when it comes to slicing. Since this is a very common action I would consider this to be critical.

Numba

Pro:

  • Can be used together with Numpy Arrays hence keeping the fast slicing part
  • Avoids additional dependencies (Pandas or Polars)

Cons:

  • I did not get merging to be as fast as with Polars!

@florian-huber
Copy link
Collaborator Author

Here results from a profiling script I made:
profiling_merging_implementations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant