perf: Speed up list operations that use amortized_iter() #20964

itamarst · 2025-01-28T20:41:12Z

I expect this will speed up most list/array operations that rely on amortized_iter() (except in cases where the actual operation has significant issues, e.g. #19106; won't hurt there, though).

Before the change, the new benchmark has a mean of around ~780µs.

After the change to iterator.rs, the benchmark has a mean of ~710µs.

After additional optimization to Series::_get_inner_mut(), mean is as low as ~650µs, but I then reverted this.

Why it helps: Typically expensive (atomic, in this case) operations aren't a problem in normal Series usage, because we call them rarely. But this is a different situation because we call them a lot.

itamarst · 2025-01-28T20:43:54Z

_get_inner_mut() is still pretty expensive, but at least we're only doing it once. Arc::get_mut_unchecked() would help, but it's an unstable feature.

itamarst · 2025-01-28T20:49:39Z

I suspect switching to triomphe::Arc would also help; it would make get_mut() cheaper, since there's no read-modify-update needed to ensure uniqueness. But that's an extra dependency so perhaps not worth it.

(_get_inner_mut()'s way of checking uniqueness is apparently race condition-y, BTW, but presumably the resulting panics are rare.)

orlp · 2025-01-29T09:12:49Z

Can you remove the race condition in _get_inner_mut? Should also be faster:

match Arc::get_mut(&mut self.0) {
    Some(r) => r,
    None => {
        self.0 = self.0.clone_inner();
        unsafe { Arc::get_mut_unchecked(&mut self.0) }
    }
}

orlp · 2025-01-29T09:24:43Z

Oh, get_mut_unchecked is unstable... I suppose you can make it get_mut().unwrap_unchecked() but it's not quite as fast as the compiler can't eliminate the atomic op. The clone path's performance isn't as important though.

ritchie46 · 2025-01-29T11:12:05Z

Oh, get_mut_unchecked is unstable... I suppose you can make it get_mut().unwrap_unchecked() but it's not quite as fast as the compiler can't eliminate the atomic op. The clone path's performance isn't as important though.

We can feature gate it, and use that if we are on the nightly compiler. This is always the case for our Python releases.

itamarst · 2025-01-29T13:25:39Z

The proposed change won't pass the borrow checker, I think. And the really expensive atomic operations are the ones in Arc:is_unique(), I believe, which cannot be eliminated (and rely on private APIs we don't have access to).

After reading in more detail it seems like the only way to avoid expensive load-update operation is to switch to something like triomphe::Arc that lacks weakrefs. Then you're doing an Acquire atomic load instead of the load-update which is probably what's causing some of the slowness, given the flamegraph I'm looking at.

Or there may be some way to rearchitect or tweak this abstraction at some higher level of the code.

itamarst · 2025-01-29T13:35:47Z

Oh, I thought no weakrefs were used but further digging suggests they are. So triomphe won't work.

itamarst · 2025-01-29T13:41:32Z

Thinking about this some more, the race condition won't result in panic, it will result in extra cloning. And that suggests using get_mut_unchecked() as replacement for get_mut() is actually fine. So we can get the performance boost on nightly compilers. I'll go do that and include a safety comment and a bunch more documentation.

orlp · 2025-01-29T13:42:51Z

The proposed change won't pass the borrow checker, I think.

Ugh... This is a classic case that Polonius would solve, hope we get something similar soon.

itamarst · 2025-01-29T14:48:19Z

So after writing this whole comment... if my reasoning is correct, this optimization can actually go into Arc::get_mut(). And if my reasoning is wrong, this shouldn't go into Polars 😁

So I suggest keeping this PR as is, and I'll open a PR against Rust, and see what they say, and if it gets merged then Polars will just get faster via upgrading to a newer Rust.

itamarst · 2025-01-29T20:01:52Z

Actually, we're OK with uniqueness giving the wrong answer sometimes, so long as it's rare and gives a false negative. Cause worse case we do extra work. Stdlib won't be OK with that.

So just gonna update this PR.

itamarst · 2025-01-29T20:48:25Z

This takes speed to as fast as 650µs per run, so even faster. But someone needs to think very hard about whether my justification is correct.

itamarst · 2025-01-29T21:07:13Z

Another small bottleneck I haven't fixed:

// from series_trait.rs:

impl (dyn SeriesTrait + '_) {
    pub fn unpack<N>(&self) -> PolarsResult<&ChunkedArray<N>>
    where
        N: 'static + PolarsDataType<IsLogical = FalseT>,
    {
        // This next line is 5% of runtime, probably because for lists creating the new datatype involves allocating (and then later freeing) a Box :cry:, it's DataType::List(Box::new(DataType::Null))...
        polars_ensure!(&N::get_dtype() == self.dtype(), unpack);
        Ok(self.as_ref())
    }
}

itamarst · 2025-01-30T11:00:11Z

Woke up in the middle of the night and realized the unsafe is unsound. Will remove later, just wanted to note that so it doesn't get merged.

itamarst · 2025-01-30T13:42:32Z

Filed #21004 as explanation of why the proposed change would've been unsound. It's not actually a problem in current codebase, but as soon as someone used a weakref it would've been.

codecov · 2025-01-30T14:02:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.21%. Comparing base (96a2d01) to head (fc6b86c).
Report is 22 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20964      +/-   ##
==========================================
- Coverage   79.34%   79.21%   -0.13%     
==========================================
  Files        1579     1583       +4     
  Lines      224319   225085     +766     
  Branches     2573     2581       +8     
==========================================
+ Hits       177976   178301     +325     
- Misses      45755    46194     +439     
- Partials      588      590       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pythonspeed added 4 commits January 28, 2025 13:23

Benchmark for list operations.

86a4364

Optimize away some of the overhead of _get_inner_mut().

7e119f2

More accurate.

32db4e8

Remove unsafe.

67c84cc

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jan 28, 2025

itamarst marked this pull request as ready for review January 28, 2025 20:57

itamarst requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners January 28, 2025 20:57

Another optimization is available on nightly.

5fc40a2

Revert unsound code.

7437013

Not actually used.

fc6b86c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Speed up list operations that use amortized_iter() #20964

perf: Speed up list operations that use amortized_iter() #20964

itamarst commented Jan 28, 2025 •

edited

Loading

itamarst commented Jan 28, 2025

itamarst commented Jan 28, 2025 •

edited

Loading

orlp commented Jan 29, 2025

orlp commented Jan 29, 2025 •

edited

Loading

ritchie46 commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

orlp commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 30, 2025

itamarst commented Jan 30, 2025

codecov bot commented Jan 30, 2025

perf: Speed up list operations that use amortized_iter() #20964

Are you sure you want to change the base?

perf: Speed up list operations that use amortized_iter() #20964

Conversation

itamarst commented Jan 28, 2025 • edited Loading

itamarst commented Jan 28, 2025

itamarst commented Jan 28, 2025 • edited Loading

orlp commented Jan 29, 2025

orlp commented Jan 29, 2025 • edited Loading

ritchie46 commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

orlp commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 29, 2025

itamarst commented Jan 30, 2025

itamarst commented Jan 30, 2025

codecov bot commented Jan 30, 2025

Codecov Report

itamarst commented Jan 28, 2025 •

edited

Loading

itamarst commented Jan 28, 2025 •

edited

Loading

orlp commented Jan 29, 2025 •

edited

Loading