Speed up creating and extending packed arrays from iterators up to 63× #1023

ttencate · 2025-01-20T20:20:11Z

This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.

GodotRust · 2025-01-20T20:30:26Z

API docs are being generated and will be shortly available at: https://godot-rust.github.io/docs/gdext/pr-1023

Bromeon

Thanks a lot, this sounds like a great improvement! 🚀

Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint(), why not do that and write directly from the iterator to self.as_mut_slice()?

Also, ParamType::owned_to_arg() is no longer occurring in the resulting code, is that not necessary for genericity?

godot-core/src/builtin/collections/packed_array.rs

ttencate · 2025-01-21T15:21:37Z

Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint(), why not do that and write directly from the iterator to self.as_mut_slice()?

That's what the "Fast part" does. The buffer is only needed if there are more items after that.

I guess there might be iterators whose size_hint() gets updated intelligently during iteration, but mostly I'd expect it to decrement down to 0 every time next() is called. So once we have processed size_hint() items, we have no idea how many more are coming, and that's when we proceed to the "Slower part" and start using the buffer.

The alternative (which I implemented initially) is to grow the array in increments of 32 elements, write to as_mut_slice(), and then shrink it again at the end. The problem with that is that you might end up over-allocating, because Godot extends the memory allocation in powers of two. In some cases, this would make the returned array use twice as much memory as in the current implementation – memory that is potentially retained for a very long time. The buffer avoids that nicely, at the expense of some extra copying.

Also, ParamType::owned_to_arg() is no longer occurring in the resulting code, is that not necessary for genericity?

Apparently not. We only implement Extend<$Element>, not something like Extend<E> where E: Into<$Element>.

Bromeon · 2025-01-21T16:37:07Z

So once we have processed size_hint() items, we have no idea how many more are coming, and that's when we proceed to the "Slower part" and start using the buffer.

If that's the slow part that only happens on "bad" implementations of size_hint(), is the complexity really necessary or can we fall back to the naive approach for remaining elements (less maintenance, lower risks of bugs)?

Do you know how often this occurs in practice?

ttencate · 2025-01-21T17:44:47Z

There are at least two categories of iterators that are common in the wild, for which we'd want good performance:

Exact size is known, e.g. just iterating over a Vec or BTreeSet. This uses the fast part and does nothing on the slower part, since the iterator is finished.
Exact size is not known and the lower bound is 0, e.g. Filter, FlatMap, FromFn. This does nothing on the fast part and then finishes the iterator through the slower part. Note that the slower path is only 1.5× slower than the fast path, and still 44× as fast as naive repeated push() calls.

This PR is sufficient to handle them both efficiently. We could eliminate the fast part (case 1) and not lose a lot of performance (maybe incur some memory fragmentation), but that's actually the straightforward and obvious part, so the maintainability gain is small.

This PR also happens to deal efficiently with anything in between, i.e. iterators that report a nonzero lower bound but may return more elements. One example of those would be a Chain of the above two cases.

Bromeon · 2025-01-21T18:39:17Z

Sounds good, thanks for elaborating! The 2kB buffer (512 ints) is probably also not a big issue, even on mobile/Wasm?

ttencate · 2025-01-21T19:11:19Z

A cursory search shows stack sizes of at least 1 MB on all platforms. If it becomes a problem after all, it's easy enough to adjust.

Dheatly23 · 2025-01-22T01:21:41Z

godot-core/src/builtin/collections/packed_array.rs

+                while let Some(item) = iter.next() {
+                    buf[0].write(item);
+                    let mut buf_len = 1;
+                    for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {


If buffer is full, then iterator is advanced but the item is discarded.

Reference: https://doc.rust-lang.org/src/core/iter/adapters/zip.rs.html#165-170

Suggested change

for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {

for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) {

😱 Yikes, great catch! Maybe this is why I intuitively wrote it in a more explicit way to begin with. iter::zip looks so symmetrical (which is why I prefer it over Iterator::zip) but in this case, that's misleading.

I've updated the test to catch this, and rewrote the loop to be more explicit. The new test also caught another bug that all three of us missed: len += buf_len; was missing at the end of the loop. But I'm confident that it is correct now.

Dheatly23 · 2025-01-23T03:00:01Z

Looks good to me.

Small issue i noticed: if iterator panics, data in buffer will not be dropped. It's not a safety issue, but it would be nice to drop buffer properly.

struct Buffer<const N: usize, T> {
    buf: [MaybeUninit<T>; N],
    len: usize,
}

impl<const N: usize, T> Default for Buffer<N, T> {
    fn default() -> Self {
        Self {
            buf: [const { MaybeUninit::uninit() }; N],
            len: 0,
        }
    }
}

impl<const N: usize, T> Drop for Buffer<N, T> {
    fn drop(&mut self) {
        assert!(self.len <= N);
        if N > 0 {
            unsafe {
                ptr::drop_in_place(ptr::slice_from_raw_parts_mut(
                    self.buf[0].as_mut_ptr(),
                    self.len,
                ));
            }
        }
    }
}

ttencate · 2025-01-23T10:10:40Z

Great catch yet again. This looks like a good opportunity to make that Buffer into a safe-ish abstraction, which makes the logic in extend() a lot easier to follow as well! PTAL.

godot-core/src/builtin/collections/packed_array.rs

Bromeon

This PR adds now a significant amount of complexity, so there's quite a chance that we introduce bugs. Given the performance gains, it's probably OK to iron those out over time, but maybe we should bookmark this in case regressions appear in the future 🙂

But I wonder -- are there no higher-level ways to achieve this with the standard library? It seems to be a pattern that occurs every now and then when implementing Extend... Does anyone know how std or other crates handle it, also with low-level unsafe all the time?

godot-core/src/builtin/collections/packed_array.rs

itest/rust/src/benchmarks/mod.rs

godot-core/src/builtin/collections/packed_array.rs

itest/rust/src/builtin_tests/containers/packed_array_test.rs

ttencate · 2025-01-23T13:12:44Z

This PR adds now a significant amount of complexity, so there's quite a chance that we introduce bugs. Given the performance gains, it's probably OK to iron those out over time, but maybe we should bookmark this in case regressions appear in the future 🙂

It'll appear in the changelog, right?

But I wonder -- are there no higher-level ways to achieve this with the standard library? It seems to be a pattern that occurs every now and then when implementing Extend... Does anyone know how std or other crates handle it, also with low-level unsafe all the time?

Here's the std::Vec implementation: https://doc.rust-lang.org/src/alloc/vec/mod.rs.html#3512-3534 It polls the size_hint() continuously, which makes sense in that implementation.

The problem is that such an implementation would make at least one Godot API call per element, so it wouldn't be any faster than we currently have.

The high-level way would be to collect the iterator into a Vec first, and then memcpying that into the Packed*Array. I haven't benchmarked that strategy, but I think it would be around the same speed as my implementation. However, it requires an allocation the same size as the output, and possibly intermediate allocations as the Vec grows. If you'd rather have that simplicity, I'd be happy to bench it.

Bromeon · 2025-02-01T16:16:20Z

The high-level way would be to collect the iterator into a Vec first, and then memcpying that into the Packed*Array. I haven't benchmarked that strategy, but I think it would be around the same speed as my implementation. However, it requires an allocation the same size as the output, and possibly intermediate allocations as the Vec grows. If you'd rather have that simplicity, I'd be happy to bench it.

If you have the time to do that, it would of course be very appreciated. But you've already done a ton of research on this PR, so I don't want to demand more and more 😉

Related questions (also depending on the outcome of the above):

The buffer module seems like a good candidate for a separate file. Maybe its Buffer struct could also be named more precisely?
Do you think that buffer is potentially reusable, e.g. for String (contiguous [char] UTF-32 array) or so? It's probably much less common to iterate over individual code points...

ttencate · 2025-02-03T09:40:09Z

If you have the time to do that, it would of course be very appreciated. But you've already done a ton of research on this PR, so I don't want to demand more and more 😉

Here are my raw benchmark results:

                                              min       median
   -- naive_known_size           ...      6.095μs      6.511μs
   -- naive_unknown_size         ...      6.140μs      6.371μs
   -- vec_known_size             ...      0.782μs      0.906μs
   -- vec_unknown_size           ...      1.586μs      1.731μs
   -- new_known_size             ...      0.095μs      0.099μs
   -- new_unknown_size           ...      0.488μs      0.491μs

All benchmarks use an iterator of 1000 elements of type i32.

The naive implementation is the current one: repeated push() calls.
The vec implementation first collect()s the iterator into a Vec, then moves that into the array.
The new implementation is this PR.

Each is run with two cases:

known_size is an Iterator whose size_hint() returns (len, Some(len)) where len is exactly the number of elements in the iterator.
unknown_size is an Iterator whose size_hint() returns (0, None).

In a previous version of this PR, I reported 0.14 µs for the equivalent of new_unknown_size. That was before the Buffer refactor, but there have been no fundamental changes in the algorithm. I tried #[inline(always)] on the Buffer methods but it made no difference. In any case, it's still a respectable 12× speedup.

The buffer module seems like a good candidate for a separate file. Maybe its Buffer struct could also be named more precisely?

Sure thing.

Do you think that buffer is potentially reusable, e.g. for String (contiguous [char] UTF-32 array) or so? It's probably much less common to iterate over individual code points...

Indeed. It's only useful if you don't know ahead of time how many elements are incoming. So if we implement FromIterator<char> or Extend<char> for GString. But at that point, the entire algorithm should be generalized somehow, and not just the Buffer.

That said, using large strings is much less common than using large arrays of core Copy-like types, so the performance gain might not be worth it.

Bromeon · 2025-02-03T09:49:29Z

Thanks a lot! So there's still a factor of 8x and 3x of this PR compared to a Vec impl, which imo justifies the complexity of a dedicated buffer.

With the Buffer being its own file, and some docs that explain the value it adds, this would look great! Do you have a good name? ExtendBuffer or so, since it's specifically for the purpose of extending an unknown number of elements?

ttencate · 2025-02-03T11:30:10Z

I like ExtendBuffer. Moved it to a sibling module for now, since there doesn't seem to be a good other place in this crate to put it. Because this makes it more "official", added some improvements:

A (somewhat ugly due to compiler limitations) static assertion that the buffer size is nonzero. This saves some runtime asserts.
Renamed drain_as_slice() to drain_as_mut_slice() and have it actually return a slice,rather than a pointer/length pair. This has the benefit that the borrow checker validates that the buffer isn't modified while the slice exists. I made it mut so that drop can sensibly be called on its elements.
A test for drop behaviour of drain_as_mut_slice().

Ready to squash?

Bromeon

Thanks a lot, the code looks very high quality and is greatly documented!

Some minor comments, after that you can squash 🙂
(can also be more than 1 commit if you find logical groups that make sense, up to you)

godot-core/src/builtin/collections/extend_buffer.rs

godot-core/src/builtin/collections/packed_array.rs

Dheatly23 · 2025-02-03T12:53:14Z

🤔 I'm not convinced about static assertion that buffer size is non-zero. Rust has been pretty good at eliding conditional statement with constant expression, adding static assertion adds complexity/compile time with at best dubious runtime optimization.

Is there any difference in the generated assembly? I want to see what Compiler Explorer says.

Bromeon · 2025-02-03T14:25:59Z

Is there any difference in the generated assembly? I want to see what Compiler Explorer says.

Imo it's a bit unfair to move all the work to @ttencate. I'm also guilty of that with the request for benchmark, but there we're at least trading off against a whole buffer class, where this static assertion is mostly 1 statement. As we use static assertions in other places, too, I'm not sure the burden of proof is on him here 😉 (that said, it's a fair question)

Maybe you could look into it?

ttencate · 2025-02-03T14:39:26Z

That's okay @Bromeon, I'm also the one creating review burden by adding stuff to this PR all the time ;) Indeed the check is elided so I'll use a plain old if instead.

Dheatly23 · 2025-02-03T14:43:40Z

Testing with following code:

Click to expand

use std::hint::black_box;
use std::mem::MaybeUninit;
use std::ptr;

/// A fixed-size buffer that does not do any allocations, and can hold up to `N` elements of type `T`.
///
/// This is used to implement `Packed*Array::extend()` in an efficient way, because it forms a middle ground between
/// repeated `push()` calls (slow) and first collecting the entire `Iterator` into a `Vec` (faster, but takes more memory).
///
/// Note that `N` must not be 0 for the buffer to be useful. This is checked at compile time.
pub struct ExtendBuffer<T, const N: usize> {
    buf: [MaybeUninit<T>; N],
    len: usize,
}

impl<T, const N: usize> Default for ExtendBuffer<T, N> {
    fn default() -> Self {
        Self {
            buf: [const { MaybeUninit::uninit() }; N],
            len: 0,
        }
    }
}

impl<T, const N: usize> ExtendBuffer<T, N> {
    /// Appends the given value to the buffer.
    ///
    /// # Panics
    /// If the buffer is full.
    #[inline(never)]
    pub fn push(&mut self, value: T) {
        self.buf[self.len].write(value);
        self.len += 1;
    }

    /// Returns `true` iff the buffer is full.
    #[inline(never)]
    pub fn is_full(&self) -> bool {
        self.len == N
    }

    /// Returns a slice of all initialized elements in the buffer, and sets the length of the buffer back to 0.
    ///
    /// It is the caller's responsibility to ensure that all elements in the returned slice get dropped!
    pub fn drain_as_mut_slice(&mut self) -> &mut [T] {
        debug_assert!(self.len <= N);
        let len = self.len;
        self.len = 0;
        // MaybeUninit::slice_assume_init_ref could be used here instead, but it's experimental.
        // SAFETY:
        // - The pointer is non-null, valid and aligned.
        // - `len` elements are always initialized.
        // - The memory is not accessed through any other pointer, because we hold a `&mut` reference to `self`.
        // - `len * mem::size_of::<T>()` is no larger than `isize::MAX`, otherwise the `buf` slice could not have existed either.
        unsafe { std::slice::from_raw_parts_mut(if N > 0 {self.buf[0].as_mut_ptr()} else {ptr::dangling_mut()}, len) }
    }
}

impl<T, const N: usize> Drop for ExtendBuffer<T, N> {
    #[inline(never)]
    fn drop(&mut self) {
        debug_assert!(self.len <= N);
        if N > 0 {
        // SAFETY: `slice_from_raw_parts_mut` by itself is not unsafe, but to make the resulting slice safe to use:
        // - `self.buf[0]` is a valid pointer, exactly `self.len` elements are initialized.
        // - The pointer is not aliased since we have an exclusive `&mut self`.
        let slice = ptr::slice_from_raw_parts_mut(self.buf[0].as_mut_ptr(), self.len);
        // SAFETY: the value is valid because the `slice_from_raw_parts_mut` requirements are met,
        // and there is no other way to access the value.
        unsafe {
            ptr::drop_in_place(slice);
        }
        }
    }
}

#[no_mangle]
pub fn zero_s() {
    let v = <ExtendBuffer<String, 0>>::default();
    black_box(v.is_full());
    drop(black_box(v));
}

#[no_mangle]
pub fn four_s() {
    let mut v = <ExtendBuffer<String, 4>>::default();
    v.push(String::from("abcd"));
    drop(black_box(v));
}

Yields zero drop call for zero_s while drop is called on four_s, thus proving the drop is elided on zero buffer length. With debug build both drop impl is still generated, but the if jump is replaced with unconditional jump, proving it's elided for both N value.

This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 12× as fast as the previous implementation.

Bromeon · 2025-02-08T15:14:50Z

Thanks a lot for this thorough pull request and the continuous improvements!

…tend Speed up creating and extending packed arrays from iterators up to 63×

ttencate force-pushed the perf/packed_array_extend branch from 30677e3 to f2e267d Compare January 20, 2025 20:34

Bromeon reviewed Jan 21, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

Bromeon added c: core Core components performance Performance problems and optimizations labels Jan 21, 2025

Dheatly23 reviewed Jan 21, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

ttencate requested review from Bromeon and Dheatly23 January 21, 2025 15:50

Dheatly23 reviewed Jan 22, 2025

View reviewed changes

ttencate commented Jan 23, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

ttencate commented Jan 23, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

Bromeon reviewed Jan 23, 2025

View reviewed changes

Bromeon reviewed Feb 3, 2025

View reviewed changes

godot-core/src/builtin/collections/extend_buffer.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/extend_buffer.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

ttencate force-pushed the perf/packed_array_extend branch from 4fbdef2 to b7178ee Compare February 3, 2025 14:44

Bromeon added this pull request to the merge queue Feb 8, 2025

Merged via the queue into godot-rust:master with commit 71bbfb8 Feb 8, 2025
15 checks passed

TitanNano pushed a commit to TitanNano/gdext that referenced this pull request Feb 9, 2025

Merge pull request godot-rust#1023 from ttencate/perf/packed_array_ex…

483ae3e

…tend Speed up creating and extending packed arrays from iterators up to 63×

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up creating and extending packed arrays from iterators up to 63× #1023

Speed up creating and extending packed arrays from iterators up to 63× #1023

ttencate commented Jan 20, 2025

GodotRust commented Jan 20, 2025

Bromeon left a comment

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Dheatly23 Jan 22, 2025 •

edited

Loading

ttencate Jan 22, 2025

Dheatly23 commented Jan 23, 2025 •

edited

Loading

ttencate commented Jan 23, 2025

Bromeon left a comment

ttencate commented Jan 23, 2025

Bromeon commented Feb 1, 2025

ttencate commented Feb 3, 2025

Bromeon commented Feb 3, 2025

ttencate commented Feb 3, 2025

Bromeon left a comment

Dheatly23 commented Feb 3, 2025 •

edited

Loading

Bromeon commented Feb 3, 2025 •

edited

Loading

ttencate commented Feb 3, 2025

Dheatly23 commented Feb 3, 2025 •

edited

Loading

Bromeon commented Feb 8, 2025

	for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {
	for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) {

Speed up creating and extending packed arrays from iterators up to 63× #1023

Speed up creating and extending packed arrays from iterators up to 63× #1023

Conversation

ttencate commented Jan 20, 2025

GodotRust commented Jan 20, 2025

Bromeon left a comment

Choose a reason for hiding this comment

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Dheatly23 Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

ttencate Jan 22, 2025

Choose a reason for hiding this comment

Dheatly23 commented Jan 23, 2025 • edited Loading

ttencate commented Jan 23, 2025

Bromeon left a comment

Choose a reason for hiding this comment

ttencate commented Jan 23, 2025

Bromeon commented Feb 1, 2025

ttencate commented Feb 3, 2025

Bromeon commented Feb 3, 2025

ttencate commented Feb 3, 2025

Bromeon left a comment

Choose a reason for hiding this comment

Dheatly23 commented Feb 3, 2025 • edited Loading

Bromeon commented Feb 3, 2025 • edited Loading

ttencate commented Feb 3, 2025

Dheatly23 commented Feb 3, 2025 • edited Loading

Bromeon commented Feb 8, 2025

Dheatly23 Jan 22, 2025 •

edited

Loading

Dheatly23 commented Jan 23, 2025 •

edited

Loading

Dheatly23 commented Feb 3, 2025 •

edited

Loading

Bromeon commented Feb 3, 2025 •

edited

Loading

Dheatly23 commented Feb 3, 2025 •

edited

Loading