-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up creating and extending packed arrays from iterators up to 63× #1023
Conversation
API docs are being generated and will be shortly available at: https://godot-rust.github.io/docs/gdext/pr-1023 |
30677e3
to
f2e267d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this sounds like a great improvement! 🚀
Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint()
, why not do that and write directly from the iterator to self.as_mut_slice()
?
Also, ParamType::owned_to_arg()
is no longer occurring in the resulting code, is that not necessary for genericity?
That's what the "Fast part" does. The buffer is only needed if there are more items after that. I guess there might be iterators whose The alternative (which I implemented initially) is to grow the array in increments of 32 elements, write to
Apparently not. We only implement |
If that's the slow part that only happens on "bad" implementations of Do you know how often this occurs in practice? |
There are at least two categories of iterators that are common in the wild, for which we'd want good performance:
This PR is sufficient to handle them both efficiently. We could eliminate the fast part (case 1) and not lose a lot of performance (maybe incur some memory fragmentation), but that's actually the straightforward and obvious part, so the maintainability gain is small. This PR also happens to deal efficiently with anything in between, i.e. iterators that report a nonzero lower bound but may return more elements. One example of those would be a |
Sounds good, thanks for elaborating! The 2kB buffer (512 ints) is probably also not a big issue, even on mobile/Wasm? |
A cursory search shows stack sizes of at least 1 MB on all platforms. If it becomes a problem after all, it's easy enough to adjust. |
while let Some(item) = iter.next() { | ||
buf[0].write(item); | ||
let mut buf_len = 1; | ||
for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If buffer is full, then iterator is advanced but the item is discarded.
Reference: https://doc.rust-lang.org/src/core/iter/adapters/zip.rs.html#165-170
for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) { | |
for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱 Yikes, great catch! Maybe this is why I intuitively wrote it in a more explicit way to begin with. iter::zip
looks so symmetrical (which is why I prefer it over Iterator::zip
) but in this case, that's misleading.
I've updated the test to catch this, and rewrote the loop to be more explicit. The new test also caught another bug that all three of us missed: len += buf_len;
was missing at the end of the loop. But I'm confident that it is correct now.
Looks good to me. Small issue i noticed: if iterator panics, data in buffer will not be dropped. It's not a safety issue, but it would be nice to drop buffer properly. struct Buffer<const N: usize, T> {
buf: [MaybeUninit<T>; N],
len: usize,
}
impl<const N: usize, T> Default for Buffer<N, T> {
fn default() -> Self {
Self {
buf: [const { MaybeUninit::uninit() }; N],
len: 0,
}
}
}
impl<const N: usize, T> Drop for Buffer<N, T> {
fn drop(&mut self) {
assert!(self.len <= N);
if N > 0 {
unsafe {
ptr::drop_in_place(ptr::slice_from_raw_parts_mut(
self.buf[0].as_mut_ptr(),
self.len,
));
}
}
}
} |
Great catch yet again. This looks like a good opportunity to make that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds now a significant amount of complexity, so there's quite a chance that we introduce bugs. Given the performance gains, it's probably OK to iron those out over time, but maybe we should bookmark this in case regressions appear in the future 🙂
But I wonder -- are there no higher-level ways to achieve this with the standard library? It seems to be a pattern that occurs every now and then when implementing Extend
... Does anyone know how std
or other crates handle it, also with low-level unsafe
all the time?
It'll appear in the changelog, right?
Here's the The problem is that such an implementation would make at least one Godot API call per element, so it wouldn't be any faster than we currently have. The high-level way would be to collect the iterator into a |
If you have the time to do that, it would of course be very appreciated. But you've already done a ton of research on this PR, so I don't want to demand more and more 😉 Related questions (also depending on the outcome of the above):
|
Here are my raw benchmark results:
All benchmarks use an iterator of 1000 elements of type
Each is run with two cases:
In a previous version of this PR, I reported 0.14 µs for the equivalent of
Sure thing.
Indeed. It's only useful if you don't know ahead of time how many elements are incoming. So if we implement That said, using large strings is much less common than using large arrays of core |
Thanks a lot! So there's still a factor of 8x and 3x of this PR compared to a With the |
I like
Ready to squash? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, the code looks very high quality and is greatly documented!
Some minor comments, after that you can squash 🙂
(can also be more than 1 commit if you find logical groups that make sense, up to you)
🤔 I'm not convinced about static assertion that buffer size is non-zero. Rust has been pretty good at eliding conditional statement with constant expression, adding static assertion adds complexity/compile time with at best dubious runtime optimization. Is there any difference in the generated assembly? I want to see what Compiler Explorer says. |
Imo it's a bit unfair to move all the work to @ttencate. I'm also guilty of that with the request for benchmark, but there we're at least trading off against a whole buffer class, where this static assertion is mostly 1 statement. As we use static assertions in other places, too, I'm not sure the burden of proof is on him here 😉 (that said, it's a fair question) Maybe you could look into it? |
That's okay @Bromeon, I'm also the one creating review burden by adding stuff to this PR all the time ;) Indeed the check is elided so I'll use a plain old |
Testing with following code: Click to expanduse std::hint::black_box;
use std::mem::MaybeUninit;
use std::ptr;
/// A fixed-size buffer that does not do any allocations, and can hold up to `N` elements of type `T`.
///
/// This is used to implement `Packed*Array::extend()` in an efficient way, because it forms a middle ground between
/// repeated `push()` calls (slow) and first collecting the entire `Iterator` into a `Vec` (faster, but takes more memory).
///
/// Note that `N` must not be 0 for the buffer to be useful. This is checked at compile time.
pub struct ExtendBuffer<T, const N: usize> {
buf: [MaybeUninit<T>; N],
len: usize,
}
impl<T, const N: usize> Default for ExtendBuffer<T, N> {
fn default() -> Self {
Self {
buf: [const { MaybeUninit::uninit() }; N],
len: 0,
}
}
}
impl<T, const N: usize> ExtendBuffer<T, N> {
/// Appends the given value to the buffer.
///
/// # Panics
/// If the buffer is full.
#[inline(never)]
pub fn push(&mut self, value: T) {
self.buf[self.len].write(value);
self.len += 1;
}
/// Returns `true` iff the buffer is full.
#[inline(never)]
pub fn is_full(&self) -> bool {
self.len == N
}
/// Returns a slice of all initialized elements in the buffer, and sets the length of the buffer back to 0.
///
/// It is the caller's responsibility to ensure that all elements in the returned slice get dropped!
pub fn drain_as_mut_slice(&mut self) -> &mut [T] {
debug_assert!(self.len <= N);
let len = self.len;
self.len = 0;
// MaybeUninit::slice_assume_init_ref could be used here instead, but it's experimental.
// SAFETY:
// - The pointer is non-null, valid and aligned.
// - `len` elements are always initialized.
// - The memory is not accessed through any other pointer, because we hold a `&mut` reference to `self`.
// - `len * mem::size_of::<T>()` is no larger than `isize::MAX`, otherwise the `buf` slice could not have existed either.
unsafe { std::slice::from_raw_parts_mut(if N > 0 {self.buf[0].as_mut_ptr()} else {ptr::dangling_mut()}, len) }
}
}
impl<T, const N: usize> Drop for ExtendBuffer<T, N> {
#[inline(never)]
fn drop(&mut self) {
debug_assert!(self.len <= N);
if N > 0 {
// SAFETY: `slice_from_raw_parts_mut` by itself is not unsafe, but to make the resulting slice safe to use:
// - `self.buf[0]` is a valid pointer, exactly `self.len` elements are initialized.
// - The pointer is not aliased since we have an exclusive `&mut self`.
let slice = ptr::slice_from_raw_parts_mut(self.buf[0].as_mut_ptr(), self.len);
// SAFETY: the value is valid because the `slice_from_raw_parts_mut` requirements are met,
// and there is no other way to access the value.
unsafe {
ptr::drop_in_place(slice);
}
}
}
}
#[no_mangle]
pub fn zero_s() {
let v = <ExtendBuffer<String, 0>>::default();
black_box(v.is_full());
drop(black_box(v));
}
#[no_mangle]
pub fn four_s() {
let mut v = <ExtendBuffer<String, 4>>::default();
v.push(String::from("abcd"));
drop(black_box(v));
} Yields zero drop call for |
This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 12× as fast as the previous implementation.
4fbdef2
to
b7178ee
Compare
Thanks a lot for this thorough pull request and the continuous improvements! |
…tend Speed up creating and extending packed arrays from iterators up to 63×
This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.