-
Notifications
You must be signed in to change notification settings - Fork 227
avr: implement __[u]divmod(h|q)i4 #815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function. | ||
#[naked] | ||
pub unsafe extern "C" fn __udivmodqi4() { | ||
// Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These implementations derive from libgcc where the originating file is marked GPLv3+. IANAL and unsure of the implications from including it here.
It would be interesting to attempt to port these functions to Rust. I think a similar implementation may be possible using some of the standard library's u8/u16 methods. I think that would require us to teach rustc
about this weird ABI. Does this implementation need to remain ABI compatible with GCC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of pulling GCC's impl we could simply use the naked function as a trampoline? i.e. like:
#[naked]
pub unsafe extern "C" fn __divmodhi4() {
let args;
naked_asm!(/* move numbers from funny, arbitrary input registers into `args` */);
let out = /* call native-Rust implementation using `args` */;
naked_asm!(/* move `out` to funny, arbitrary output registers */)
}
This would be less efficient and we have to be careful not to overwrite registers we shouldn't touch, but it should be good enough to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied to this in another comment but unfortunately we can't take these as-is, ports are considered derivative work.
Mentioned here #711 (comment) but adding a new ABI to rustc would be overkill for four functions in the whole ecosystem, maybe not possible based on LLVM IR limitations. You could probably use Rust as a start though, emit the optimized asm and manually adjust for the ABI. Or emit the optimized LLVM IR and adjust the signature to match the ABI if possible (this is effectively what compiler support in rustc would do), and compile that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to go the trampoline route, but wasn't sure how to map the registers. I'm guessing it needs to move the input parameters into the expected registers, call a regular Rust function, then move the return arguments back to the special registers. All while save and restoring those trampled registers.
It would be a start, but likely too much overhead for a long-term solution. I'll try implementing an algorithm in Rust and see what the optimizer spits out.
/// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function. | ||
#[naked] | ||
pub unsafe extern "C" fn __udivmodqi4() { | ||
// Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be very cautious with licensing here; the GCC sources are GPL but this crate needs to stay MIT or Apache-2.0, so unfortunately anything directly derived needs to be removed. If you write it by hand and get similar results that's fine, but these look pretty line-for-line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. That's why I wanted to be extraordinarily clear that these are lifted from gcc and likely problematic here. It's a starting point. Unfortunately, at the end of the day it's "just math", so re-coding would likely require us to use a different algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to come up with a different algorithm, as you say there are only so many ways to do small integer division. Our code doesn't need to be different from GCC just to say it's different and overlap is going to happen for these tiny routines, but the way to turn a given algorithm into assembly has to be done from scratch.
I'm not an expert here though, @joshtriplett could probably clarify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely shouldn't have any routines taken directly from GCC, or written by someone who is working directly from the code in GCC as an example; that's extremely likely to make them effectively derived from GCC.
Is there an implementation in LLVM or similar that we could draw from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia has a nice entry on division algorithms. I implemented "long division" and after optimizing variable usage came up with
pub fn long_div_u8(mut num_quo: u8, divisor: u8) -> (u8, u8) {
let mut remainder = 0u8;
for _ in 0..8u8 {
let overflow;
(num_quo, overflow) = num_quo.overflowing_add(num_quo);
(remainder, _) = remainder.carrying_add(remainder, overflow);
if let Some(rem) = remainder.checked_sub(divisor) {
remainder = rem;
num_quo += 1;
}
}
(num_quo, remainder)
}
I thought the u8
methods would lead to reasonable code, but the generated code (--release
) is really bad. @Patryk27, do you know if it's possible to teach LLVM to make use of the CPU status register (especially the carry bit) and associated opcodes? Or, maybe we need more AVR-optimized intrinsics?
rustc
isn't able to see that an overflowing_add
(left shift into carry bit) followed by a carrying_add
should have generated an adc
. Instead it generates
// let (a, b) = intrinsics::add_with_overflow(...);
add r24, r24
ldi r20, 0x01
cp r24, r25
ldi r25, 0x01
brcs .+2
mov r25, r1
// let (a, c1) = self.overflowing_add(rhs);
add r19, r19
// let (a, b) = intrinsics::add_with_overflow(...);
or r19, r25
// if let Some(rem) = remainder.checked_sub(divisor)
cp r19, r22
mov r25, r22
brcc .+2
mov r25, 1
// if self < rhs
cp r19, r22
brcc .+2
mov r20, r1
sub r19, r25
add r24, r20
...
The GCC assembly looks almost exactly like what I would expect the Rust code to generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried implementing the "non-restoring division" but some of the math requires an extra bit that I couldn't get working in 8-bits. Here's what I've got thus far:
pub fn non_restoring_u8(numerator: u8, divisor: u8) -> (u8, u8) {
const N: u8 = 8;
let m = divisor;
let mut a = 0u16; // <<=== would like to be u8
let mut q = numerator;
for _i in (0..N).rev() {
let over;
(q, over) = q.overflowing_add(q);
(a, _) = a.carrying_add(a, over);
if a & 0x200 != 0 {
a = a.wrapping_add(u16::from(m));
} else {
a = a.wrapping_sub(u16::from(m));
}
if a & 0x100 == 0 {
q |= 1;
}
}
if a & 0x100 != 0 {
a = a.wrapping_add(u16::from(m));
}
(q, u8::try_from(a).unwrap())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_non_restoring_u8() {
for n in 0..u8::MAX {
for d in 1..u8::MAX {
//println!("n: {n}, d: {d}");
assert_eq!(non_restoring_u8(n, d), (n / d, n % d));
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some generic algorithms in the crate, at https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs and
__udivmodsi4, |
rustc
isn't able to see that anoverflowing_add
(left shift into carry bit) followed by acarrying_add
should have generated anadc
. Instead it generates
Rustc doesn't do this kind of optimization, it would be on the LLVM end. If you can reproduce the problem in LLVM IR, it would be filing a missed optimization bug. You can just replace those couple of lines with inline asm.
Unfortunately it's probably going to be pretty tough to get something that optimizes super well here without handwritten assembly. For this initial PR just doing a naked trampoline to a suboptimal Rust function would be fine, it can be further optimized later.
We have a couple of generic divrem
algorithms that might be possible to test and reuse https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to meaningfully add tests as these intrinsics are specific to AVR. I have don't my best to compare the --release
codegen with the GCC output, but haven't tried running the code on real hardware. I'm hoping @Patryk27 might be able to run them through his AVR math fuzzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
GCC uses a special calling convention for integer multiplication and devision. ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention Unless we teach the compiler about this convention we're limited to implementing these functions via naked_asm!. The implementations provided here derive from libcc. ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S
I've submitted a from-scratch implementation in #816. The gcc implementation unconditionally shifts bits into the quotient via the carry bit (at the bottom of the loop). In contrast I shift the quotient at the top of loop and conditionally assert its bits. gcc also complements the quotient before returning which is not required in my implementation. The special ABI doesn't leave much room for differentiation, but I feel the mechanics are reasonably unique for such a short code segment. The algorithmic differences are easier to see in the 16-bit solution where gcc starts by jumping toward the end of the loop, requiring an awkward loop count. closing this in favor of #816 |
GCC uses a special calling convention for integer multiplication and devision.
ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention
Unless we teach the compiler about this convention we're limited to implementing these functions via naked_asm!. The implementations provided here derive from libcc.
ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S
This effort is in support of #711