avr: implement __[u]divmod(h|q)i4 #815

tones111 · 2025-04-10T18:07:15Z

GCC uses a special calling convention for integer multiplication and devision.
ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention

Unless we teach the compiler about this convention we're limited to implementing these functions via naked_asm!. The implementations provided here derive from libcc.
ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S

This effort is in support of #711

compiler-builtins/src/int/udiv.rs

tones111 · 2025-04-10T18:18:57Z

compiler-builtins/src/int/udiv.rs

+    /// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
+    #[naked]
+    pub unsafe extern "C" fn __udivmodqi4() {
+        // Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365


These implementations derive from libgcc where the originating file is marked GPLv3+. IANAL and unsure of the implications from including it here.

It would be interesting to attempt to port these functions to Rust. I think a similar implementation may be possible using some of the standard library's u8/u16 methods. I think that would require us to teach rustc about this weird ABI. Does this implementation need to remain ABI compatible with GCC?

Maybe instead of pulling GCC's impl we could simply use the naked function as a trampoline? i.e. like:

#[naked] pub unsafe extern "C" fn __divmodhi4() { let args; naked_asm!(/* move numbers from funny, arbitrary input registers into `args` */); let out = /* call native-Rust implementation using `args` */; naked_asm!(/* move `out` to funny, arbitrary output registers */) }

This would be less efficient and we have to be careful not to overwrite registers we shouldn't touch, but it should be good enough to get started.

I replied to this in another comment but unfortunately we can't take these as-is, ports are considered derivative work.

Mentioned here #711 (comment) but adding a new ABI to rustc would be overkill for four functions in the whole ecosystem, maybe not possible based on LLVM IR limitations. You could probably use Rust as a start though, emit the optimized asm and manually adjust for the ABI. Or emit the optimized LLVM IR and adjust the signature to match the ABI if possible (this is effectively what compiler support in rustc would do), and compile that.

I wanted to go the trampoline route, but wasn't sure how to map the registers. I'm guessing it needs to move the input parameters into the expected registers, call a regular Rust function, then move the return arguments back to the special registers. All while save and restoring those trampled registers.

It would be a start, but likely too much overhead for a long-term solution. I'll try implementing an algorithm in Rust and see what the optimizer spits out.

tgross35 · 2025-04-10T18:19:47Z

compiler-builtins/src/int/udiv.rs

+    /// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
+    #[naked]
+    pub unsafe extern "C" fn __udivmodqi4() {
+        // Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365


We need to be very cautious with licensing here; the GCC sources are GPL but this crate needs to stay MIT or Apache-2.0, so unfortunately anything directly derived needs to be removed. If you write it by hand and get similar results that's fine, but these look pretty line-for-line.

agreed. That's why I wanted to be extraordinarily clear that these are lifted from gcc and likely problematic here. It's a starting point. Unfortunately, at the end of the day it's "just math", so re-coding would likely require us to use a different algorithm.

I don't think it's necessary to come up with a different algorithm, as you say there are only so many ways to do small integer division. Our code doesn't need to be different from GCC just to say it's different and overlap is going to happen for these tiny routines, but the way to turn a given algorithm into assembly has to be done from scratch.

I'm not an expert here though, @joshtriplett could probably clarify things.

We definitely shouldn't have any routines taken directly from GCC, or written by someone who is working directly from the code in GCC as an example; that's extremely likely to make them effectively derived from GCC.

Is there an implementation in LLVM or similar that we could draw from?

Wikipedia has a nice entry on division algorithms. I implemented "long division" and after optimizing variable usage came up with

pub fn long_div_u8(mut num_quo: u8, divisor: u8) -> (u8, u8) { let mut remainder = 0u8; for _ in 0..8u8 { let overflow; (num_quo, overflow) = num_quo.overflowing_add(num_quo); (remainder, _) = remainder.carrying_add(remainder, overflow); if let Some(rem) = remainder.checked_sub(divisor) { remainder = rem; num_quo += 1; } } (num_quo, remainder) }

I thought the u8 methods would lead to reasonable code, but the generated code (--release) is really bad. @Patryk27, do you know if it's possible to teach LLVM to make use of the CPU status register (especially the carry bit) and associated opcodes? Or, maybe we need more AVR-optimized intrinsics?

rustc isn't able to see that an overflowing_add (left shift into carry bit) followed by a carrying_add should have generated an adc. Instead it generates

// let (a, b) = intrinsics::add_with_overflow(...); add r24, r24 ldi r20, 0x01 cp r24, r25 ldi r25, 0x01 brcs .+2 mov r25, r1 // let (a, c1) = self.overflowing_add(rhs); add r19, r19 // let (a, b) = intrinsics::add_with_overflow(...); or r19, r25 // if let Some(rem) = remainder.checked_sub(divisor) cp r19, r22 mov r25, r22 brcc .+2 mov r25, 1 // if self < rhs cp r19, r22 brcc .+2 mov r20, r1 sub r19, r25 add r24, r20 ...

The GCC assembly looks almost exactly like what I would expect the Rust code to generate.

I also tried implementing the "non-restoring division" but some of the math requires an extra bit that I couldn't get working in 8-bits. Here's what I've got thus far:

pub fn non_restoring_u8(numerator: u8, divisor: u8) -> (u8, u8) { const N: u8 = 8; let m = divisor; let mut a = 0u16; // <<=== would like to be u8 let mut q = numerator; for _i in (0..N).rev() { let over; (q, over) = q.overflowing_add(q); (a, _) = a.carrying_add(a, over); if a & 0x200 != 0 { a = a.wrapping_add(u16::from(m)); } else { a = a.wrapping_sub(u16::from(m)); } if a & 0x100 == 0 { q |= 1; } } if a & 0x100 != 0 { a = a.wrapping_add(u16::from(m)); } (q, u8::try_from(a).unwrap()) } #[cfg(test)] mod tests { use super::*; #[test] fn test_non_restoring_u8() { for n in 0..u8::MAX { for d in 1..u8::MAX { //println!("n: {n}, d: {d}"); assert_eq!(non_restoring_u8(n, d), (n / d, n % d)); } } } }

We do have some generic algorithms in the crate, at https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs and

compiler-builtins/compiler-builtins/src/int/sdiv.rs

Line 119 in 9978a8b

__udivmodsi4,

. I kind of expect that we

rustc isn't able to see that an overflowing_add (left shift into carry bit) followed by a carrying_add should have generated an adc. Instead it generates

Rustc doesn't do this kind of optimization, it would be on the LLVM end. If you can reproduce the problem in LLVM IR, it would be filing a missed optimization bug. You can just replace those couple of lines with inline asm.

Unfortunately it's probably going to be pretty tough to get something that optimizes super well here without handwritten assembly. For this initial PR just doing a naked trampoline to a suboptimal Rust function would be fine, it can be further optimized later.

We have a couple of generic divrem algorithms that might be possible to test and reuse https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs.

tones111 · 2025-04-10T18:23:16Z

compiler-builtins/src/int/udiv.rs

I'm not sure how to meaningfully add tests as these intrinsics are specific to AVR. I have don't my best to compare the --release codegen with the GCC output, but haven't tried running the code on real hardware. I'm hoping @Patryk27 might be able to run them through his AVR math fuzzer.

compiler-builtins/src/int/udiv.rs

GCC uses a special calling convention for integer multiplication and devision. ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention Unless we teach the compiler about this convention we're limited to implementing these functions via naked_asm!. The implementations provided here derive from libcc. ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S

tones111 · 2025-04-15T00:56:13Z

I've submitted a from-scratch implementation in #816.

The gcc implementation unconditionally shifts bits into the quotient via the carry bit (at the bottom of the loop). In contrast I shift the quotient at the top of loop and conditionally assert its bits. gcc also complements the quotient before returning which is not required in my implementation.

The special ABI doesn't leave much room for differentiation, but I feel the mechanics are reasonably unique for such a short code segment. The algorithmic differences are easier to see in the 16-bit solution where gcc starts by jumping toward the end of the loop, requiring an awkward loop count.

closing this in favor of #816

tones111 commented Apr 10, 2025

View reviewed changes

compiler-builtins/src/int/udiv.rs Outdated Show resolved Hide resolved

tones111 commented Apr 10, 2025

View reviewed changes

tgross35 reviewed Apr 10, 2025

View reviewed changes

tones111 commented Apr 10, 2025

View reviewed changes

compiler-builtins/src/int/udiv.rs Outdated Show resolved Hide resolved

tones111 force-pushed the avr_div branch from 892b179 to 15d5245 Compare April 10, 2025 19:30

tones111 closed this Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avr: implement __[u]divmod(h|q)i4 #815

avr: implement __[u]divmod(h|q)i4 #815

tones111 commented Apr 10, 2025

tones111 Apr 10, 2025

Patryk27 Apr 10, 2025 •

edited

Loading

tgross35 Apr 10, 2025 •

edited

Loading

tones111 Apr 10, 2025

tgross35 Apr 10, 2025

tones111 Apr 10, 2025

tgross35 Apr 10, 2025

joshtriplett Apr 10, 2025

tones111 Apr 11, 2025 •

edited

Loading

tones111 Apr 11, 2025 •

edited

Loading

tgross35 Apr 11, 2025

tones111 Apr 10, 2025

Patryk27 Apr 10, 2025

tones111 commented Apr 15, 2025

avr: implement __[u]divmod(h|q)i4 #815

avr: implement __[u]divmod(h|q)i4 #815

Conversation

tones111 commented Apr 10, 2025

Choose a reason for hiding this comment

Patryk27 Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

tgross35 Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tones111 Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

tones111 Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tones111 commented Apr 15, 2025

Patryk27 Apr 10, 2025 •

edited

Loading

tgross35 Apr 10, 2025 •

edited

Loading

tones111 Apr 11, 2025 •

edited

Loading

tones111 Apr 11, 2025 •

edited

Loading