Skip to content

avr: implement __[u]divmod(h|q)i4 #815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tones111
Copy link
Contributor

GCC uses a special calling convention for integer multiplication and devision.
ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention

Unless we teach the compiler about this convention we're limited to implementing these functions via naked_asm!. The implementations provided here derive from libcc.
ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S

This effort is in support of #711

/// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
#[naked]
pub unsafe extern "C" fn __udivmodqi4() {
// Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These implementations derive from libgcc where the originating file is marked GPLv3+. IANAL and unsure of the implications from including it here.

It would be interesting to attempt to port these functions to Rust. I think a similar implementation may be possible using some of the standard library's u8/u16 methods. I think that would require us to teach rustc about this weird ABI. Does this implementation need to remain ABI compatible with GCC?

Copy link
Contributor

@Patryk27 Patryk27 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of pulling GCC's impl we could simply use the naked function as a trampoline? i.e. like:

#[naked]
pub unsafe extern "C" fn __divmodhi4() {
    let args;

    naked_asm!(/* move numbers from funny, arbitrary input registers into `args` */);

    let out = /* call native-Rust implementation using `args` */;

    naked_asm!(/* move `out` to funny, arbitrary output registers */)
}

This would be less efficient and we have to be careful not to overwrite registers we shouldn't touch, but it should be good enough to get started.

Copy link
Contributor

@tgross35 tgross35 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied to this in another comment but unfortunately we can't take these as-is, ports are considered derivative work.

Mentioned here #711 (comment) but adding a new ABI to rustc would be overkill for four functions in the whole ecosystem, maybe not possible based on LLVM IR limitations. You could probably use Rust as a start though, emit the optimized asm and manually adjust for the ABI. Or emit the optimized LLVM IR and adjust the signature to match the ABI if possible (this is effectively what compiler support in rustc would do), and compile that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to go the trampoline route, but wasn't sure how to map the registers. I'm guessing it needs to move the input parameters into the expected registers, call a regular Rust function, then move the return arguments back to the special registers. All while save and restoring those trampled registers.

It would be a start, but likely too much overhead for a long-term solution. I'll try implementing an algorithm in Rust and see what the optimizer spits out.

/// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
#[naked]
pub unsafe extern "C" fn __udivmodqi4() {
// Derived from: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S#L1365
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be very cautious with licensing here; the GCC sources are GPL but this crate needs to stay MIT or Apache-2.0, so unfortunately anything directly derived needs to be removed. If you write it by hand and get similar results that's fine, but these look pretty line-for-line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. That's why I wanted to be extraordinarily clear that these are lifted from gcc and likely problematic here. It's a starting point. Unfortunately, at the end of the day it's "just math", so re-coding would likely require us to use a different algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to come up with a different algorithm, as you say there are only so many ways to do small integer division. Our code doesn't need to be different from GCC just to say it's different and overlap is going to happen for these tiny routines, but the way to turn a given algorithm into assembly has to be done from scratch.

I'm not an expert here though, @joshtriplett could probably clarify things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely shouldn't have any routines taken directly from GCC, or written by someone who is working directly from the code in GCC as an example; that's extremely likely to make them effectively derived from GCC.

Is there an implementation in LLVM or similar that we could draw from?

Copy link
Contributor Author

@tones111 tones111 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wikipedia has a nice entry on division algorithms. I implemented "long division" and after optimizing variable usage came up with

pub fn long_div_u8(mut num_quo: u8, divisor: u8) -> (u8, u8) {
     let mut remainder = 0u8;
     for _ in 0..8u8 {
         let overflow;
         (num_quo, overflow) = num_quo.overflowing_add(num_quo);
         (remainder, _) = remainder.carrying_add(remainder, overflow);

         if let Some(rem) = remainder.checked_sub(divisor) {
             remainder = rem;
             num_quo += 1;
         }
     }
     (num_quo, remainder)
 }

I thought the u8 methods would lead to reasonable code, but the generated code (--release) is really bad. @Patryk27, do you know if it's possible to teach LLVM to make use of the CPU status register (especially the carry bit) and associated opcodes? Or, maybe we need more AVR-optimized intrinsics?

rustc isn't able to see that an overflowing_add (left shift into carry bit) followed by a carrying_add should have generated an adc. Instead it generates

// let (a, b) = intrinsics::add_with_overflow(...);
add r24, r24
ldi r20, 0x01
cp r24, r25
ldi r25, 0x01
brcs .+2
mov r25, r1

// let (a, c1) = self.overflowing_add(rhs);
add r19, r19

// let (a, b) = intrinsics::add_with_overflow(...);
or r19, r25

// if let Some(rem) = remainder.checked_sub(divisor)
cp r19, r22
mov r25, r22
brcc .+2
mov r25, 1

// if self < rhs
cp r19, r22
brcc .+2
mov r20, r1
sub r19, r25
add r24, r20

...

The GCC assembly looks almost exactly like what I would expect the Rust code to generate.

Copy link
Contributor Author

@tones111 tones111 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried implementing the "non-restoring division" but some of the math requires an extra bit that I couldn't get working in 8-bits. Here's what I've got thus far:

pub fn non_restoring_u8(numerator: u8, divisor: u8) -> (u8, u8) {
    const N: u8 = 8;

    let m = divisor;
    let mut a = 0u16;    // <<=== would like to be u8
    let mut q = numerator;

    for _i in (0..N).rev() {
        let over;
        (q, over) = q.overflowing_add(q);
        (a, _) = a.carrying_add(a, over);
        if a & 0x200 != 0 {
            a = a.wrapping_add(u16::from(m));
        } else {
            a = a.wrapping_sub(u16::from(m));
        }
        if a & 0x100 == 0 {
            q |= 1;
        }
    }
    if a & 0x100 != 0 {
        a = a.wrapping_add(u16::from(m));
    }
    (q, u8::try_from(a).unwrap())
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_non_restoring_u8() {
        for n in 0..u8::MAX {
            for d in 1..u8::MAX {
                //println!("n: {n}, d: {d}");
                assert_eq!(non_restoring_u8(n, d), (n / d, n % d));
            }
        }
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some generic algorithms in the crate, at https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs and

. I kind of expect that we

rustc isn't able to see that an overflowing_add (left shift into carry bit) followed by a carrying_add should have generated an adc. Instead it generates

Rustc doesn't do this kind of optimization, it would be on the LLVM end. If you can reproduce the problem in LLVM IR, it would be filing a missed optimization bug. You can just replace those couple of lines with inline asm.

Unfortunately it's probably going to be pretty tough to get something that optimizes super well here without handwritten assembly. For this initial PR just doing a naked trampoline to a suboptimal Rust function would be fine, it can be further optimized later.

We have a couple of generic divrem algorithms that might be possible to test and reuse https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/mod.rs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to meaningfully add tests as these intrinsics are specific to AVR. I have don't my best to compare the --release codegen with the GCC output, but haven't tried running the code on real hardware. I'm hoping @Patryk27 might be able to run them through his AVR math fuzzer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

GCC uses a special calling convention for integer multiplication and
devision.
ref: https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention

Unless we teach the compiler about this convention we're limited to
implementing these functions via naked_asm!.  The implementations
provided here derive from libcc.
ref: https://github.com/gcc-mirror/gcc/blob/95f10974a9190e345776604480a2df0191104308/libgcc/config/avr/lib1funcs.S
@tones111
Copy link
Contributor Author

I've submitted a from-scratch implementation in #816.

The gcc implementation unconditionally shifts bits into the quotient via the carry bit (at the bottom of the loop). In contrast I shift the quotient at the top of loop and conditionally assert its bits. gcc also complements the quotient before returning which is not required in my implementation.

The special ABI doesn't leave much room for differentiation, but I feel the mechanics are reasonably unique for such a short code segment. The algorithmic differences are easier to see in the 16-bit solution where gcc starts by jumping toward the end of the loop, requiring an awkward loop count.

closing this in favor of #816

@tones111 tones111 closed this Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants