Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodiff batching #137880

Merged
merged 5 commits into from
Apr 5, 2025
Merged

Autodiff batching #137880

merged 5 commits into from
Apr 5, 2025

Conversation

ZuseZ4
Copy link
Member

@ZuseZ4 ZuseZ4 commented Mar 2, 2025

Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of

for i in 0..100 {
   df(x[i], y[i], 1234);
}

You can now do

for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}

which will give the same results, but allows better compiler optimizations. See the testcase for details.

There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.

I will also add more tests for both modes.

For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.

r? ghost

Tracking:

@rustbot rustbot added A-attributes Area: Attributes (`#[…]`, `#![…]`) S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 2, 2025
@rust-log-analyzer

This comment has been minimized.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Mar 2, 2025

@rustbot label +F-autodiff

@rustbot rustbot added the F-autodiff `#![feature(autodiff)]` label Mar 2, 2025
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Mar 8, 2025

☔ The latest upstream changes (presumably #138177) made this pull request unmergeable. Please resolve the merge conflicts.

@ZuseZ4 ZuseZ4 force-pushed the autodiff-batching branch 2 times, most recently from 0243b2b to a1865e2 Compare March 13, 2025 05:53
@ZuseZ4 ZuseZ4 force-pushed the autodiff-batching branch from b76368b to 722b3d0 Compare March 24, 2025 05:48
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Apr 3, 2025

Ok, so this is enough for one PR.
It adds most of the batching infrastructure, but it only explicitly tests it for forward-mode autodiff. It also adds support for sret in combination with forward-mode-batching.

There are three cases which I left for a follow-up PR, to not make this PR too large.

  1. Reverse-Mode-batching
  2. sret handling for reverse-mode (scalar/batching) or forward-mode (scalar)
  3. The second batching mode. Right now we have support batching where each (non-const) arg is passed N times, which allows fusing N function calls (e.g. in a loop) into one call. There is a second mode, which just accepts just one shadow arg (similar to scalar mode), but instead each arg is N times larger (e.g. a vector now has N times the len).

Now that I have more features implemented, it also becomes a bit clearer to me how this code should look like, so I did some refactorings, even though I tried to split out most of that into the previous cleanup PR.

I'll replace the todo's with propper errors, even though the things to do hopefully won't stay for many days.
Let me know what else you'd think could be improved. (I also generally assume I'll do another refactor once all of batching is merged, since then I know how much code we'll have where.)

@ZuseZ4 ZuseZ4 marked this pull request as ready for review April 3, 2025 03:15
@rustbot
Copy link
Collaborator

rustbot commented Apr 3, 2025

Some changes occurred in compiler/rustc_codegen_ssa/src/codegen_attrs.rs

cc @jdonszelmann

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

@ZuseZ4 ZuseZ4 requested a review from oli-obk April 3, 2025 03:15
@ZuseZ4 ZuseZ4 closed this Apr 3, 2025
@ZuseZ4 ZuseZ4 reopened this Apr 3, 2025
@ZuseZ4 ZuseZ4 force-pushed the autodiff-batching branch from 5935252 to e16de5d Compare April 3, 2025 06:50
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Apr 3, 2025

Thank you for all the feedback! I think I should have addressed everything, do you have any other comments?

github UI thinks I have some Requested changes from you left, but I can't find them.

@ZuseZ4 ZuseZ4 requested a review from oli-obk April 3, 2025 19:58
@oli-obk
Copy link
Contributor

oli-obk commented Apr 3, 2025

Github UI doesn't care about resolving comments... Re-reviewing now

@oli-obk
Copy link
Contributor

oli-obk commented Apr 3, 2025

Please squash the review commits. If you don't want to fiddle the review commits into appropriate earlier commits, squashing all of the commits in this PR is fine by me

@ZuseZ4 ZuseZ4 force-pushed the autodiff-batching branch from 51a79e3 to 2898b90 Compare April 3, 2025 21:26
@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Apr 3, 2025

The individual commits build nicely on their own, so it was easy to clean up the history.

@bors r=@oli-obk

@bors
Copy link
Collaborator

bors commented Apr 3, 2025

📌 Commit 2898b90 has been approved by oli-obk

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 3, 2025
@ZuseZ4 ZuseZ4 mentioned this pull request Apr 3, 2025
7 tasks
Zalathar added a commit to Zalathar/rust that referenced this pull request Apr 4, 2025
Autodiff batching

Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments.  So instead of
```rs
for i in 0..100 {
   df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.

There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.

I will also add more tests for both modes.

For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.

r? ghost

Tracking:

- rust-lang#124509
- rust-lang#135283
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 4, 2025
Rollup of 14 pull requests

Successful merges:

 - rust-lang#137869 (Demote i686-pc-windows-gnu to Tier 2)
 - rust-lang#137880 (Autodiff batching)
 - rust-lang#138546 (Add integer to string formatting tests)
 - rust-lang#138947 (Refactor Apple version handling in the compiler)
 - rust-lang#138950 (replace extra_filename with strict version hash in metrics file names)
 - rust-lang#139213 (Run coretests and alloctests with cg_clif in CI)
 - rust-lang#139274 (Rustdoc: typecheck settings.js)
 - rust-lang#139295 (Remove creation of duplicate `AnonPipe`)
 - rust-lang#139298 (Allow for missing invisible close delim when reparsing an expression.)
 - rust-lang#139313 (Deduplicate some `rustc_middle` function bodies by calling the `rustc_type_ir` equivalent)
 - rust-lang#139317 (compiletest: Encapsulate all of the code that touches libtest)
 - rust-lang#139322 (Add helper function for checking LLD usage to `run-make-support`)
 - rust-lang#139335 (Pass correct param-env to `error_implies`)
 - rust-lang#139342 (Add a mailmap entry for myself)

Failed merges:

 - rust-lang#138949 (Rename `is_like_osx` to `is_like_darwin`)

r? `@ghost`
`@rustbot` modify labels: rollup
Zalathar added a commit to Zalathar/rust that referenced this pull request Apr 4, 2025
Autodiff batching

Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments.  So instead of
```rs
for i in 0..100 {
   df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.

There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.

I will also add more tests for both modes.

For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.

r? ghost

Tracking:

- rust-lang#124509
- rust-lang#135283
@ZuseZ4 ZuseZ4 force-pushed the autodiff-batching branch from 2898b90 to 89d8948 Compare April 4, 2025 18:29
@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Apr 4, 2025

Not part of any rollup rn, so I pushed a 3 line bugfix to compiler/rustc_codegen_llvm/src/builder/autodiff.rs, which I discovered while working on the second mode, as part of more extended testing.

@bors r=@oli-obk

@bors
Copy link
Collaborator

bors commented Apr 4, 2025

📌 Commit 89d8948 has been approved by oli-obk

It is now in the queue for this repository.

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2025
Rollup of 11 pull requests

Successful merges:

 - rust-lang#136457 (Expose algebraic floating point intrinsics)
 - rust-lang#137880 (Autodiff batching)
 - rust-lang#137897 (fix pthread-based tls on apple targets)
 - rust-lang#138024 (Allow optimizing out `panic_bounds_check` in Unicode checks.)
 - rust-lang#138546 (Add integer to string formatting tests)
 - rust-lang#138826 (StableMIR: Add `associated_items`.)
 - rust-lang#138950 (replace extra_filename with strict version hash in metrics file names)
 - rust-lang#139274 (Rustdoc: typecheck settings.js)
 - rust-lang#139285 (use lower case to match other error messages)
 - rust-lang#139341 (Apply `Recovery::Forbidden` when reparsing pasted macro fragments.)
 - rust-lang#139389 (make `Arguments::as_statically_known_str` doc(hidden))

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit c6bf3a0 into rust-lang:master Apr 5, 2025
6 checks passed
@rustbot rustbot added this to the 1.88.0 milestone Apr 5, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2025
Rollup merge of rust-lang#137880 - EnzymeAD:autodiff-batching, r=oli-obk

Autodiff batching

Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments.  So instead of
```rs
for i in 0..100 {
   df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.

There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.

I will also add more tests for both modes.

For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.

r? ghost

Tracking:

- rust-lang#124509
- rust-lang#135283
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-attributes Area: Attributes (`#[…]`, `#![…]`) F-autodiff `#![feature(autodiff)]` F-batching `#![feature(batching)]` S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants