Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulley Performance Tracking #10102

Open
5 tasks
alexcrichton opened this issue Jan 24, 2025 · 4 comments
Open
5 tasks

Pulley Performance Tracking #10102

alexcrichton opened this issue Jan 24, 2025 · 4 comments
Labels
performance pulley Issues related to the Pulley interpreter

Comments

@alexcrichton
Copy link
Member

I wasnted to open a meta-issue about tracking the performance of Pulley. There's a few items I've identified about improving Pulley's performance which I'm unable to tackle today myself so I'm hoping to track both the meta-topic here of Pulley's performance.

Overall Pulley is in a relatively good spot with respect to performance right now, but Pulley is by no means outstripping other wasm interpreters. Pulley's chief downside from what I can tell is that it's starting from a much lower level of abstraction than other interpreters, CLIF, instead of wasm. This particularly hurts memory accesses where Pulley fundamentally uses two opcodes per memory access: one for the bounds check and one for the actual load/store. In terms of interpreter performance is this pretty costly to turn the interpreter loop twice per load/store where other interpreters are likely only turning the loop once.

That's not to say that Pulley is fundamentally less performant than other interpreters, however. Pulley also has the strengths of an optimizing compiler such as Cranelift to perform relatively advanced optimizations such as hoisting loads/stores out of loops. Pulley also can relatively easily add macro-ops as necessary to improve performance as well.

Locally though what I've seen are performance issues with Pulley are:

  • As mentioned above, loads/stores are two opcodes. This is because one opcode is "rooted" in the CLIF trapnz node for the bounds check. Another node is rooted in the load/store itself. Lowering in ISLE has no way currently to merge these two instructions together into a single trapping instruction. Improving this would need the ability to lower two terms at once (sort of?) or something like a peephole optimization pass after VCode is created.
  • Pulley register moves are not cheap, unlike native platforms, and regalloc2's allocation decisions don't always minimize register moves. One example is Extraneous mov in Pulley bytecode after register allocation #9942. Other decisions I've found hard to isolate into reproducible examples, but I've seen when benchmarking this binary that a hot loop both starts and ends with the same xmov, so I'm not sure why it's there. I've tried to isolate some small examples in tests/disas/pulley/coremark-1.wat and tests/disas/pulley/fib.wat in-tree.
  • In general Cranelift isn't the best at optimizing explicitly-bounds-checked code. Pulley can't rely on signals for catching faults meaning that all loads/stores are explicitly bounds-checked. One example of suboptimal performance here is that the bound of linear memory, stored in VMContext, isn't hoisted outside of a loop. The bound is reloaded each iteration of the loop despite the loop not doing anything that can mutate the bound (such as calling another function or writing to VMContext). The reason for this is that the base pointer if memory is marked (correctly) as readonly but the bound is not readonly (also correctly).
  • It's not clear what to do with the interpreter loop in terms of the "match loop" vs the "tail loop". This is additionally tracked in pulley: Is the best interpreter loop a "match" loop or a "tail call" loop? #9995 and conventional wisdom is that the "tail loop" should be faster but that doesn't seem to be true for all benchmarks. This probably needs more investigation as I'm not sure what's going on myself.
  • One possible vector for optimization is opcode decoding in the interpreter. Currently all immediates are loaded one-by-one from memory but it should be possible to load multiple at once by issuing a pointer-size-load at the start of an instruction. This in theory reduces the number of loads that the native CPU has to pipeline while putting more work on shifts/bitops on the ALU. The downside of this approach is that we have to be more careful as it's possible to read past the end of a function when decoding it. Initial benchmarking additionally showed that this wasn't a speedup on x64. I've got a branch implementing this but it's not in a clean enough state to land this in-tree.

I plan on adding more items here over time as I see them, and/or spinning off these to sub-issues as appropriate.

@alexcrichton alexcrichton added performance pulley Issues related to the Pulley interpreter labels Jan 24, 2025
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "pulley"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: pulley

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@fitzgen
Copy link
Member

fitzgen commented Jan 24, 2025

  • As mentioned above, loads/stores are two opcodes. This is because one opcode is "rooted" in the CLIF trapnz node for the bounds check. Another node is rooted in the load/store itself. Lowering in ISLE has no way currently to merge these two instructions together into a single trapping instruction. Improving this would need the ability to lower two terms at once (sort of?) or something like a peephole optimization pass after VCode is created.

Alternatively, rather than making instruction selection support lowering two root instructions at once (which is hard), we could try something like this:

  • we add to Pulley a bounds_checked_{load,store} dest, addr, bound family of instructions (exact details to vary with what we actually need) that are defined to trap when the address is out of bounds
  • we enable spectre mitigations, or something new but not "spectre" that uses selects instead of trap[n]zs for bounds checks, so that we see the bounds check in the data flow into the memory operation, rather than it being a separate trap[n]z, eg:
    v0 = select is_in_bounds, addr, zero
    v1 = load v0
    
  • we add lowering rules to pattern match on that data flow going into a load/store and emit bounds_checked_{load,store} instructions when possible
    • bonus: figure out how to avoid re-adding bounds checks to every load/store during lowering when we already deduped them in the mid-end, so that the first memory operation lowers to a bounds_checked_* operation but subsequent ones lower to raw loads/stores

@fitzgen
Copy link
Member

fitzgen commented Jan 25, 2025

cc #10111, since I believe we've seen some pulley benchmarks where a conditional trap was deduplicated within a loop body, but not hoisted up above the loop for a combo of reasons (see that issue for more discussion).

@alexcrichton
Copy link
Member Author

For the spectre ops, I think that could work yeah. We currently discard all trap codes in MemFlags on loads/stores but that's mainly because there are none today. That could be used to hook into Pully where we can semantically have instructions that are "trap if addr==0 otherwise do the load/stores". We could then move all macro-ops like *_g32 addressing to these semantics and fold the spectre-check + trapping-load into the new *_g32 ops.

I'll see if I can't play around with that this week.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Jan 29, 2025
This commit is a large refactoring to reimplement how WebAssembly
loads/stores are translated to Pulley opcodes when using the
interpreter. Additionally the functionality related to memory support
has changed quite a bit with the interpreter as well. This is all based
off comments on bytecodealliance#10102 with the end goal of folding the two Pulley
opcodes today of "do the bounds check" and "do the load" into one
opcode. This is intended to reduce the number of opcodes and overall
improve interpreter throughput by minimizing turns of the interpreter
loop.

The basic idea behind this PR is that a new basic suite of loads/stores
are added to Pulley which trap if the address is zero. This provides a
route to translate trapping loads/stores in CLIF to Pulley bytecode
without actually causing segfaults at runtime. WebAssembly translation
to CLIF is then updated to use the `select` trick for wasm loads/stores
where either 0 is loaded from or the actual address is loaded from.
Basic support for translation and such is added for this everywhere, and
this ensures that all loads/stores for wasm will be translated
successfully with Pulley.

The next step was to extend the "g32" addressing mode preexisting in
Pulley to support a bounds check as well. New pattern-matches were added
to ISLE to search for a bounds check in the address of a trapping
load/store. If found then the entire chain of operations necessary to
compute the address are folded into a single "g32" opcode which ends up
being a fallible load/store at runtime.

To fit all this into Pulley this commit contains a number of
refactorings to shuffle around existing opcodes related to memory and
extend various pieces of functionality here and there:

* Pulley now uses a `AddrFoo` types to represent addressing modes as a
  single immediate rather than splitting it up into pieces for each
  method. For example `AddrO32` represents "base + offset32". `AddrZ`
  represents the same thing but traps if the address is zero. The
  `AddrG32` mode represents a bounds-checked 32-bit linear memory access
  on behalf of wasm.

* Pulley loads/stores were reduced to always using an `AddrFoo`
  immediate. This means that the old `offset8` addressing mode was
  removed without replacement here (to be added in the future if
  necessary). Additionally the suite of sign-extension modes supported
  were trimmed down to remove 8-to-64, 16-to-64, and 32-to-64 extensions
  folded as part of the opcode. These can of course always be re-added
  later but probably want to be added just for the `G32` addressing mode
  as opposed to all addressing modes.

* The interpreter itself was refactored to have an `AddressingMode`
  trait to ensure that all memory accesses, regardless of addressing
  modes, are largely just copy/pastes of each other. In the future it
  might make sense to implement these methods with a macro, but for now
  it's copy/paste.

* In ISLE the `XLoad` generic instruction removed its `ext` field to
  have extensions handled exclusively in ISLE instead of partly in
  `emit.rs`.

* Float/vector loads/stores now have "g32" addressing (in addition to
  the "z" that's required for wasm) since it was easy to add them.

* Translation of 1-byte accesses on Pulley from WebAssembly to CLIF no
  longer has a special case for using `a >= b` instead of `a > b - 1` to
  ensure that the same bounds-check instruction can be used for all
  sizes of loads/stores.

* The bounds-check which folded a load-of-the-bound into the opcode is
  now present as a "g32bne" addressing mode. with its of suite of
  instructions to boo.

Overall this PR is not a 1:1 replacement of all previous opcodes with
exactly one opcode. For example loading 8 bits sign-extended to 64-bits
is now two opcodes instead of one. Additionally some previous opcodes
have expanded in size where for example the 8-bit offset mode was remove
in favor of only having 32-bit offsets. The goal of this PR is to reboot
how memory is handled in Pulley. All loads/stores now use a specific
addressing mode and currently all operations supported across addressing
modes are consistently supported. In the future it's expected that some
features will be added to some addressing modes and not others as
necessary, for example extending the "g32" addressing mode only instead
of all addressing modes.

For an evaluation of this PR:

* Code size: `spidermonkey.cwasm` file is reduced from 19M to 16M.
* Sightglass: `pulldown-cmark` is improved by 15%
* Sightglass: `bz2` is improved by 20%
* Sightglass: `spidermonkey` is improved by 22%
* Coremark: score improved by 40%

Overall this PR and new design looks to be a large win. This is all
driven by the reduction in opcodes both for compiled code size and
execution speed by minimizing turns of the interpreter loop. In the end
I'm also pretty happy with how this turned out and I think the
refactorings are well worth it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance pulley Issues related to the Pulley interpreter
Projects
None yet
Development

No branches or pull requests

2 participants