Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant stack size #2688

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

timcassell
Copy link
Collaborator

@timcassell timcassell commented Jan 16, 2025

Fixes #1120.

  1. Refactored engine stages to make the stack size constant for each benchmark invocation.
  2. Applied AggressiveOptimization to engine methods to eliminate tiered-JIT as a potential variable in the engine when running iterations. (See also Apply AggressiveOptimization to clocks AndreyAkinshin/perfolizer#19)

I tested this on Ryzen 7 9800X3D, and got the same results on both master and this PR. I could not repro the measurement spikes from the issue (the graph was smooth).

I tested on Apple M3 and got these results.

Master:

image

PR:

image

Observations to note

  • The after results are much less spikey for longer.
    • I can't say why the last bit of the graph is also spikey, it could have to do with the fact that I ran on a MacBook Air and it could have throttled, but I can't prove it (the benchmark took a very long time to run, so I only ran it once). I also did not include the changes in Apply AggressiveOptimization to clocks AndreyAkinshin/perfolizer#19 in this test, which could also be a factor.
  • The after results are larger, at the upper end of the spikes from the results with master.
    • I suspect that the benchmarks are sensitive to the alignment of the stack when it is invoked. I'm not sure if BDN should take this into account by default, and if so, what it should do about it (we already have [MemoryRandomization] that randomizes the stack for each iteration).

@timcassell
Copy link
Collaborator Author

timcassell commented Feb 13, 2025

Another curiosity I found while working on #2336 is, unrolling the calls affects performance strangely on ARM architecture (Apple M3) (while it does what you'd expect on x86-64).

Benchmark of just _field++ (M3):

Unroll x16

overhead:  0.911, workload:  0.635, diff: - 0.275

NoUnroll

overhead:  0.900, workload:  1.240, diff:  0.339

timcassell and others added 3 commits February 15, 2025 15:09
Apply AggressiveOptimization to engine methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large Spike from WorkloadWarmup to WorkloadActual
2 participants