Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move wasm benchmarks to JS benchmark style scoring #18

Merged
merged 4 commits into from
Dec 21, 2024

Conversation

kmiller68
Copy link
Contributor

@kmiller68 kmiller68 commented Nov 27, 2024

As discussed with other browsers the plan for JS3 is to score wasm the same way we score JS (with First Iteration, Average, and Worst N). Each iteration (other than the first) runs in about 60ms on my M4 Max MBP.

Move existing benchmarks to WasmLegacyBenchmark and add a new WasmEmccBenchmark that supports running iterations. It's currently an async benchmark subclass since the compile is async.

As a proof of concept this patch converts tsf-wasm to use the new scoring. I roughly expect the process to add a C++ library that doesn't already have its own wasm build script to be: compile with -s MODULARIZE=1 -s EXPORT_NAME=setupModule then add a benchmark.js with something like

class Benchmark {
    async runIteration() {
        if (!Module._runIteration)
            await setupModule(Module);

        Module._runIteration(150);
    }
};

I sampled the old test but recompiled and the new JS-style version of the test and the samples seemed quite similar. So it seems like this benchmark change doesn't radically alter the characteristics of at least this test.

Lastly, get emcc to ouput the .symbols object so we know which wasm function is which index. We could emit a name section into the binary but my assumption is that most production code does not ship with that so it might change the benchmark in an unrealistic way.

@kmiller68 kmiller68 marked this pull request as draft November 27, 2024 18:29
@kmiller68 kmiller68 force-pushed the wasm-benchmarks-should-be-scored-like-js-benchmarks branch 2 times, most recently from 30ea5a3 to 66e93c8 Compare November 27, 2024 18:42
@kmiller68 kmiller68 force-pushed the wasm-benchmarks-should-be-scored-like-js-benchmarks branch 3 times, most recently from adbb24d to d1534f0 Compare December 18, 2024 15:54
@kmiller68 kmiller68 marked this pull request as ready for review December 18, 2024 16:00
@kmiller68
Copy link
Contributor Author

See d1534f0 for details on the time spent in each function breakdowns. Weirdly even though the test runs significantly longer JSC still spends the same amount of time in our BBQ (template JIT) tier. Maybe that's a bug on our end though?

@kmiller68
Copy link
Contributor Author

@camillobruni @danleh Do either of you have thoughts/comments/concerns? I would CC the Mozilla folks too but I don't have their GH handles...

@danleh
Copy link
Contributor

danleh commented Dec 19, 2024

Sorry for the late reply. I'll take a look later today.

@@ -36,11 +36,17 @@ globalThis.dumpJSONResults ??= false;
globalThis.customTestList ??= [];

let shouldReport = false;
let startDelay;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was simply pulled out into a variable and made configurable, right? And this is essentially independent of the Wasm change, correct? Just out of interest: Do you know what the original reasoning behind 4000ms delay, something like empirically observed loading time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be independent. The delay is because our internal A/B testing harness restarts the browser between full benchmark runs to reduce variance or caching. So we want to wait a few seconds for the browser to to calm down before starting the test, at least, IIRC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in on more configurable test params!

I'll probably copy the params.mjs from speedometer as well and then we have a more central place to store all the test parameters.

@@ -330,7 +336,7 @@ class Driver {
} else
globalObject = runString("");

globalObject.console = { log: globalObject.print, warn: (e) => { print("Warn: " + e); /*$vm.abort();*/ } }
globalObject.console = { log: globalObject.print, warn: (e) => { print("Warn: " + e); /*$vm.abort();*/ }, error: (e) => { print("Error: " + e); /*$vm.abort();*/ } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: While we are at it, we could also add console.debug here, sqlite3-wasm uses that as well (and reformat to separate lines?)

TIMEIT(writeTest(filename, 100, count, TSF_ZIP_NONE));
TIMEIT(readTest(filename, count));
TIMEIT(readMallocTest(filename, count));
TIMEIT(readConvertTest(filename, count));


// TODO: tsf_zlib_supported is false so we should just remove this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, also makes it more explicit/apparent what is part of the benchmark and what not.

wasm/TSF/tsf.js Outdated
Copy link
Contributor

@danleh danleh Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about putting all compile outputs/generated files in a wasm/TSF/build/ directory, so that it's more clear what is generated by emscripten (and that it's not manually modified as in the old version of TSF-wasm).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to clarify: You did rebuild with a more up-to-date Emscripten version, right? So some changes in, say, execution profile, could simply come from toolchain evolution, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the function numbers are completely different. Now we have the .symbols so at least you can manually map between emcc versions.

emcc \
-o tsf.html -o tsf.js -O2 -s WASM=1 -s TOTAL_MEMORY=52428800 -g1 \
-o tsf.js -O2 -s MODULARIZE=1 -s EXPORT_NAME=setupModule -s WASM=1 -s TOTAL_MEMORY=52428800 -g1 --emit-symbol-map -s EXPORTED_FUNCTIONS=_runIteration \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer with MODULARIZE, cool! :)

if (!Module._runIteration)
await setupModule(Module);

Module._runIteration(150);
Copy link
Contributor

@danleh danleh Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to clarify the workload size before vs. now: It used to be the default of count = 1000 because main was invoked without arguments, and now we are at defaultIterationCount = 120 times 150 per runIteration, right?

It seems the line item is running quite a bit longer now cf. before (~12s vs. ~1.7s, or roughly 130B dynamic instructions vs. 15B on an x64 workstation). I see more functions using the top-tier now, which is fine by me. In the old version, quite some time was spent in JS read/write functions and not in Wasm at all, which was certainly not ideal.

We just might have to lower the workload size, in case the overall benchmark becomes too long to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excerpt of a bottom-up profile in d8 before:
image
vs. now:
image
I'm happy to land this, we should just be aware that this is now moving more into the direction of measuring peak performance. (Was this a conscious decision?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should probably reduce the iteration count. I did that in my followup patches for other benchmarks, I'll do it here too. That said, it seems like your samples include the harness code? That's probably not ideal.

The function numbers also changed for me when I recompiled the test, I ended up just simulating the old benchmark by running one iteration with count = 1000 in the samples I put in my commit message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I just took a profile of the whole d8 execution with Linux perf (via the helper explained in https://v8.dev/docs/linux-perf#profiling-d8-with-linux-perf-d8.py which takes care of injecting the samples from JIT-ed code). If startup or benchmark harness take a significant portion of the overall profile, I would say the workload is too short/light (as it probably was before).

@@ -1,6 +1,9 @@
#!/bin/sh

set -euo pipefail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding some log file with a last build timestamp and Emscripten version for reproducibility? Something like

BUILD_LOG="$(realpath build.log)"
echo -e "Built on $(date --rfc-3339=seconds)\n" | tee "$BUILD_LOG"

echo "Toolchain versions" | tee -a "$BUILD_LOG"
emcc --version | head -n1 | tee -a "$BUILD_LOG"

(shameless copy from sqlite3/build.sh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking we should just have one script to do it rather than duplicating the logic. That said, I can do that in a follow up #22. I'll copy your logic for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like date --rfc-3339=seconds isn't supported by MacOS's out of the box date. I'm gonna go with date -u '+%Y-%m-%dT%H:%M:%SZ'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know and keep in mind for an eventual (shared?) build script.

@@ -986,6 +1007,94 @@ class AsyncBenchmark extends DefaultBenchmark {
}
};

// Meant for wasm benchmarks that are directly compiled with an emcc build script. It might not work for benchmarks built as
// part of a larger project's build system or a wasm benchmark compiled from a language that doesn't compile with emcc.
class WasmEMCCBenchmark extends AsyncBenchmark {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now, long term it would be nice to get rid of some of the code duplication, e.g., the prerunCode below is mostly (but not fully) equal to the one in WasmLegacyBenchmark right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the goal is to eventually just delete WasmLegacyBenchmark once we have all the benchmarks we care about in the JS style system. If we want to save them for posterity we can reevaluate.

@danleh
Copy link
Contributor

danleh commented Dec 20, 2024

@camillobruni @danleh Do either of you have thoughts/comments/concerns? I would CC the Mozilla folks too but I don't have their GH handles...

Overall looks great to me, thanks! Left some small comments :)

In terms of Mozilla folks, I only know of (some of them mentioned their GitHub handles in the WebKit JetStream discussion Slack channel)

Move existing benchmarks to WasmLegacyBenchamrk and add a new WasmBenchmark that supports
running iterations. It's currently an async benchmark subclass since the compile is async.

As a proof of concept this patch converts tsf-wasm to use the new scoring. As part of this
process tsf needed to be recompiled so I added a step to the build.sh script which wraps
emcc's output .js file with some glue code. This should make it easier to rebuild in the
future too since we're not directly editing the module loading file. I went with this
route rather than hooking into emcc's API since I wanted to use the js glue as close to
"out of the box" as possible.

Lastly, get emcc to ouput the .symbols object so we know which wasm function is which index.
We could emit a name section into he binary but my assumption is that most production code
does not ship with that so it could change the benchmark in an unrealistic way.

TODO:
* Profile perf differences
* Generalize the wrapping process
* Remove postIterationHook code for a different PR.
This seems to work pretty nicely and should be future proof.
…he generated code.

Remove some stale logging.

Remove main from TSF-wasm since it's no longer used.

Make the count value an argument rather than a constant in C code. This appears to make the
performance characteristics similar to what it was before this refactor.

Old sampling (after recompile so wasm function numbers are meaningful):

Sampling rate: 1000.000000 microseconds. Total samples: 364
Top functions as <numSamples  'functionName#hash:sourceID'>
   199    '<?>.wasm-function[100]#<nil>:4294967295'
    34    '<?>.wasm-function[238]#<nil>:4294967295'
    19    '<?>.wasm-function[208]#<nil>:4294967295'
    16    '_platform_memmove#<nil>:4294967295'
    12    '<?>.wasm-function[251]#<nil>:4294967295'
    11    '<?>.wasm-function[232]#<nil>:4294967295'
     9    '<?>.wasm-function[209]#<nil>:4294967295'
     8    '<?>.wasm-function[237]#<nil>:4294967295'
     7    '<?>.wasm-function[201]#<nil>:4294967295'
     5    '<?>.wasm-function[250]#<nil>:4294967295'
     4    '<?>.wasm-function[213]#<nil>:4294967295'
     3    '<?>.wasm-function[216]#<nil>:4294967295'

Sampling rate: 1000.000000 microseconds. Total samples: 364

Tier breakdown:
-----------------------------------
LLInt:                     0  (0.000000%)
Baseline:                  1  (0.274725%)
DFG:                       0  (0.000000%)
FTL:                       0  (0.000000%)
js builtin:                0  (0.000000%)
IPInt:                     0  (0.000000%)
WasmLLInt:                 3  (0.824176%)
BBQ:                     211  (57.967033%)
OMG:                     116  (31.868132%)
Wasm:                      0  (0.000000%)
Host:                      0  (0.000000%)
RegExp:                    0  (0.000000%)
C/C++:                    30  (8.241758%)
Unknown Frame:             3  (0.824176%)

Hottest bytecodes as <numSamples   'functionName#hash:JITType:bytecodeIndex'>
    33    '<?>.wasm-function[100]:BBQMode:0x2ed'
    33    '<?>.wasm-function[100]:BBQMode:0x2f0'
    16    '_platform_memmove#<nil>:None:<nil>'
     8    '<?>.wasm-function[100]:BBQMode:0xa733'
     7    '<?>.wasm-function[100]:BBQMode:0x70a2'
     6    '<?>.wasm-function[100]:BBQMode:0x6f8a'
     6    '<?>.wasm-function[100]:BBQMode:0x6f94'
     5    '<?>.wasm-function[208]:OMGMode:nil'
     5    '<?>.wasm-function[100]:BBQMode:0x5b'
     5    '<?>.wasm-function[100]:BBQMode:0x6f9b'
     4    '<?>.wasm-function[100]:BBQMode:0x5df3'
    ...

New sampling with this PR:

Sampling rate: 1000.000000 microseconds. Total samples: 4908
Top functions as <numSamples  'functionName#hash:sourceID'>
  2763    '<?>.wasm-function[100]#<nil>:4294967295'
   464    '<?>.wasm-function[238]#<nil>:4294967295'
   436    '<?>.wasm-function[208]#<nil>:4294967295'
   138    '<?>.wasm-function[251]#<nil>:4294967295'
   128    '<?>.wasm-function[237]#<nil>:4294967295'
   117    '<?>.wasm-function[232]#<nil>:4294967295'
   108    '<?>.wasm-function[250]#<nil>:4294967295'
   107    '<?>.wasm-function[209]#<nil>:4294967295'
   104    '<?>.wasm-function[201]#<nil>:4294967295'
    77    '<?>.wasm-function[216]#<nil>:4294967295'
    54    '<?>.wasm-function[226]#<nil>:4294967295'
    43    '<?>.wasm-function[162]#<nil>:4294967295'

Sampling rate: 1000.000000 microseconds. Total samples: 4908

Tier breakdown:
-----------------------------------
LLInt:                     0  (0.000000%)
Baseline:                  2  (0.040750%)
DFG:                       2  (0.040750%)
FTL:                       5  (0.101874%)
js builtin:                1  (0.020375%)
IPInt:                     0  (0.000000%)
WasmLLInt:                 3  (0.061125%)
BBQ:                    2771  (56.458843%)
OMG:                    1988  (40.505297%)
Wasm:                      0  (0.000000%)
Host:                      0  (0.000000%)
RegExp:                    0  (0.000000%)
C/C++:                    95  (1.935615%)
Unknown Frame:            41  (0.835371%)
Unknown Executable:        1  (0.020375%)

Hottest bytecodes as <numSamples   'functionName#hash:JITType:bytecodeIndex'>
   526    '<?>.wasm-function[100]:BBQMode:0x2ed'
   451    '<?>.wasm-function[100]:BBQMode:0x2f0'
   125    '<?>.wasm-function[208]:OMGMode:0x0'
   105    '<?>.wasm-function[100]:BBQMode:0x6f8a'
    91    '<?>.wasm-function[100]:BBQMode:0xa724'
    84    '<?>.wasm-function[100]:BBQMode:0x5b'
    81    '<?>.wasm-function[100]:BBQMode:0xa733'
    79    '<?>.wasm-function[100]:BBQMode:0x6f94'
    77    '<?>.wasm-function[100]:BBQMode:nil'
    63    '<?>.wasm-function[208]:OMGMode:0x16d'
    ...

It looks like versions have similar sample rates for the top function.
There's more samples in the top tier compiler but that's presumably ok and expected as the test runs longer.
Build products are now built to a build subdirectory.
Log the build info into build.log
Fix `date` command to work on MacOS out of the box.
Reduce iteration count to 15/2.
Remove ZLib check since it's always false in JetStream anyway.
@kmiller68 kmiller68 force-pushed the wasm-benchmarks-should-be-scored-like-js-benchmarks branch from d1534f0 to e89743d Compare December 20, 2024 14:45
@kmiller68
Copy link
Contributor Author

I'll wait until EoD to merge in case any Mozilla folks have comments .

@kmiller68 kmiller68 merged commit 4e888f5 into main Dec 21, 2024
log: globalObject.print,
warn: (e) => { print("Warn: " + e); },
error: (e) => { print("Error: " + e); },
debug: (e) => { print("Debug: " + e); },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a full console object available in our V8 shell (same goes for the browser).
I can create a follow-up CL to factor out a more defensive patching approach that reuses existing methods if possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants