Introduce ARM Neon and SSE2 SIMD. #743

samyron · 2025-02-03T03:45:51Z

Version 2 of the introduction of ARM Neon SIMD.

There are currently two implementations:

"Rules" based.
Lookup Table based. This is effectively an SIMD accelerated version of the scalar implementation.

Benchmarks (Lookup table)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    62.000 i/100ms
          json_coder    67.000 i/100ms
                  oj    30.000 i/100ms
Calculating -------------------------------------
                json    628.035 (±12.7%) i/s    (1.59 ms/i) -      3.162k in   5.118636s
          json_coder    626.843 (±15.8%) i/s    (1.60 ms/i) -      3.082k in   5.079836s
                  oj    352.174 (± 9.4%) i/s    (2.84 ms/i) -      1.740k in   5.005929s

Comparison:
                json:      628.0 i/s
          json_coder:      626.8 i/s - same-ish: difference falls within error
                  oj:      352.2 i/s - 1.78x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    50.000 i/100ms
          json_coder    56.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    632.784 (±27.0%) i/s    (1.58 ms/i) -      3.000k in   5.063991s
          json_coder    628.328 (±16.7%) i/s    (1.59 ms/i) -      3.080k in   5.034271s
                  oj    351.466 (± 9.7%) i/s    (2.85 ms/i) -      1.728k in   5.003977s

Comparison:
                json:      632.8 i/s
          json_coder:      628.3 i/s - same-ish: difference falls within error
                  oj:      351.5 i/s - 1.80x  slower

Benchmarks (Rules based)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    69.000 i/100ms
          json_coder    78.000 i/100ms
                  oj    33.000 i/100ms
Calculating -------------------------------------
                json    758.135 (±22.7%) i/s    (1.32 ms/i) -      3.657k in   5.114664s
          json_coder    800.957 (±11.5%) i/s    (1.25 ms/i) -      3.978k in   5.044465s
                  oj    343.750 (±11.9%) i/s    (2.91 ms/i) -      1.683k in   5.004571s

Comparison:
                json:      758.1 i/s
          json_coder:      801.0 i/s - same-ish: difference falls within error
                  oj:      343.7 i/s - 2.21x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    59.000 i/100ms
          json_coder    53.000 i/100ms
                  oj    37.000 i/100ms
Calculating -------------------------------------
                json    828.807 (±15.1%) i/s    (1.21 ms/i) -      4.071k in   5.060739s
          json_coder    799.688 (±20.1%) i/s    (1.25 ms/i) -      3.816k in   5.019480s
                  oj    364.514 (± 7.1%) i/s    (2.74 ms/i) -      1.850k in   5.100773s

Comparison:
                json:      828.8 i/s
          json_coder:      799.7 i/s - same-ish: difference falls within error
                  oj:      364.5 i/s - 2.27x  slower

I am still working on this but I wanted to share progress.

Edit: Looks like I missed one commit so I'll have to resolve some merge conflicts.

byroot · 2025-02-03T09:12:55Z

The gain seem to be 7% on real word benchmarks:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.438k i/100ms
Calculating -------------------------------------
               after     24.763k (± 0.8%) i/s   (40.38 μs/i) -    124.338k in   5.021560s

Comparison:
              before:    23166.2 i/s
               after:    24762.5 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   254.000 i/100ms
Calculating -------------------------------------
               after      2.600k (± 1.3%) i/s  (384.61 μs/i) -     13.208k in   5.080852s

Comparison:
              before:     2439.5 i/s
               after:     2600.0 i/s - 1.07x  faster

Also note that I did one more refactoring to make the introduction of SIMD easier, so you still have a conflict.

byroot · 2025-02-03T09:17:37Z

ext/json/ext/generator/simd.h

+uint8x16x4_t load_uint8x16_4(const unsigned char *table, int offset) {
+  uint8x16x4_t tab;
+  for(int i=0; i<4; i++) {
+    tab.val[i] = vld1q_u8(table+offset+(i*16));
+  }
+  return tab;
+}


Isn't that just vld4q_u8?

https://developer.arm.com/architectures/instruction-sets/intrinsics/vld4q_u8

Unfortunately it's not. vld4q_u8 interleaves the data among the 4 vector registers.

% cat load-test.c #include <stdio.h> #include <stdint.h> #include <arm_neon.h> void print_vec(char *msg, uint8x16_t vec) { printf("%s\n[ ", msg); uint8_t store[16] = {0}; vst1q_u8(store, vec); for(int i=0; i<16; i++) { printf("%3d ", store[i]); } printf("]\n"); } uint8x16x4_t load_table(uint8_t *table, int offset) { uint8x16x4_t tab; for(int i=0; i<4; i++) { tab.val[i] = vld1q_u8(table+offset+(i*16)); } return tab; } int main(void) { uint8_t table[256]; for(int i=0; i<256; i++) { table[i] = i; } uint8x16x4_t tab1 = load_table(table, 0); print_vec("tab1.val[0]", tab1.val[0]); print_vec("tab1.val[1]", tab1.val[1]); print_vec("tab1.val[2]", tab1.val[2]); print_vec("tab1.val[3]", tab1.val[3]); printf("\n"); uint8x16x4_t tab1_2 = vld4q_u8(table); print_vec("tab1_2.val[0]", tab1_2.val[0]); print_vec("tab1_2.val[1]", tab1_2.val[1]); print_vec("tab1_2.val[2]", tab1_2.val[2]); print_vec("tab1_2.val[3]", tab1_2.val[3]); return 0; }

% ./load-test tab1.val[0] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ] tab1.val[1] [ 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ] tab1.val[2] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ] tab1.val[3] [ 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ] tab1_2.val[0] [ 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 ] tab1_2.val[1] [ 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 ] tab1_2.val[2] [ 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 ] tab1_2.val[3] [ 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 ]

Wow, that's so weird.

Well, maybe that loop should be unrolled then, I suspect the compiler does it, but might as well be explicit.

byroot · 2025-02-03T09:21:45Z

Can you just include the implementation for the regular escaping? I'm not sure the script safe version is quite worth it.

…tion. Also store the potential matches directly rather than looking up values in the escape table.

samyron · 2025-02-05T03:14:37Z

ext/json/ext/generator/generator.c

+    if ((ch_len = search_escape_basic_neon_advance_lut(search)) != 0) {
+        return ch_len;
+    }
+
+    // if ((ch_len = search_escape_basic_neon_advance_rules(search)) != 0) {
+    //     return ch_len;
+    // }


Seems like it's a toss up which one is the best. It might be an artifact that my M1 Macbook Air is passively cooled and it gets warm after I run it over and over.

samyron · 2025-02-06T02:51:34Z

Comparison between master and this branch in real world benchmarks. This is for the lookup table implementation.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.027k i/100ms
Calculating -------------------------------------
               after     21.413k (± 1.6%) i/s   (46.70 μs/i) -    107.431k in   5.018339s

Comparison:
              before:    14448.8 i/s
               after:    21412.9 i/s - 1.48x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   110.000 i/100ms
Calculating -------------------------------------
               after      1.098k (± 1.2%) i/s  (910.41 μs/i) -      5.500k in   5.007977s

Comparison:
              before:      993.9 i/s
               after:     1098.4 i/s - 1.11x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   216.000 i/100ms
Calculating -------------------------------------
               after      2.086k (± 8.9%) i/s  (479.31 μs/i) -     10.368k in   5.034983s

Comparison:
              before:     1642.1 i/s
               after:     2086.3 i/s - 1.27x  faster

Running it a second time:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.042k i/100ms
Calculating -------------------------------------
               after     21.400k (± 1.7%) i/s   (46.73 μs/i) -    108.226k in   5.058877s

Comparison:
              before:    15039.4 i/s
               after:    21399.7 i/s - 1.42x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   109.000 i/100ms
Calculating -------------------------------------
               after      1.094k (± 1.2%) i/s  (913.67 μs/i) -      5.559k in   5.079778s

Comparison:
              before:     1005.4 i/s
               after:     1094.5 i/s - 1.09x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   215.000 i/100ms
Calculating -------------------------------------
               after      2.137k (± 5.5%) i/s  (467.91 μs/i) -     10.750k in   5.050467s

Comparison:
              before:     1639.0 i/s
               after:     2137.1 i/s - 1.30x  faster

…e only need 128 bytes for the lookup table as the top 128 bytes are all zeros.

byroot · 2025-02-06T07:57:51Z

Not sure why but it's way more modest on my machine (Air M3):

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.603k i/100ms
Calculating -------------------------------------
               after     26.544k (± 1.8%) i/s   (37.67 μs/i) -    132.753k in   5.002890s

Comparison:
              before:    23370.1 i/s
               after:    26543.7 i/s - 1.14x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   136.000 i/100ms
Calculating -------------------------------------
               after      1.368k (± 0.7%) i/s  (730.98 μs/i) -      6.936k in   5.070329s

Comparison:
              before:     1369.9 i/s
               after:     1368.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.702k (± 0.3%) i/s  (370.11 μs/i) -     13.719k in   5.077550s

Comparison:
              before:     2475.0 i/s
               after:     2701.9 i/s - 1.09x  faster

samyron · 2025-02-10T03:17:25Z

Apologies for going dark for a while. I've been trying to make incremental improvements on a different branch (found here). My hope was using a move mask would be faster than vmaxvq_u8 to determine if any byte needs to be escaped. It also has the benefit of not needing to store all of the candidate matches as all that would be needed is a uint64_t which indicates which bytes need to be escaped. Unfortunately on my machine, it didn't seem to make much of a difference.

Feel free to try it out though.

byroot · 2025-02-10T08:54:33Z

Apologies for going dark for a while

That's no worries at all. I want to release a 2.10.0 with the current change on master, but I'm pairing with Étienne on making sure we have no blind spots on JSON::Coder. So probably gonna happen this week.

After that I think I can start merging some SIMD stuff. I'd like to go with the smaller possible useful SIMD acceleration to ensure it doesn't cause issues with people. If it works well, we can then go farther. So yeah, no rush.

samyron · 2025-02-11T13:26:52Z

@byroot if you have a few minutes, would you be able to checkout this branch and benchmark it against master. You'll have to tweak your compare script a bit to compile this branch with cmd("bundle", "exec", "rake", "clean", "compile", "--", "--disable-generator-use-simd"). I want to see how your M3 compares with my M1.

This branch uses the bit twiddling sort of platform agnostic SIMD code if the SIMD code is disabled via aextconf.rb flag.

The results on my M1:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.944k i/100ms
Calculating -------------------------------------
               after     19.671k (± 2.5%) i/s   (50.84 μs/i) -     99.144k in   5.043309s

Comparison:
              before:    15135.7 i/s
               after:    19670.9 i/s - 1.30x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   113.000 i/100ms
Calculating -------------------------------------
               after      1.109k (± 2.1%) i/s  (901.49 μs/i) -      5.650k in   5.095561s

Comparison:
              before:     1040.1 i/s
               after:     1109.3 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   204.000 i/100ms
Calculating -------------------------------------
               after      2.006k (± 3.8%) i/s  (498.51 μs/i) -     10.200k in   5.092718s

Comparison:
              before:     1687.4 i/s
               after:     2006.0 i/s - 1.19x  faster

byroot · 2025-02-12T12:46:55Z

With that compilation flag and compared to master:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.326k i/100ms
Calculating -------------------------------------
               after     23.218k (± 1.6%) i/s   (43.07 μs/i) -    116.300k in   5.010271s

Comparison:
              before:    22460.3 i/s
               after:    23218.0 i/s - 1.03x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   132.000 i/100ms
Calculating -------------------------------------
               after      1.290k (± 1.4%) i/s  (775.38 μs/i) -      6.468k in   5.016121s

Comparison:
              before:     1323.6 i/s
               after:     1289.7 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   242.000 i/100ms
Calculating -------------------------------------
               after      2.495k (± 0.6%) i/s  (400.84 μs/i) -     12.584k in   5.044306s

Comparison:
              before:     2449.6 i/s
               after:     2494.8 i/s - 1.02x  faster

samyron · 2025-02-25T02:39:46Z

From a co-worker with an M4 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after     2.876k i/100ms
Calculating -------------------------------------
               after     28.251k (± 3.0%) i/s   (35.40 μs/i) -    143.800k in   5.095128s

Comparison:
              before:    24938.2 i/s
               after:    28251.0 i/s - 1.13x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   154.000 i/100ms
Calculating -------------------------------------
               after      1.516k (± 2.9%) i/s  (659.57 μs/i) -      7.700k in   5.083078s

Comparison:
              before:     1575.4 i/s
               after:     1516.1 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   295.000 i/100ms
Calculating -------------------------------------
               after      2.933k (± 3.3%) i/s  (340.94 μs/i) -     14.750k in   5.034796s

Comparison:
              before:     2678.2 i/s
               after:     2933.0 i/s - 1.10x  faster

samyron · 2025-02-26T15:12:29Z

From another co-worker with an M1 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.166k i/100ms
Calculating -------------------------------------
               after     21.521k (± 1.2%) i/s   (46.47 μs/i) -    108.300k in   5.032957s

Comparison:
              before:    15231.1 i/s
               after:    21521.3 i/s - 1.41x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   108.000 i/100ms
Calculating -------------------------------------
               after      1.062k (± 5.5%) i/s  (941.69 μs/i) -      5.400k in   5.103989s

Comparison:
              before:     1013.4 i/s
               after:     1061.9 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   219.000 i/100ms
Calculating -------------------------------------
               after      2.061k (±12.8%) i/s  (485.22 μs/i) -     10.074k in   5.040974s

Comparison:
              before:     1677.4 i/s
               after:     2060.9 i/s - 1.23x  faster

radiospiel · 2025-03-16T21:18:22Z

@samyron

I just pushed a PR #769 to this repo which also employs SIMD to speed up string escapes. I am really sorry that we both worked in that area at the same time; after I started my work I didn't check back with this repo for a while (and I should have done that.)

I believe the main difference between my PR and yours seem that mine supports x86 as well. It is doing this by using a cross-platform shim simd.h from Postgres, which comes with implementations on AVX, Neon, (and also on plain C). Still, on Neon I see somewhat higher gains than those reported here; however I don't understand where that difference comes from.

I want to suggest to collaborate on getting SIMD support in one way or another. 👋

samyron · 2025-03-18T01:34:03Z

Hi @radiospiel, I'll take a look at #769. I originally started working on #730 which supports Neon, SSE 4.2 and AVX2 with runtime detection support. The PR got a bit big so I decided to close it and implement each instruction set individually.

Additionally, @byroot refactored the code quite a bit to make the SIMD implementation quite a bit easier. There are two implementations in this PR, one uses a lookup table and the other is rule-based. Both seem to have similar performance on my machine.

On my machine I see a 11%-48% improvement depending on the benchmark. A few of my co-workers saw various speedups depending on their machine.

I should probably mark this PR as "Ready for Review". However, I'm happy to collaborate either on this or your PR.

Edit: oh yeah, there is an old-school bit-twiddling SIMD approach in pure C: #738

radiospiel · 2025-03-18T08:26:44Z

Thank you, @samyron .

I became painfully aware of the work you did when I tried to merge master into my branch, because the interface's of the escape functions had been changed; my implementation relies on a "escape me a uchar[] array into an fbuffer" which is no longer available with whats in master today :)

The main difference between your approach and mine is that you switch out the search functionality, depending on the availability of SIMD, while I switch out the SIMD primitives instead. This allows me to have working implementations for X86, ARM, and bit-twiddling; but only a handful of primitives are available because NEON and AVX are different, so your approach should allow for per-hardware type optimal implementations.

I have a busy week ahead of me, but I will definitively take a look end of the week. I will also benchmark on Graviton instances; most ARM server workloads are probably not on a Apple Silicon CPU after all :) Happy to benchmark this PR as well.

Can you share a benchmark script that produces the most useful output for you? I would be especially interested in understanding how you get the "before" and "after" entries in the benchmark output :)

Speaking of benchmarks:

On my machine I see a 11x-48x improvement depending on the benchmark.

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x → %?
The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

samyron · 2025-03-18T16:01:30Z

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x → %? The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

Apologies, yes, that was a typo. I'll fix it in the comment above

radiospiel · 2025-03-23T19:17:50Z

@samyron I reran benchmarks (link). Both our PRs show a substantial improvement over the baseline, the only significant difference is on short strings.

Encoding Type	json 2.10.2	samyron	radiospiel
strings.ascii	13.046k (± 1.6%)	29.681k (± 1.9%)	33.583k (± 3.0%)
strings.escapes	4.608k (± 1.9%)	10.765k (± 2.2%)	9.681k (± 2.5%)
strings.mixed	32.971k (± 1.4%)	88.580k (± 2.1%)	90.133k (± 3.2%)
strings.multibyte	32.836k (± 2.0%)	89.385k (± 3.0%)	89.475k (± 2.1%)
strings.short	91.819k (± 9.8%)	95.388k (± 2.5%)	133.008k (± 2.6%)
strings.tests	21.350k (± 4.1%)	22.538k (± 2.7%)	22.600k (± 2.5%)

strings.short is a test on a 13-byte string ("b" * 5) + "€" + ("a" * 5), which is shorter than the size of the SIMD buffer (which in my case is 16 byte.).

I believe such short strings are relevant, because JSON object keys are probably quite often shorter than 16 byte; my PR applies SIMD for strings of 8 byte and more (link). (The value of 8 seemed beneficial and looked nice, but I should probably retest this with smaller values.)

Maybe you could be able to support that as well?

radiospiel · 2025-03-23T19:28:10Z

@byroot we have two competing implementations of the same approach. While mine is probably more beneficial in the short term (because it also supports x86), I believe that @samyron 's approach has more future potential, because it allows handcrafted SIMD implementations that are fundamentally different between NEON and SSE2. (and it certainly can be extended to also support shorter strings, see comment above.)

Also, transplanting a x86 implementation from my PR into @samyron 's shouldn't be too hard to achieve.

I see the following alternatives:

we scrap mine, @.samyron adds support for shorter strings, and, in a follow up we transplant SSE2 into @.samyron's;
we merge mine, with the understanding that @.samyron's will be merged in at a later point, with SSE2 support right out of the box; mine will be removed again

What do you all think about that? ☝️

radiospiel · 2025-03-23T19:32:14Z

ext/json/ext/generator/simd.h

+#ifdef ENABLE_SIMD
+
+#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(__aarch64__) || defined(_M_ARM64)
+#include <arm_neon.h>


The internet is convinced that no ARM64's lack neon support, so maybe this is not necessary. For example here: https://github.com/postgres/postgres/blob/REL_17_4/src/include/port/simd.h#L38

radiospiel · 2025-04-09T12:44:02Z

I figured it out. The issue was with the search_escape_basic_impl function pointer. It was defined as unsigned char (*search_escape_basic_impl)(search_state *);. ...I changed the definition to static unsigned char (*search_escape_basic_impl)(search_state *);

Wow, I had not expected that to make a difference. Great finding!

I was running your latest commit on a Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz. I can confirm that worst case performance is no longer adversely affected also in x86.

Here are the numbers for the strings benchmarks; this is on a ruby 3.3. (I can't compare against 2.10.2 or a newer ruby, because that is not available there.)

JSON Version	Encoding (i/s)
2.3.1	87.569
2.7.1	83.429
2.9.1	254.117
local	333.684

so a 30% speedup on string focused benchmark cases.

byroot · 2025-04-09T12:48:00Z

Somehow, citm_catalog.json is still 9% slower than master on my machine.

I get that because we need to check that we're not going out of bounds it slows things down a bit, but 9% seem a bit surprising.

radiospiel · 2025-04-09T12:51:35Z

As for code review I wanted to ask about two things:

some of the search_escape_basic_XXX functions are inline, others are not. Is that intentional?
in each of the implementations we now iterate over vector-sized blocks, and then do the "fill up vector with 'X' and test again" part, and in both locations the SIMD instructions that do the actual check are repeated (e.g. here and here. Can we extract these into individual inlined methods?

…MD-fallback case to a method.

samyron · 2025-04-09T14:01:04Z

As for code review I wanted to ask about two things:

* some of the `search_escape_basic_XXX` functions are inline, others are not. Is that intentional?

* in each of the implementations we now iterate over vector-sized blocks, and then do the "fill up vector with 'X' and test again" part, and in both locations the SIMD instructions that do the actual check are repeated (e.g. [here](https://github.com/ruby/json/pull/743/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR304-R307) and [here](https://github.com/ruby/json/pull/743/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR335-R337). Can we extract these into individual inlined methods?

Thank you! Done.

radiospiel · 2025-04-09T14:26:48Z

Somehow, citm_catalog.json is still 9% slower than master on my machine

is this PR rebased off the current master? maybe the test case benefits from other optimizations that have landed in master but are not present here?

byroot · 2025-04-09T14:27:19Z

ext/json/ext/generator/generator.c

+    memset(s, 'X', len);
+
+    // Optimistically copy the remaining 'len' characters to the output FBuffer. If there are no characters
+    // to escape, then everything ends up in the correct spot. Otherwise it was convenient temporary storage.
+    memcpy(s, search->ptr, len);


I don't understand what this memset purpose is, given we memcpy at the same pointer with same length right after.

If I comment it out, the test suite still pass on my machine. Is it some sort of left over?

That's a bug/typo. That was supposed to be memset(s, 'X', vec_len);. The goal is to ensure that the anything past len bytes does not need to be escaped.

byroot · 2025-04-09T14:30:45Z

is this PR rebased off the current master?

No. But I tried to merge master into it in case you were right, and I still have a 10% slowdown on citm_catalog.json.

…e escaped.

radiospiel · 2025-04-09T14:51:18Z

I see a 3% performance improvement (but within the error ranges) with this branch. These are the numbers I see on citm_catalog (Apple M1, 3.4 + yjit):

2.10.2: 1.074k (± 1.7%) i/s  (930.68 μs/i) -      5.457k in   5.080131s
master: 1.276k (± 1.7%) i/s  (783.94 μs/i) -      6.477k in   5.079093s
branch: 1.194k (± 1.7%) i/s  (837.32 μs/i) -      6.069k in   5.083191s
branch w/merged master: 1.318k (± 1.7%) i/s  (758.59 μs/i) -      6.681k in   5.069660s

When I change the 8 byte limit to 6, (i.e. we SIMD even strings of only 6 bytes) I get

branch w/merged master: 1.351k (± 1.7%) i/s  (740.25 μs/i) -      6.800k in   5.035171s

and with 4 I get

branch w/merged master: 1.372k (± 1.6%) i/s  (729.12 μs/i) -      6.885k in   5.021282s

Finding the optimum for this limit might be heavily CPU model specific, though, so I don't know how far we want to go there.

vector's width worth of data remaining.

samyron · 2025-04-10T03:11:01Z

In the case there isn't a vector's width worth of data, but at least SIMD_MINIMUM_THRESHOLD's worth of bytes remain (currently set to 6...), the code will now use the neon_next_match to locate the match instead of falling back to search_escape_basic to locate those matches.

Edit: Updated the code for SSE2 too.

The real-world benchmarks on my M1 are now:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.568k i/100ms
Calculating -------------------------------------
               after     26.384k (± 1.8%) i/s   (37.90 μs/i) -    133.536k in   5.062980s

Comparison:
              before:    15443.5 i/s
               after:    26383.9 i/s - 1.71x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   124.000 i/100ms
Calculating -------------------------------------
               after      1.253k (± 1.0%) i/s  (797.88 μs/i) -      6.324k in   5.046229s

Comparison:
              before:      989.7 i/s
               after:     1253.3 i/s - 1.27x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.628k (± 1.3%) i/s  (380.56 μs/i) -     13.181k in   5.017108s

Comparison:
              before:     1705.5 i/s
               after:     2627.7 i/s - 1.54x  faster

…ector's width worth of data.

samyron · 2025-04-10T14:24:37Z

I now have access to an M4 Pro. Here are the current real-world results compared to master.

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     3.620k i/100ms
Calculating -------------------------------------
               after     36.214k (± 2.5%) i/s   (27.61 μs/i) -    181.000k in   5.001447s

Comparison:
              before:    27564.7 i/s
               after:    36214.1 i/s - 1.31x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   182.000 i/100ms
Calculating -------------------------------------
               after      1.820k (± 2.1%) i/s  (549.34 μs/i) -      9.282k in   5.101445s

Comparison:
              before:     1717.0 i/s
               after:     1820.4 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   367.000 i/100ms
Calculating -------------------------------------
               after      3.675k (± 2.6%) i/s  (272.11 μs/i) -     18.717k in   5.096566s

Comparison:
              before:     2882.5 i/s
               after:     3675.0 i/s - 1.27x  faster

radiospiel · 2025-04-11T00:02:17Z

interesting to see that on the m4 pro the difference between SIMD and non-SIMD is less prominent (but still very much there.)

samyron · 2025-04-12T03:05:44Z

This branch is still work in progress but it optimizes the worst cases. I don't like that I had to create an ESCAPE_UTF8_CHAR_BASIC macro but clang wouldn't inline the method. Without the method inlined, while it did help, it wasn't on-par with master.

The key idea is to use popcount to count the number of matches within each chunk. If it was greater or equal to half of the width of the vector (a heuristic), loop over the chunk and execute ESCAPE_UTF8_CHAR_BASIC.

Comparing that branch with master, the worst cases are now no worse.

== Encoding bytes.16.bestcase (190001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   491.000 i/100ms
Calculating -------------------------------------
               after      4.168k (± 9.0%) i/s  (239.91 μs/i) -     21.113k in   5.102554s

Comparison:
              before:     3293.9 i/s
               after:     4168.2 i/s - 1.27x  faster


== Encoding bytes.16.worstcase (350001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   178.000 i/100ms
Calculating -------------------------------------
               after      1.854k (± 9.9%) i/s  (539.49 μs/i) -      9.256k in   5.041858s

Comparison:
              before:     1643.4 i/s
               after:     1853.6 i/s - same-ish: difference falls within error


== Encoding bytes.16.mixed (670001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   167.000 i/100ms
Calculating -------------------------------------
               after      1.603k (±11.6%) i/s  (623.90 μs/i) -      8.016k in   5.086205s

Comparison:
              before:     1625.5 i/s
               after:     1602.8 i/s - same-ish: difference falls within error


== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   373.000 i/100ms
Calculating -------------------------------------
               after      3.907k (±10.9%) i/s  (255.94 μs/i) -     19.396k in   5.021881s

Comparison:
              before:     3488.1 i/s
               after:     3907.2 i/s - same-ish: difference falls within error


== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   167.000 i/100ms
Calculating -------------------------------------
               after      1.647k (± 9.2%) i/s  (607.14 μs/i) -      8.183k in   5.008542s

Comparison:
              before:     1792.1 i/s
               after:     1647.1 i/s - same-ish: difference falls within error


== Encoding bytes.15.mixed (250001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   100.000 i/100ms
Calculating -------------------------------------
               after      1.235k (±10.3%) i/s  (809.58 μs/i) -      6.200k in   5.075855s

Comparison:
              before:     1878.4 i/s
               after:     1235.2 i/s - 1.52x  slower


== Encoding bytes.32.bestcase (350001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   431.000 i/100ms
Calculating -------------------------------------
               after      3.971k (±12.1%) i/s  (251.82 μs/i) -     19.826k in   5.065091s

Comparison:
              before:     2479.6 i/s
               after:     3971.1 i/s - 1.60x  faster


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   126.000 i/100ms
Calculating -------------------------------------
               after      1.101k (±11.4%) i/s  (908.37 μs/i) -      5.544k in   5.103404s

Comparison:
              before:      917.5 i/s
               after:     1100.9 i/s - same-ish: difference falls within error

Note: At this point my M1 might be thermal throttling. I've been running benchmarks almost constantly for a few hours.

byroot · 2025-04-12T09:29:31Z

I don't like that I had to create an ESCAPE_UTF8_CHAR_BASIC macro but clang wouldn't inline the method.

Interesting. Have you tried __attribute__((always_inline))? But I see how that could have caused the perf degradation.

Also just a quick heads up, I'm gonna be busy with RubyKaigi for the next ~10 days, so don't be surprised if I'm not responsive.

Once you are satisfied with the performance and wish to prepare this PR to be mergeable, you can get ris of the disabled codepath, commented out code, etc. I have a few diff comments that weren't addressed.

samyron · 2025-04-12T15:31:13Z

Interesting. Have you tried __attribute__((always_inline))? But I see how that could have caused the perf degradation.

I haven't but I will try and see what happens.

Once you are satisfied with the performance and wish to prepare this PR to be mergeable, you can get ris of the disabled codepath, commented out code, etc. I have a few diff comments that weren't addressed.

I think now that the worst case is addressed to not be worse than master, I'm going to call it good enough. There are additional changes I'm experimenting with but I should probably put them on the back burner for now and focus on getting this ready to be merged. I'll removed the commented out code, settle on a NEON implementation and review any outstanding PR comments.

I'll also ensure code style consistency... I know I've been putting my opening {'s on the same line as the method definition but the exiting code mostly does not. I'll address that too.

…tions too.

radiospiel · 2025-04-18T17:09:07Z

I have run benchmarks on servers (they shouldn't be affected by thermal throttling as much.) Here are the numbers, comparing 2.3.1 → 2.7.6 → 2.9.1 → current master → this branch (at commit ad995fc), on both Intel X86 and ARM Neon machines.

ruby 3.3.4; 16 core Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz

Test Case	Bytes	arm-neon-simd-v2	master	json 2.9.1	json 2.7.6	json 2.3.1
dumps.activitypub	52595	10.141k	9.092k	7.714k	7.544k	3.559k
dumps.citm_catalog	500298	412.048	469.153	433.180	390.871	244.163
dumps.twitter	466906	941.694	892.179	751.575	719.190	356.065

ruby 3.3.4; 16 core Graviton CPU @ 2.10GHz

Test Case	Bytes	arm-neon-simd-v2	master	json 2.9.1	json 2.7.6	json 2.3.1
dumps.activitypub	52595	17.106k	14.112k	10.298k	10.133k	6.385k
dumps.citm_catalog	500298	681.012	736.664	607.248	541.731	360.880
dumps.twitter	466906	1.497k	1.410k	1.019k	949.443	558.259

There is still quite a drop in performance on the citm_catalog testcase, unfortunately. activitypub benefits nicely, and we have some gains on twitter.

radiospiel · 2025-04-18T17:25:18Z

Note that these numbers ^^^ are on a ruby 3.3.4; ruby 3.4.1 is not available on those machines.

radiospiel · 2025-04-18T17:58:29Z

I also ran some tests on a JSON dump which only contains the strings from citm_catalog.json, as a pretty-printed array, and I see a 10% speedup there. I wonder what in this PR might have affected the runtime of something else.

but it looks like that effect disappears when the current master is merged into this PR. I'll rerun benchmarks again and post results.

samyron · 2025-04-19T01:49:10Z

@radiospiel thank you so much for the benchmarks. I'm currently focused on the SSE2 implementation, particularly around the worst case performance using this benchmark: benchmark_encoding "bytes.15.worstcase", ([('"' * 15)] * 10000).

With the code as of commit 6f417af4f544e79e2d2864d716686be91a46a740 the performance on my machine is:

== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json   103.000 i/100ms
Calculating -------------------------------------
                json      1.070k (± 3.2%) i/s  (934.28 μs/i) -      5.356k in   5.009603s

maser (commit c472d72fdf1fd8e297cfdaf05b95b1ddcfef63fb) is:

== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json   191.000 i/100ms
Calculating -------------------------------------
                json      2.087k (± 2.9%) i/s  (479.13 μs/i) -     10.505k in   5.037973s

Instead of using the function pointer search_escape_basic_impl in the loop in convert_UTF8_to_JSON, I call the _sse2 function directly (this isn't runtime detection, I wanted to see the best-case scenario...):

static inline unsigned char search_escape_basic_sse2(search_state *);

static inline void convert_UTF8_to_JSON(search_state *search)
{
    while (search_escape_basic_sse2(search)) {
        escape_UTF8_char_basic(search);
    }
}

== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json   174.000 i/100ms
Calculating -------------------------------------
                json      1.826k (± 3.4%) i/s  (547.57 μs/i) -      9.222k in   5.056308s

I'd be much happier with a 10-15% performance decrease compared to master rather than an 50%.

Trying a simple conditional:

static SIMD_Implementation SIMD_IMPL;
static inline unsigned char search_escape_basic_sse2(search_state *);

static inline void convert_UTF8_to_JSON(search_state *search)
{
    if (SIMD_IMPL == SIMD_SSE2) {
        while (search_escape_basic_sse2(search)) {
            escape_UTF8_char_basic(search);
        }
    } else {
        while (search_escape_basic(search)) {
            escape_UTF8_char_basic(search);
        }
    }
}

I get the following performance:

== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json   173.000 i/100ms
Calculating -------------------------------------
                json      1.774k (± 3.5%) i/s  (563.55 μs/i) -      8.996k in   5.076102s

I need to test this on my M1. Ideally I'd like to keep the implementation the same between platforms.

samyron · 2025-04-20T02:51:49Z

As of master commit 92beca803238ac956508da9ccb94fdc609e2209b and this branch commit 479af084038c9f442f896fff1486546bcacac9db on my M1 Air:

Combined Real-world and Synthetic Benchmarks

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.435k i/100ms
Calculating -------------------------------------
               after     25.431k (± 1.7%) i/s   (39.32 μs/i) -    129.055k in   5.076191s

Comparison:
              before:    17550.4 i/s
               after:    25430.9 i/s - 1.45x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   136.000 i/100ms
Calculating -------------------------------------
               after      1.372k (± 1.2%) i/s  (728.88 μs/i) -      6.936k in   5.056303s

Comparison:
              before:     1271.5 i/s
               after:     1372.0 i/s - 1.08x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   267.000 i/100ms
Calculating -------------------------------------
               after      2.647k (± 1.7%) i/s  (377.80 μs/i) -     13.350k in   5.045105s

Comparison:
              before:     2025.1 i/s
               after:     2646.9 i/s - 1.31x  faster


== Encoding bytes.16.bestcase (190001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   785.000 i/100ms
Calculating -------------------------------------
               after      7.881k (± 1.8%) i/s  (126.89 μs/i) -     40.035k in   5.081835s

Comparison:
              before:     5843.0 i/s
               after:     7880.7 i/s - 1.35x  faster


== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   721.000 i/100ms
Calculating -------------------------------------
               after      7.379k (± 1.7%) i/s  (135.52 μs/i) -     37.492k in   5.082265s

Comparison:
              before:     5257.0 i/s
               after:     7379.2 i/s - 1.40x  faster


== Encoding bytes.15.singlematch (1310001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   101.000 i/100ms
Calculating -------------------------------------
               after      1.041k (± 1.9%) i/s  (960.95 μs/i) -      5.252k in   5.048865s

Comparison:
              before:      874.9 i/s
               after:     1040.6 i/s - 1.19x  faster


== Encoding bytes.32.bestcase (350001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   629.000 i/100ms
Calculating -------------------------------------
               after      6.234k (± 2.8%) i/s  (160.41 μs/i) -     31.450k in   5.048944s

Comparison:
              before:     3894.6 i/s
               after:     6234.0 i/s - 1.60x  faster


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   167.000 i/100ms
Calculating -------------------------------------
               after      1.665k (± 1.6%) i/s  (600.63 μs/i) -      8.350k in   5.016677s

Comparison:
              before:     1143.3 i/s
               after:     1664.9 i/s - 1.46x  faster

Introduce ARM Neon SIMD.

259090c

samyron mentioned this pull request Feb 3, 2025

WIP ARM Neon/AVX2 SIMD implementation. #730

Closed

byroot reviewed Feb 3, 2025

View reviewed changes

samyron added 3 commits February 4, 2025 20:23

Use the 'rules' implementation instead of the lookup table implementa…

9ad196e

…tion. Also store the potential matches directly rather than looking up values in the escape table.

Merge branch 'master' into arm-neon-simd-v2

0c1958a

Refactoring and simplifications.

d8a2e56

samyron commented Feb 5, 2025

View reviewed changes

Load the SIMD lookup table explicitly without loops.

89ba0be

Use only 2 64-byte lookup tables for the neon escape_table_basic as w…

a23b84e

…e only need 128 bytes for the lookup table as the top 128 bytes are all zeros.

samyron added 2 commits February 9, 2025 20:45

Simplifications.

5506091

A few more cleanups.

3ae5677

radiospiel mentioned this pull request Mar 16, 2025

Implement JSON string escaping using SIMD (ARM + X86) #769

Draft

samyron marked this pull request as ready for review March 18, 2025 01:34

radiospiel reviewed Mar 23, 2025

View reviewed changes

samyron added 2 commits April 9, 2025 08:41

Ensure all search_escape_basic* functions are inlined.

062587e

Refactor the code that copies the last remaining characters in the SI…

f49af9b

…MD-fallback case to a method.

byroot reviewed Apr 9, 2025

View reviewed changes

Change 'len' to 'vec_len' to ensure bytes past 'len' do not need to b…

15f1887

…e escaped.

Added the ability to use the matches_mask in the case there isn't a full

a666f5a

vector's width worth of data remaining.

SSE implementation of using the escape mask when there isn't a full v…

1dc47f8

…ector's width worth of data.

samyron added 2 commits April 16, 2025 21:36

Optimizations, comments and formatting. Still work in progress.

af822fc

Implemented optimizations in the SSE2 implemenation. A few simplifica…

ad995fc

…tions too.

This comment was marked as outdated.

Sign in to view

samyron added 3 commits April 19, 2025 21:34

Updates to better handle escape-heavy workloads on ARM Neon.

9cf63a1

Apply the same optimizations to the SSE2 implementation.

df76269

Merge branch 'master' into arm-neon-simd-v2

479af08

Introduce ARM Neon and SSE2 SIMD. #743

Are you sure you want to change the base?

Introduce ARM Neon and SSE2 SIMD. #743

Conversation

samyron commented Feb 3, 2025 • edited Loading

Benchmarks (Lookup table)

Benchmarks (Rules based)

byroot commented Feb 3, 2025

byroot Feb 3, 2025

Choose a reason for hiding this comment

samyron Feb 3, 2025

Choose a reason for hiding this comment

byroot Feb 3, 2025

Choose a reason for hiding this comment

byroot commented Feb 3, 2025

samyron Feb 5, 2025

Choose a reason for hiding this comment

samyron commented Feb 6, 2025 • edited Loading

byroot commented Feb 6, 2025

samyron commented Feb 10, 2025 • edited Loading

byroot commented Feb 10, 2025

samyron commented Feb 11, 2025

byroot commented Feb 12, 2025

samyron commented Feb 25, 2025

samyron commented Feb 26, 2025

radiospiel commented Mar 16, 2025

samyron commented Mar 18, 2025 • edited Loading

radiospiel commented Mar 18, 2025 • edited Loading

samyron commented Mar 18, 2025

radiospiel commented Mar 23, 2025

radiospiel commented Mar 23, 2025 • edited Loading

radiospiel Mar 23, 2025

Choose a reason for hiding this comment

radiospiel commented Apr 9, 2025 • edited Loading

byroot commented Apr 9, 2025

radiospiel commented Apr 9, 2025

samyron commented Apr 9, 2025

radiospiel commented Apr 9, 2025

byroot Apr 9, 2025

Choose a reason for hiding this comment

samyron Apr 9, 2025

Choose a reason for hiding this comment

byroot commented Apr 9, 2025

radiospiel commented Apr 9, 2025

samyron commented Apr 10, 2025 • edited Loading

samyron commented Apr 10, 2025

radiospiel commented Apr 11, 2025

samyron commented Apr 12, 2025

byroot commented Apr 12, 2025

samyron commented Apr 12, 2025

radiospiel commented Apr 18, 2025 • edited Loading

radiospiel commented Apr 18, 2025

This comment was marked as outdated.

radiospiel commented Apr 18, 2025

samyron commented Apr 19, 2025

samyron commented Apr 20, 2025

Combined Real-world and Synthetic Benchmarks

samyron commented Feb 3, 2025 •

edited

Loading

samyron commented Feb 6, 2025 •

edited

Loading

samyron commented Feb 10, 2025 •

edited

Loading

samyron commented Mar 18, 2025 •

edited

Loading

radiospiel commented Mar 18, 2025 •

edited

Loading

radiospiel commented Mar 23, 2025 •

edited

Loading

radiospiel commented Apr 9, 2025 •

edited

Loading

samyron commented Apr 10, 2025 •

edited

Loading

radiospiel commented Apr 18, 2025 •

edited

Loading