-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: faster bitpacking filter for selectivities from 5% to 80% #2068
base: develop
Are you sure you want to change the base?
Conversation
Benchmarks: random_access |
Benchmarks: datafusionTable of Results
|
Benchmarks: TPC-HTable of Results
|
Benchmarks: ClickbenchTable of Results
|
It's worth running this on our AVX512 machine too, to see if the switch point depends on SIMD width. M3's only have 128 bits IIRC. |
On a c2-standard-4 (Cascade Lake), the switch points are slightly different. It seems i8: 0.02, i16: 0.03, i32: 0.075, i64: 0.09. This PR uses: i8: 0.02, i16: 0.03, i32: 0.04, i64: 0.04. 512 / 128 = 4. These tests use 10,000 element arrays, so 0.04 is around 400 elements whereas 0.075 and 0.09 are around 750 and 900. I'm not sure there's a robust way to pick this threshold without benchmarking on the target machine. I'd be happy to push the 4 and 8 byte types up to 0.075 and 0.09. On an Apple M3 this is 20-35% slower but we're talking about 2.0 us vs 1.5 us. CPU
|
…5% and 80% This new benchmark demonstrates that the switchpoint is in [0.02, 0.04]. 8-bit elements switch around 0.02, but 32- and 64-bit elements switch around 0.04. [Google Sheet with one run on my Apple M3 Max](https://docs.google.com/spreadsheets/d/1T4JeSLnpFegA_pRS70iNu4ve9YMjEu-j1vRL7spazoA/edit?gid=624487667#gid=624487667).
02cab19
to
369f8cf
Compare
Okay, I went with the Cascade Lake threshold as those are best for our benchmarks. I wish I had a more principled way to write them down or some way to tune to the current CPU. |
On an Apple M3 Max, the new benchmark in this PR demonstrates that the switchpoint is in [0.02, 0.04]. 8-bit, 16-bit, 32-bit, and 64-bit arrays switch at 0.02, 0.03, 0.04, and 0.04, respectively. I later ran the benchmarks on a Cascade Lake cloud machine and found that the 8-bit, 16-bit, 32-bit, and 64-bit arrays switch at 0.03, 0.03, 0.075 and 0.09, respectively. In this PR, I use the Cascade Lake values, but I don't have a great answer for how to pick these. Regardless, it is clear that 0.8 is not the correct choice.
Google Sheet with one run on my Apple M3 Max. A sheet with a run on a c2-standard-4 "Cascade Lake"