Replace SHA function by something very fast #619

za3k · 2025-02-28T22:28:29Z

My estimate is that this should increase total markdown processing speed by ~2.5X.

Fixes issue #618.

Tests pass.

My estimate is that this should increase speed by ~2.5X. Fixes issue trentm#618

nicholasserra · 2025-03-02T23:56:21Z

Hello! Thanks for this, looks sane. Only thing i'm wondering is collision potential versus the shas we were using. Guessing minimal to none. Do you have any perspective on that? Thank you

za3k · 2025-03-03T02:16:21Z

Two ways of thinking about it.

This fails as often as a python dict or set fails, in terms of string collision. Personally, I have never had that happen.
You can do some math. It looks like the hash is a 64-bit number in cpython (on my 64-bit machine, at least). Birthday paradox says we would need 2**32=4B strings before reaching 1 expected collision in a good hash system.
However, someone did some empirical tests, and it's not a good hash system, and the real answer might be more like 200K strings?

If are worried about users with more than 200K strings, first of all I'd say improve performance! But to avoid the chance of failure and speed things up, I see three options:

The ideal answer would be stop doing this, not to find a better hash function. You can escape a segment of HTML code without hashing in any way. But, that would be a lot more rewriting work.
You can use random IDs (I tried this). It breaks something in the URL escaping logic for images and end-material references, because they rely on the result being the same to look it back up the next time. I forget the details, sorry. But maybe you could fix just that.
You could hash both the string and some variant on the string, which would give you a 128-bit number. That should avoid any collisions, I hope.

>>> s="hello, world"; (hash(s) << 64) + hash(s+"also")
-42389628753142344245553632286555727257

za3k · 2025-03-03T02:31:37Z

Oh, one reasonable thing you might want to do is run some kind of benchmark before and after test? I didn't actually verify it got a lot faster.

Replace SHA function by something very fast

34f3d96

My estimate is that this should increase speed by ~2.5X. Fixes issue trentm#618

za3k force-pushed the master branch from f14b6a9 to 34f3d96 Compare February 28, 2025 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace SHA function by something very fast #619

Replace SHA function by something very fast #619

za3k commented Feb 28, 2025

nicholasserra commented Mar 2, 2025

za3k commented Mar 3, 2025 •

edited

Loading

za3k commented Mar 3, 2025

Replace SHA function by something very fast #619

Are you sure you want to change the base?

Replace SHA function by something very fast #619

Conversation

za3k commented Feb 28, 2025

nicholasserra commented Mar 2, 2025

za3k commented Mar 3, 2025 • edited Loading

za3k commented Mar 3, 2025

za3k commented Mar 3, 2025 •

edited

Loading