Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace SHA function by something very fast #619

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

za3k
Copy link

@za3k za3k commented Feb 28, 2025

My estimate is that this should increase total markdown processing speed by ~2.5X.

Fixes issue #618.

Tests pass.

My estimate is that this should increase speed by ~2.5X.

Fixes issue trentm#618
@nicholasserra
Copy link
Collaborator

Hello! Thanks for this, looks sane. Only thing i'm wondering is collision potential versus the shas we were using. Guessing minimal to none. Do you have any perspective on that? Thank you

@za3k
Copy link
Author

za3k commented Mar 3, 2025

Two ways of thinking about it.

  1. This fails as often as a python dict or set fails, in terms of string collision. Personally, I have never had that happen.

  2. You can do some math. It looks like the hash is a 64-bit number in cpython (on my 64-bit machine, at least). Birthday paradox says we would need 2**32=4B strings before reaching 1 expected collision in a good hash system.
    However, someone did some empirical tests, and it's not a good hash system, and the real answer might be more like 200K strings?

If are worried about users with more than 200K strings, first of all I'd say improve performance! But to avoid the chance of failure and speed things up, I see three options:

  1. The ideal answer would be stop doing this, not to find a better hash function. You can escape a segment of HTML code without hashing in any way. But, that would be a lot more rewriting work.

  2. You can use random IDs (I tried this). It breaks something in the URL escaping logic for images and end-material references, because they rely on the result being the same to look it back up the next time. I forget the details, sorry. But maybe you could fix just that.

  3. You could hash both the string and some variant on the string, which would give you a 128-bit number. That should avoid any collisions, I hope.

>>> s="hello, world"; (hash(s) << 64) + hash(s+"also")
-42389628753142344245553632286555727257

@za3k
Copy link
Author

za3k commented Mar 3, 2025

Oh, one reasonable thing you might want to do is run some kind of benchmark before and after test? I didn't actually verify it got a lot faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants