Fast hash for strings

As of Python 3 this method does not work:

Python has a built-in hash() function that’s very fast and perfect for most uses:

>>> hash("dfds")
3591916071403198536

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python’s built-in is so vastly faster, it’s never worth using hashlib for anything unless you need security… not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than “hex” gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'

Caveats:

  • Be warned that in Python 3.3 and up, this function is
    randomized and won’t work for some use cases. You can disable this with PYTHONHASHSEED=0

  • See https://github.com/flier/pyfasthash for fast, stable hashes that
    that similarly won’t overload your CPU for non-cryptographic applications.

  • Don’t use this lambda style in real code… write it out! And
    stuffing things like 2**32 in your code, instead of making them
    constants is bad form.

  • In the end 8 bytes of collision resistance is OK for a smaller
    applications…. with less than a million entries, you’ve got
    collision odds of < 0.0000001%. That’s a 12 byte b64 encoded
    string. But it might not be enough for larger apps.

  • 16 bytes is enough for a UUID/OID in a cache, etc.

Speed comparison for producing 300k 16 byte hashes from a bytes-input.

builtin: 0.188
md5: 0.359
fnvhash_c: 0.113

For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.

builtin: 0.197
md5: 0.603
fnvhash_c: 0.284

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)