Fastest way to generate a random-like unique string with random length in Python 3

Question

Basic improvements, sets and local names

Use a set, not a list, and testing for uniqueness is much faster; set membership testing takes constant time independent of the set size, while lists take O(N) linear time. Use a set comprehension to produce a series of keys at a time to avoid having to look up and call the set.add() method in a loop; properly random, larger keys have a very small chance of producing duplicates anyway.

Because this is done in a tight loop, it is worth your while optimising away all name lookups as much as possible:

import secrets
import numpy as np
from functools import partial

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(secrets.choice, string.ascii_uppercase + string.digits)
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys

The _randint keyword argument binds the np.random.randint name to a local in the function, which are faster to reference than globals, especially when attribute lookups are involved.

The pickchar() partial avoids looking up attributes on modules or more locals; it is a single callable that has all the references in place, so is faster in execute, especially when done in a loop.

The while loop keeps iterating only if there were duplicates produced. We produce enough keys in a single set comprehension to fill the remainder if there are no duplicates.

Timings for that first improvement

For 100 items, the difference is not that big:

>>> timeit('p(100)', 'from __main__ import produce_amount_keys_list as p', number=1000)
8.720592894009314
>>> timeit('p(100)', 'from __main__ import produce_amount_keys_set as p', number=1000)
7.680242831003852

but when you start scaling this up, you’ll notice that the O(N) membership test cost against a list really drags your version down:

>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_list as p', number=10)
15.46253142200294
>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_set as p', number=10)
8.047800761007238

My version is already almost twice as fast as 10k items; 40k items can be run 10 times in about 32 seconds:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_list as p', number=10)
138.84072386901244
>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_set as p', number=10)
32.40720253501786

The list version took over 2 minutes, more than ten times as long.

Numpy’s random.choice function, not cryptographically strong

You can make this faster still by forgoing the secrets module and using np.random.choice() instead; this won’t produce a cryptographic level randomness however, but picking a random character is twice as fast:

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys

This makes a huge difference, now 10 times 40k keys can be produced in just 16 seconds:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_npchoice as p', number=10)
15.632006907981122

Further tweaks with the itertools module and a generator

We can also take the unique_everseen() function from the itertools module Recipes section to have it take care of the uniqueness, then use an infinite generator and the itertools.islice() function to limit the results to just the number we want:

# additional imports
from itertools import islice, repeat

# assumption: unique_everseen defined or imported

def produce_amount_keys(amount_of_keys):
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    def gen_keys(_range=range, _randint=np.random.randint):
        while True:
            yield ''.join([pickchar() for _ in _range(_randint(12, 20))])
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is slightly faster still, but only marginally so:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_itertools as p', number=10)
14.698191125993617

os.urandom() bytes and a different method of producing strings

Next, we can follow on on Adam Barnes’s ideas for using UUID4 (which is basically just a wrapper around os.urandom()) and Base64. But by case-folding Base64 and replacing 2 characters with randomly picked ones, his method severely limits the entropy in those strings (you won’t produce the full range of unique values possible, a 20-character string only using (256 ** 15) / (36 ** 20) == 1 in every 99437 bits of entropy!).

The Base64 encoding uses both upper and lower case characters and digits but also adds the - and / characters (or + and _ for the URL-safe variant). For only uppercase letters and digits, you’d have to uppercase the output and map those extra two characters to other random characters, a process that throws away a large amount of entropy from the random data provided by os.urandom(). Instead of using Base64, you could also use the Base32 encoding, which uses uppercase letters and the digits 2 through 8, so produces strings with 32 ** n possibilities versus 36 ** n. However, this can speed things up further from the above attempts:

import os
import base64
import math

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=base64.b32encode, _randint=np.random.randint):
        # (count / math.log(256, 32)), rounded up, gives us the number of bytes
        # needed to produce *at least* count encoded characters
        factor = math.log(256, 32)
        input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(input_length[count]))[:count].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is really fast:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b32 as p', number=10)
4.572628145979252

40k keys, 10 times, in just over 4 seconds. So about 75 times as fast; the speed of using os.urandom() as a source is undeniable.

This is, cryptographically strong again; os.urandom() produces bytes for cryptographic use. On the other hand, we reduced the number of possible strings produced by more than 90% (((36 ** 20) - (32 ** 20)) / (36 ** 20) * 100 is 90.5), we are no longer using the 0, 1, 8 and 9 digits in the outputs.

So perhaps we should use the urandom() trick to produce a proper Base36 encoding; we’ll have to produce our own b36encode() function:

import string
import math

def b36encode(b, 
        _range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
        _c=(string.ascii_uppercase + string.digits).encode()):
    """Encode a bytes value to Base36 (uppercase ASCII and digits)

    This isn't too friendly on memory because we convert the whole bytes
    object to an int, but for smaller inputs this should be fine.
    """
    b_int = _fb(b, 'big')
    length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
    return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))

and use that:

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint):
        # (count / math.log(256, 36)), rounded up, gives us the number of bytes
        # needed to produce *at least* count encoded characters
        factor = math.log(256, 36)
        input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(input_length[count]))[-count:].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is reasonably fast, and above all produces the full range of 36 uppercase letters and digits:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b36 as p', number=10)
8.099918447987875

Granted, the base32 version is almost twice as fast as this one (thanks to an efficient Python implementation using a table) but using a custom Base36 encoder is still twice the speed of the non-cryptographically secure numpy.random.choice() version.

However, using os.urandom() produces bias again; we have to produce more bits of entropy than is required for between 12 to 19 base36 ‘digits’. For 17 digits, for example, we can’t produce 36 ** 17 different values using bytes, only the nearest equivalent of 256 ** 11 bytes, which is about 1.08 times too high, and so we’ll end up with a bias towards A, B, and to a lesser extent, C (thanks Stefan Pochmann for pointing this out).

Picking an integer below `(36 ** length)` and mapping integers to base36

So we need to reach out to a secure random method that can give us values evenly distributed between 0 (inclusive) and 36 ** (desired length) (exclusive). We can then map the number directly to the desired string.

First, mapping the integer to a string; the following has been tweaked to produce the output string the fastest:

def b36number(n, length, _range=range, _c=string.ascii_uppercase + string.digits):
    """Convert an integer to Base36 (uppercase ASCII and digits)"""
    chars = [_c[0]] * length
    while n:
        length -= 1
        chars[length] = _c[n % 36]
        n //= 36
    return ''.join(chars)

Next, we need a fast and cryptographically secure method of picking our number in a range. You can still use os.urandom() for this, but then you’d have to mask the bytes down to a maximum number of bits, and then loop until your actual value is below the limit. This is actually already implemented, by the secrets.randbelow() function. In Python versions < 3.6 you can use random.SystemRandom().randrange(), which uses the exact same method with some extra wrapping to support a lower bound greater than 0 and a step size.

Using secrets.randbelow() the function becomes:

import secrets

def produce_amount_keys(amount_of_keys):
    def gen_keys(_below=secrets.randbelow, _encode=b36number, _randint=np.random.randint):
        limit = [None] * 12 + [36 ** l for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_below(limit[count]), count)
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

and this then is quite close to the (probably biased) base64 solution:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_below as p', number=10)
5.135716405988205

This is almost as fast as the Base32 approach, but produces the full range of keys!