Why is Python list slower when sorted?

Question

Cache misses. When N int objects are allocated back-to-back, the memory reserved to hold them tends to be in a contiguous chunk. So crawling over the list in allocation order tends to access the memory holding the ints’ values in sequential, contiguous, increasing order too.

Shuffle it, and the access pattern when crawling over the list is randomized too. Cache misses abound, provided there are enough different int objects that they don’t all fit in cache.

At r==1, and r==2, CPython happens to treat such small ints as singletons, so, e.g., despite that you have 10 million elements in the list, at r==2 it contains only (at most) 100 distinct int objects. All the data for those fit in cache simultaneously.

Beyond that, though, you’re likely to get more, and more, and more distinct int objects. Hardware caches become increasingly useless then when the access pattern is random.

Illustrating:

>>> from random import randint, seed
>>> seed(987987987)
>>> for x in range(1, 9):
...     r = 10 ** x
...     js = [randint(1, r) for _ in range(10_000_000)]
...     unique = set(map(id, js))
...     print(f"{r:12,} {len(unique):12,}")
...     
          10           10
         100          100
       1,000    7,440,909
      10,000    9,744,400
     100,000    9,974,838
   1,000,000    9,997,739
  10,000,000    9,999,908
 100,000,000    9,999,998

Leave a Comment Cancel reply