Why is it impossible to find a specified value in a sorted array faster than O(log n)?

Question

Let’s imagine that you have an array of n items. If you perform a lookup in this array, then that lookup can return one of n + 1 values: either “the item isn’t present,” or “the item is present at index i” for any of the n indices.

Now, suppose that the only way that your algorithm is allowed to work with the array is by asking questions of the form “is the item greater than or equal to the item in index i?” for some choice of i, and let’s imagine that you ask some question of this form k total times. There are then 2^k possible ways that the comparisons could come back. To see why, there are two options for the how the first comparison can go (either “yes” or “no”). There are two options for how the second comparison can go (either “yes” or “no”), and two options for the third comparison. Multiplying all those 2s together gives 2^k.

We now have two constraints:

Any correct algorithm must be able to return one of n + 1 different options.
With k comparisons, there are 2^k possible ways those comparisons can work out.

This means that we have to have n + 1 ≤ 2^k, since otherwise there aren’t enough possible outcomes from the search algorithm to be able to cover all n + 1 possible outcomes. Taking the log base two of both sides gives lg (n + 1) ≤ k, so the number of comparisons made must be Ω(log n).

Stated differently – if your algorithm makes too few queries, there aren’t enough possible ways for those comparisons to go to ensure that every possible option can be produced.

Of course, if you aren’t in the comparison model, you can outperform searches in an array using hash tables. Some hash tables give expected O(1) lookups, while others (say, cuckoo hashing) give worst-case O(1) lookups.

Moving outside the comparison model, there are algorithms that, subject to certain assumptions, have expected runtimes lower than O(log n). For example, interpolation search can find items in sorted arrays in expected time O(log log n), provided that the data are sampled from a uniform distribution. It works by making numeric estimates of where in the sequence to choose the next item to probe, and works well in practice.

On the more theoretical side of things, fusion trees can perform searches in time O(log n / log w), where w is the machine word size, provided that the values are integers that fit into a single machine word. This can be improved down to the surprising runtime of O(sqrt(log n / log log n)). It’s known that if the n values each fit into a single machine word, then the predecessor lower bound says you can’t do better than the (very unusual runtime of) O(min{log w / log log w, sqrt(log n / log log n)}), where w is the machine word size. These algorithms outperform the Ω(log n) lower bound by making multiple comparisons in parallel using creative operations on individual machine words.

Leave a Comment Cancel reply