Longest equally-spaced subsequence

Question

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question “is this number in A?”). Call d the step of the subsequence we’re looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.

A = [1, 4, 5, 7, 8, 12]    # in sorted order
Aset = set(A)

for d in range(1, 12):
    already_seen = set()
    for a in A:
        if a not in already_seen:
            b = a
            count = 1
            while b + d in Aset:
                b += d
                count += 1
                already_seen.add(b)
            print "found %d items in %d .. %d" % (count, a, b)
            # collect here the largest 'count'

Updates:

This solution might be good enough if you’re only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10’000’000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

Leave a Comment Cancel reply