Finding median of large set of numbers too big to fit into memory

There’s a few potential solutions:

  • External merge sort – O(n log n)
    You basically sort the numbers on the first pass, then find the median on the second.
  • Order statistics distributed selection algorithm – O(n)
    Simplify the problem to the original problem of finding the kth number in an unsorted array.
  • Counting sort histogram O(n)
    You have to assume some properties about the range of the numbers – can the range fit in the memory?
  • If anything is known about the distribution of the numbers other
    algorithms can be produced.

For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)