If the data is sorted, you can collect the top 100 in O(n)
where n
is the data’s size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.
See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub
Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined – we get this from sorting but perhaps there is a way of doing this more efficiently.
You can sort the data file using (external) merge sort in roughly O(n log n)
by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.
About this code sample:
-
Sorted data is represented by a
long[]
. Because the logic reads values one by one, it’s an OK approximation of reading the data from a sorted file. -
The OP didn’t specify how multiple values with equal frequency should be treated; consequently, the code doesn’t do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren’t other values with the same frequency.
import java.util.*;
import java.util.Map.Entry;
class TopN {
private final int maxSize;
private Map<Long, Long> countMap;
public TopN(int maxSize) {
this.maxSize = maxSize;
this.countMap = new HashMap(maxSize);
}
private void addOrReplace(long value, long count) {
if (countMap.size() < maxSize) {
countMap.put(value, count);
} else {
Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
Entry<Long, Long> minEntry = opt.get();
if (minEntry.getValue() < count) {
countMap.remove(minEntry.getKey());
countMap.put(value, count);
}
}
}
public Set<Long> get() {
return countMap.keySet();
}
public void process(long[] data) {
long value = data[0];
long count = 0;
for (long current : data) {
if (current == value) {
++count;
} else {
addOrReplace(value, count);
value = current;
count = 1;
}
}
addOrReplace(value, count);
}
public static void main(String[] args) {
long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
TopN topMap = new TopN(2);
topMap.process(data);
System.out.println(topMap.get()); // [5, 6]
}
}