What’s the current state-of-the-art suffix array construction algorithm?

Question

Currently, the best Suffix-Array constructor known is LibDivSufSort, by Yuta Mori :
http://code.google.com/p/libdivsufsort/

It uses Induced Sorting methodology (Basically, after sorting all strings starting with “A*”, you can induce sortings of strings “BA*” “CA*” “DA*” etc.)

It is praised everywhere for its efficiency and nice handling of degenerated cases. It’s also the fastest, and uses the optimal amount of memory (5N). The license is unobstrusive, so this algorithm is integrated into several other programs, such as for example libbsc compression library, by Ilya Grebnov.
http://libbsc.com/default.aspx

For comparison purposes, you will find a Suffix Array Compression benchmarks linked at this page :
http://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks
and this page
http://homepage3.nifty.com/wpage/benchmark/index.html

The benchmark also lists many other worthy SACA implementation.
Nevertheless, for both license and efficiency reason, i would recommend libdivsufsort among them.

Edit : Apparently, MSufsort is said to be heading towards version 4 soon, and is supposed to become quite faster than Divsufsort. If that is right, it would become a new SACA champion. But still, we don’t have release date yet, and this will be alpha stuff. So if you need a proven implementation now, libdivsufsort remains the best choice.

Note also that these “best SACA implementations” do not use “one construction algorithm”, but combine several optimisations tricks, which makes them difficult to summarize.

Leave a Comment Cancel reply