Good algorithm and data structure for looking up words with missing letters?

Question

I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

Solution #1: Using Regex

This is working Ruby code for this problem:

def query(str, data)    
  r = Regexp.new("^#{str.gsub("?", ".")}$")
  idx = 0
  begin
    idx = data.index(r, idx)
    if idx
      yield data[idx, str.size]
      idx += str.size + 1
    end
  end while idx
end

start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
  puts w
end
puts Time.now - start_time

The file wordlist.txt contains 45425 words (downloadable here). The program’s output for query ?r?te is:

brute
crate
Crete
grate
irate
prate
write
wrote
0.013689

So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

query ????????????????e

counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681

query ?h?a?r?c?l?

theatricals
0.013608

This looks fast enough for me.

Solution #2: Regex with Prepared Data

If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

def query_split(str, data)
  query(str, data[str.length]) do |w|
    yield w
  end
end

# prepare data    
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
  data[w.length-1] += w
end

# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
  puts w
end
puts Time.now - start_time

Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):

?r?te 0.001112 sec
?h?a?r?c?l? 0.000852 sec
????????????????e 0.000169 sec

Solution #3: One Big Hashtable (Updated Requirements)

Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

Here I create one big hashtable, where each possible query maps to a list of its results:

def create_big_hash(data)
  h = Hash.new do |h,k|
    h[k] = Array.new
  end    
  data.each_line do |l|
    w = l.strip
    # add all words with one ?
    w.length.times do |i|
      q = String.new(w)
      q[i] = "?"
      h[q].push w
    end
    # add all words with two ??
    (w.length-1).times do |i|
      q = String.new(w)      
      q[i, 2] = "??"
      h[q].push w
    end
  end
  h
end

# prepare data    
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"

# use prepared data for query
t = Time.new
h["?ood"].each do |w|
  puts w
end
puts (Time.new - t)

Output is

4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05

The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer’s precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

Solution #4: Get Really Serious

All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

Tries for Approximate String Matching – If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
Agrep – A Fast Approximate Pattern-Matching Tool – Agrep is based on a new efficient and flexible algorithm for approximate string matching.
Google Scholar search for approximate string matching – More than enough to read on this topic.

Solution #1: Using Regex

Solution #2: Regex with Prepared Data

Solution #3: One Big Hashtable (Updated Requirements)

Solution #4: Get Really Serious

Leave a Comment Cancel reply