Memory Efficient Alternatives to Python Dictionaries

[*]

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

295760     S. Lott's answer
237984     S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156     d[a][b][c] = int(freq)
189132     keys.append((a,b,c)); freqs.append(int(freq))
146132     d[intern(a),intern(b)][intern(c)] = int(freq)
145408     d[intern(a)][intern(b)][intern(c)] = int(freq)
 83888 [*] d[a+' '+b+' '+c] = int(freq)
 82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
 68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
 60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))
 50556     pair array
 48320     squeezed pair array
 33024     squeezed single array

The entries marked [*] have no efficient way to look up a pair (a,b); they’re listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

‘Pair array’ is the scheme below in my original answer (“I’d start with the array with keys
being the first two words…”), where the value table for each pair is
represented as a single string. ‘Squeezed pair array’ is the same,
leaving out the frequency values that are equal to 1 (the most common
case). ‘Squeezed single array’ is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

import collections

def build(file):
    pairs = collections.defaultdict(list)
    for line in file:  # N.B. file assumed to be already sorted
        a, b, c, freq = line.split()
        key = ' '.join((a, b))
        pairs[key].append(c + ':' + freq if freq != '1' else c)
    out = open('squeezedsinglearrayfile', 'w')
    for key in sorted(pairs.keys()):
        out.write('%s|%s\n' % (key, ' '.join(pairs[key])))

def load():
    return open('squeezedsinglearrayfile').readlines()

if __name__ == '__main__':
    build(open('freqs'))

I haven’t written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there’s a standard trick to strip out common prefixes, with another level of index to get them back, but that’s rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are ‘hello george’ and ‘hello world’, make the second entry be ‘6world’ instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I’d start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though — I think you may be out of luck as far as batteries-included memory-efficient options.

Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your ‘real’ data structure, whatever it is, you use integer keys instead of strings — indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn’t suggest this after all.)

Leave a Comment Cancel reply