Large memory usage and occasional CPU spikes is almost certainly the GC kicking in. You can see if this is indeed the case by using RTS options like -B
, which causes GHC to beep whenever there is a major collection, -t
which will tell you statistics after the fact (in particular, see if the GC times are really long) or -Dg
, which turns on debugging info for GC calls (though you need to compile with -debug
).
There are several things you can do to alleviate this problem:
-
On the initial import of the data, GHC is wasting a lot of time growing the heap. You can tell it to grab all of the memory you need at once by specifying a large
-H
. -
A large heap with stable data will get promoted to an old generation. If you increase the number of generations with
-G
, you may be able to get the stable data to be in the oldest, very rarely GC’d generation, whereas you have the more traditional young and old heaps above it. -
Depending the on the memory usage of the rest of the application, you can use
-F
to tweak how much GHC will let the old generation grow before collecting it again. You may be able to tweak this parameter to make this un-garbage collected. -
If there are no writes, and you have a well-defined interface, it may be worthwhile making this memory un-managed by GHC (use the C FFI) so that there is no chance of a super-GC ever.
These are all speculation, so please test with your particular application.