Keras difference between generator and sequence

Those methods are roughly the same. It is correct to subclass
Sequence when your dataset doesn’t fit in memory. But you shouldn’t
run any preprocessing in any of the class’ methods because that will
be reexecuted once per epoch wasting lots of computing resources.

It is probably also easier to shuffle the samples rather than their
indices. Like this:

from random import shuffle

class DataGen(Sequence):
    def __init__(self, batch_size, preproc, type, x_set, y_set):
        self.samples = list(zip(x, y))
        self.batch_size = batch_size
        shuffle(self.samples)
        self.type = type
        self.preproc = preproc

    def __len__(self):
        return int(np.ceil(len(self.samples) / self.batch_size))

    def __getitem__(self, i):
        batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size]
        return self.preproc.process(*zip(batch))

    def on_epoch_end(self):
        shuffle(self.samples)

I think it is impossible to say why you run out of memory without
knowing more about your data. My guess would be that your preproc
function is doing something wrong. You can debug it by running:

for e in DataGen(batch_size, preproc, *train):
    print(e)
for e in DataGen(batch_size, preproc, *dev):
    print(e)

You will most likely run out of memory.

Leave a Comment