How to use return_sequences option and TimeDistributed layer in Keras?

The LSTM layer and the TimeDistributed wrapper are two different ways to get the “many to many” relationship that you want. LSTM will eat the words of your sentence one by one, you can chose via “return_sequence” to outuput something (the state) at each step (after each word processed) or only output something after the … Read more

In Keras, what exactly am I configuring when I create a stateful `LSTM` layer with N `units`?

You can check this question for further information, although it is based on Keras-1.x API. Basically, the unit means the dimension of the inner cells in LSTM. Because in LSTM, the dimension of inner cell (C_t and C_{t-1} in the graph), output mask (o_t in the graph) and hidden/output state (h_t in the graph) should … Read more

How to stack multiple lstm in keras?

You need to add return_sequences=True to the first layer so that its output tensor has ndim=3 (i.e. batch size, timesteps, hidden state). Please see the following example: # expected input data shape: (batch_size, timesteps, data_dim) model = Sequential() model.add(LSTM(32, return_sequences=True, input_shape=(timesteps, data_dim))) # returns a sequence of vectors of dimension 32 model.add(LSTM(32, return_sequences=True)) # returns … Read more

What’s the difference between a bidirectional LSTM and an LSTM?

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. Using bidirectional will run your inputs in two ways, one from past to future and one from future … Read more

What’s the difference between “hidden” and “output” in PyTorch LSTM?

I made a diagram. The names follow the PyTorch docs, although I renamed num_layers to w. output comprises all the hidden states in the last layer (“last” depth-wise, not time-wise). (h_n, c_n) comprises the hidden states after the last timestep, t = n, so you could potentially feed them into another LSTM. The batch dimension … Read more

What is the intuition of using tanh in LSTM? [closed]

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates. On the other hand, to overcome the vanishing gradient problem, we need a … Read more

Understanding Keras LSTMs

As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture. General Keras behavior The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example): In this image, I increased the number … Read more