How to correctly give inputs to Embedding, LSTM and Linear layers in PyTorch?

Your understanding of most of the concepts is accurate, but, there are some missing points here and there.

Interfacing embedding to LSTM (Or any other recurrent unit)

You have embedding output in the shape of (batch_size, seq_len, embedding_size). Now, there are various ways through which you can pass this to the LSTM.
* You can pass this directly to the LSTM, if LSTM accepts input as batch_first. So, while creating your LSTM pass argument batch_first=True.
* Or, you can pass input in the shape of (seq_len, batch_size, embedding_size). So, to convert your embedding output to this shape, you’ll need to transpose the first and second dimensions using torch.transpose(tensor_name, 0, 1), like you mentioned.

Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me.
A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead.

There is an argument in favor of not using batch_first, which states that the underlying API provided by Nvidia CUDA runs considerably faster using batch as secondary.

Using context size

You are directly feeding the embedding output to LSTM, this will fix the input size of LSTM to context size of 1. This means that if your input is words to LSTM, you will be giving it one word at a time always. But, this is not what we want all the time. So, you need to expand the context size. This can be done as follows –

# Assuming that embeds is the embedding output and context_size is a defined variable
embeds = embeds.unfold(1, context_size, 1)  # Keeping the step size to be 1
embeds = embeds.view(embeds.size(0), embeds.size(1), -1)

Unfold documentation
Now, you can proceed as mentioned above to feed this to the LSTM, just remembed that seq_len is now changed to seq_len - context_size + 1 and embedding_size (which is the input size of the LSTM) is now changed to context_size * embedding_size

Using variable sequence lengths

Input size of different instances in a batch will not be the same always. For example, some of your sentence might be 10 words long and some might be 15 and some might be 1000. So, you definitely want variable length sequence input to your recurrent unit. To do this, there are some additional steps that needs to be performed before you can feed your input to the network. You can follow these steps –
1. Sort your batch from largest sequence to the smallest.
2. Create a seq_lengths array that defines the length of each sequence in the batch. (This can be a simple python list)
3. Pad all the sequences to be of equal length to the largest sequence.
4. Create LongTensor Variable of this batch.
5. Now, after passing the above variable through embedding and creating the proper context size input, you’ll need to pack your sequence as follows –

# Assuming embeds to be the proper input to the LSTM
lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False)

Understanding output of LSTM

Now, once you have prepared your lstm_input acc. To your needs, you can call lstm as

lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c))

Here, (h_t, h_c) needs to be provided as the initial hidden state and it will output the final hidden state. You can see, why packing variable length sequence is required, otherwise LSTM will run the over the non-required padded words as well.
Now, lstm_outs will be a packed sequence which is the output of lstm at every step and (h_t, h_c) are the final outputs and the final cell state respectively. h_t and h_c will be of shape (batch_size, lstm_size). You can use these directly for further input, but if you want to use the intermediate outputs as well you’ll need to unpack the lstm_outs first as below

lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs)

Now, your lstm_outs will be of shape (max_seq_len - context_size + 1, batch_size, lstm_size). Now, you can extract the intermediate outputs of lstm according to your need.

Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest).

Also note that, h_t will always be equal to the last element for each batch output.

Interfacing lstm to linear

Now, if you want to use just the output of the lstm, you can directly feed h_t to your linear layer and it will work. But, if you want to use intermediate outputs as well, then, you’ll need to figure out, how are you going to input this to the linear layer (through some attention network or some pooling). You do not want to input the complete sequence to the linear layer, as different sequences will be of different lengths and you can’t fix the input size of the linear layer. And yes, you’ll need to transpose the output of lstm to be further used (Again you cannot use view here).

Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer.

Leave a Comment