Tensorflow 2.0 / Keras - LSTM vs GRU Hidden States
I was going through the Neural Machine Translation with Attention tutorial for Tensorflow 2.0. Having gone through the verbal and visual explanations by Jalammar and also a plethora of other sites, I decided it was time to get my hands dirty with actual Tensorflow code.
I had previously done a bit of coding related to CNNs for my Final Year Project, but this was my first experience with RNNs. I had learnt the theory of RNNs and its mainstream relatives, LSTM and GRU, but this was my first time looking at the code. The first thing that got me stumped was the hidden states in the Encoder class.
The key here are the two arguments, return_sequences and return_state, which were both set to True. This will enable us to retrieve the output state and hidden state. Initially, I thought that the output state and the hidden state were the same. This was because my knowledge of LSTM and GRU were lacking.
As we can see, LSTM has two outputs, the cell state on top, and the hidden state at the bottom. Meanwhile, GRU only has one output, the hidden state. After going through Keras’ documentation (because Tensorflow 2.0’s documentation has not been fully updated), it is stated that:
Output shape
-
if return_state: a list of tensors. The first tensor is the output. The remaining tensors are the last states, each with shape (batch_size, units). For example, the number of state tensors is 1 (for RNN and GRU) or 2 (for LSTM).
-
if return_sequences: 3D tensor with shape (batch_size, timesteps, units).
A minimal example is available at Understand the Difference Between Return Sequences and Return States for LSTMs in Keras by Jason Brownlee. The code below is extracted from the post linked above and is for LSTM.
The output for that piece of code would be:
However, when we use GRU, keep in mind that it has no cell state. It only has a hidden state.
The output for that piece of code would be:
Notice that the second array is the final hidden state, which is the same as the last value of the first array. The output from the GRU also has one less array because it does not have a cell state unlike LSTM.
Now, pardon me as I get back to completing the tutorial.