How does TensorFlow SparseCategoricalCrossentropy work?

SparseCategoricalCrossentropy and CategoricalCrossentropy both compute categorical cross-entropy. The only difference is in how the targets/labels should be encoded. When using SparseCategoricalCrossentropy the targets are represented by the index of the category (starting from 0). Your outputs have shape 4×2, which means you have two categories. Therefore, the targets should be a 4 dimensional vector with … Read more

How to understand masked multi-head attention in transformer

I had the very same question after reading the Transformer paper. I found no complete and detailed answer to the question in the Internet so I’ll try to explain my understanding of Masked Multi-Head Attention. The short answer is – we need masking to make the training parallel. And the parallelization is good as it … Read more

ValueError: Tensor must be from the same graph as Tensor with Bidirectinal RNN in Tensorflow

TensorFlow stores all operations on an operational graph. This graph defines what functions output to where, and it links it all together so that it can follow the steps you have set up in the graph to produce your final output. If you try to input a Tensor or operation on one graph into a … Read more

What is the difference between the predict and predict_on_batch methods of a Keras model?

The difference lies in when you pass as x data that is larger than one batch. predict will go through all the data, batch by batch, predicting labels. It thus internally does the splitting in batches and feeding one batch at a time. predict_on_batch, on the other hand, assumes that the data you pass in … Read more

Is there any way to get variable importance with Keras?

*Edited to include relevant code to implement permutation importance. I answered a similar question at Feature Importance Chart in neural network using Keras in Python. It does implement what Teque5 mentioned above, namely shuffling the variable among your sample or permutation importance using the ELI5 package. from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor import eli5 from eli5.sklearn … Read more

Understanding accumulated gradients in PyTorch

You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero). The only difference between your two versions, is how you calculate the final loss. The for … Read more