deep-learning – Tarik Billa

Why does my training loss have regular spikes?

April 10, 2024 by Tarik

I’ve figured it out myself: TL;DR: Make sure your loss magnitude is independent of your mini-batch size. The long explanation: In my case the issue was Keras-specific after all. Maybe the solution to this problem will be useful for someone at some point. It turns out that Keras divides the loss by the mini-batch size. … Read more

How does TensorFlow SparseCategoricalCrossentropy work?

April 10, 2024 by Tarik

SparseCategoricalCrossentropy and CategoricalCrossentropy both compute categorical cross-entropy. The only difference is in how the targets/labels should be encoded. When using SparseCategoricalCrossentropy the targets are represented by the index of the category (starting from 0). Your outputs have shape 4×2, which means you have two categories. Therefore, the targets should be a 4 dimensional vector with … Read more

How to understand masked multi-head attention in transformer

April 9, 2024 by Tarik

I had the very same question after reading the Transformer paper. I found no complete and detailed answer to the question in the Internet so I’ll try to explain my understanding of Masked Multi-Head Attention. The short answer is – we need masking to make the training parallel. And the parallelization is good as it … Read more

shuffle in the model.fit of keras

January 8, 2024 by Tarik

It will shuffle your entire dataset (x, y and sample_weight together) first and then make batches according to the batch_size argument you passed to fit. Edit As @yuk pointed out in the comment, the code has been changed significantly since 2018. The documentation for the shuffle parameter now seems more clear on its own. You … Read more

Is it good learning rate for Adam method?

January 6, 2024 by Tarik

The learning rate looks a bit high. The curve decreases too fast for my taste and flattens out very soon. I would try 0.0005 or 0.0001 as a base learning rate if I wanted to get additional performance. You can quit after several epochs anyways if you see that this does not work. The question … Read more

ValueError: Tensor must be from the same graph as Tensor with Bidirectinal RNN in Tensorflow

January 6, 2024 by Tarik

TensorFlow stores all operations on an operational graph. This graph defines what functions output to where, and it links it all together so that it can follow the steps you have set up in the graph to produce your final output. If you try to input a Tensor or operation on one graph into a … Read more

What is the difference between the predict and predict_on_batch methods of a Keras model?

January 5, 2024 by Tarik

The difference lies in when you pass as x data that is larger than one batch. predict will go through all the data, batch by batch, predicting labels. It thus internally does the splitting in batches and feeding one batch at a time. predict_on_batch, on the other hand, assumes that the data you pass in … Read more

Is there any way to get variable importance with Keras?

January 5, 2024 by Tarik

*Edited to include relevant code to implement permutation importance. I answered a similar question at Feature Importance Chart in neural network using Keras in Python. It does implement what Teque5 mentioned above, namely shuffling the variable among your sample or permutation importance using the ELI5 package. from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor import eli5 from eli5.sklearn … Read more

Understanding accumulated gradients in PyTorch

January 3, 2024 by Tarik

You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero). The only difference between your two versions, is how you calculate the final loss. The for … Read more

PyTorch – How to get learning rate during training?

January 2, 2024 by Tarik

For only one parameter group like in the example you’ve given, you can use this function and call it during training to get the current learning rate: def get_lr(optimizer): for param_group in optimizer.param_groups: return param_group[‘lr’]