If there’s a mask in your model, it’ll be propagated layer-by-layer and eventually applied to the loss. So if you’re padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.
Some Details:
It’s a bit involved to explain the whole process, so I’ll just break it down to several steps:
- In
compile()
, the mask is collected by callingcompute_mask()
and applied to the loss(es) (irrelevant lines are ignored for clarity).
weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
# Prepare output masks.
masks = self.compute_mask(self.inputs, mask=None)
if masks is None:
masks = [None for _ in self.outputs]
if not isinstance(masks, list):
masks = [masks]
# Compute total loss.
total_loss = None
with K.name_scope('loss'):
for i in range(len(self.outputs)):
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)
- Inside
Model.compute_mask()
,run_internal_graph()
is called. - Inside
run_internal_graph()
, the masks in the model is propagated layer-by-layer from the model’s inputs to outputs by callingLayer.compute_mask()
for each layer iteratively.
So if you’re using a Masking
layer in your model, you shouldn’t worry about the loss on the padding placeholders. The loss on those entries will be masked out as you’ve probably already seen inside _weighted_masked_objective()
.
A Small Example:
max_sentence_length = 5
character_number = 2
input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
output = LSTM(3, return_sequences=True)(masked_input)
model = Model(input_tensor, output)
model.compile(loss="mae", optimizer="adam")
X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
[[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)
print(y_pred)
[[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[-0.11980877 0.05803877 0.07880752]
[-0.00429189 0.13382857 0.19167568]
[ 0.06817091 0.19093043 0.26219055]]
[[ 0. 0. 0. ]
[ 0.0651961 0.10283815 0.12413475]
[-0.04420842 0.137494 0.13727818]
[ 0.04479844 0.17440712 0.24715884]
[ 0.11117355 0.21645413 0.30220413]]]
# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()
print(model.evaluate(X, y_true))
0.881977558136
print(masked_loss)
0.881978
print(unmasked_loss)
0.917384
As can be seen from this example, the loss on the masked part (the zeroes in y_pred
) is ignored, and the output of model.evaluate()
is equal to masked_loss
.
EDIT:
If there’s a recurrent layer with return_sequences=False
, the mask stop propagates (i.e., the returned mask is None
). In RNN.compute_mask()
:
def compute_mask(self, inputs, mask):
if isinstance(mask, list):
mask = mask[0]
output_mask = mask if self.return_sequences else None
if self.return_state:
state_mask = [None for _ in self.states]
return [output_mask] + state_mask
else:
return output_mask
In your case, if I understand correctly, you want a mask that’s based on y_true
, and whenever the value of y_true
is [0, 0, 1]
(the one-hot encoding of “#”) you want the loss to be masked. If so, you need to mask the loss values in a somewhat similar way to Daniel’s answer.
The main difference is the final average. The average should be taken over the number of unmasked values, which is just K.sum(mask)
. And also, y_true
can be compared to the one-hot encoded vector [0, 0, 1]
directly.
def get_loss(mask_value):
mask_value = K.variable(mask_value)
def masked_categorical_crossentropy(y_true, y_pred):
# find out which timesteps in `y_true` are not the padding character '#'
mask = K.all(K.equal(y_true, mask_value), axis=-1)
mask = 1 - K.cast(mask, K.floatx())
# multiply categorical_crossentropy with the mask
loss = K.categorical_crossentropy(y_true, y_pred) * mask
# take average w.r.t. the number of unmasked entries
return K.sum(loss) / K.sum(mask)
return masked_categorical_crossentropy
masked_categorical_crossentropy = get_loss(np.array([0, 0, 1]))
model = Model(input_tensor, output)
model.compile(loss=masked_categorical_crossentropy, optimizer="adam")
The output of the above code then shows that the loss is computed only on the unmasked values:
model.evaluate: 1.08339476585
tf unmasked_loss: 1.08989
tf masked_loss: 1.08339
The value is different from yours because I’ve changed the axis
argument in tf.reverse
from [0,1]
to [1]
.