400% higher error with PyTorch compared with identical Keras model (with Adam optimizer)

The problem here is unintentional broadcasting in the PyTorch training loop. The result of a nn.Linear operation always has shape [B,D], where B is the batch size and D is the output dimension. Therefore, in your mean_squared_error function ypred has shape [32,1] and ytrue has shape [32]. By the broadcasting rules used by NumPy and … Read more