Migrating from Keras to PyTorch in 2020

Note: this specifically applies to PyTorch 1.5 and tf.Keras 2.1.0. As long as the final number in the version is the same, this should still be applicable, but otherwise, YMMV

While completing my fellowship at Insight, I reached a point in my recyclable classification project where I wanted to implement a siamese network to improve the generalizability of my model, while requiring less training data. The ad-hoc siamese networks I created in Keras had acceptable results, but were certainly not production-ready. I began looking on papers with code for a more SOTA model, and realized that Keras implementations are few and far between. In fact, PyTorch seems to have exploded in popularity to the point where it is now the #1 framework on papers with code. With this in mind, I decided it was time to start learning PyTorch.

Why move?

In addition to the predominance of PyTorch on paperswithcode, there are several advantages it offers. Back when I began learning Keras, TensorFlow was the dominant framework partially due to easier productionalization of models using tf.Serve. However, since then, the productionalization gap appears to be closing. In addition, PyTorch comes with out-of-the-box asynchronous behavior when loading, whereas this still has to be specified and configured when using tf.Datasets. TensorBoard was, and still is, the predominant monitoring solution for model training, but it’s now fully supported by PyTorch. Finally, in my opinion, TensorFlow has made some rather strange api decisions, whereas PyTorch seems to bear more similarity to Numpy.

Defining the computation graph

Since Keras has both the sequential and functional api, it’s worth doing a comparison of both with the preferred way of instantiating a PyTorch network. Here, let’s use a simple feedforward MLP as an example:

#Keras sequential api
model = Sequential()
model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

#Keras functional api
input_layer = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_layer)
x = layers.Dense(32, activation='relu')(x)
output = layers.Dense(10, activation='softmax')(x)

model = Model(input_layer, output)

#Pytorch example
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.dense1 = nn.Linear(64, 32)
        self.dense2 = nn.Linear(32, 32)
        self.output = nn.linear(32, 10)

The obvious difference here is that the standard practice for PyTorch models is encapsulation into a class, and instantiating the parent Module class. Another less obvious difference is that both the input and output sizes of each layer have to be specified in PyTorch. Other than that, it’s a pretty straightforward difference. Let’s try a less trivial example that’s more suited to the Keras functional API: a siamese network for image similarity classification, based on the original paper with several convolutional layers taken out for the sake of length. This employs custom layers, shared weights, and gives an overview of how a multi-input, single output neural network can be implemented in the two frameworks.

#Keras example
input_shape = (105, 105, 3)

img_input = Input(shape=input_shape)
left_input = Input(shape=input_shape)
right_input = Input(shape=input_shape)

x = Conv2D(64,(10,10), activation='relu', input_shape=input_shape)(img_input)
x = MaxPooling2D()(x)
x = Conv2D(128, (4, 4), activation='relu')(x)
x = Flatten()(x)
x = Dropout(0.5)(x)
out = Dense(2048, activation="sigmoid")(x)

twin = Model(img_input, out)

# get the image embeddings, sharing weights between the two networks
encoded_l = twin(left_input)
encoded_r = twin(right_input)

# merge two encoded inputs with the l1 distance between them
L1_layer = Lambda(lambda x: tf.math.abs(x[0] - x[1]))
L1_distance = L1_layer([encoded_l, encoded_r])

prediction = Dense(1, activation='sigmoid')(L1_distance)

siamese_net = Model(inputs=[left_input, right_input], outputs=prediction)

# The same model in PyTorch    correct = 0
class SiameseNetwork(nn.Module):
    def __init__(self):
            nn.Conv2d(3, 64, 10)
            nn.MaxPool2d(2, 2)
            nn.Conv2d(64, 128, 4)
            nn.Linear(2048, 2048)

Unlike Keras, convolutional layers in PyTorch have arguments in the order of in_channel size, out_channels size, kernel_size, stride, padding, with the default stride and padding being 1 and 0, respectively. You’re probably noticing that with the PyTorch model, we stopped around the twin definition in the Keras model. The reason being is that the biggest difference between keras and pytorch is how you train the model, aka the training loop.

Training the model

Defining the model isn’t very different between Keras and PyTorch, but training the model certainly is. Rather than calling model.compile(), you instead define your forward pass as a method of your model. Also, your loss function, optimizer, and learning rate are usually defined in the training loop. Let’s start with the forward pass and training loop for our first MLP:

# Defining the forward pass. Note that this is a method of Model
    def forward(self, x):
        x = self.dense1(x)
        # F is an alias for torch.nn.functional
        x = F.relu(x)
        x = self.dense2(x)
        x = F.relu(x)
        x = self.output(x)
        out = F.softmax(x)
        return out

This shows two methods of model creation: for the siamese model, we define the entire model intially, making the forward pass as simple as return twin(x). With the MLP, we defined the layers individually. Which method is better definitely depends on your use case, but my intuition is that a neural network that can be drawn as a linear progression of layers lends itself well to the MLP method, whereas defining your entire model as an attribute works well for more advanced graphs such as ResNet/Inception type models, or models with multimodal input/outputs. Alright, so we have our model and how our data flows through it. The next step is training and evaluation. This is indeed far more code than is needed by using callbacks in Keras, but the training loop adds complexity in exchange for significantly more flexibility.

# train_loader is some predefined Torch DataLoader instance
# device is your cpu/gpu name
def train(model, device, train_loader, optimizer, epoch):
    for batch, (X, y) in enumerate(train_loader):
        # Assuming X and y are torch tensors, you can also just call X.cuda() instead if
        # you know you don't need to switch devices.
        X, y = X.to(device), y.to(device)
        # We re-instantiate the gradients during each iteration
        y_hat = model(X)
        loss = F.mse_loss(y_hat, y)
        # Now we back-propagate
        # Report accuracy every 10 batches
        if batch % 10 == 0:
            print(f'Loss of {loss} after {batch} epochs on training set')

The function above is meant to be used in a for loop with a preset number of epochs. Optimizers are called in a similar manner compared to Keras. Similar to the LearningRateScheduler in Keras’ callbacks, we now have several built-in adaptive learning rates. We have our training function, now for the test one:

# Again, test_loader is a DataLoader instance
def test(model, device, test_loader):
    test_loss = 0
    correct = 0
    # We don't want to change the gradients, so we freeze the model here
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            y_hat = model(X)
            test_loss += F.mse_loss(y_hat, y, reduction='sum').item()
            pred = y_hat.argmax(dim=1) # For binary classification
            # For multiclass, pass keepdim=True above
            # Now we format the actual target and compare it to the predicted one
            correct += pred.eq(y.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print(f'Average loss: {test_loss}\nAccuracy: {correct/len(test_loader.dataset)*100}')

Now we have our model with its foward propagation method, a training function, and a testing function. We presume there is a data loading function in there somewhere as well. So the final step is putting it all together, either in script for or in a main function for CLI execution. Here is the last bit in script form:

# With Torch, we have to specify GPU/CPU computation
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
# First we load the model onto the GPU
model = Model().to(device)
# Now we load our optimizer
optimizer = torch.optim.Adam(lr=0.001)
# Let's also apply a learning rate decay
scheduler = torch.optim.lr_scheduler.StepLR(optimizer)
# Now let's train for 100 epochs
for epoch in range(100):
   train(model, device, train_loader, optimizer, epoch)
   test(model, device, test_loader)
#Saving the weights of the model to a pickle file
torch.save(model.state_dict(), 'torch_example.pt')

Whew, that’s a lot of code for a 3 layer MLP! Of course, this is only a starting point. You’ll probably want some kind of early stopping mechanism, monitoring with tensorboard or custom visualizations, a tqdm progress bar, and/or logging. In performing this excercise, I’m of the mind that the additional code is actually a good thing, as debugging becomes far easier as you can isolate the line causing the issue with a visual debugger (cough or a print statement cough), as opposed to Keras abstracting that complexity away.

So this post doesn’t get too long, I’m going to direct you to the repository for my Insight project if you want to see the siamese network in PyTorch. Overall, PyTorch is pretty great, and a smoother transition than I originally thought. I’ll have to see if this is just a honeymoon phase, but I figure there’s likely a reason there are so many converts as of late. Happy hacking, and thanks for reading!