DEMOCRATIZING AI FOR EVERY DATA SCIENTIST, WITH MODERN TOOLS

In real-world applications, managed AI services such as Amazon Rekognition and Amazon Comprehend offer a viable alternative to dedicated data science teams building models from scratch. Even when a use case requires model re-training with purpose-built datasets such as custom image labels or text entities, it can be easily achieved with Amazon Rekognition Custom Labels or Amazon Comprehend Custom Entities.

These services offer state of the art machine learning model implementations, covering several use cases. Such models are not a feasible approach in some contexts. It could happen either because the underlying network requires being deeply customized to data scientists need to implement network architectures that are above state of the art, such as LSTMs, GANs, OneShot learners, Reinforcement Learning Models, or even model ensembles.

Research and model building is a never-ending job in machine learning, opening every day a whole new set of capabilities. Nevertheless, it often requires a large team of diverse professionals to build a model from the Neural Network architecture definition to production deployment.

Amazon SageMaker comes into the play aiming to democratize machine learning for everyone, with a set of tools targeting both data scientists and software engineers. As of 2020, Amazon SageMaker (SM) is a suite of tools dedicated to dataset labelization (SM GroundTruth), model development (SM Notebooks), distributed training, and inference deployment (SM Models/Endpoints) and experiment creation, debugging and monitoring (SageMaker Studio).

In just a few years, many deep learning frameworks appeared, starting with TensorFlow, Apache MXNet, and PyTorch, each of them raising the bar of model creation and customization. One of the most promising technology, due to its flexibility in dynamic computational graph definition and data parallelism support.

With Lightning, PyTorch gets both simplified AND on steroids.

Amazon SageMaker introduced support for PyTorch since day one and built a consistent user base during the last few years. Nevertheless, PyTorch missed the simplicity, low learning curve, and high level of abstraction of alternatives such as Keras (for Tensorflow). A few frameworks were developed to fill the gap, such as the excellent Fast.ai library, which aims to be an easy-to-learn solution to developers approaching PyTorch.

In 2019, to bring machine learning efforts to a common denominator, William Falcon published the first production-ready version of PyTorch Lightning, a framework to structure a PyTorch project, gain support for less boilerplate and improved code reading.

In this article, we will start from scratch with a simple neural network creation following a consolidated workflow to develop, test, and deploy a machine learning model on Amazon SageMaker, with a step-by-step tutorial, focused on a beginner audience. No prior knowledge of Amazon SageMaker nor PyTorch is required, even if it could help to understand some language APIs.

MNIST is the new "Hello World."

We will start from scratch with a simple neural network used for handwritten digit recognition, using the famous MNIST dataset. The use case is pretty narrow, but in recent years it has become the "Hello World" of image processing with a neural network, due to the simplicity of the resulting model.

Amazon SageMaker Notebooks

The first step when dealing with a machine learning project is building the model in some experiment context. Amazon SageMaker Notebooks offer easy setup of a JupyterLab environment. PyTorch offers a prepared dataset through the torchvision library. Since this article wants to present a workflow suitable for general-purpose model training, we decided not to use the PyTorch dataset and download MNIST images from the internet and save them into an S3 bucket.

When using SageMaker Studio to build the model, we suggest downloading a bunch of data locally to speed up development and testing. We can easily do that using the following command:

Now we can display a few random data, just to better understand how it is organized before we start building our Lightning model.

None
A small set of MNIST dataset of handwritten digits images.

MNIST Classifier and Amazon SageMaker

Amazon SageMaker manages code runs from Python code after we set up a PyTorch estimate. An estimator is a class that holds all the required params needed by training (or an inference script to run on a SageMaker container).

In order to perform training of a Neural Network with convolutional layers, we have to run our training job on an ml.p2.xlarge instance with a GPU.

Amazon Sagemaker defaults training code into a code folder within our project, but its path can be overridden when instancing Estimator. Training scripts is where the magic of PyTorch Lightning happens.

Our trainer can run with no changes either on our local GPU rig or on an Amazon SageMaker container.

The magic of Amazon SageMaker is within environment variables which default to trainer and model params. Within a container, these variables are set to folders that are copied from S3 before running our script and back to S3 after training is completed.

At this point, we haven't defined a model yet, just mapped some variables and configured an estimator object, but some Lightning specific constructs are already visible, such as Trainer class.

Trainer, as its name suggests, is a Python class capable of abstracting all training workflow steps, plus a series of everyday operations such as saving model checkpoints after every epoch. Trainer automates a set of activities such as finding the best learning rate, ensure reproducibility, set the number of GPUs and multi-node backend for parallel training, and many more.

Lightning offers a set of defaults to make training super simple. Values can be overridden since it has full control over the complete lifecycle because our classifier class must conform to a protocol.

Let's break down our code in and check what happens at each step

1. Import libraries and extend LightningModule

Every PyTorch Lightning implementation must extend base pl.LightningModule class which inherits from nn.Module adding some utility methods.

2. Prepare network layers

In the class constructor, we prepare network layers to be used later building the computational graph. Convolutional layers extract features from images and pass to the following layers adding nonlinearity and randomness.

3. Build data loaders for train, validation and test datasets

DataLoader classes are crafted from the PyTorch image loader. Shuffling and splitting ensure a random validation dataset, built from training images.

4. Implement utility methods required by Trainer

PyTorch Lightning enforces a standard project structure, requiring the classifier to implement certain methods that will be invoked by Trainer class when performing training and validation.

5. Implement forward pass

The forward method is equal to the traditional PyTorch forward function that must be implemented to build the computational graph.

6. Implement the training step

The training step method is invoked for each image batch by Trainer, computing network predictions, and their relative loss function.

7. Validation computing and stacking

Lightning offers support to a set of optional methods such as validation_step, validation_epoch_end, and validation_end to allow developers to define how a validation loss should be computed and stack results to find the improvements during training. These methods require code returning data conforming to a specific schema, then PL outputs all the metrics in a TensorBoard compatible format.

Equivalent methods can be implemented to support model testing which is highly encouraged before going to production.

Now we're ready to give our model spin and start training with Amazon SageMaker.

Model training on Amazon SageMaker

Training starts running main.py from the command line or another Jupyter Notebook. It could also be run from AWS Lambda function, invoked by an AWS Step Function to make the training process fully scriptable and serverless. However, logs are collected into the console and pushed to Amazon CloudWatch for further inspection. This feature is pretty useful when starting multiple training jobs to fine-tune hyperparameters.

None
Console output for Amazon SageMaker training job.

Amazon SageMaker starts p2.xlarge instances on our behalf, then downloads input data into the container and starts our code, launching main.py, after installing all dependencies in our requirements.txt file.

None
Console output for Amazon SageMaker training job.

Amazon SageMaker builds a job descriptor in JSON format and passes it to the training context. In this object, all the parameters are sent to the training job as well as input directories are mapped to /opt/ml/ subfolders, receiving data from S3, and the output gets collected in a result bucket. The training code is packaged as well on a different S3 path, then downloaded into the container.

Finally, just before launching our training script, environment variables are set to standard SageMaker values.

After a couple of minutes, since we're training for just six epochs, our validation is displayed, and saved models are uploaded to S3. Since PyTorch Lightning automatically saves model checkpoints on our behalf, and we mapped its output directory to output_data_dir, we can collect from S3 also intermediary checkpoints and validation data ready to be processed and analyzed by TensorBoard.

A Classification model is available on S3 to be used in an inference script, in an Amazon SageMaker endpoint, or to be deployed on edge devices using the JIT compiler.

Where to go from here?

In this article, we've discussed how Amazon SageMaker and PyTorch Lightning work together democratizing Deep Learning, reducing the boilerplate every developer or data scientist has to write to build a model from scratch to production. Amazon SageMaker relieves the burden of spinning up and configuring training machines with just a few lines of code. At the same time, Lightning makes steps such as gradients management, optimization, and backpropagation transparent, allowing researchers to focus on the neural network architecture.

The full code of the project is available on GitHub. It can be run as a standalone script on any PC, just launching

If you prefer a Jupyter Notebook interface, the same code could be run within Amazon SageMaker, just running notebook/sagemaker_deploy.ipynb. Since SageMaker launches training jobs, there is no need to have a GPU instance to run the notebook.

This article is just a sample project to showcase how SageMaker and Lightning can work together. Still, it can be used as a starting point for Computer Vision tasks such as image classification, just changing the network architecture to resemble VGG or ResNet, and providing an adequate dataset.

In the next articles, we'll dive into machine learning production pipelines for image processing and introduce some of the architectural solutions we have adopted in Neosperience to implement Image Memorability and customer behavior analysis. Stay tuned!

My name is Luca Bianchi. I am Chief Technology Officer at Neosperience and, the author of Serverless Design Patterns and Best Practices. I have built software architectures for production workload at scale on AWS for nearly a decade.

Neosperience Cloud is the one-stop SaaS solution for brands aiming to bring Empathy in Technology, leveraging innovation in machine learning to provide support for 1:1 customer experiences.

You can contact me via Twitter and LinkedIn.