Zoom Our NN to classify the MNIST
Zoom session 3 — part 1
Time:
3 pm Pacific Time.
Topic:
In this session, we will build our own NN to classify the MNIST.
Let's build an MLP
The multi-layer perceptron (MLP) is the simplest neural network. It is a feed-forward (i.e. no loop), fully-connected (i.e. each neuron of one layer is connected to all the neurons of the adjacent layers) neural network with a single hidden layer.
Load packages
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
The torch.nn.functional
module contains all the functions of the torch.nn
package.
These functions include loss functions, activation functions, pooling functions…
Create a SummaryWriter
instance for TensorBoard
writer = SummaryWriter()
Define the architecture of the network
# To build a model, create a subclass of torch.nn.Module:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
# Method for the forward pass:
def forward(self, x):
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
Python’s class inheritance gives our subclass all the functionality of torch.nn.Module
while allowing us to customize it.
Define a training function
def train(model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad() # reset the gradients to 0
output = model(data)
loss = F.nll_loss(output, target) # negative log likelihood
writer.add_scalar("Loss/train", loss, epoch)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
Define a testing function
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
# Sum up batch loss:
test_loss += F.nll_loss(output, target, reduction='sum').item()
# Get the index of the max log-probability:
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
# Print a summary
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
Define a function main() which runs our network
def main():
epochs = 1
torch.manual_seed(1)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST(
'~/projects/def-sponsor00/data',
train=True, download=True, transform=transform)
test_data = datasets.MNIST(
'~/projects/def-sponsor00/data',
train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=1000)
model = Net().to(device) # create instance of our network and send it to device
optimizer = optim.Adadelta(model.parameters(), lr=1.0)
scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
for epoch in range(1, epochs + 1):
train(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)
scheduler.step()
Run the network
main()
Write pending events to disk and close the TensorBoard
writer.flush()
writer.close()
The code is working. Time to actually train our model!
Jupyter is a fantastic tool. It has a major downside however: when you launch a Jupyter server, you are running a job on a compute node. If you want to play for 8 hours in Jupyter, you are requesting an 8 hour job. Now, most of the time you spend on Jupyter is spent typing, running bits and pieces of code, or doing nothing at all. If you ask for GPUs, many CPUs, and lots of RAM, all of it will remain idle almost all of the time. It is a really suboptimal use of Compute Canada resources.
In addition, if you ask for lots of resources for a long time, you will have to wait a long time in the queue before they get allocated to you.
Lastly, you will go through your allocation quickly.
A much better strategy is to develop and test your code (with very little data, few epochs, etc.) in an interactive job (with salloc
) or in Jupyter, then, launch an sbatch
job to actually train your model. This ensures that heavy duty resources such as GPU(s) are only allocated to you when you are actually needing and using them.
Concrete example with our training cluster: this cluster only has 1 GPU. If you want to use it in Jupyter, you have to request it for your Jupyter session. This means that the entire time your Jupyter session is active, nobody else can use that GPU. While you let your session idle or do tasks that do not require a GPU, this is not a good use of resources.
Let's train and test our model
Log in the training cluster
Open a terminal and SSH to our training cluster as we saw in the first lesson.
Load necessary modules
First, we need to load the Python and CUDA modules. This is done with the Lmod tool through the module command. Here are some key Lmod commands:
# Get help on the module command
$ module help
# List modules that are already loaded
$ module list
# See which modules are available for a tool
$ module avail <tool>
# Load a module
$ module load <module>[/<version>]
Here are the modules we need:
$ module load nixpkgs/16.09 gcc/7.3.0 cuda/10.0.130 cudnn/7.6 python/3.8.2
Install Python packages
You also need the Python packages matplotlib
, torch
, torchvision
, and tensorboard
.
On Compute Canada clusters, you need to create a virtual environment in which you install packages with pip
.
Do not use Anaconda
While Anaconda is a great tool on personal computers, it is not an appropriate tool when working on the Compute Canada clusters: binaries are unoptimized for those clusters and library paths are inconsistent with their architecture. Anaconda installs packages in $HOME
where it creates a very large number of small files. It can also create conflicts by modifying .bashrc
.
For this workshop, since we all need the same packages, I already created a virtual environment that we will all use. All you have to do is to activate it with:
$ source ~/projects/def-sponsor00/env/bin/activate
If you want to exit the virtual environment, you can press Ctrl-D or run:
(env) $ deactivate
For future reference, below is how you would install packages on a real Compute Canada cluster (but please don't do it in the training cluster as it is unnecessary and would only slow it down).
Create a virtual environment:
$ virtualenv –no-download ~/env
Activate the virtual environment:
$ source ~/env/bin/activate
Update pip:
(env) $ pip install –no-index –upgrade pip
Install the packages you need in the virtual environment:
(env) $ pip install –no-cache-dir –no-index matplotlib torch torchvision tensorboard
Write a Python script
Create a directory for this project and cd
into it:
mkdir mnist
cd mnist
Start a Python script with the text editor of your choice:
nano nn.py
In it, copy-paste the code we played with in Jupyter, but this time have it run for 10 epochs:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
writer = SummaryWriter()
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def train(model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
writer.add_scalar("Loss/train", loss, epoch)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
def main():
epochs = 10 # don't forget to change the number of epochs
torch.manual_seed(1)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST(
'~/projects/def-sponsor00/data',
train=True, download=True, transform=transform)
test_data = datasets.MNIST(
'~/projects/def-sponsor00/data',
train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=1000)
model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=1.0)
scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
for epoch in range(1, epochs + 1):
train(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)
scheduler.step()
main()
writer.flush()
writer.close()
Write a Slurm script
Write a shell script with the text editor of your choice:
nano nn.sh
This is what you want in that script:
#!/bin/bash
#SBATCH --time=5:0
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=4G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
python ~/mnist/nn.py
Notes:
--time
accepts these formats: "min", "min:s", "h:min:s", "d-h", "d-h:min" & "d-h:min:s"%x
will get replaced by the script name &%j
by the job number
Submit a job
Finally, you need to submit your job to Slurm:
$ sbatch ~/mnist/nn.sh
You can check the status of your job with:
$ sq
PD
= pending
R
= running
CG
= completing (Slurm is doing the closing processes)
No information = your job has finished running
You can cancel it with:
$ scancel <jobid>
Once your job has finished running, you can display efficiency measures with:
$ seff <jobid>
Let's explore our model's metrics with TensorBoard
TensorBoard is a web visualization toolkit developed by TensorFlow which can be used with PyTorch.
Because we have sent our model's metrics logs to TensorBoard as part of our code, a directory called runs
with those logs was created in our ~/mnist
directory.
Launch TensorBoard
TensorBoard requires too much processing power to be run on the login node. When you run long jobs, the best strategy is to launch it in the background as part of the job. This allows you to monitor your model as it is running (and cancel it if things don't look right).
Example:
#!/bin/bash
#SBATCH ...
#SBATCH ...
tensorboard --logdir=runs --host 0.0.0.0 &
python ~/mnist/nn.py
Because we only have 1 GPU and are taking turns running our jobs, we need to keep our jobs very short here. So we will launch a separate job for TensorBoard. This time, we will launch an interactive job:
salloc --time=1:0:0 --mem=2000M
To launch TensorBoard, we need to activate our Python virtual environment (TensorBoard was installed by pip
):
source ~/projects/def-sponsor00/env/bin/activate
Then we can launch TensorBoard in the background:
tensorboard --logdir=~/mnist/runs --host 0.0.0.0 &
Now, we need to create a connection with SSH tunnelling between your computer and the compute note running your TensorBoard job.
Connect to TensorBoard from your computer
From a new terminal on your computer, run:
ssh -NfL localhost:6006:<hostname>:6006 userxxx@uu.c3.ca
Replace <hostname> by the name of the compute node running your salloc
job. You can find it by looking at your prompt (your prompt shows <username>@<hostname>).
Replace <userxxx> by your user name.
Now, you can open a browser on your computer and access TensorBoard at http://localhost:6006.