Uncategorized

MXNet Distributed Training with Azure ML (Custom Configuration Sample)

In my previous post, I described the basic concepts and benefits for Azure Machine Learning with several samples of Python code.
In this post, I proceed to more advanced topics by showing you how to set up (customize) your Azure Machine Learning Compute (AmlCompute) for the practical training and finally I’ll show you Apache MXNet distributed training (CIFAR10 example) with Azure Machine Learning.

As I explained in my previous post, Azure ML supports built-in environment (called curated environment), in which you can train model by TensorFlow or PyTorch.
As you can see in this post, you can bring other frameworks and customize as you expect.

Bringing Your Own Image

In usual experimentation, Azure ML uses the default images (mcr.microsoft.com/azureml/base, mcr.microsoft.com/azureml/base-gpu) in mcr.microsoft.com (MCR, backend container catalog in Microsoft to publish container images).
You can also bring your own container images into Azure Machine Learning by publishing in public repository (docker hub) or by registering into Azure Container Registry (ACR).

Note : Each Azure ML resource is having related container registry (ACR) as follows.

Here I’ll show you simple example with the following Dockerfile.
The default image (mcr.microsoft.com/azureml/base, mcr.microsoft.com/azureml/base-gpu) includes Miniconda, MKL, and Intel MPI, whereas the following image installs Open MPI for Azure ML compute.

FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04USER root:rootENV LANG=C.UTF-8 LC_ALL=C.UTF-8ENV DEBIAN_FRONTEND=noninteractiveRUN apt-get update -y && \apt-get install -y wget bzip2 gcc g++ make# Install MinicondaENV PATH /opt/miniconda/bin:$PATHRUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-4.5.11-Linux-x86_64.sh -O /tmp/miniconda.sh && \/bin/bash /tmp/miniconda.sh -bf -p /opt/miniconda && \conda clean -ay && \rm /tmp/miniconda.sh# Install Open MPIRUN wget -q https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.4.tar.gz && \tar -xzf openmpi-1.10.4.tar.gz && \cd openmpi-1.10.4 && \./configure --prefix=/usr/local/mpi && \make -j"$(nproc)" install && \cd .. && \rm -rf /openmpi-1.10.4 && \rm -rf openmpi-1.10.4.tar.gzENV PATH=/usr/local/mpi/bin:$PATH \LD_LIBRARY_PATH=/usr/local/mpi/lib:$LD_LIBRARY_PATH# Set Path for Non-Interactive ExecutionRUN sed -i '1s/^/LD_LIBRARY_PATH="\/usr\/local\/mpi\/lib:$LD_LIBRARY_PATH"\n\n/' ~/.bashrcRUN sed -i '1s/^/PATH="\/usr\/local\/mpi\/bin:$PATH"\n\n/' ~/.bashrc

I have already built this image and published as “tsmatz/azureml-openmpi:0.1.0-gpu” on docker hub.
You can then quickly use this image for distributed training in such as Horovod.

Environment Settings

In distributed training, you can also use several environment variables in your script on Azure ML cluster compute. (See here for details.)

For instance, the following python code will print MPI master host name (like “10.0.0.4“), when it’s run on MPI distribution.

import osmaster_address = os.environ["AZ_BATCH_MASTER_NODE"]host_and_port = master_address.split(":")print(host_and_port[0])  # print host only

Note : Especially Azure ML Compute is based on Azure Batch infrastructure and you can then also use Azure Batch or Azure Batch AI environment variables, such as, node id, node shared directory, MPI host list, and others.

Azure Batch environment variables
https://docs.microsoft.com/en-us/azure/batch/batch-compute-node-environment-variables

Azure Batch AI environment variables
https://github.com/microsoftarchive/BatchAI/blob/master/documentation/using-batchai-environment-variables.md

Run MXNet Distributed Training on Azure ML

Now let’s run distributed training in MXNet on Azure Machine Learning.

First, please download cifar10_dist.py from MXNet official samples, and change the value of “gpus_per_machine” variable. For instance, if you use compute node with 1 GPU in cluster, set “gpus_per_machine=1” (default is 4) as follows.

Note : The following training code (training blocks) will run only on MXNet worker nodes.
When MXNet module is imported (in the following “import mxnet” block), the server and scheduler will run immediately and doesn’t proceed to the training blocks. As a result, the training is skipped on the server and scheduler nodes, even if the training code exists as follows.

cifar10_dist.py

from __future__ import print_functionimport random, sys import mxnet as mxfrom mxnet import autograd, gluon, kv, ndfrom mxnet.gluon.model_zoo import vision import numpy as np # Create a distributed key-value storestore = kv.create('dist') # Clasify the images into one of the 10 digitsnum_outputs = 10 # 64 images in a batchbatch_size_per_gpu = 64# How many epochs to run the trainingepochs = 5 # How many GPUs per machinegpus_per_machine = 1# Effective batch size across all GPUsbatch_size = batch_size_per_gpu * gpus_per_machine # Create the context (a list of all GPUs to be used for training)ctx = [mx.gpu(i) for i in range(gpus_per_machine)] # Convert to float 32# Having channel as the first dimension makes computation more efficient. Hence the (2,0,1) transpose.# Dividing by 255 normalizes the input between 0 and 1def transform(data, label):  return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32) class SplitSampler(gluon.data.sampler.Sampler):  """ Split the dataset into `num_parts` parts and sample from the part with index `part_index`   Parameters  ----------  length: intNumber of examples in the dataset  num_parts: intPartition the data into multiple parts  part_index: intThe index of the part to read from  """  def __init__(self, length, num_parts=1, part_index=0):# Compute the length of each partitionself.part_len = length // num_parts# Compute the start index for this partitionself.start = self.part_len * part_index# Compute the end index for this partitionself.end = self.start + self.part_len   def __iter__(self):# Extract examples between `start` and `end`, shuffle and return them.indices = list(range(self.start, self.end))random.shuffle(indices)return iter(indices)   def __len__(self):return self.part_len # Load the training datatrain_data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=True, transform=transform),  batch_size,  sampler=SplitSampler(49920, store.num_workers, store.rank)) # Load the test data test_data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=False, transform=transform),  batch_size, shuffle=False) # Use ResNet from model zoonet = vision.resnet18_v1() # Initialize the parameters with Xavier initializernet.collect_params().initialize(mx.init.Xavier(), ctx=ctx) # SoftmaxCrossEntropy is the most common choice of loss function for multiclass classificationsoftmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() # Use Adam optimizer. Ask trainer to use the distributer kv store.trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': .001}, kvstore=store) # Evaluate accuracy of the given network using the given datadef evaluate_accuracy(data_iterator, net):   acc = mx.metric.Accuracy()   # Iterate through data and label  for i, (data, label) in enumerate(data_iterator): # Get the data and label into the GPUdata = data.as_in_context(ctx[0])label = label.as_in_context(ctx[0]) # Get network's output which is a probability distribution# Apply argmax on the probability distribution to get network's classification.output = net(data)predictions = nd.argmax(output, axis=1) # Give network's prediction and the correct label to update the metricacc.update(preds=predictions, labels=label)   # Return the accuracy  return acc.get()[1] # We'll use cross entropy loss since we are doing multiclass classificationloss = gluon.loss.SoftmaxCrossEntropyLoss() # Run one forward and backward pass on multiple GPUsdef forward_backward(net, data, label):   # Ask autograd to remember the forward pass  with autograd.record():# Compute the loss on all GPUslosses = [loss(net(X), Y) for X, Y in zip(data, label)]   # Run the backward pass (calculate gradients) on all GPUs  for l in losses:l.backward() # Train a batch using multiple GPUsdef train_batch(batch, ctx, net, trainer):   # Split and load data into multiple GPUs  data = batch[0]  data = gluon.utils.split_and_load(data, ctx)   # Split and load label into multiple GPUs  label = batch[1]  label = gluon.utils.split_and_load(label, ctx)   # Run the forward and backward pass  forward_backward(net, data, label)   # Update the parameters  this_batch_size = batch[0].shape[0]  trainer.step(this_batch_size) # Run as many epochs as requiredfor epoch in range(epochs):   # Iterate through batches and run training using multiple GPUs  batch_num = 1  for batch in train_data: # Train the batch using multiple GPUstrain_batch(batch, ctx, net, trainer) batch_num += 1   # Print test accuracy after every epoch  test_accuracy = evaluate_accuracy(test_data, net)  print("Epoch %d: Test_acc %f" % (epoch, test_accuracy))  sys.stdout.flush()

Here we use 1 scheduler, 1 server, and 2 workers (total 4 nodes) for MXNet distributed training.
Let’s create Azure Machine Learning compute (AmlCompute) which has 4 nodes as follows.

As you can see, here we use Standard_NC4as_T4_v3 which has 1 Tesla T4 GPU for our training cluster.

az ml compute create --name cluster01 \  --resource-group $my_resource_group \  --workspace-name $my_workspace \  --type amlcompute \  --min-instances 0 \  --max-instances 4 \  --size Standard_NC4as_T4_v3

To run training on the environment, in which MXNet is installed, here I create and register custom AML environment as follows.

az ml command

az ml environment create --file env_distributed_mxnet.yml \  --resource-group $my_resource_group \  --workspace-name $my_workspace

env_distributed_mxnet.yml

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.jsonname: mxnet-distribution-envimage: tsmatz/azureml-openmpi:0.1.0-gpuconda_file: conda_distributed_mxnet.ymldescription: environment for mxnet distribution

conda_distributed_mxnet.yml

name: mxnet_environmentdependencies:- python=3.6- pip:  - mxnet-cu90channels:- anaconda- conda-forge

Now I’ll manually launch MXNet distributed training process without built-in MXNet launch.py.
To set MXNet roles and start training (running cifar10_dist.py), I have created the following shell script (run.sh).

As I have mentioned above, here I have used built-in environment variables in AML distributed job.

run.sh

# setup MXNet roleif [ $OMPI_COMM_WORLD_RANK -eq 0 ]thenexport DMLC_ROLE=schedulerelif [ $OMPI_COMM_WORLD_RANK -eq 1 ]thenexport DMLC_ROLE=serverelseexport DMLC_ROLE=workerfiexport DMLC_PS_ROOT_URI=$AZ_BATCHAI_MPI_MASTER_NODEexport DMLC_PS_ROOT_PORT=9092export DMLC_NUM_SERVER=1export DMLC_NUM_WORKER=2# run trainingpython cifar10_dist.py

Now let’s start (run) distributed training on Azure ML compute (AmlCompute) cluster as follows.
As you can see below, it will run run.sh (see above) in this training.

az ml command

az ml job create --file train_distributed_mxnet.yml \  --resource-group $my_resource_group \  --workspace-name $my_workspace

train_distributed_mxnet.yml

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.jsoncode: script-mxnetcommand: bash run.shenvironment: azureml:mxnet-distribution-env@latestcompute: azureml:cluster01display_name: mxnet_dist_testexperiment_name: mxnet_dist_testresources:  instance_count: 4distribution:  type: mpi  process_count_per_instance: 1description: MXNet distributed training

After your script completes (it will take several minutes), you could see the result’s output (accuracy) in driver’s logs.

You can download and run this example from here .

 

[Change Logs]

Aug 2022    Updated example for Azure ML CLI v2

 

Categories: Uncategorized

Tagged as: ,

6 replies»

  1. Hi Tsuyoshi,

    Do you know how to run the following script in AzureML Service? I have read all the documents, but can’t figure out how to specify parameter “-m torch.distributed.launch –nproc_per_node=4” to python.

    python -m torch.distributed.launch –nproc_per_node=4 /path_to_maskrcnn_benchmark/tools/train_net.py –config-file “path/to/config/file.yaml”

    Like

Leave a reply to Jinpeng Wang Cancel reply