Distributed training with PyTorch and Azure ML

By Beatriz Stollnitz, Principal Cloud Advocate at Microsoft

Overview of distributed training

Adding distributed training to Azure ML code

    cluster = AmlCompute(
        ...
        type="amlcompute",
        ...
    )
    environment = Environment(image="mcr.microsoft.com/azureml/" +
                              "openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04:latest",
                              conda_file=CONDA_PATH)
    job = command(
        ...
        resources=dict(instance_count=2),
        distribution=dict(type="PyTorch", process_count_per_instance=4),
        ...
    )
  • WORLD_SIZE — The number of processes in the current instance.
  • NODE_RANK — The index of the current instance. The first instance has NODE_RANK zero.
  • MASTER_ADDR — The IP address of the first instance.
  • MASTER_PORT — An available port on the first instance.
  • LOCAL_RANK — The index of the current process within its instance.
  • RANK — The global index of the current process (among all processes on all instances).

Adding distributed training to PyTorch code

  • The backend, which determines how the processes communicate with each other. The methods available to us are “gloo,” “mpi,” and “nccl.” We choose “nccl” because we want distributed GPU training.
  • The initialization method, which determines how we want to initialize information needed during training. This information can be initialized using TCP, a shared file system, or environment variables. We’ll choose environment variable initialization, so that PyTorch will look for the environment variables that Azure ML sets automatically.
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", local_rank)
from torch import nn
    ...
    model = nn.parallel.DistributedDataParallel(
        module=NeuralNetwork().to(device), device_ids=[local_rank])
    if rank == 0:
        save_model(model_dir, model)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
    train_loader = DataLoader(train_data,
                              batch_size=batch_size,
                              sampler=train_sampler)
    for epoch in range(epochs):
        ...
        train_sampler.set_epoch(epoch)

Additional Resources:

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learning Path

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learn

Part 1: Training and Deploying Your PyTorch Model in the Cloud with Azure ML

Part 2: Training Your PyTorch Model Using Components and Pipelines in Azure ML

Part 3: Faster Training and Inference Using the Azure Container for PyTorch in Azure ML

Article originally posted here. Reposted with permission.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button