Pytorch: Train your deep-learning models on AAU GPUs¶

This tutorial show how to deploy "Distributed Data Parallel (DDP) in PyTorch" to efficiently train your deep-learning models on the AAU GPUs avalaible through UCloud.

See here for a more detailed tutorial on DDP using Pytorch.

This tutorial specifically focuses on Part 4: Multi-GPU DDP Training with fault tolerance using Torchrun.

The following python scripts are needed to replicate this tutorial:

multigpu_torchrun.py (Can be used as template)
datautils.py (Used to load dummy data)

Prerequisite reading:

Update VM¶

sudo apt update
sudo apt upgrade -y 
sudo apt install nvidia-driver-525 nvidia-utils-525 -y  # Or newer version

Activate Conda¶

This can be done by either installing a conda from scratch or by deploying er prior installation. Please see "Using Conda for easy workflow deployment on AAU GPU VMs" for information.

# Download and install miniconda (If needed)
curl -s -L -o miniconda_installer.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash miniconda_installer.sh -b -f -p miniconda3

# Set conda to path
export PATH=/home/ucloud/miniconda3/bin:$PATH # Set conda to path

# initialize conda
conda init && bash -i

# Reboot VM
sudo reboot

Re-connect to VM using SSH¶

ssh ucloud@IP_address_from_the_red_mark

Check nvidia driver Configuration¶

nvidia-smi

# Expected Output
Mon Aug  7 09:38:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |      0MiB / 15109MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Create or re-load a Pytorch-conda environment¶

look for latest pytorch installation at https://pytorch.org/get-started/locally/

# Create pytorch conda environment if non-exist on the Conda installation
conda deactivate
conda create --name pytorch
conda activate pytorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Set pre-installed conda libraries to path (including cudatoolkit=11.2 cudnn=8.1.0 )
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

Transfer Files and Folders (SSH-Copy) to VM¶

Open a second terminal (1st terminal is connected to the VM):

scp -r "C:\path\pytorch_folder" ucloud@IP_address_from_the_red_mark:

Run Pytorch training in DDP mode:¶

In this example a model is trained for 50 epocs with a model snapshot ("snapshot.pt") being save every 10 epocs and the final model being saved as "finalmodel.pt".

Line 78 and 79 needs to be changed to adjust training data and model. 78 train_set = MyTrainDataset(2048) # load your dataset 79 model = torch.nn.Linear(20, 1) # load your model

"--nproc_per_node=" determines the number of GPUs to use. "gpu" pytorch will utilise all avaliable GPUs.

torchrun --standalone --nproc_per_node=gpu multigpu_torchrun.py 50 10

$ torchrun --standalone --nproc_per_node=2 multigpu_torchrun.py 50 10
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[GPU0] Epoch 0 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 0 | Batchsize: 32 | Steps: 32
Epoch 0 | Training snapshot saved at snapshot.pt
[GPU1] Epoch 1 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 1 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 2 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 2 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 3 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 3 | Batchsize: 32 | Steps: 32
...
[GPU0] Epoch 7 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 7 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 8 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 8 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 9 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 9 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 10 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 10 | Batchsize: 32 | Steps: 32
Epoch 10 | Training snapshot saved at snapshot.pt
[GPU0] Epoch 11 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 11 | Batchsize: 32 | Steps: 32
...
[GPU0] Epoch 47 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 47 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 48 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 48 | Batchsize: 32 | Steps: 32
[GPU0] Epoch 49 | Batchsize: 32 | Steps: 32
[GPU1] Epoch 49 | Batchsize: 32 | Steps: 32
Final model saved at finalmodel.pt

Track the GPU usage¶

nvidia-smi -l 5 # Will update every 5 seconds

# Expected Output
Mon Aug  7 09:38:25 2023
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1011      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2312      C   python                           1324MiB |
|    0   N/A  N/A      2381      C   ...a3/envs/rapids/bin/python     1042MiB |
|    1   N/A  N/A      1011      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2383      C   ...a3/envs/rapids/bin/python     1042MiB |
+-----------------------------------------------------------------------------+
Tue Aug 29 11:04:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P0    49W /  70W |   2389MiB / 15360MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   36C    P0    53W /  70W |   1067MiB / 15360MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Transfer Results and Conda enviroment local machine (SSH-Copy)¶

Open a second terminal (1st terminal is connected to the VM):

scp -r ucloud@IP_address_from_the_red_mark:/home/ucloud/folder "C:\path-to-folder"