Tensorflow: Train your deep-learning models on AAU GPUs¶
This tutorial show how to deploy "Distributed Data Parallel (DDP) using Tensorflow/Keras" to efficiently train your deep-learning models on the AAU GPUs avalaible through UCloud.
See here for a more detailed tutorial on DDP using Tensorflow.
This tutorial specifically focuses on Multi-GPU DDP Training with fault tolerance
The following python script is needed to replicate this tutorial:
- multigpu_torchrun.py (Can be used as template)
Prerequisite reading:
Update VM¶
sudo apt update
sudo apt upgrade -y
sudo apt install nvidia-driver-525 nvidia-utils-525 -y # Or newer version
Activate Conda¶
This can be done by either installing a conda from scratch or by deploying er prior installation. Please see "Using Conda for easy workflow deployment on AAU GPU VMs" for information.
# Download and install miniconda (If needed)
curl -s -L -o miniconda_installer.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash miniconda_installer.sh -b -f -p miniconda3
# Set conda to path
export PATH=/home/ucloud/miniconda3/bin:$PATH # Set conda to path
# initialize conda
conda init && bash -i
# Reboot VM
sudo reboot
Re-connect to VM using SSH¶
ssh ucloud@IP_address_from_the_red_mark
Check nvidia driver Configuration¶
nvidia-smi
# Expected Output
Mon Aug 7 09:38:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 70C P0 31W / 70W | 0MiB / 15109MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Create or re-load a Tensorflow/Keras Conda environment¶
# Create pytorch conda environment if non-exist on the Conda installation
conda deactivate
conda create --name tensorflow
conda activate tensorflow
conda install -c anaconda tensorflow-gpu
# Set pre-installed conda libraries to path (including cudatoolkit=11.2 cudnn=8.1.0 )
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
Transfer Files and Folders (SSH-Copy) to VM¶
Open a second terminal (1st terminal is connected to the VM):
scp -r "C:\path\pytorch_folder" ucloud@IP_address_from_the_red_mark:
Run Tensorflow training in DDP mode:¶
In this example a model is trained for 10 epocs using 2 GPUs with a model checkpoint being saved with "ckpt" folder for each epoc and the final model being saved as "final_model.keras".
functions "get_compiled_model" and "get_dataset" need to be changed to adjust training data and model.
# Run the model with 10 epcos and 2 GPUs
python multigpu_tensorflow.py 10 2
# Run the model with 10 epcos and all avaiable GPUs
python multigpu_tensorflow.py 10
Track the GPU usage¶
nvidia-smi -l 5 # Will update every 5 seconds
# Expected Output
Mon Aug 7 09:38:25 2023
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1011 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2312 C python 1324MiB |
| 0 N/A N/A 2381 C ...a3/envs/rapids/bin/python 1042MiB |
| 1 N/A N/A 1011 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2383 C ...a3/envs/rapids/bin/python 1042MiB |
+-----------------------------------------------------------------------------+
Tue Aug 29 11:04:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P0 49W / 70W | 2389MiB / 15360MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 36C P0 53W / 70W | 1067MiB / 15360MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Transfer Results and Conda enviroment local machine (SSH-Copy)¶
Open a second terminal (1st terminal is connected to the VM):
scp -r ucloud@IP_address_from_the_red_mark:/home/ucloud/folder "C:\path-to-folder"