RAPIDS-cuML: Train your Scikit-learn models on AAU GPUs¶

This tutorial show how to deploy RAPIDS-cuML-GPU Machine Learning Algorithms to efficiently train your scikit-learn models on the AAU GPUs avalaible through UCloud.

"cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions. cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook."

In most cases, cuML's Python API matches the API from scikit-learn. which will make it easy to navigate from scikit-learn to RAPIDS-cuML

A table of the supported algoritmns can be found here.

This tutorial will use a random forrest which can be found on the cuML notebook examples

The following python script is needed to replicate this tutorial:

multigpu_rapids_cuml.py (Can be used as template)

Prerequisite reading:

Update VM¶

sudo apt update
sudo apt upgrade -y 
sudo apt install nvidia-driver-525 nvidia-utils-525 -y  # Or newer version

Activate Conda¶

This can be done by either installing a conda from scratch or by deploying er prior installation. Please see "Using Conda for easy workflow deployment on AAU GPU VMs" for information.

# Download and install miniconda (If needed)
curl -s -L -o miniconda_installer.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash miniconda_installer.sh -b -f -p miniconda3

# Set conda to path
export PATH=/home/ucloud/miniconda3/bin:$PATH # Set conda to path

# initialize conda
conda init && bash -i

# Reboot VM
sudo reboot

Re-connect to VM using SSH¶

ssh ucloud@IP_address_from_the_red_mark

Check nvidia driver Configuration¶

nvidia-smi

# Expected Output
Mon Aug  7 09:38:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |      0MiB / 15109MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Create or re-load a RAPIDS Conda environment¶

look for latest RAPIDS installation at https://docs.rapids.ai/install

# Create pytorch conda environment if non-exist on the Conda installation
conda deactivate
conda create --solver=libmamba -n rapids -c rapidsai -c conda-forge -c nvidia  \
    rapids=23.08 python=3.10 cuda-version=11.8


# Set pre-installed conda libraries to path (including cudatoolkit=11.2 cudnn=8.1.0 )
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

Transfer Files and Folders (SSH-Copy) to VM¶

Open a second terminal (1st terminal is connected to the VM):

scp -r "C:\path\pytorch_folder" ucloud@IP_address_from_the_red_mark:

Run a Random Forrest training on multiple GPUs:¶

python multigpu_rapids_cuml.py

Track the GPU usage¶

nvidia-smi -l 5 # Will update every 5 seconds

# Expected Output
Mon Aug  7 09:38:25 2023
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1011      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2312      C   python                           1324MiB |
|    0   N/A  N/A      2381      C   ...a3/envs/rapids/bin/python     1042MiB |
|    1   N/A  N/A      1011      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2383      C   ...a3/envs/rapids/bin/python     1042MiB |
+-----------------------------------------------------------------------------+
Tue Aug 29 11:04:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P0    49W /  70W |   2389MiB / 15360MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   36C    P0    53W /  70W |   1067MiB / 15360MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Transfer Results and Conda enviroment local machine (SSH-Copy)¶

Open a second terminal (1st terminal is connected to the VM):

scp -r ucloud@IP_address_from_the_red_mark:/home/ucloud/folder "C:\path-to-folder"