Green Node Logo
 
NVIDIA GPUs

Set up SLURM Cluster for AI Training and Inference

Jul 02, 2024

GreenNode
 

As AI projects continuously expand in complexity and scale, the demand for robust high-performance computing (HPC) environments skyrockets. This growth necessitates efficient resource management – a challenge that SLURM (Simple Linux Utility for Resource Management) tackles head-on. 

Then what is SLURM and how this high-scalable workload manager works on Linux clusters? In this blog, we’ll guide you through SLURM concept and how to set up a SLURM cluster tailored for AI projects. This setup will help you efficiently allocate resources, schedule jobs, and maximize the use of your computing infrastructure for AI training and inference. 

Let’s dive in! 

1. Introduction to SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to be the conductor of your high-performance computing (HPC) symphony.  Imagine having an orchestra of powerful machines, each with its own processing prowess.  During the performance, SLURM acts as the maestro. More than just a workload manager, SLURM ensures: 

  • Tasks are accurately distributed across your compute cluster 
  • Resources like CPU cores, memory, and even specialized hardware like GPUs are appropriately allocated 
  • Adequate resource redistribution to ensure the operation's scalability. 

In short, SLURM is the central nervous system of your HPC environment, enabling AI engineers to efficiently harness the power of their compute clusters for tackling the most demanding AI challenges. 

2. Prerequisites

Before setting up a SLURM cluster, you’ll need the following: 

 - A cluster of machines running a Linux distribution (e.g., Ubuntu, CentOS) with network connectivity between them.  

- Root access to the machines in the cluster. 

Lab environment: 

  • Head node (controller node) running Ubuntu 22.04 
  • hostname: cls-wrk105, IP: 10.237.96.105 
  • Compute nodes running Ubuntu 22.04 
  • hostname: - cls-wrk105, IP: 10.237.96.105 
  • hostname: - cls-wrk106, IP: 10.237.96.106 
    Slurm-1.jpg

You can see node information as follows:

Slurm-2.jpg

3. Install and configure Slurm on head node

To create a SLURM cluster, you have to install and configure Slurm on a head node. 

3.1. Install SLURM 

First, update your package manager and install slurmctld on head nodes: 

root@cls-wrk105:~# sudo apt update 
root@cls-wrk105:~# sudo apt install slurmctld 

You should see the following result: 

Slurm-3.jpg
3.2. Munge Authentication 

Next, you need to ensure munge is running. To check this, do as follows: 

root@cls-wrk105:~# systemctl status munge 

You should see the following result: 

Slurm-4.jpg

3.3. Configure 

Configure SLURM by editing the main configuration file, usually located at /etc/slurm/slurm.conf.  

Below is a basic configuration example: 

root@cls-wrk105:~# sudo cat << EOF > /etc/slurm/slurm.conf 
ClusterName=lab-cluster 
SlurmctldHost=cls-wrk105.lab.local 
ProctrackType=proctrack/linuxproc 
SlurmctldPidFile=/var/run/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm/slurmd 
SlurmUser=slurm 
StateSaveLocation=/var/lib/slurm/slurmctld 
SwitchType=switch/none 
TaskPlugin=task/none 
# 
# TIMERS 
InactiveLimit=0 
KillWait=30 
MinJobAge=300 
SlurmctldTimeout=120 
SlurmdTimeout=300 
Waittime=0 
# SCHEDULING 
SchedulerType=sched/backfill 
SelectType=select/cons_tres 
SelectTypeParameters=CR_Core 
# 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
JobCompType=jobcomp/none 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=info 
SlurmctldLogFile=/var/log/slurm/slurmctld.log 
SlurmdDebug=info 
SlurmdLogFile=/var/log/slurm/slurmd.log 
# 
# COMPUTE NODES 
DefMemPerNode=32000 
GresTypes=gpu 
NodeName=cls-wrk105.lab.local Gres=gpu:8 CPUs=64 RealMemory=504000 
State=UNKNOWN 
NodeName=cls-wrk106.lab.local Gres=gpu:7 CPUs=40 RealMemory=252000 
State=UNKNOWN 
PartitionName=gpu Nodes=ALL Default=YES MaxTime=INFINITE State=UP 
EOF 

If you have GPU resources, you can configure the gres field in the slurm.conf file to specify the number of GPUs available on each node. For example, if you have two nodes with different GPU configurations, you can set up the gres field as follows in the slurm.conf file: 

GresTypes=gpu 
NodeName=cls-wrk105.lab.local gres=gpu:8 CPUs=64 RealMemory=504000 
State=UNKNOWN 
NodeName=cls-wrk106.lab.local gres=gpu:7 CPUs=40 RealMemory=252000 
State=UNKNOWN 

In this example, cls-wrk105.lab.local has 8 GPUs, 64 CPUs, and 504GB of memory, while cls-wrk106.lab.local has 7 GPUs, 40 CPUs, and 252GB of memory. You can adjust these values based on your hardware configuration. See more: https://slurm.schedmd.com/slurm.conf.html 

3.4. Configure gres for GPU 

Ensure you install the NVIDIA driver and CUDA toolkit on each node to access the GPU resources. 

cls-wrk105: 8 GPUs 

You can see the list of GPUs to be shown as follows: 

Slurm-5.jpg

cls-wrk106: 7 GPUs 

Slurm-6.jpg

Setup the /etc/slurm/gres.conf file to specify the GPU type and number of GPUs available on each node: 

root@cls-wrk105:~# sudo cat << EOF > /etc/slurm/gres.conf 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia0 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia1 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia2 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia3 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia4 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia5 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia6 
NodeName=cls-wrk105.lab.local Name=gpu File=/dev/nvidia7 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia0 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia1 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia2 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia3 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia4 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia5 
NodeName=cls-wrk106.lab.local Name=gpu File=/dev/nvidia6 
EOF 

Note: Make sure you install the NVIDIA driver and CUDA toolkit on each node to access the GPU resources. 

If you build SLURM from source, you can enable NVML support by adding the –with-nvml flag to the configure command. This will allow SLURM to monitor GPU usage and temperature and enforce resource limits based on GPU utilization. 

Hence, you don’t need to specify the GPU type and number of GPUs available on each node in the gres.conf file, just add the following line to the gres.conf file: 

Autodetect=NVML 

See more:

- https://slurm.schedmd.com/gres.conf.html  

- https://slurm.schedmd.com/gres.html 

Start the SLURM service on the head node: 

root@cls-wrk105:~# sudo systemctl start slurmctld 

You should see the following result: 

Slurm-7.jpg

4. Install slurm on compute nodes

To install slurm on the compute nodes, do as follows: 

1. Install SLURM on the compute nodes: 

root@cls-wrk105:~# sudo apt update 
root@cls-wrk105:~# sudo apt install slurmd 

root@cls-wrk106:~# sudo apt install slurmd 
root@cls-wrk106:~# sudo systemctl start slurmd 

2. Copy the SLURM configuration, munge key file from the head node to each compute node: 

root@cls-wrk105:~# scp /etc/slurm/*.conf [email protected]:/etc/slurm/ 
root@cls-wrk105:~# scp /etc/munge/munge.key [email protected]:/etc/munge/munge.key 

3. Restart the munge service on each compute node: 

root@cls-wrk106:~# sudo systemctl restart munge 

4. Start the SLURM service on each compute node: 

root@cls-wrk106:~# sudo systemctl start slurmd 
Slurm-8.jpg

5. Testing the SLURM Cluster

1. Set the nodes to the idle state to allow jobs to be scheduled on them. You can do this using the scontrol command 

root@cls-wrk105:~# scontrol update nodename=cls-wrk105.lab.local,cls-wrk106.lab.local state=idle 

2. Check the status of your nodes: 

root@cls-wrk105:~# sinfo 

Slurm-9.jpg

3. Run a test job to verify that the SLURM cluster is working correctly: 

root@cls-wrk105:~# srun -N2 hostname 

Slurm-10.jpg

You can check some basic SLURM commands here: https://slurm.schedmd.com/quickstart.html#commands 

6. Run gpu job

1. Install apptainer

Install appcontainer to run the sample job on compute nodes. Appcontainer is a container runtime that allows you to run containerized applications on HPC clusters. You can install appcontainer from the PPA repository as follows: 

root@cls-wrk105:~# sudo add-apt-repository -y ppa:apptainer/ppa 
root@cls-wrk105:~# sudo apt update 
root@cls-wrk105:~# sudo apt install -y apptainer 
root@cls-wrk106:~# sudo add-apt-repository -y ppa:apptainer/ppa 
root@cls-wrk106:~# sudo apt update 
root@cls-wrk106:~# sudo apt install -y apptainer 

2. Run sample job

Here is an example of batch script to run a PyTorch distributed data parallel (DDP) job on the SLURM cluster : 

root@cls-wrk105:~# cat << EOF > ./sbatch_torchrun_ddp.job 
#!/bin/bash 
#SBATCH --job-name=multinode-random 
#SBATCH --nodes=2 
#SBATCH --time=2:00:00 
#SBATCH --gres=gpu:2 
#SBATCH -o /tmp/multinode-random.%N.%J.%u.out # STDOUT 
#SBATCH -e /tmp/multinode-random.%N.%J.%u.err # STDERR 
#SBATCH --ntasks=2 
 
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST ) ) 
nodes_array=($nodes) 
echo $nodes_array 
head_node=${nodes_array[0]} 
echo $head_node 
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | cut -d" " -f2) 
echo Master Node IP: $head_node_ip 
export LOGLEVEL=INFO 
 
srun rm -rf /tmp/slurm-exec-instruction 
srun git clone https://github.com/QuangNamVu/slurm-exec-instruction /tmp/slurm-exec-instruction 
 
srun apptainer run \ 
   --nv --bind /tmp/slurm-exec-instruction/torch_dist_train/src/:/mnt \ 
   docker://pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel \ 
   torchrun --nnodes=2 \ 
   --nproc_per_node=1 \ 
   --rdzv_id=100 \ 
   --rdzv_backend=c10d \ 
   --rdzv_endpoint=$head_node_ip:29400 \ 
   /mnt/rand_ddp.py --batch_size 128 --total_epochs 10000 --save_every 50 
EOF 

3. Run the job using the sbatch command: 

root@cls-wrk105:~# sbatch ./sbatch_torchrun_ddp.job 

Slurm will schedule the job on the compute nodes and run the PyTorch DDP job using the specified resources. You can monitor the job using the squeue command and check the output files for the job. 

Process running on cls-wrk105: 

Slurm-11.jpg

Process running on cls-wrk106: 

Slurm-12.jpg

Output file template was defined in sbatch_torchrun_ddp.job

#SBATCH -o /tmp/multinode-random.%N.%J.%u.out # STDOUT 

Here is sample output of job ID 19 run on cls-wrk105 by user root 

root@cls-wrk105:~# cat /tmp/multinode-random.cls-wrk105.19.root.out 

You can see the sample output as follows: 

..... 
starting main 
delete snapshot file 
starting main 
delete snapshot file 
using backend NCCL 
==================== 
0 
==================== 
using backend NCCL 
==================== 
0 
==================== 
____________________ 
cuda:0 
____________________ 
____________________ 
cuda:0 
____________________ 
[GPU0] Epoch 0 | Batchsize: 64 | Steps: 1 
[GPU1] Epoch 0 | Batchsize: 64 | Steps: 1 
..... 

Conclusion

Setting up a SLURM cluster for AI engineers can greatly improve the efficiency and scalability of your compute resources. This guide has equipped you with the knowledge to set up and optimize your SLURM cluster for maximum AI productivity.  

By following the steps outlined in this guide, you've unlocked the immense potential of a SLURM cluster specifically tailored for AI projects. SLURM’s powerful job scheduling and resource management capabilities make it an essential tool for managing complex AI projects. We hope this guide has been helpful in setting up your SLURM cluster and optimizing your AI workflows. Follow GreenNode’s newsletter on LinkedIn for more valuable insights and in-depth knowledge of the fast-paced AI industry. 

 

Tags:

Read more