For SLURM documentation, please go here: https://slurm.schedmd.com/
Simply put, SLURM is a job scheduler system that allows users to allocate compute resources for computational jobs.
The following is my old note during the time I upgraded Maui + Torque to SLURM at my HomeWood’s HPC cluster.
In this example, I will use two hosts to illustrate the installation and configuration of getting a basic slurm up and running:
-
Distro: CentOS 6
Management host: mgmt
compute host: compute001
#perform on both mgmt and + compute001:
yum update
yum install wget gcc gcc-c++ make kernel-devel kernel-headers perl rpm-build -y
yum -y install epel-release
#Verify both munge users are the same on both mgmt + compute001
#Or remove user “munge” and re-create a new munge
#Add slurm user with exact uid + gid on both mgmt + nodes:
groupadd -g 497 slurm
useradd -m -c “SLURM ID” -d /var/lib/slurm -u 497 -g slurm -s /bin/bash slurm
#Download blcr:
wget http://crd.lbl.gov/assets/Uploads/FTG/Projects/CheckpointRestart/downloads/blcr-0.8.5.tar.gz
#Build the RPM:
rpmbuild -tb –define ‘with_multilib 0’ blcr-0.8.5.tar.gz
#Install the packages:
cd /root/rpmbuild/RPMS/x86_64/
yum install blcr* –nogpgcheck -y
#Install Munge:
yum install munge munge-libs munge-devel -y
#Only on mgmt:
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown -R munge: /var/log/munge
#All nodes:
#then cp munge.key to all of the compute001:/etc/munge/munge.key
scp /etc/munge/munge.key compute001:/etc/munge/
#Edit permission: on compute001
chown -R munge: /etc/munge/
chown -R munge: /var/log/munge
chmod -R 0400 /etc/munge/ /var/log/munge/
#Start munge:
/usr/sbin/munged –force
>service munge start
#Verify Munge is working:
munge -n
munge -n | unmunge
munge -n | ssh compute001 unmunge
remunge
#Back to setting up SLURM
#Dependencies both mgmt + compute001:
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad hwloc hwloc-devel numactl readline-devel mysql-devel pam-devel perl-ExtUtils-MakeMaker rrdtool freeipmi lua-devel gtk2-devel redhat-lsb redhat-rpm-config -y
#Turn off SELinux
#Enable port 6817,6818,7321 udp + tcp
for portnumber in 6817 6818 7321; do iptables -A INPUT -m state –state NEW -p tcp –dport $portnumber -j ACCEPT; done
for portnumber in 6817 6818 7321; do iptables -A INPUT -m state –state NEW -p udp –dport $portnumber -j ACCEPT; done
#Sync clock on all nodes and mgmt:
yum install ntp -y
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start
#Only on mgmt:
#Download latest slurm: http://www.schedmd.com/#repos
wget http://www.schedmd.com/download/latest/slurm-16.05.2.tar.bz2
rpmbuild -ta slurm-16.05.2.tar.bz2
#Now, copy all slurm..rpm to all mgmt and compute001. Then install it.
scp slurm* all_nodes:/tmp/
yum –nogpgcheck localinstall slurm*
On mgmt: create slurm.conf http://slurm.schedmd.com/configurator.easy.html
Or
cp /etc/slurm/slurm.conf.example /etc/slurm.conf
#then substitute with your system info
#When slurm.conf is configured, then copy it to all compute nodes /etc/slurm/
>scp /etc/slurm/slurm.conf compute00x:/etc/slurm/
#On mgmt only:
mkdir /var/spool/slurmctld
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmctld
chown slurm:/var/spool/slurmd
chmod 755 /var/spool/slurmctld
chmod 755 /var/spool/slurmd
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
#On the compute001:
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log
—Verify it:
slurmd -C
#On both mgmt + compute001:
chkconfig slurm on
/etc/init.d/slurm start
scontrol update nodename=compute001x state=resume
scontrol update nodename=compute[001-100] state=resume
#Possible error:
slurm_receive_msg: Protocol authentication error
[2016-08-03T15:13:19.650] error: slurm_receive_msg [172.20.1.1:49082]: Protocol authentication error
[2016-08-03T15:13:19.650] error: invalid type trying to be freed 65534
[2016-08-03T15:13:20.663] error: Munge decode failed: Expired credential
#Solution: make sure both ntpd sync
#Sync time between sever and node:
yum install ntp
ntpdate server_name
service ntpd start
chkconf ntpd on
#Other error:
#Slurm will not start:
#Solution: make sure proper permission on log and /var/spool/slurmd in /etc/slurm/slurm.conf both server and compute node
#Verify:
Display compute nodes:
scontrol show nodes
#Run job on server:
srun -N1 /bin/hostname
#Display job queue:
scontrol show jobs
#Submit script jobs:
sbatch -N1 script-file
Hope you enjoy it!