slurm – Tarik Billa

comment in bash script processed by slurm

December 26, 2023 by Tarik

just add another # at the beginning. ##SBATCH –mail-user… This will not be processed by Slurm

Error in SLURM cluster – Detected 1 oom-kill event(s): how to improve running jobs

November 29, 2023 by Tarik

The approved answer is correct but, to be more precise, error slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup. indicates that you are low on Linux’s CPU RAM memory. If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like … Read more

SLURM: see how many cores per node, and how many cores per job

November 21, 2023 by Tarik

in order to see the details of all the nodes you can use: scontrol show node For an specific node: scontrol show node “nodename” And for the cores of job you can use the format mark %C, for instance: squeue -o”%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C” More info about format.

How do I save print statements when running a program in SLURM?

September 20, 2023 by Tarik

By default, print in Python is buffered, meaning that it does not write to files or stdout immediately, and needs to be ‘flushed’ to force the writing to stdout immediately. See this question for available options. The simplest option is to start the Python interpreter with the -u option. From the python man page: -u … Read more

Slurm: Why use srun inside sbatch?

September 3, 2023 by Tarik

The srun command is used to create job ‘steps’. First, it will bring better reporting of the resource usage ; the sstat command will provide real-time resource usage for processes that are started with srun, and each step (each call to srun) will be reported individually in the accounting. Second, it can be used to … Read more

How to submit a job to any [subset] of nodes from nodelist in SLURM?

August 31, 2023 by Tarik

You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use: sbatch –exclude=myCluster[01-09] myScript.sh and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows … Read more

What does the status “CG” mean in SLURM?

August 29, 2023 by Tarik

“CG” stands for “completing” and it happens to a job that cannot be terminated, probably because of an I/O operation. More detailed info in the Slurm Troubleshooting Guide

Use slurm job id

August 24, 2023 by Tarik

You can do something like this: RES=$(sbatch simulation) && sbatch –dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job 102045. The construct ${RES##* } isolates the last word (see more info here), in the current case the job id. The && part ensures you do … Read more

How to “undrain” slurm nodes in drain state

August 5, 2023 by Tarik

Found an approach, enter scontrol interpreter (in command line type scontrol) and then scontrol: update NodeName=node10 State=DOWN Reason=”undraining” scontrol: update NodeName=node10 State=RESUME Then scontrol: show node node10 displays amongst other info State=IDLE Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir … Read more

HPC cluster: select the number of CPUs and threads in SLURM sbatch

July 24, 2023 by Tarik

Depending on the parallelism you are using: distributed or shared memory –ntasks=# : Number of “tasks” (use with distributed parallelism). –ntasks-per-node=# : Number of “tasks” per node (use with distributed parallelism). –cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism). From this question: if every node has 24 cores, is … Read more