Error in SLURM cluster – Detected 1 oom-kill event(s): how to improve running jobs

The approved answer is correct but, to be more precise, error slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup. indicates that you are low on Linux’s CPU RAM memory. If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like … Read more

Slurm: Why use srun inside sbatch?

The srun command is used to create job ‘steps’. First, it will bring better reporting of the resource usage ; the sstat command will provide real-time resource usage for processes that are started with srun, and each step (each call to srun) will be reported individually in the accounting. Second, it can be used to … Read more

How to submit a job to any [subset] of nodes from nodelist in SLURM?

You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use: sbatch –exclude=myCluster[01-09] myScript.sh and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows … Read more

Use slurm job id

You can do something like this: RES=$(sbatch simulation) && sbatch –dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job 102045. The construct ${RES##* } isolates the last word (see more info here), in the current case the job id. The && part ensures you do … Read more

How to “undrain” slurm nodes in drain state

Found an approach, enter scontrol interpreter (in command line type scontrol) and then scontrol: update NodeName=node10 State=DOWN Reason=”undraining” scontrol: update NodeName=node10 State=RESUME Then scontrol: show node node10 displays amongst other info State=IDLE Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir … Read more

HPC cluster: select the number of CPUs and threads in SLURM sbatch

Depending on the parallelism you are using: distributed or shared memory –ntasks=# : Number of “tasks” (use with distributed parallelism). –ntasks-per-node=# : Number of “tasks” per node (use with distributed parallelism). –cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism). From this question: if every node has 24 cores, is … Read more