comment in bash script processed by slurm
just add another # at the beginning. ##SBATCH –mail-user… This will not be processed by Slurm
just add another # at the beginning. ##SBATCH –mail-user… This will not be processed by Slurm
The approved answer is correct but, to be more precise, error slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup. indicates that you are low on Linux’s CPU RAM memory. If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like … Read more
in order to see the details of all the nodes you can use: scontrol show node For an specific node: scontrol show node “nodename” And for the cores of job you can use the format mark %C, for instance: squeue -o”%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C” More info about format.
By default, print in Python is buffered, meaning that it does not write to files or stdout immediately, and needs to be ‘flushed’ to force the writing to stdout immediately. See this question for available options. The simplest option is to start the Python interpreter with the -u option. From the python man page: -u … Read more
The srun command is used to create job ‘steps’. First, it will bring better reporting of the resource usage ; the sstat command will provide real-time resource usage for processes that are started with srun, and each step (each call to srun) will be reported individually in the accounting. Second, it can be used to … Read more
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use: sbatch –exclude=myCluster[01-09] myScript.sh and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows … Read more
“CG” stands for “completing” and it happens to a job that cannot be terminated, probably because of an I/O operation. More detailed info in the Slurm Troubleshooting Guide
You can do something like this: RES=$(sbatch simulation) && sbatch –dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job 102045. The construct ${RES##* } isolates the last word (see more info here), in the current case the job id. The && part ensures you do … Read more
Found an approach, enter scontrol interpreter (in command line type scontrol) and then scontrol: update NodeName=node10 State=DOWN Reason=”undraining” scontrol: update NodeName=node10 State=RESUME Then scontrol: show node node10 displays amongst other info State=IDLE Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10 which showed Reason=SlurmdSpoolDir … Read more
Depending on the parallelism you are using: distributed or shared memory –ntasks=# : Number of “tasks” (use with distributed parallelism). –ntasks-per-node=# : Number of “tasks” per node (use with distributed parallelism). –cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism). From this question: if every node has 24 cores, is … Read more