The approved answer is correct but, to be more precise, error
slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.
indicates that you are low on Linux’s CPU RAM memory.
If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch):
RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)
Check out the explanation in this article for more details.
Solution:
Increase or add in your script parameter --mem-per-cpu
.
-
If you are using sbatch:
sbatch your_script.sh
to run your script, add in it following line:#SBATCH –mem-per-cpu=<value bigger than you’ve requested before>
-
If you are using srun:
srun python3 your_script.py
add this parameter like this:srun –mem-per-cpu=<value bigger than you’ve requested before> python3 your_script.py