How do I submit a large number of very similar jobs?

There are a few tricks that can help you to submit large numbers of similar jobs (as in HTC) that will make your life easier.

The user should be careful if your program needs a random number seed, e.g. for Monte Carlo simulation. Your program should handle it properly, to avoid using the same pseudo seed multiple times.

We can start with the simple C++ program we introduced in here. We then create a submission template, called example3a-template.slurm,

#!/bin/bash
#
#SBATCH --qos=cu_hpc
#SBATCH --partition=cpu
#SBATCH --job-name=example3a
#SBATCH --output=example3a_INPUT1_log.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G

module purge

#To handle PATHs
export MYCODEDIR=`pwd`
echo "MYCODEDIR = "$MYCODEDIR
echo "TMPDIR = "$TMPDIR

#Sleep 3m, allow me to capture the `squeue` screen in time
sleep 3m

#To run C++ program
cd $TMPDIR
cp $MYCODEDIR/example3 .
chmod a+x example3
rm -rf example3.txt
./example3
cp -rf example3.txt $MYCODEDIR/example3_INPUT1.txt

And here is our submit.sh srcipt,

#!/bin/bash

for x in {0..20..1}
do
    #prepare configuration
    rm -rf example3a_$x.slurm
    cp example3a-template.slurm example3a_$x.slurm
    sed s/INPUT1/$x/g example3a_$x.slurm >| temp
    mv temp example3a_$x.slurm

    #submit and clean slurm submission file
    echo "Job:" $x
    sbatch example3a_$x.slurm
    rm -rf example3a_$x.slurm
done

The scipt will loop from 0 to 20. In each loop, it will

  1. prepare submission script from the template. sed editor is used to find a pattern INPUT1and then replace it with $x ,

  2. submit jobs to the Slurm cluster,

  3. delete submission script.

After your submission is done, you can check your jobs using squeue

[your_name@frontend-03 example2]$ squeue -u your_name
             JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
             81969       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81970       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81971       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81972       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81973       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81974       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81975       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81976       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81977       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81978       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81979       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81980       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81981       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81982       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81983       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81984       cpu example3 your_name PD       0:00      1 (QOSMaxJobsPerUserLimit)
             81965       cpu example3 your_name  R       1:04      1 cpu-bladeh-01
             81966       cpu example3 your_name  R       1:04      1 cpu-bladeh-01
             81967       cpu example3 your_name  R       1:04      1 cpu-bladeh-01
             81968       cpu example3 your_name  R       1:04      1 cpu-bladeh-01

Last updated