Skip to content

Running the benchmark tests for cluster vendors

Ian Harry edited this page Jul 3, 2018 · 14 revisions

In this page we describe the benchmark tests that must be run to assess the performance of potential compute nodes that will be required as part of a formal offer in response to our tender process.

The benchmark consists of a number of tests, all of which must be run, and the results of these are combined together to give one final performance number. In addition to the performance tests, the energy usage must also be documented (need to add some stuff about that!!!)

After running the benchmark tests you will have a set of numbers:

  • PyCBC benchmark performance
  • Waveform generation performance
  • lalinference performance
  • ....
  • Energy efficiency number???

These will be combined together to produce a single number for each machine according to

INSERT FORMULA

By multiplying the above number by the number of machines that it is possible to buy, within the stated constraints on total cost, and total energy consumption, we obtain the final figure we will use for comparing offers.

Setup for the benchmarks

Our benchmark tests are all verified to work within a docker/singularity image that we provide. This image can be found here:

https://hub.docker.com/r/spxiwh/docker-ligo-lalsuite-dev/

The instructions for building this image (the Dockerfile) can be found here:

https://github.com/spxiwh/docker-ligo-lalsuite-dev/blob/stretch/Dockerfile

vendors are permitted to use their own install stage to, for example, build optimized versions of underlying libraries but the codes must run to completion and produce scientifically meaningful results, as defined below. Should we allow vendors this freedom? Currently the codes are not verifying their output, but we can add some sanity checks.

One can download and run this image on a test machine (with singularity installed) by running:

singularity pull --name IMAGE.simg docker://spxiwh/docker-ligo-lalsuite-dev:stretch
singularity run IMAGE.simg

Instructions for running for the benchmarks

Pycbc

Below is the proposed PyCBC benchmark. It uses a public template bank from git-hub, and pulls data from LOSC at runtime. I didn't use an injection file, as I didn't see the point, but that could be added. As O1 data was only recently released I've triggered a rebuild of the docker image to pick up this information. With docker it's not clear how one would test using, for example, MKL libraries which can't really be included in the docker image. Still at least the docker image provides a list of instructions for installing all the necessary things.

First we setup the jobs by downloading necessary input files and preparing them. This is not part of the benchmark.

curl -L https://raw.githubusercontent.com/ligo-cbc/pycbc-config/master/O2/bank/H1L1-HYPERBANK_SEOBNRv4v2_VARFLOW_THORNE-1163174417-604800.xml.gz > H1L1-HYPERBANK_SEOBNRv4v2_VARFLOW_THORNE-1163174417-604800.xml.gz
curl -L https://losc.ligo.org/archive/data/O1/1126170624/L-L1_LOSC_4_V1-1126256640-4096.gwf > L-L1_LOSC_4_V1-1126256640-4096.gwf

pycbc_coinc_bank2hdf --bank-file H1L1-HYPERBANK_SEOBNRv4v2_VARFLOW_THORNE-1163174417-604800.xml.gz --output-file H1L1-HYPERBANK_SEOBNRv4v2_VARFLOW_THORNE-1163174417-604800.hdf

export SCHEME=cpu

Next we need to make some decisions. First if testing the benchmark to see if it will run to completion on a specific system, one can set NTEMPLATES to be low, as illustrated below. Otherwise it must be set as shown underneath.

# Use this if running the production benchmark
export NTEMPLATES=100000

# Use this if just wanting to test the benchmark
export NTEMPLATES=1000

# Run this after setting NTEMPLATES. If NTEMPLATES changes this must be rerun.
pycbc_hdf5_splitbank --bank-file H1L1-HYPERBANK_SEOBNRv4v2_VARFLOW_THORNE-1163174417-604800.hdf --templates-per-bank ${NTEMPLATES} --output-prefix H1L1-SPLITBANK_ --random-sort
mv H1L1-SPLITBANK_0.hdf tmp.hdf
rm -f H1L1-SPLITBANK_*hdf
mv tmp.hdf H1L1-SPLITBANK_0.hdf

We also must decide how many parallel jobs you want to run at the same time on the machine. For this benchmark we are free to use hyperthreading and even oversubscribe a node if neede. The benchmark will measure total performance.

# How many parallel jobs do you want to run at the same time on the machine?
export N_JOBS=XXXX

Now we run N_JOBS iterations of our process on the same compute node. There are two sub-benchmarks so we perform one, wait for all processes to finish, then perform the second. These jobs run in the background, so do let them all finish before moving on!

# This line is important! Without it some parts of the job can unexpectedly start using multiple threads!
export OMP_NUM_THREADS=1

for IDX in $(seq 1 ${N_JOBS}); do pycbc_inspiral --sgchisq-snr-threshold 6.0 --sgchisq-locations "mtotal>40:20-30,20-45,20-60,20-75,20-90,20-105,20-120" --pad-data 8 --strain-high-pass 15 --sample-rate 2048 --segment-length 512 --segment-start-pad 144 --segment-end-pad 16 --allow-zero-padding --taper-data 1 --psd-estimation median --psd-segment-length 16 --psd-segment-stride 8 --psd-inverse-length 16 --psd-num-segments 63 --autogating-threshold 100 --autogating-cluster 0.5 --autogating-width 0.25 --autogating-taper 0.25 --autogating-pad 16 --enable-bank-start-frequency  --low-frequency-cutoff 20 --approximant 'SPAtmplt:mtotal<4' 'SEOBNRv4_ROM:else' --order -1 --snr-threshold 5.5 --cluster-method window --cluster-window 1 --cluster-function symmetric --chisq-bins "0.72*get_freq('fSEOBNRv4Peak',params.mass1,params.mass2,params.spin1z,params.spin2z)**0.7" --newsnr-threshold 5 --filter-inj-only  --injection-window 4.5 --processing-scheme ${SCHEME} --injection-filter-rejector-chirp-time-window 5 --channel-name L1:GDS-CALIB_STRAIN --gps-start-time 1126258462 --gps-end-time 1126260462 --trig-start-time 1126258700 --trig-end-time 1126260000 --output OUTPUTSEOB_${IDX}.hdf --bank-file H1L1-SPLITBANK_0.hdf --frame-files L-L1_LOSC_4_V1-1126256640-4096.gwf --channel-name L1:LOSC-STRAIN  & done

Ensure the above jobs have all finished before moving on!

# This line is important! Without it some parts of the job can unexpectedly start using multiple threads!
export OMP_NUM_THREADS=1

for IDX in $(seq 1 ${N_JOBS}); do pycbc_inspiral --sgchisq-snr-threshold 6.0 --sgchisq-locations "mtotal>40:20-30,20-45,20-60,20-75,20-90,20-105,20-120" --pad-data 8 --strain-high-pass 15 --sample-rate 2048 --segment-length 512 --segment-start-pad 144 --segment-end-pad 16 --allow-zero-padding --taper-data 1 --psd-estimation median --psd-segment-length 16 --psd-segment-stride 8 --psd-inverse-length 16 --psd-num-segments 63 --autogating-threshold 100 --autogating-cluster 0.5 --autogating-width 0.25 --autogating-taper 0.25 --autogating-pad 16 --enable-bank-start-frequency  --low-frequency-cutoff 20 --approximant 'SPAtmplt:mtotal<50' 'SEOBNRv4_ROM:else' --order -1 --snr-threshold 5.5 --cluster-method window --cluster-window 1 --cluster-function symmetric --chisq-bins "0.72*get_freq('fSEOBNRv4Peak',params.mass1,params.mass2,params.spin1z,params.spin2z)**0.7" --newsnr-threshold 5 --filter-inj-only  --injection-window 4.5 --processing-scheme ${SCHEME} --injection-filter-rejector-chirp-time-window 5 --channel-name L1:GDS-CALIB_STRAIN --gps-start-time 1126258462 --gps-end-time 1126260462 --trig-start-time 1126258700 --trig-end-time 1126260000 --output OUTPUTF2_${IDX}.hdf --bank-file H1L1-SPLITBANK_0.hdf --frame-files L-L1_LOSC_4_V1-1126256640-4096.gwf --channel-name L1:LOSC-STRAIN & done

The performance criteria is then printed with some python code (formatted to be pasted directly into a bash window):

python - <<EOF
import h5py, glob

filelist = glob.glob('OUTPUTSEOB_*.hdf')

perf_tot = 0
for f in filelist:
    a = h5py.File(f,'r')
    perf_tot += 1. / a['L1/search/run_time'][0]

print "PERFORMANCE NUMBER 1:", perf_tot

filelist = glob.glob('OUTPUTF2_*.hdf')

perf_tot = 0
for f in filelist:
    a = h5py.File(f,'r')
    perf_tot += 1. / a['L1/search/run_time'][0]

print "PERFORMANCE NUMBER 2:", perf_tot

EOF

Some examples on some standard machines

MACHINE Performance number 1 Performance number 2
Pinatuba @ vulcan - 32 cores 0.0634 0.1279
ldas-pcdev3@CIT - 16 cores (HT = 32 jobs) 0.0546 0.0960
ldas-pcdev5@CIT - 40 cores 0.0992 0.197
ldas-pcdev5@CIT - 40 cores (HT = 80 jobs) 0.1668 0.243
ldas-pcdev6@LLO - 72 cores 0.182 0.308

These numbers are also divided by the number of physical cores to give the evaluation numbers.

MACHINE Evaluation number 1 Evaluation number 2
Pinatuba @ vulcan 0.00198125 0.003996875
ldas-pcdev3@CIT - 16 cores (HT = 32 jobs) 0.0034125 0.006
ldas-pcdev5@CIT - 40 cores 0.00248 0.004925
ldas-pcdev5@CIT - 40 cores (HT = 80 jobs) 0.00417 0.006075
ldas-pcdev6@LLO - 72 cores 0.00253 0.00428

These temporary numbers are based on the rapid benchmark. They may not be reliable as things like setup time might be relevant here. Don't read too much into these other than getting a typical order of scale.

Waveform generation

Michael has a little waveform generation benchmark here: https://git.ligo.org/michael.puerrer/TD-wf-bench. It can run on any number of cores with MPI. Runtime can be tuned by changing the starting frequency or total mass. Current settings are chosen for short runtimes for testing.

At low mass and low starting frequency it will also require a significant chunk of memory per waveform (this can exceed 10GB for BNS). (PLEASE CLARIFY AND BE SPECIFIC HERE)

This benchmark can be run in the following way.

First we setup for the jobs by downloading necessary input files and preparing them. This is not part of the benchmark.

git clone https://git.ligo.org/michael.puerrer/TD-wf-bench.git
cd TD-wf-bench

This benchmark will use MPI to split the necessary benchmark work over a number of processes. Each process will run on a single core. The vendor can specify how many processes to spawn. Presumably this will be equal to the number of physical cores, or equal to twice the number of physical cores (hyperthreading), but other choices are possible. If the node is not filled, the non-utilized CPUs are still counted when evaluating the performance.

export NPROCESSES=XXXX

Then run the benchmark

./wf_bench.sh ${NPROCESSES}

lalinference

Here is a benchmark for the lalinference parameter estimation code.

First we need to create an injection and prepare the ROQ data. This is not part of the benchmark.

# Generate an injection file
lalapps_inspinj --gps-start-time 441417609 --gps-end-time 441417639 --m-distr componentMass --min-mass1 5 --min-mass2 5 --max-mass1 5 --max-mass2 5 --max-mtotal 10 --i-distr uniform --waveform IMRPhenomPv2pseudoFourPN --amp-order 0 --l-distr random --f-lower 20 --t-distr uniform --time-step 30 --disable-spin --o injections.xml --snr-distr volume --ifos H1,L1,V1 --ligo-fake-psd LALAdLIGO --virgo-fake-psd LALAdVirgo --min-snr 20 --max-snr 20 --ligo-start-freq 20 --virgo-start-freq 20

#create directory needed to use the ROQ data and prepare necessary files
mkdir ROQdata

lalinference_datadump --L1-flow 20.0 --approx IMRPhenomPv2pseudoFourPN --psdlength 1024 --V1-timeslide 0 --V1-cache LALSimAdVirgo --chirpmass-max 6.170374 --inj injections.xml --comp-max 21.9986477179 --adapt-temps  --srate 4096.0 --event 0 --V1-fhigh 2047.96875 --neff 500 --seglen 32.0 --L1-channel L1:LDAS-STRAIN --L1-fhigh 2047.96875 --H1-timeslide 0 --trigtime 441417609 --comp-min 1.49140053129 --psdstart 441416535.0 --H1-cache LALSimAdLIGO --progress --H1-channel H1:LDAS-STRAIN --V1-channel V1:h_16384Hz --tol 1.0  --disable-spin  --V1-flow 20.0 --fref 100 --H1-fhigh 2047.96875 --L1-cache LALSimAdLIGO --amporder 0 --randomseed 1829391048 --dataseed -8975086 --L1-timeslide 0 --q-min 0.125 --chirpmass-min 3.346569 --H1-flow 20.0 --outfile ROQdata/data-dump  --data-dump  --ifo V1  --ifo H1  --ifo L1  

lalapps_compute_roq_weights -B /ROQ_data/IMRPhenomPv2/32s  -t 0.1  -T 0.000172895418228  --seglen 32.0  --fLow 20.0  --ifo V1  --fHigh 2047.96875  --data ROQdata/data-dumpV1-freqDataWithInjection.dat  --psd ROQdata/data-dumpV1-PSD.dat  --out ROQdata/

lalapps_compute_roq_weights -B /ROQ_data/IMRPhenomPv2/32s  -t 0.1  -T 0.000172895418228  --seglen 32.0  --fLow 20.0  --ifo L1  --fHigh 2047.96875  --data ROQdata/data-dumpL1-freqDataWithInjection.dat  --psd ROQdata/data-dumpL1-PSD.dat  --out ROQdata/

lalapps_compute_roq_weights -B /ROQ_data/IMRPhenomPv2/32s  -t 0.1  -T 0.000172895418228  --seglen 32.0  --fLow 20.0  --ifo H1  --fHigh 2047.96875  --data ROQdata/data-dumpH1-freqDataWithInjection.dat  --psd ROQdata/data-dumpH1-PSD.dat  --out ROQdata/

Benchmark starts here. We will run a number of lalinference processes on the benchmark machine. Each process will use 8 apparent cores. A free parameter is the number of processes to run on the machine. Obvious values might be the number of physical cores divided by 8, or twice the number of physical cores divided by 8, but other values (e.g. overfilling the machine) are also allowed. We will assess performance according to the total rate of work done by the machine (so twice as many jobs, which each run at half the speed gives the same performance number). So choose the number of processes according to

export NPROCS=XXX

Finally, for testing we have a quick-to-complete instance of the benchmark. This can be used to verify installation etc. For the full benchmark the production setting must be used.

# USE THIS SETTING WHEN DOING A PRODUCTION BENCHMARK
export NSTEPS=100000

# USE THIS SETTING IF WANTING TO TEST THE CODE IS FUNCTIONING
export NSTEPS=2000

Then the benchmark can be started according to

# This line is important! Without it some parts of the job can unexpectedly start using multiple threads!
export OMP_NUM_THREADS=1

#run lalinference on the injection.
TIMEFORMAT=%R

for IDX in $(seq 1 ${NPROCS}); do
mkdir RUN_${IDX}
cd RUN_${IDX}
{ time lalinference_bench --psdlength 1024 --psdstart 441416535.0 --seglen 32 --srate 4096.0 --trigtime 441417609 --ifo H1 --H1-channel H1:LDAS-STRAIN --H1-cache LALSimAdLIGO --dataseed 1324 --chirpmass-max 6.170374 --chirpmass-min 3.346569 --q-min 0.125 --comp-max 21.9986477179 --comp-min 1.49140053129 --disable-spin --amporder 0 --fref 100 --inj ../injections.xml --event 0 --H1-timeslide 0 --trigtime 441417609 --psdstart 441416535.0 --tol 1.0 --H1-flow 20.0 --H1-fhigh 2047.96875 --ntemps 8 --np 8 --nsteps 1 --skip 100 --approx IMRPhenomPv2pseudoFourPN --outfile samples.hdf5  --randomseed 1829391048 --L1-flow 20.0  --V1-timeslide 0 --V1-cache LALSimAdVirgo  --L1-channel L1:LDAS-STRAIN --L1-fhigh 2047.96875 --V1-channel V1:h_16384Hz --V1-fhigh 2047.96875  --V1-flow 20.0 --L1-cache LALSimAdLIGO --L1-timeslide 0 --ifo V1 --ifo L1 --no-detector-frame --Niter ${NSTEPS} &> logging; } 2> runtime &
cd ..
done

OLD COMMAND KEPT HERE FOR FUTURE TESTING, PLEASE IGNORE THIS FOR NOW!!

# This line is important! Without it some parts of the job can unexpectedly start using multiple threads!
export OMP_NUM_THREADS=1

#run lalinference on the injection.
TIMEFORMAT=%R

for IDX in $(seq 1 ${NPROCS}); do
mkdir RUN_${IDX}
cd RUN_${IDX}
{ time lalinference_mpi_wrapper --L1-flow 20.0 --approx IMRPhenomPv2pseudoFourPN --psdlength 1024 --V1-timeslide 0 --V1-cache LALSimAdVirgo --chirpmass-max 6.170374 --mpirun mpirun --inj ../injections.xml --comp-max 21.9986477179 --srate 4096.0 --event 0 --H1-cache LALSimAdLIGO --executable lalinference_mcmc --seglen 32.0 --L1-channel L1:LDAS-STRAIN --L1-fhigh 2047.96875 --H1-timeslide 0 --trigtime 441417609 --comp-min 1.49140053129 --psdstart 441416535.0 --H1-channel H1:LDAS-STRAIN --V1-channel V1:h_16384Hz --V1-fhigh 2047.96875 --tol 1.0 --disable-spin --V1-flow 20.0 --fref 100 --outfile samples.hdf5 --L1-cache LALSimAdLIGO --amporder 0 --randomseed 1829391048 --dataseed -8975086 --L1-timeslide 0 --q-min 0.125 --chirpmass-min 3.346569 --H1-flow 20.0 --H1-fhigh 2047.96875 --V1-roqweightsLinear ../ROQdata/weights_linear_V1.dat  --V1-roqweightsQuadratic ../ROQdata/weights_quadratic_V1.dat  --H1-roqweightsLinear ../ROQdata/weights_linear_H1.dat  --H1-roqweightsQuadratic ../ROQdata/weights_quadratic_H1.dat  --L1-roqweightsLinear ../ROQdata/weights_linear_L1.dat  --L1-roqweightsQuadratic ../ROQdata/weights_quadratic_L1.dat  --roqtime_steps ../ROQdata/roq_sizes.dat  --roq-times ../ROQdata/tcs.dat  --roqnodesLinear ../ROQdata/fnodes_linear.dat  --roqnodesQuadratic ../ROQdata/fnodes_quadratic.dat --ifo V1  --ifo H1  --ifo L1 --ntemps 8 --np 8 --skip 100 --nsteps ${NSTEPS} &> /dev/null ; } 2> runtime &
cd ..
done

The performance criteria is printed with a small python code (formatted to be pasted directly into a bash window):

python - <<EOF
import h5py

total_work = 0
for i in range(int(${NPROCS})):
    time=open('RUN_{}/runtime'.format(i+1),'r')
    runtime=float(time.readline())
    total_work += 1./runtime

print total_work

EOF

Some examples on some standard machines for the short benchmark (note that the short benchmark does not give an accurate measurement of the actual performance)

MACHINE Performance number
Pinatubo @ vulcan 1 JOB 2.55780642521
krakatoa @ vulcan 1 JOB 1.5627930236

Energy monitoring

The vendor is required to report the energy usage of the full cluster running under full load, and this must be within the limits specified in the tender document. To do this the vendor is required to use an energy meter to measure the power usage of the proposed machines while under full load.

This should be measured by starting the PyCBC benchmark test using a number of processes equal to the number of physical cores of the system. After running the tests for 10 minutes (to allow the jobs to run through any setup tasks) the vendor should start the energy meter and measure the energy used over the next 2 hours. The PyCBC benchmark jobs must continue to run during these two hours.

The energy consumed (measured in kilowatt-hours (kWh) for both the small and big compute nodes), is then divided by 2 (2 hours) to get the power used and this number is used to compute the total energy of the cluster according to.

E_total = E_big-nodes * N_big-nodes + E_small-nodes * N_small-nodes

where E_big-nodes is the power usage (in kW) for the big nodes, and E_small-nodes the same for the small nodes. N_big-nodes is the proposed number of big nodes in the vendor's offer and N_small-nodes the number of small nodes.

A limit on E_total is given in the tender document. Offers must not exceed this.