Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong running time #58

Open
Halbaroth opened this issue Mar 15, 2023 · 4 comments
Open

Wrong running time #58

Halbaroth opened this issue Mar 15, 2023 · 4 comments

Comments

@Halbaroth
Copy link
Contributor

I am using Benchpress to track regressions in Alt-Ergo but sometimes I got regressions that I cannot reproduce manually. My guess is that the way Benchpress computes running times is not appropriate for this purpose and the current times can be wrong if the server is overloaded. Thus, I tried to run Benchpress with less jobs to prevent this behavior, but still I have non-replicable regressions.

I could solve my problem by adding the running time without IO operations (the CPU time?) in the database for each tests.

I would be glad to add the feature by myself ;)

@Halbaroth
Copy link
Contributor Author

Halbaroth commented Mar 15, 2023

let run cmd : Run_proc_result.t =
let start = Ptime_clock.now() in
(* call process and block *)
let p =
try
let oc, ic, errc = Unix.open_process_full cmd (Unix.environment()) in
close_out ic;
(* read out and err *)
let err = ref "" in
let t_err = Thread.create (fun e -> err := CCIO.read_all e) errc in
let out = CCIO.read_all oc in
Thread.join t_err;
let status = Unix.close_process_full (oc, ic, errc) in
object
method stdout= out
method stderr= !err
method errcode=int_of_process_status status
method status=status
end
with e ->
object
method stdout=""
method stderr="process died: " ^ Printexc.to_string e
method errcode=1
method status=Unix.WEXITED 1
end
in
let errcode = p#errcode in
Log.debug
(fun k->k "(@[run.done@ :errcode %d@ :cmd %a@]" errcode Misc.Pp.pp_str cmd);
(* Compute time used by the command *)
let rtime = Ptime.diff (Ptime_clock.now ()) start |> Ptime.Span.to_float_s in
let utime = 0. in

Does Ptime_clock.now () return the elapsed time? As an experimentation, I will replace Ptime_clock.now () by Sys.time which returns the cpu time. It should fix my issue. If it do so, I will add an extra field in the DB to keep the total cpu time also.

@c-cube
Copy link
Member

c-cube commented Mar 15, 2023

Sys.time will return the CPU time for the current process (i.e benchpress), not for the solver. Ptime_clock.now returns the posix wallclock, which indeed might vary if the server is overloaded.

I'm not sure what the clean solution is. Maybe @Gbury can chime in about using cgroups to ensure that subprocesses get dedicated CPU resources? I think François Bobot does this in his benchmark tool.

@Gbury
Copy link
Collaborator

Gbury commented Mar 15, 2023

This is all a bit complicated, but I'd say:

  • to ensure a set amount of cpu ressources, the best best would probably be to set the cpu affinity (see, e.g. https://unix.stackexchange.com/questions/73/how-can-i-set-the-processor-affinity-of-a-process-on-linux ), but that would also require to set affinity for all the other processes running on the machine if one wants to ensure that the prover processes are alone on each core (or find another solution ?)
  • One of the motivation for using something else that wall clock time to measure regression is that some colleagues told us that the OS might decide to batch IOs, which could result in arbitrary delays for some instances of provers on systems that are under load (or that do a lot of IO, e.g. when starting more than a dozen provers at the same time). One idea was to use cpu time to identity regressions, but maybe there would be other solutions (stagger the starting times of processes, preload/put in the cache the input problem files, etc...)

@Halbaroth
Copy link
Contributor Author

I am trying another solution given by a colleague ;)

bclement-ocp added a commit to bclement-ocp/benchpress that referenced this issue Jun 7, 2023
This patch does multiple things to try to improve stability of running
times:

 - Introduce cpu affinity support in benchpress. This is configured with
   on the command-line with the new `--cpus` option, which replaces `-j`
   and allows to specify a list of cpus or cpu ranges in taskset format,
   such as `0-3,7-12,15` (strides are not supported).

   When the `--cpus` option is provided, provers will be run in parallel
   on the provided cpus, with at most one prover running on a given cpu
   at once.

   Note that the `--cpus` setting *only* concerns the prover runs (and
   some glue code around one specific prover run): it does *not*
   otherwise restrict the cpus used by benchpress itself. It is
   recommended to use an external method such as the `taskset` utility
   to assign to the toplevel benchpress process a CPU affinity that does
   not overlap with the prover CPUs provided in `--cpus`.

   When using the `--cpus` setting, users should be aware that *there
   may still be other processes on the system using these cpus*
   (obviously)! It is thus recommended to use one of the existing
   techniques to isolate these CPUs from the rest of the system. I know
   of two ways to do this: the `isolcpus` kernel parameter, which is
   deprecated but is slightly easier to use, and cpusets, which are not
   deprecated but harder to use.

   To use `isolcpus`, simply set the cpus to use for benchmarking as the
   isolated CPUs on the kernel cmdline and reboot (OK, maybe not that
   simple for one-shot benchmarking, but fairly easy for a machine that
   is mostly used for benchmarking). There is no need to set the CPU
   affinity for the `benchpress` binary: it will never be scheduled on
   the isolated CPUs, and neither will any other processes (unless
   manually required to).

   To use cpusets (which is the solution I employed on our benchmark
   machine at OCamlPro), you should create a `system` cpuset that
   contains only the CPUs that will *NOT* be used by benchpress, and
   move all processes to that cpuset (this can be done on a running
   system, consult the cpuset documentation). Then, create another
   cpuset that contains the CPUs to use for benchpress, including the
   CPUs to use for the `benchpress` binary (in practice I use the root
   cpuset that contains all the CPUs), and run `benchpress` in that
   cpuset. You must not forget to use `taskset` to prevent the
   `benchpress` binary from using the CPUs destined for the provers. In
   practice, I move the shell that I use to run benchpress to that
   second cpuset.

 - Input files are copied into RAM (through `/dev/shm`) before running
   the provers, and the input to the prover is the copy in RAM.  This
   ensures we don't benchmark the time it takes to load the problem from
   disk, and hels minimise noise in situations of high disk contention
   (such as when tens of prover instances are trying to read their input
   at the same time). We are still limited by the RAM bandwidth but…
   what can you do.

 - Standard output and standard error of the prover are written to
   temporary files in RAM instead of using pipes. This ensures that the
   prover never gets stuck waiting for benchpress to read from the pipes
   (although this shouldn't be an issue due to huge default pipe buffer
   sizes on Linux), but more importantly minimises context switches
   between the prover and benchpress.

With these changes we observe much less noise in benchmark results
internally at OCamlPro, and this all but fixes sneeuwballen#58 (that said, it would
still be nice to add support for getting execution times through
getrusage).
bclement-ocp added a commit to bclement-ocp/benchpress that referenced this issue Jun 7, 2023
This patch does multiple things to try to improve stability of running
times:

 - Introduce cpu affinity support in benchpress. This is configured with
   on the command-line with the new `--cpus` option, which replaces `-j`
   and allows to specify a list of cpus or cpu ranges in taskset format,
   such as `0-3,7-12,15` (strides are not supported).

   When the `--cpus` option is provided, provers will be run in parallel
   on the provided cpus, with at most one prover running on a given cpu
   at once.

   Note that the `--cpus` setting *only* concerns the prover runs (and
   some glue code around one specific prover run): it does *not*
   otherwise restrict the cpus used by benchpress itself. It is
   recommended to use an external method such as the `taskset` utility
   to assign to the toplevel benchpress process a CPU affinity that does
   not overlap with the prover CPUs provided in `--cpus`.

   When using the `--cpus` setting, users should be aware that *there
   may still be other processes on the system using these cpus*
   (obviously)! It is thus recommended to use one of the existing
   techniques to isolate these CPUs from the rest of the system. I know
   of two ways to do this: the `isolcpus` kernel parameter, which is
   deprecated but is slightly easier to use, and cpusets, which are not
   deprecated but harder to use.

   To use `isolcpus`, simply set the cpus to use for benchmarking as the
   isolated CPUs on the kernel cmdline and reboot (OK, maybe not that
   simple for one-shot benchmarking, but fairly easy for a machine that
   is mostly used for benchmarking). There is no need to set the CPU
   affinity for the `benchpress` binary: it will never be scheduled on
   the isolated CPUs, and neither will any other processes (unless
   manually required to).

   To use cpusets (which is the solution I employed on our benchmark
   machine at OCamlPro), you should create a `system` cpuset that
   contains only the CPUs that will *NOT* be used by benchpress, and
   move all processes to that cpuset (this can be done on a running
   system, consult the cpuset documentation). Then, create another
   cpuset that contains the CPUs to use for benchpress, including the
   CPUs to use for the `benchpress` binary (in practice I use the root
   cpuset that contains all the CPUs), and run `benchpress` in that
   cpuset. You must not forget to use `taskset` to prevent the
   `benchpress` binary from using the CPUs destined for the provers. In
   practice, I move the shell that I use to run benchpress to that
   second cpuset.

 - Input files are copied into RAM (through `/dev/shm`) before running
   the provers, and the input to the prover is the copy in RAM.  This
   ensures we don't benchmark the time it takes to load the problem from
   disk, and hels minimise noise in situations of high disk contention
   (such as when tens of prover instances are trying to read their input
   at the same time). We are still limited by the RAM bandwidth but…
   what can you do.

 - Standard output and standard error of the prover are written to
   temporary files in RAM instead of using pipes. This ensures that the
   prover never gets stuck waiting for benchpress to read from the pipes
   (although this shouldn't be an issue due to huge default pipe buffer
   sizes on Linux), but more importantly minimises context switches
   between the prover and benchpress.

With these changes we observe much less noise in benchmark results
internally at OCamlPro, and this all but fixes sneeuwballen#58 (that said, it would
still be nice to add support for getting execution times through
getrusage).
bclement-ocp added a commit to bclement-ocp/benchpress that referenced this issue Jun 7, 2023
This patch does multiple things to try to improve stability of running
times:

 - Introduce cpu affinity support in benchpress. This is configured with
   on the command-line with the new `--cpus` option, which replaces `-j`
   and allows to specify a list of cpus or cpu ranges in taskset format,
   such as `0-3,7-12,15` (strides are not supported).

   When the `--cpus` option is provided, provers will be run in parallel
   on the provided cpus, with at most one prover running on a given cpu
   at once.

   Note that the `--cpus` setting *only* concerns the prover runs (and
   some glue code around one specific prover run): it does *not*
   otherwise restrict the cpus used by benchpress itself. It is
   recommended to use an external method such as the `taskset` utility
   to assign to the toplevel benchpress process a CPU affinity that does
   not overlap with the prover CPUs provided in `--cpus`.

   When using the `--cpus` setting, users should be aware that *there
   may still be other processes on the system using these cpus*
   (obviously)! It is thus recommended to use one of the existing
   techniques to isolate these CPUs from the rest of the system. I know
   of two ways to do this: the `isolcpus` kernel parameter, which is
   deprecated but is slightly easier to use, and cpusets, which are not
   deprecated but harder to use.

   To use `isolcpus`, simply set the cpus to use for benchmarking as the
   isolated CPUs on the kernel cmdline and reboot (OK, maybe not that
   simple for one-shot benchmarking, but fairly easy for a machine that
   is mostly used for benchmarking). There is no need to set the CPU
   affinity for the `benchpress` binary: it will never be scheduled on
   the isolated CPUs, and neither will any other processes (unless
   manually required to).

   To use cpusets (which is the solution I employed on our benchmark
   machine at OCamlPro), you should create a `system` cpuset that
   contains only the CPUs that will *NOT* be used by benchpress, and
   move all processes to that cpuset (this can be done on a running
   system, consult the cpuset documentation). Then, create another
   cpuset that contains the CPUs to use for benchpress, including the
   CPUs to use for the `benchpress` binary (in practice I use the root
   cpuset that contains all the CPUs), and run `benchpress` in that
   cpuset. You must not forget to use `taskset` to prevent the
   `benchpress` binary from using the CPUs destined for the provers. In
   practice, I move the shell that I use to run benchpress to that
   second cpuset.

 - Input files are copied into RAM (through `/dev/shm`) before running
   the provers, and the input to the prover is the copy in RAM.  This
   ensures we don't benchmark the time it takes to load the problem from
   disk, and hels minimise noise in situations of high disk contention
   (such as when tens of prover instances are trying to read their input
   at the same time). We are still limited by the RAM bandwidth but…
   what can you do.

 - Standard output and standard error of the prover are written to
   temporary files in RAM instead of using pipes. This ensures that the
   prover never gets stuck waiting for benchpress to read from the pipes
   (although this shouldn't be an issue due to huge default pipe buffer
   sizes on Linux), but more importantly minimises context switches
   between the prover and benchpress.

With these changes we observe much less noise in benchmark results
internally at OCamlPro, and this all but fixes sneeuwballen#58 (that said, it would
still be nice to add support for getting execution times through
getrusage).
bclement-ocp added a commit to bclement-ocp/benchpress that referenced this issue Jun 7, 2023
This is configured with on the command-line with the new `--cpus`
option, which replaces `-j` and allows to specify a list of cpus or cpu
ranges in taskset format, such as `0-3,7-12,15` (strides are not
supported).

When the `--cpus` option is provided, provers will be run in parallel on
the provided cpus, with at most one prover running on a given cpu at
once.

Note that the `--cpus` setting *only* concerns the prover runs (and some
glue code around one specific prover run): it does *not* otherwise
restrict the cpus used by benchpress itself. It is recommended to use an
external method such as the `taskset` utility to assign to the toplevel
benchpress process a CPU affinity that does not overlap with the prover
CPUs provided in `--cpus`.

When using the `--cpus` setting, users should be aware that *there may
still be other processes on the system using these cpus* (obviously)! It
is thus recommended to use one of the existing techniques to isolate
these CPUs from the rest of the system. I know of two ways to do this:
the `isolcpus` kernel parameter, which is deprecated but is slightly
easier to use, and cpusets, which are not deprecated but harder to use.

To use `isolcpus`, simply set the cpus to use for benchmarking as the
isolated CPUs on the kernel cmdline and reboot (OK, maybe not that
simple for one-shot benchmarking, but fairly easy for a machine that is
mostly used for benchmarking). There is no need to set the CPU affinity
for the `benchpress` binary: it will never be scheduled on the isolated
CPUs, and neither will any other processes (unless manually required
to).

To use cpusets (which is the solution I employed on our benchmark
machine at OCamlPro), you should create a `system` cpuset that contains
only the CPUs that will *NOT* be used by benchpress, and move all
processes to that cpuset (this can be done on a running system, consult
the cpuset documentation). Then, create another cpuset that contains the
CPUs to use for benchpress, including the CPUs to use for the
`benchpress` binary (in practice I use the root cpuset that contains all
the CPUs), and run `benchpress` in that cpuset. You must not forget to
use `taskset` to prevent the `benchpress` binary from using the CPUs
destined for the provers. In practice, I move the shell that I use to
run benchpress to that second cpuset.

This helps with sneeuwballen#58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants