Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tx-generator MVar deadlock reporting #5850

Merged
merged 9 commits into from
Jul 2, 2024
Merged

Conversation

NadiaYvette
Copy link
Contributor

@NadiaYvette NadiaYvette commented May 22, 2024

Description

In order to provide diagnostics surrounding MVar deadlocks seen during termination of the tx-generator, the following commit series provides some element of instrumentation & deliberate signal handling. This installs a signal handler for SIGINT and SIGTERM which issues trace messages and shuts down the entire process to the best of its ability. Some of the hope is that the explicit signal catching allows for deadlocks during termination of the tx-generator to be counteracted by sending signals and that that will log trace messages of the same kind as are handled by performance team tooling.

Checklist

  • Commit sequence broadly makes sense and commits have useful messages
  • New tests are added if needed and existing tests are updated. These may include:
    • golden tests
    • property tests
    • roundtrip tests
    • integration tests
      See Runnings tests for more details
      There should not be new tests required.
  • Any changes are noted in the CHANGELOG.md for affected package
    This change doesn't appear worthy of changelogging, as it doesn't cause upstream API changes.
  • The version bounds in .cabal files are updated
    No incompatibilities are introduced, so it doesn't appear that version bounds in cabal files need updating.
  • CI passes. See note on CI. The following CI checks are required:
    • Code is linted with hlint. See .github/workflows/check-hlint.yml to get the hlint version
    • Code is formatted with stylish-haskell. See .github/workflows/stylish-haskell.yml to get the stylish-haskell version
    • Code builds on Linux, MacOS and Windows for ghc-8.10.7 and ghc-9.2.7
  • Self-reviewed the diff

Note on CI

If your PR is from a fork, the necessary CI jobs won't trigger automatically for security reasons.
You will need to get someone with write privileges. Please contact IOG node developers to do this
for you.

@NadiaYvette NadiaYvette requested review from a team as code owners May 22, 2024 01:39
@NadiaYvette NadiaYvette self-assigned this May 22, 2024
@NadiaYvette NadiaYvette added the WIP Work In Progress (cannot be merged yet) label May 22, 2024
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch 2 times, most recently from b5d6a4c to 3ab03e2 Compare May 22, 2024 04:11
@NadiaYvette NadiaYvette changed the title Nadia.chambers/txgen mvar 004 tx-generator MVar deadlock reporting May 22, 2024
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch 8 times, most recently from cfa09f4 to e874653 Compare May 28, 2024 15:16
@mgmeier mgmeier force-pushed the nadia.chambers/txgen-mvar-004 branch from e874653 to 8ab6c14 Compare May 28, 2024 15:31
@NadiaYvette NadiaYvette removed the WIP Work In Progress (cannot be merged yet) label May 28, 2024
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch from 8ab6c14 to 271a966 Compare June 3, 2024 16:57
@mgmeier mgmeier force-pushed the nadia.chambers/txgen-mvar-004 branch from 271a966 to a20a4cf Compare June 4, 2024 14:58
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch 3 times, most recently from 7545743 to 0c8a34f Compare June 26, 2024 18:48
@NadiaYvette NadiaYvette requested a review from a team as a code owner June 26, 2024 18:48
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch 3 times, most recently from df78323 to 7f111d0 Compare June 27, 2024 22:22
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch 2 times, most recently from 8f010dc to 3962aa7 Compare July 1, 2024 08:47
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch from 3962aa7 to bcbb945 Compare July 1, 2024 10:37
Copy link
Contributor

@mgmeier mgmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the huge chunk of work @NadiaYvette

@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch from f76820e to 5644eea Compare July 1, 2024 15:44
@mgmeier mgmeier force-pushed the nadia.chambers/txgen-mvar-004 branch from 5644eea to f76820e Compare July 1, 2024 15:45
NadiaYvette and others added 9 commits July 1, 2024 15:46
This is to speed up the development cycle esp. for rapid tests.
This adds a keepalive interval field for NixServiceOptions and then
handles defaulting it to 30 like now when it's absent from the JSON
and defaulted to 10. It also omits it when it's the default when
rendering it into JSON.
[email protected] wrote the nix code and the beginnings of the Haskell
side. So this commit goes about creating node aliases, represented in
Haskell in the NodeDescription structure, and using them in reporting,
which happens in the exception handlers. The node aliases contribute to
thread labels reported in the exception handlers where possible; older
base versions lack accessors for thread labels.

handleTxSubmissionClientError is the main exception handler involved.
txSubmissionClient, consumeTxsNonBlocking and newTpsThrottle are
spawned as threads within walletBenchmark and labelled accordingly,
with the txSubmissionClient labels further elaborated upon with the
node submitted to. SubmitMode's threadName field for the Benchmark
alternative is produced by the compiler and corresponds to a label for
a set of threads tracked in an AsyncBenchmarkControl structure as
opposed to a label for an individual thread, hence the effort to
generate thread labels by other means.
This installs a signal handler in Cardano.Benchmarking.Command.runCommand
It borrows heavily from the signal handler from cardano-node.
I. make AsyncBenchmarkControl a record
    The type alias of a tuple was not very mnemonic or self-explanatory.
    This replaces it with a record and haddock documents its fields.
II. use ABC to cancel threads
    The AsyncBenchmarkControl that should be initialized by the time a
    signal is received is fetched from the TVar and unpacked to be used
    to throw exceptions to the other threads. The other threads can now
    catch the exceptions in order to carry out orderly shutdowns in the
    sequel.
III. use TVar for Env AsyncBenchmarkControl
    In order to thread the AsyncBenchmarkControl through the contexts
    surrounding the creation and destruction of the Async structures and
    overall container, this stores a TVar (Maybe AsyncBenchmarkControl)
    as a value in a Map where previously it was just
    AsyncBenchmarkControl. The idea is to use the reference to it to be able
    to use it in the context of a signal handler by packaging the
    reference data with the code pointer in a partial application or
    monadic context or similar. With that data in hand, it's just a
    matter of iterating over the threads and reaping them all.
This took a fair amount of rearrangement to broaden the constant
environment in order to pass the keepalive interval in the
NixServiceOptions around. So a few different things happened:

I. create EnvConsts structure encompassing
  A. AsyncBenchmarkControl TVar (potentially changing to IORef)
  B. IOManager
  C. Maybe NixServiceOptions
  This moves the mutable reference in A. to the Reader environment from
  the State of the ExceptionT Env.Error (RWST EnvConsts () Env IO)
  ActionM monad. The reference stays constant though the referenced
  data changes.
II. pass EnvConsts to runScript and runSelftest
III. update Env.hs and NixService.hs accessors

Some of it represents changing a little of the design of the Env and
ActionM once again even after the prior commits, so a fair amount of
squashing commits that entirely redo earlier commits' changes and
rewriting commit messages will need to be done in the sequel.
I. Make tracers potentially available within signal handlers.
      This logs the event better.
II. killThread weak main TID.
      Killing the main thread if the signal is received in a
      secondary thread makes sense as a back-up strategy.
@NadiaYvette NadiaYvette force-pushed the nadia.chambers/txgen-mvar-004 branch from f76820e to dd32da1 Compare July 1, 2024 15:47
@mgmeier mgmeier added this pull request to the merge queue Jul 2, 2024
Merged via the queue into master with commit 6fa77a9 Jul 2, 2024
23 checks passed
@mgmeier mgmeier deleted the nadia.chambers/txgen-mvar-004 branch July 2, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants