Skip to content

3.8.1.66

Compare
Choose a tag to compare
@PalNilsson PalNilsson released this 09 Sep 12:33
· 63 commits to master since this release
4a45bc2
  • Added default path for ifconfig command (used to lookup IPv6 info) if command not found
  • Support for OIDC tokens in urllib based request function (used for pilot-PanDA server communications)
    • Together with a token key, the primary OIDC token is used to download a shorter token, used in the later communications with the PanDA server
    • The pilot is refreshing the token immediately after launch, the original long lasting token is overwritten
    • The short lasting tokens are refreshed periodically (once every 60 minutes)
    • Note: OIDC tokens are used by default if found locally, otherwise X509 is used - i.e. there is no corresponding pilot option to activate the mechanism
  • Received SIGTERM signals on Kubernetes resources reported with new error code 1379, “Job was preempted”
  • Added two error codes for arcproxy failures
    • 1380: “General arcproxy failure” (was previously reported as 1008: “"General pilot error, consult batch log"”)
    • 1381: “Arcproxy failure while loading shared libraries”
      • Note: this (1381) is currently only used internally and does not lead to a failed job
  • Remote file open container now using EL9 instead of CentOS7
    • Required for latest ROOT release
    • Requested by A. De Silva
  • Skipping setting RUCIO_ACCOUNT for payload
    • Requested by R. Walker
  • A time-out was added to the gdb command execution (for producing a core dump file) when a looping job has been discovered
    • Requested by R. Walker
  • Real-time logging
    • Now possible to specify real-time logging server (type, protocol, URL and port) via pilot argument
      • Previously, it only worked via pilot config
      • Requested by W. Guan
    • Added Loki real-time logging module (Rubin)
    • Real-time logging can now be activated for all jobs on a given queue (relevant for pilot logs, not payload stdout)
      • Activation currently via PQ.catchall
      • Streaming of pilot logs requested by I. Vukotic
      • To be tested more widely
  • New pilot option --noworkerpilotstatusupdate can be used to switch off worker pilot status updates
    • Needed at NERSC
    • Requested by T. Maeno
  • Added timeout to urlopen() used for pilot-PanDA server communication
    • The default timeout is too short and for getjob operations can lead to “jobdispatcher, 102: Sent job didn't receive reply from pilot within 30 min”-errors
    • In case of failure, pilot will currently fallback to curl based communication
    • Timeout is now explicitly set to 30 s
    • Reported by Z. Yang (Rubin)
  • Bug fix
    • Patch for setting final job completion state before log stage-out had completed
      • Leading to “ddm, 200: Could not get GUID/LFN/MD5/FSIZE/SURL from pilot XML”-error
      • Reported by R. Walker, discussed in JIRA ticket ATLASPANDA-1047
  • Housekeeping with pylint
    • The average pylint score of all pilot modules is 9.56

Contributions from W. Guan, P. Nilsson