Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextflow with Azure batch appears to fail when reading from multiple containers #5448

Open
zjupNN opened this issue Oct 30, 2024 · 2 comments

Comments

@zjupNN
Copy link

zjupNN commented Oct 30, 2024

Bug report

Nextflow with Azure batch appears to fail when reading from multiple containers. The files are reported by processes as not existing, and .fusion.log contains messages about 403 authentication errors. This is apparently similar to a previously fixed issue, but persists in 24.10.0, so it may be a different cause?

See also discussion on Slack.

Expected behavior and actual behavior

We expected that it was possible to read form multiple Azure containers in the same workflow; it seems not to be.

Steps to reproduce the problem

Here is a small workflow to illustrate the problem:

process multi {
  conda "conda-forge::gawk"

  input:
  path(p1)
  path(p2)

  output:
  path("both.txt")

  """
  cat ${p1} ${p2} > both.txt
  """
}

workflow {
  p1 = Channel.fromPath(params.p1)
  p2 = Channel.fromPath(params.p2)
  multi(p1, p2)
}

Running

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://input1/foo.txt \
  --p2 az://input2/bar.txt

fails, whereas

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://output/foo.txt \
  --p2 az://output/bar.txt

works fine.

The config in question, containing the azure_batch profile (with some redacted info):

nextflow.enable.moduleBinaries = true

process {
    resourceLimits = [ cpus: 128, memory: 200.GB, time: 24.h ]

    errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
    maxRetries = 1
    maxErrors = '-1'

    cpus = { 1 * task.attempt }
    memory = { 10.GB * task.attempt }
    time = { 12.h * task.attempt }
}

profiles {
  azure_batch {
    process {
      executor = 'azurebatch'
      machineType = "Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3"
    }

    managedIdentity {
          system = true
      }

        wave {
            enabled = true
            strategy = ['conda']
        }

        fusion { 
            enabled = true
            exportStorageCredentials = true
        }

    azure {
      managedIdentity {
        system = true
      }

      storage {
        accountName = '[...]'
      }

      batch {
        location = '[...]'
        accountName = '[...]'

        autoPoolMode = true
        deletePoolsOnCompletion = true

        pools {
                auto {
           autoScale = true
              vmCount = 1
              maxVmCount = 100
                       virtualNetwork = '[...]'
                    }
         }
      }
    }
  }
}

Program output

Running nextflow prints:

executor >  azurebatch (fusion enabled) (1)
[22/bfb52c] multi (1) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'multi (1)'

Caused by:
  The task exited with an exit code representing a failure


Command executed:

  cat foo.txt bar.txt > both.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  + cat foo.txt bar.txt
  cat: foo.txt: No such file or directory
  cat: bar.txt: No such file or directory

Work dir:
  [...]

Container:
  [...]

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

-- Check '.nextflow.log' file for details

The .nextflow.log does not contain anything that stands out, whereas the .fusion.log contains:

RESPONSE 403: 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Environment

  • Nextflow version: 24.10.0
  • Java version: 21.0.4
  • Operating system: Ubuntu 24.04.1 LTS
  • Bash version: fish 3.7.0/bash 5.2.21(1)

Additional context

We have not been able to verify whether the problem is fusion-related. The pipeline still fails (with a similar but different error message) when running with fusion.enabled: false, but it has been difficult to diagnose whether this is the same issue or an unrelated problem with getting azcopy to where it needs to be during execution.

@pditommaso
Copy link
Member

Duplicate of #5444 (?)

@zjupNN
Copy link
Author

zjupNN commented Oct 30, 2024

Yes, the issue #5444 was created based on the slack discussion related to this issue - here I've just put the input from data scientists to get an overview how it was found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants