Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr as new Nextflow backend #1088

Closed
sb10 opened this issue Mar 26, 2019 · 79 comments
Closed

wr as new Nextflow backend #1088

sb10 opened this issue Mar 26, 2019 · 79 comments

Comments

@sb10
Copy link
Contributor

sb10 commented Mar 26, 2019

New feature

I develop wr which is a workflow runner like Nextflow, but can also just be used as a backend scheduler. It can schedule to LSF and OpenStack right now.

The benefit to Nextflow users of going via wr instead of using Nextflow’s existing LSF or Kubernetes support is:

  1. wr makes more efficient use of LSF: it can pick an appropriate queue, use job arrays, and “reuse” job slots. In a simple test I did, Nextflow using wr in LSF mode was 2 times faster than Nextflow using its own LSF scheduler.
  2. wr’s OpenStack support is incredibly easy to use and set up (basically a single command to run), and provides auto scaling up and down. Kubernetes, by comparison, is really quite complex to get working on OpenStack, doesn’t auto scale, and wastes resources with multiple nodes needed even while no workflows are being operated on. I was able to get Nextflow to work with wr in OpenStack mode (but the shared disk requirement for Nextflow’s state remains a concern).

Usage scenario

Users with access to LSF or OpenStack clusters who want to run their Nextflow workflows efficiently and easily.

Suggest implementation

Since I don’t know Java well enough to understand how to implement this “correctly”, I wrote a simple bsub emulator in wr, which is what my tests so far have been based on. I submit the Nextflow command as a job to wr, turning on the bsub emulation, and configure Nextflow to use its existing LSF scheduler. While running under the emulation, Nextflow’s bsub calls actually call wr.

Of course the proper way to do this would be have Nextflow call wr directly (either the wr command line, or it’s REST API). The possibly tricky thing with regard to having it work in OpenStack mode is having it tell wr about OpenStack-specific things like what image to use, what hardware flavour to use, pass details on how to mount S3 etc. (the bsub emulation handles all of this).

Here's what I did for my LSF test...

echo_1000_sleep.nf:

#!/usr/bin/env nextflow

num = Channel.from(1..1000)

process echo_sleep {
  input:
  val x from num

	output:
	stdout result

  "echo $x && sleep 1"
}

result.subscribe { println it }

workflow.onComplete {
    println "Pipeline completed at: $workflow.complete"
    println "Execution status: ${ workflow.success ? 'OK' : 'failed' }"
}

nextflow.config:

process {
  executor='lsf'
  queue='normal'
	memory='100MB'
}

install wr:

wget https://github.com/VertebrateResequencing/wr/releases/download/v0.17.0/wr-linux-x86-64.zip
unzip wr-linux-x86-64.zip
mv wr /to/somewhere/in/my/PATH/wr

run:

wr manager start -s lsf
echo "nextflow run ./echo_1000_sleep.nf" | wr add --bsub -r 0 -i nextflow --cwd_matters --memory 1GB

Here's what I did to get it to work in OpenStack...

nextflow_install.sh:

sudo apt-get update
sudo apt-get install openjdk-8-jre-headless -y
wget -qO- https://get.nextflow.io | bash
sudo mv nextflow /usr/bin/nextflow

put input files in S3:

s3cmd put nextflow.config s3://sb10/nextflow/nextflow.config
s3cmd put echo_1000_sleep.nf s3://sb10/nextflow/echo_1000_sleep.nf

~/.openstack_rc:

[your rc file containing OpenStack environment variables downloaded from Horizon]

run:

source ~/.openstack_rc
wr cloud deploy --os 'Ubuntu Xenial' --username ubuntu
echo "cp echo_1000_sleep.nf /shared/echo_1000_sleep.nf && cp nextflow.config /shared/nextflow.config && cd /shared && nextflow run echo_1000_sleep.nf" | wr add --bsub -r 0 -o 2 -i nextflow --memory 1GB --mounts 'ur:sb10/nextflow' --cloud_script nextflow_install.sh --cloud_shared

The NFS share at /shared created by the --cloud_shared option is slow and limited in size; a better solution would be to set up your own high performance shared filesystem in OpenStack (eg. GlusterFS), then add to nextflow_install.sh to mount this share. Or even better, is there a way to have Nextflow not store state on disk? If it could just query wr for job completion status, that would be better.

@pditommaso
Copy link
Member

This is a nice initiative, I'll be very happy to review and merge a pull request implementing the support for this back-end engine.

@sb10
Copy link
Contributor Author

sb10 commented Mar 27, 2019

@pditommaso Is there a way to get around needing a shared disk for Nextflow state?
If I was going to try and write this myself, despite not knowing Java, what's my best starting point? Base it on Nextflow's LSF scheduler? Or one of the cloud ones? Has anyone written a guide to writing Nextflow schedulers?

@pditommaso
Copy link
Member

Is there a way to get around needing a shared disk for Nextflow state

Not easily, the bottom line is that temporary task outputs need to be shared to downstream tasks, therefore some kind of shared storage. For example with AWS Batch NF is using S3 as shared storage, I guess for OpenStack there's a similar object storage solution (and actually there's).

what's my best starting point?

It depends, if wr provides a command line tool somehow similar to qsub, bsub, etc, your can just create a executor extending AbstractGridExecutor along the same lines of the one for Slurm for example.

If you are planning instead to use a low-level API, you can have a look can extend the Executor base case.

Has anyone written a guide to write Nextflow schedulers

Nope, but I can advise you if needed.

@sb10
Copy link
Contributor Author

sb10 commented Mar 27, 2019

temporary task outputs need to be shared to downstream tasks, therefore some kind of shared storage. For example with AWS Batch NF is using S3 as shared storage

That works for wr's purposes; wr provides easy mounting of S3, which could be used for job inputs/outputs.

The problem I have is with Nextflow running on the "master" node needing to check the output written to S3 on a "worker" node before it, eg. submits the next job.

wr's S3 mount is not POSIX enough for what Nexflow on the "master" needs to do for this state checking. (eg. it caches S3 "directory" contents, so is not aware if a different process writes to that "directory")

Can job input/outputs be stored in S3, but job state be queried from the scheduler?

@pditommaso
Copy link
Member

Can job input/outputs be stored in S3, but job state be queried from the scheduler?

Yes, you will need to write your own version of TaskHandler

@sb10
Copy link
Contributor Author

sb10 commented Apr 11, 2019

I have an initial basic but working implementation here: https://github.com/sb10/nextflow/blob/wr/modules/nextflow/src/main/groovy/nextflow/executor/WrExecutor.groovy

I have # *** comments for areas where I'm stuck. Can you give me some pointers? In particular:

  1. Don't know how to pick up options from nextflow.config.
  2. Not sure as to the best approach for doing the important things BashWrapperBuilder does without creating a file.
  3. How do you get a "pure" task name?
  4. How do you get Groovy to trust a ca cert?
  5. Not sure how passing all possible task options should be done. Just pass around the task object, and will that have all desirable information like memory requirement hints etc?
  6. Under what circumstances does kill() get called?

I've only had a quick glance at an example LSF test, and didn't really get it. How should I go about writing a complete test for this? Are end-to-end tests done, or only mocks?

I'm currently using the TaskPollingMonitor, but the ideal way of implementing this would be for Nextflow to listen on a websocket and for wr to tell it the moment that tasks change state. What would be the best approach to doing that? Implement my own TaskMonitor? Or is it even possible?

Currently, my implementation is only working running locally. I still haven't thought about how to handle input and output files being in S3. With existing Executors, how does a user start a workflow running when their input files are not available locally, but in an S3 bucket? And how do they specify that the output files go to S3 as well?

@pditommaso
Copy link
Member

It looks a good start.

I have # *** comments for areas where I'm stuck.

I would suggest submitting a draft PR so I can reply to your comments.

Don't know how to pick up options from nextflow.config.

The Executor object has a session attribute that holds the config map. Something like session.config.navigate('your.attribute') should work.

Not sure as to the best approach for doing the important things BashWrapperBuilder does without creating a file.

What do you mean?

How do you get a "pure" task name?

What do you mean for pure?

How do you get Groovy to trust a ca cert?

More than Groovy, you should look at URLConnection Java API. Have a look at K8sClient

Not sure how passing all possible task options should be done. Just pass around the task object, and will that have all desirable information like memory requirement hints etc?

Task specific settings are accessible on TaskRun#config map. Have a look for example here.

Under what circumstances does kill() get called?

Mostly when the execution is aborted and the cleanup method is invoked.

How should I go about writing a complete test for this? Are end-to-end tests done, or only mocks?

Mocks would be a good start. Have a look a K8sClientTest

I'm currently using the TaskPollingMonitor, but the ideal way of implementing this would be for Nextflow to listen on a websocket and for wr to tell it the moment that tasks change state.

I would keep this as a future enhancement.

Currently, my implementation is only working running locally. I still haven't thought about how to handle input and output files being in S3. With existing Executors, how does a user start a workflow running when their input files are not available locally, but in an S3 bucket?

This should be managed automatically by BashWrapperBuilder

@sb10 sb10 mentioned this issue Apr 15, 2019
@sb10
Copy link
Contributor Author

sb10 commented Apr 15, 2019

I have # *** comments for areas where I'm stuck.

I would suggest submitting a draft PR so I can reply to your comments.

PR created here: #1114

Don't know how to pick up options from nextflow.config.

The Executor object has a session attribute that holds the config map. Something like session.config.navigate('your.attribute') should work.

OK, I'll give it a try. But what is session.getConfigAttribute() and how are you supposed to use it?

Not sure as to the best approach for doing the important things BashWrapperBuilder does without creating a file.

What do you mean?

If possible I don't want Nextflow to create any files of its own (only tasks commands should create workflow output files). If that's not possible, then at least I need the wrapper script it creates to not have any absolute paths in it, because those paths won't exist on other OpenStack instances that wr might create to run tasks.

How do you get a "pure" task name?

What do you mean for pure?

For a scatter task, task.name looks like name (1), while I just want name.

Under what circumstances does kill() get called?

Mostly when the execution is aborted and the cleanup method is invoked.

Specifically, what state could a task be in when kill() is called? Any state?

Currently my code only considers "started" and "complete" statuses, which actually comprise multiple underlying states, and miss some possibilities. For example, in wr a job can be "lost" if it was running on an instance that got deleted externally. The user confirms the situation and then tells wr the job is really dead, and can then choose to restart it.

I imagine the current situation with my Nextflow implementation, it would initially satisfy checkIfRunning(), but never satisfy checkIfCompleted(). Is something going to happen automatically in Nextflow to handle these situations, or should I either a) include "lost" as a "complete" status, or b) should users be checking wr's own status interface to deal with lost jobs?

Currently, my implementation is only working running locally. I still haven't thought about how to handle input and output files being in S3. With existing Executors, how does a user start a workflow running when their input files are not available locally, but in an S3 bucket?

This should be managed automatically by BashWrapperBuilder

It's not clear to me how this is done. What does a user have to do to enable this feature?

The scenario I envisage is:

Nextflow is run on a local machine from which wr has been deployed to OpenStack. This local machine does not have the S3 bucket mounted, and is completely different to the OpenStack servers that will be created to run tasks (in particular, different home directory location).

nextflow.config is configured with the location of an S3 bucket that holds their input files and where output files should go, and their ~/.s3cfg holds their connection details.

This results in WrRestApi.add() (called by submit()) running the TaskBean(task).script in a directory that is mounted on the configured S3 bucket location.

For this to actually work, task.getWorkDirStr() would have to return a unique task path relative to the configured S3 bucket location, and wr would cd there before running the task bean script. Or alternatively, copyStrategy.getStageInputFilesScript needs to symlink from the correct location within a mounted bucket location to the local working directory, and then output files need to be copied over to the mounted bucket location.

Is a scenario like this at all possible? Or what is the "Nextflow" way of doing this kind of thing?

@pditommaso
Copy link
Member

OK, I'll give it a try. But what is session.getConfigAttribute() and how are you supposed to use it?

Actually, it should work. TES executor uses the same method. The best thing to do is to create a unit test and debug it.

If possible I don't want Nextflow to create any files of its own (only tasks commands should create workflow output files). If that's not possible, then at least I need the wrapper script it creates to not have any absolute paths in it, because those paths won't exist on other OpenStack instances that wr might create to run tasks.

You can create a copy strategy that does nothing. Have a look at TesFileCopyStrategy

For a scatter task, task.name looks like name (1), while I just want name

Check the TaskRun#processor#name

Specifically, what state could a task be in when kill() is called? Any state?

It's used to kill pending or running tasks when the workflow execution is aborted.

It's not clear to me how this is done. What does a user have to do to enable this feature?

It strongly depends on the storage type used as a pipeline work dir. When an input file is stored in a file system different for the one used as work dir, NF copies that file from the foreign file system to the work dir (for example, an input is on ftp and the work dir is a S3 bucket). This is controlled by the resolveForeignFiles method linked above. Then if the work dir is NOT a posix file system, files are copied to a scratch node in the local node. This is done generally by the Task handler.

@sb10
Copy link
Contributor Author

sb10 commented Apr 15, 2019

Are there any guides or examples aimed at end-users of using Nextflow with data in S3?

@pditommaso
Copy link
Member

There's nothing special, file paths need to start with s3://

@sb10
Copy link
Contributor Author

sb10 commented Apr 15, 2019

I'm testing with this basic workflow:

#!/usr/bin/env nextflow

Channel.fromPath('test_inputs/*.input').set { inputs_ch }

process capitalize {
  input:
  file x from inputs_ch
  output:
  file 'file.output' into outputs_ch
  script:
  """
  cat $x | tr [a-z] [A-Z] > file.output
  """
}

outputs_ch
      .collectFile()
      .println{ it.text }

workflow.onComplete {
    println "Pipeline completed at: $workflow.complete"
    println "Execution status: ${ workflow.success ? 'OK' : 'failed' }"
}

If I change the input to be specified like:

Channel.fromPath('s3://mybucket/test_inputs/*.input').set { inputs_ch }

Then it fails because it tries to contact Amazon S3. How do I tell Nextflow to use my own S3 service?

I looked at https://www.nextflow.io/docs/latest/config.html#config-aws and created a nextflow.config with an aws section with an endpoint value, but it didn't seem to help. In my ~/.s3cfg file I have to specify:

[default]
access_key = xxx
secret_key = xxx
encrypt = False
host_base = cog.sanger.ac.uk
host_bucket = %(bucket)s.cog.sanger.ac.uk
progress_meter = True
use_https = True

Assuming that can be resolved, what change do I make to my workflow so that the file.output is actually an input-specific output file stored in S3 as well?

I'm hoping that instead of the current method of copying files from S3 to local disk as part of staging, I can make use of wr's S3 mount capabilities, avoiding the need for large local disks when working on large files in S3.

@pditommaso
Copy link
Member

Here there's an example (an by the way it should be a guy at sanger).

what change do I make to my workflow so that the file.output is actually an input-specific output file stored in S3 as well

Tasks are implicitly assumed to be executed in a posix file system and therefore output files are created in the task work dir. Then if the pipeline work dir is a remote storage, NF will copy that file to the remote work dir.

I'm hoping that instead of the current method of copying files from S3 to local disk as part of staging, I can make use of wr's S3 mount capabilities, avoiding the need for large local disks when working on large files in S3.

It should be possible writing your own copying strategy.

@sb10
Copy link
Contributor Author

sb10 commented Apr 16, 2019

Then if the pipeline work dir is a remote storage

How does a user specify this? Or how does Nextflow know?

Anyway, I'll look in to writing my own copy strategy.

@pditommaso
Copy link
Member

How does a user specify this? Or how does Nextflow know?

Using the -w command line option and specifying a remote supported storage path e.g. s3 or gs

nextflow run <something> -w s3://bucket/work/dir

Then, this decides if an input file is on a foreign storage ie. different from the current work dir

@sb10
Copy link
Contributor Author

sb10 commented Apr 16, 2019

OK, I'll give it a try. But what is session.getConfigAttribute() and how are you supposed to use it?

Actually, it should work.

It wasn't working because of my config file, but I'm not sure how to get the TesExecutor's way of doing it to work.

session.getConfigAttribute('executor.cacertpath', [...]) works with:

executor {
  name='wr'
  cacertpath='foo.pem'
}

but session.getConfigAttribute('executor.wr.cacertpath', [...]) does not work with:

process {
  executor='wr'
}

executor {
  $wr {
  cacertpath='foo.pem'
  }
}

or any other combination I could think of. What is the correct way that people would expect to be able to specify executor options in their nextflow.config, and how do I pick those options up in all cases?

@sb10
Copy link
Contributor Author

sb10 commented Apr 18, 2019

I've now implemented my WrTaskHandler as a BatchHandler, which speeds up getting task status. Is there an equivalent way to batch up the submit requests as well?

@pditommaso
Copy link
Member

Do you mean to batch submit requests? nope, there isn't a specific mechanism for this.

@sb10
Copy link
Contributor Author

sb10 commented Apr 18, 2019

Yes, that's what I mean. I'm thinking I can add a BatchSubmitHandler trait to BatchHandler.groovy, which is identical to the BatchHandler trait but just with a different method name (batchSubmit() instead of batch()). I need this because a single batch() method can't handle the needs of batching both status requests and submit requests.

Alternatively, I could pass my REST API client to my TaskMonitor and have the monitor class directly submit all the tasks in the pending queue to the client in 1 go.

Or something else?

@pditommaso
Copy link
Member

I think both approaches can work. I would suggest to keep it as an optimisation step, or just postpone it because I need to merge some changes in the TaskPolling and Executor login in the master next week.

@sb10
Copy link
Contributor Author

sb10 commented Apr 18, 2019

Getting back to S3, I've got my config file working by doing:

aws {
  client {
    endpoint = "https://cog.sanger.ac.uk"
    signerOverride = "S3SignerType"
  }
}

And then using the normal new BashWrapperBuilder(task).build(), plus running like nextflow testS3.nf -w s3://sb10/nextflow/work it finds and scatters over my S3 input files and writes the .command.run and .command.sh files to S3. Great!

But my wrapperFile variable is /sb10/nextflow/work/4f/03a48f82c42eaa2228e931eae3d8f9/.command.run and the uploaded .command.run file contains paths like:

nxf_launch() {
    /bin/bash -ue /sb10/nextflow/work/4f/03a48f82c42eaa2228e931eae3d8f9/.command.sh
}

Is this expected, or am I not quite doing something correctly? Am I supposed to mount my sb10 bucket at /sb10 for this to all work and come together?

@sb10
Copy link
Contributor Author

sb10 commented Apr 18, 2019

And if that's the case, how can my WrTaskHandler.submit() method know that it should request such a mount when submitting to wr?

@pditommaso
Copy link
Member

pditommaso commented Apr 18, 2019

One question at a time please, otherwise the thread become unreadable. You are getting that path because you are using the SimpleFileCopyStrategy (I guess) which is designed to work with local paths and the S3Path object when converted to a string omit the protocol path ie. s3://.

Provided that you will mount the inputs (in a container I supposed) how are you expecting the script to reference these files?

@sb10
Copy link
Contributor Author

sb10 commented Apr 18, 2019

The ideal way would be that I submit the command .mnt/sb10/nextflow/work/4f/03a48f82c42eaa2228e931eae3d8f9/.command.sh to wr while telling it to mount s3://sb10 at .mnt/sb10 in the current working directory.

On an OpenStack node wr will create a unique working directory in /tmp, mount the s3 bucket at .mnt/sb10, and execute the .sh file that is now available. Currently that won't work because the .sh would also need to refer to the .run file as being in .mnt/sb10/nextflow/work/4f/03a48f82c42eaa2228e931eae3d8f9/.command.run.

I've been trying to get my head around SimpleFileCopyStrategy and how I could make minimal changes while getting it to give me the relative paths I want, but I haven't understood it yet.

@pditommaso
Copy link
Member

Let's put in this way, give a pipeline execution work dir as s3://foo NF will compute for each task a work dir as s3://foo/xx/yy. This path is set in the TaskBean#workDir attribute that is then set in the SimpleFileCopyStrategy#workDir attribute.

Then for each (s3) input file the stageInputFile method is invoked to which is passed the source S3 path and the expected relative file name resolved against the task work dir (most of the time just the file name w/o path).

Therefore it's the role of your custom file copy strategy to implement the logic to handle your use case. IMO it should be very similar to AwsBatchFileCopyStrategy.groovy with the difference that you don't need to copy the input files but only the output files to the targetDir (that is supposed to be the same as workDir unless the process specifies a custom storeDir)

Entering in Easter mode 😄👋👋

@sb10
Copy link
Contributor Author

sb10 commented Apr 26, 2019

Thanks. I now have tasks running successfully in OpenStack with data in S3, but my output file is not "unstaged" to be put in S3, so the workflow fails due to missing output files.
This is probably just because I don't know how to write a proper S3-based workflow. Can you tell me how to adjust this so it does the right thing?

#!/usr/bin/env nextflow

Channel.fromPath('s3://sb10/nextflow/test_inputs/*.input').set { inputs_ch }

process capitalize {
  input:
  file x from inputs_ch
  output:
  file 'file.output' into outputs_ch
  script:
  """
  cat $x | tr [a-z] [A-Z] > file.output
  """
}

outputs_ch
      .collectFile()
      .println{ it.text }

workflow.onComplete {
    println "Pipeline completed at: $workflow.complete"
    println "Execution status: ${ workflow.success ? 'OK' : 'failed' }"
}

In particular, I don't know how to specify that I want file.output to get stored back in S3 somewhere, at a location that is unique to the scatter, such that it can be found and acted on by the outputs_ch (or, in a more complex workflow, by a subsequent step).

Or should the workflow remain unchanged, and instead my CopyStrategy be changed to copy all output files to the same place the .command.sh was in (ie. the workDir in S3)?

Edit: my stageOutCommand() is not getting called because in BashWrapperBuilder, copyStrategy.getUnstageOutputFilesScript(outputFiles,targetDir) is only called if( changeDir || workDir != targetDir ). My workDir and targetDir are the same in this case, and changeDir is null. Is this expected? Should I be trying to set changeDir to something by enabling a scratch directory?

@sb10
Copy link
Contributor Author

sb10 commented Jul 15, 2019

@PeteClapham Nextflow using wr as a backend does use wr's database for state, but my implementaiton still involves Nextflow creating at least 2 files and numerous folders per job. These files are bash scripts of what should be run.

I just tried to benchmark a workflow that just did 1 million echos (on the release version of Nextflow), but I filled my lustre number-of-files disk quota after about 5000 jobs ;)

@PeteClapham
Copy link

yes, the disk requirements are also proving an impact. Again, a database would help here. I'll send a local pm with more details

@sb10
Copy link
Contributor Author

sb10 commented Jul 15, 2019

@PeteClapham I don't think it's an issue in OpenStack though, since it will only be a few files and folders per machine and no shared disk. Of course S3 needs to cope with lots of files.

@PeteClapham
Copy link

Absolutely, all platforms are going to be similarly limited regardless of the storage backend in use, some will cope for longer than others but the issue wil remain.

Ultimately all open / close operations come with some overhead. Physical spiining storage elements suffer more than SSD etc but even then databaes are more scalable at this type of transaction.

@sb10
Copy link
Contributor Author

sb10 commented Jul 15, 2019

@PeteClapham We don't have a need to scale infinitely though, because we only have so much hardware on which to execute jobs.

So if a workflow can fill a team's OpenStack quota without hitting performance issues, that's good enough.

I think it's theoretically possible to make my Nextflow->wr implemention create no extra files on disk (just folders and the files created by the actual bioinformatic executables), but that may be a lot of extra work. Would be good to understand the details of real problems first.

@PeteClapham
Copy link

@sb10 I suspect that infinite scaling co-existis with Nessi and Bigfoot, at least I've never seen the 3 of them in the same room together :)

Realistically, we have numerous workloads with 20,000 - 50,000 jobs within a single array (On LSF) and similar numbers that are being pushed through cloud platforms. In both cases 10k related concurrent running tasks is not unusual to see, depending on platform contention.

So this is not an un-usual scaling question and certainly one we have been seeing local for many years.

We can expect to see customers looking to scale to greater numbers in the longer term, but this will depend upon the ability of the infrastructure and tools to provide the required support to make this happen, hence the comment above.

@PeteClapham
Copy link

We have a local customer who has been hitting or reporting the above impact, so I'll pass on their details to allow a more rapid dialogue.

@PeteClapham
Copy link

For clarity, holding state in small files and many artifacts n disk has been reported locally as significantly imapcting scalabilty, quota management and run cleanup should a component or components fail. It may be that there are better working solutions around the current concerns and we would be happy to receive guidance from the NF team. Internally, we can look at optimisation otions regarding WR and similar workflows. I suspect that there will be local optimisations arising as teams become more familiar with the tools in flight.

Thanks to you both for helping look at this. It is much appreciated.

@pditommaso
Copy link
Member

At some point, NF should implement a garbage collection algorithm able to remove files as soon as they are not needed any more #452.

@sb10
Copy link
Contributor Author

sb10 commented Aug 2, 2019

You can read the guide I wrote for users here: https://github.com/VertebrateResequencing/wr/wiki/Nextflow

To try this out yourself for real, you can follow the guide up to the LSF section, but start wr in local mode with wr manager start -s local instead. Once you've done that, use Nextflow as normal. Depending on hardware and disk, I found Nextflow -> wr[in local mode] to be either a little slower or even sometimes a little faster than pure Nextflow[in local mode].

(To "install" wr, just download the latest version from https://github.com/VertebrateResequencing/wr/releases and extract it from the zip.)

The PR has no integration tests with wr, btw. Is that a problem?

@pditommaso
Copy link
Member

It looks good.

The PR has no integration tests with wr, btw. Is that a problem?

No, but it could be useful that you setup a CI service for this.

@sb10
Copy link
Contributor Author

sb10 commented Oct 24, 2019

Hi, I'm not sure how to setup a CI service for it. This code has been in use here for production work at scale for some months now. It would be good to get it integrated so users can use the latest version of Nextflow without losing wr's OpenStack scheduling.

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

I've rebased it on top of latest master. Tests pass locally, but your CI tests fail. I'm not sure why or how to debug.

@pditommaso
Copy link
Member

Is there are link to the error log or you can post it here?

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

@pditommaso
Copy link
Member

nextflow.wr.executor.WrFileCopyStrategyTest > should return task env FAILED
    org.spockframework.runtime.SpockComparisonFailure at WrFileCopyStrategyTest.groovy:57

nextflow.wr.executor.WrFileCopyStrategyTest > should return cp script to unstage output files to S3 FAILED
    org.spockframework.mock.TooFewInvocationsError at WrFileCopyStrategyTest.groovy:340

nextflow.wr.executor.WrFileCopyStrategyTest > should return commands for an S3 bin dir FAILED
    org.spockframework.runtime.SpockComparisonFailure at WrFileCopyStrategyTest.groovy:548

nextflow.wr.processor.WrMonitorTest > should submit all pending tasks FAILED
    org.spockframework.mock.TooFewInvocationsError at WrMonitorTest.groovy:109

Are the WR tests failing. Weird local tests pass. Are you using your own password/s3 buckets/etc?

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

I don't think any actual S3 access is occurring in these tests.
https://github.com/sb10/nextflow/blob/nf-wr/modules/nf-wr/src/test/nextflow/wr/executor/WrFileCopyStrategyTest.groovy

Sorry, I was running the tests wrong locally. When I run make test module=nf-wr I do get the same error. Now I can debug and fix them...

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

Fixed and all tests now passing. I don't know how they passed before or what changed in between, but hopefully it's all correct now.

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

Now the tests in CI are failing due to timing out after about 1.5hrs.
https://github.com/sb10/nextflow/commit/40339f0708ea4ca5fc6cac9fb70638185dc920e4/checks
Any ideas why that is happening?
The tests locally complete in a few seconds.

@pditommaso
Copy link
Member

It looks one or more wr tests hang.

@sb10
Copy link
Contributor Author

sb10 commented Dec 4, 2019

How do you debug what's happening in that CI service? If it doesn't hang locally, what can I do?

@pditommaso
Copy link
Member

The tests report is uploaded to the nextflow web site, but this is disabled on pull requests for security reasons. You may try to add a Timeout in your test classes. For example

@sb10
Copy link
Contributor Author

sb10 commented Mar 23, 2020

Hi, getting back to working on this again. I've just added my files from #1148 to https://github.com/nextflow-io/nf-wr, it compiles with the command in the README.md, but I don't know how to run the tests or use it with Nextflow.

Can you offer some guidance?

@sb10
Copy link
Contributor Author

sb10 commented Mar 23, 2020

I got tests to work and they're now passing. But how does an end-user install latest Nextflow that uses nf-wr?

@pditommaso
Copy link
Member

pditommaso commented Mar 23, 2020

Nice. Once packaged and uploaded to Maven the user will need to use the NXF_GRAB variable to pull the nf-wr dep for now. At some point, there will be a plugin mechanism to make it more smooth.

@sb10
Copy link
Contributor Author

sb10 commented Mar 23, 2020

Thanks. I don't know anything about Maven. Have you got code somewhere that does this package and upload? Or do I just follow something like https://maven.apache.org/repository/guide-central-repository-upload.html ? What repository do I upload to? And can you give me the exact value I'd set NXF_GRAB to?

@pditommaso
Copy link
Member

Uploading to Maven central is quite tricky. I'll try to set up automated uploading with the CI build at my earliest convenience.

@sb10
Copy link
Contributor Author

sb10 commented Mar 26, 2020

Thanks. In the mean time, is there any way I can create a build for myself with latest Nextflow and nf-wr? Ie. point it at my local clone or .jar file instead of at Maven?

@pditommaso
Copy link
Member

Let's move this to its own issue nextflow-io/nf-wr#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants