Add --split option #80

melissalinkert · 2022-07-01T01:06:32Z

See #79.

raw2ometiff --split input.zarr output should write a set of OME-TIFF files named output_s0.ome.tiff, output_s1.ome.tiff, etc., with one file per input series. showinf -noflat output_s0.ome.tiff should detect all of the output_s*.ome.tiff files together, and display the same dimensions as the original data that was converted by bioformats2raw.

raw2ometiff input.zarr output.ome.tiff should behave as before; in addition to correctness, it's worth checking for any noticeable performance differences. I didn't see a real difference when testing with some small fake plates and CMU-1.ndpi, but it's worth double-checking.

Given how much this modifies TIFF writing internals, this will need careful review and testing with both HCS and WSI data, and with and without the new --split option.

Fixes glencoesoftware#79.

melissalinkert · 2022-07-01T03:29:41Z

Not sure why the build is failing here; all tests pass locally. Will investigate in the morning.

melissalinkert · 2022-07-01T17:41:26Z

Build fixed and debugging commits force-pushed away, so ready for review now.

DavidStirling · 2022-07-07T09:34:08Z

I've done some testing with this, it's good stuff! In general it seems to work well and the individual series files are correctly sized. Non-split mode appears unchanged.

After splitting, it seems that trying to open an individual series .tif in FIJI (and I presume any other bioformats-based viewer) 'sees' series from all the different series which were exported rather than just the opened file. The GUI seems to list them all as being within the dropped .tif but the actual data is not. This wouldn't be a major problem, however if any of the other series files are missing (e.g. we just copied Series 2 to another location and tried to open it), the importer fails to open the file with an indexing error:

java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
	at java.util.ArrayList.rangeCheck(ArrayList.java:657)
	at java.util.ArrayList.get(ArrayList.java:433)
	at loci.formats.in.OMETiffReader.initFile(OMETiffReader.java:691)
	at loci.formats.FormatReader.setId(FormatReader.java:1392)
	at loci.formats.ImageReader.setId(ImageReader.java:849)
	at io.scif.bf.BioFormatsFormat$Parser.typedParse(BioFormatsFormat.java:472)
	at io.scif.bf.BioFormatsFormat$Parser.typedParse(BioFormatsFormat.java:450)
	at io.scif.AbstractParser.parse(AbstractParser.java:244)
	at io.scif.AbstractParser.parse(AbstractParser.java:314)
	at io.scif.AbstractParser.parse(AbstractParser.java:53)
	at io.scif.AbstractReader.setSource(AbstractReader.java:271)
	at io.scif.services.DefaultInitializeService.initializeReader(DefaultInitializeService.java:91)
	at io.scif.img.ImgOpener.createReader(ImgOpener.java:483)
	at io.scif.img.ImgOpener.openImgs(ImgOpener.java:242)
	at io.scif.services.DefaultDatasetIOService.open(DefaultDatasetIOService.java:152)
	at io.scif.services.DefaultDatasetIOService.open(DefaultDatasetIOService.java:133)
	at io.scif.io.DatasetIOPlugin.open(DatasetIOPlugin.java:80)
	at io.scif.io.DatasetIOPlugin.open(DatasetIOPlugin.java:52)
	at org.scijava.io.DefaultIOService.open(DefaultIOService.java:69)
	at HandleExtraFileTypes.openImage(HandleExtraFileTypes.java:528)
	at HandleExtraFileTypes.run(HandleExtraFileTypes.java:76)
	at ij.IJ.runUserPlugIn(IJ.java:235)
	at ij.IJ.runPlugIn(IJ.java:198)
	at ij.IJ.runPlugIn(IJ.java:187)
	at ij.io.Opener.openWithHandleExtraFileTypes(Opener.java:512)
	at ij.io.Opener.openUsingHandleExtraFileTypes(Opener.java:369)
	at ij.io.Opener.openImage(Opener.java:359)
	at ij.io.Opener.openImage(Opener.java:243)
	at ij.io.Opener.open(Opener.java:109)
	at ij.io.Opener.openAndAddToRecent(Opener.java:292)
	at ij.plugin.DragAndDrop.openFile(DragAndDrop.java:194)
	at ij.plugin.DragAndDrop.run(DragAndDrop.java:160)
	at java.lang.Thread.run(Thread.java:748)

I anticipate that users will want to isolate and work on the individual series files, so perhaps it's better if they're fully decoupled from eachother? In a perfect world perhaps the reader would check for access to the sub-files before listing them as options for display.

As another smaller point, could we have the name generator strip .ome.tif(f) from the output filename when in split mode? Currently you end up with output.ome.tif_s0.ome.tiff. This would hopefully be more intuitive since non-split mode requires the extension.

…split-tiffs

melissalinkert · 2023-02-22T22:23:34Z

Conflicts fixed - this is now ready for review again.

chris-allan · 2023-02-23T11:58:26Z

@sbesson / @melissalinkert: Let's make sure we select a good cross section of data to run some representative conversions against with and without this PR to check that:

Performance is not adversely affected
Conversion without --split is byte identical

sbesson · 2023-02-23T20:13:09Z

Similarly to the testing of #82 and similar recent efforts, I would propose the following samples as a starting point for the assessment:

https://downloads.openmicroscopy.org/images/Cellomics/BBBC017/ (original) https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/BBBC017 (converted) HCS 384 wells / 6 FOVs / 2304 images
https://downloads.openmicroscopy.org/images/Leica-SCN/openslide/ (original) https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/sub-resolutions/Brightfield/ (converted) brightfield WSI 3- 5 series
https://downloads.openmicroscopy.org/images/Vectra-QPTIFF/perkinelmer/ (original https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/sub-resolutions/Fluorescence/ (converted) fluorescence WSI 4 images
a classical multi-position fluorescence acquisition with ~10-50 positions as well e.g. https://downloads.openmicroscopy.org/images/Leica-LIF/imagesc-30856/

This should cover a broad of multi-image filesets across several acquisition modalities Anything else obvious I might be missing @melissalinkert ?

melissalinkert · 2023-02-23T23:33:01Z

This should cover a broad of multi-image filesets across several acquisition modalities Anything else obvious I might be missing @melissalinkert ?

I think that covers all of the multi-Image cases. It might also be interesting to test an even larger plate - something like 1536 x 20 fields, or the largest we've encountered so far. I don't think we have a public example of that, but a fake plate might be OK in that case since the idea is to see what happens when tens of thousands of files are created.

sbesson · 2023-03-09T21:34:20Z

Capturing a few preliminary results from the testing of this functionality:

without the --split option, the NGFF -> OME-TIFF writing times with this PR included are consistent with the writing times of raw2ometiff 0.4.1 across all samples
with the --split option, there might be a increase in writing time depending on the samples and especially the number of series/files

possibly the biggest issue encountered is the fact the conversion of the large_plate NGFF (1536 wells x 20 fields) failed with the default settings with:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
     at java.util.Arrays.copyOfRange(Arrays.java:3664)
     at java.lang.StringBuffer.toString(StringBuffer.java:669)
     at loci.common.xml.XMLTools.getXML(XMLTools.java:261)
     at loci.formats.services.OMEXMLServiceImpl.getOMEXML(OMEXMLServiceImpl.java:494)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.makeIFD(PyramidFromDirectoryWriter.java:1187)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertPyramid(PyramidFromDirectoryWriter.java:893)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertToPyramid(PyramidFromDirectoryWriter.java:879)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:253)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:101)
     at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
     at picocli.CommandLine.access$900(CommandLine.java:145)
     at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
     at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
     at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
     at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1968)
     at picocli.CommandLine.parseWithHandlers(CommandLine.java:2349)
     at picocli.CommandLine.parseWithHandler(CommandLine.java:2284)
     at picocli.CommandLine.call(CommandLine.java:2560)
     at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.main(PyramidFromDirectoryWriter.java:223)
    ```

Discussing with @melissalinkert, a related issue is the fact the current implementation currently writes the same metadata in the header of each TIFF file. For large HCS data, this can become a siginificant overhead both in terms of performance but also in terms of disk space.

As a potential compromise solution, a proposal would be update the current implementation and store data generated with the --split option according the partial OME-XML metadata specification and using a companion file containing the OME-XML metadata - see here for a real-life example of application of this specification.

As noted in the comments, this is primarily to save disk space (and maybe a bit of write time) for large plates which might have thousands of files when converted with --split.

melissalinkert · 2023-03-13T20:01:23Z

3167d51 should now correctly create the companion file/BinaryOnly OME-TIFFs, and the --split test has been updated accordingly. This is ready for re-testing, @sbesson.

sbesson

Restarted the conversion tests using the last commit against the series of samples. While the timings are largely identical for dataset with a few series, there seems to be some performance regression as the number of series increases.

Using 3 consecutive measurements, the NIRHTa+001 NGFF plate takes 86min, 118min and 85min to convert into a single OME-TIFF and 207min, 205min and 187min to convert into a multi-file OME-TIFF.

For the plate with 30K fields of view, the conversion to a single file OME-TIFF takes 1557min and the conversion to a multi-file OME-TIFF takes 2125min (but completes without exceptions).

melissalinkert · 2023-03-21T22:58:49Z

I think 9cb9421 will help, but I'd still expect --split to be slower and for the time difference to increase with the number of series. ~10 hours across 1536x20 series is obviously out of line, but no matter what --split needs to write an extra header and generate/write a small OME-XML block for every series. Even a small consistent difference of 50ms per series would add up to ~26 minutes for the 1536x20 field plate.

sbesson · 2023-03-28T11:18:42Z

With the last commit, I am still seeing a significant overhead due to the metadata writing. The full logs at debug level:

with the relevant part being the last lines lines

% tail -n 5 *.log
==> NIRHTa+001_default.log <==
2023-03-24 21:13:54,675 [pool-2304-thread-6] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 81797 compressed bytes to NIRHTa+001.ome.tiff at 2830557771
2023-03-24 21:13:54,675 [pool-2304-thread-6] INFO  org.perf4j.TimingLogger - start[1679692434668] time[6] tag[writeTile]
2023-03-24 21:13:54,677 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 325464 compressed bytes to NIRHTa+001.ome.tiff at 2830639568
2023-03-24 21:13:54,678 [pool-2304-thread-3] INFO  org.perf4j.TimingLogger - start[1679692434666] time[11] tag[writeTile]
2023-03-24 22:26:42,624 [main] INFO  org.perf4j.TimingLogger - start[1679692383152] time[4419472] tag[convertToPyramid]

==> NIRHTa+001_split.log <==
2023-03-24 17:24:27,294 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 325464 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 575996
2023-03-24 17:24:27,295 [pool-2304-thread-3] INFO  org.perf4j.TimingLogger - start[1679678667283] time[11] tag[writeTile]
2023-03-24 17:24:27,295 [pool-2304-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 269584 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 901460
2023-03-24 17:24:27,295 [pool-2304-thread-2] INFO  org.perf4j.TimingLogger - start[1679678667280] time[15] tag[writeTile]
2023-03-24 20:51:02,428 [main] INFO  org.perf4j.TimingLogger - start[1679678611862] time[12450565] tag[convertToPyramid]

==> large_plate_split.log <==
2023-03-26 05:04:50,153 [pool-30720-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 7010 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 16
2023-03-26 05:04:50,153 [pool-30720-thread-2] INFO  org.perf4j.TimingLogger - start[1679807090152] time[1] tag[writeTile]
2023-03-26 05:04:50,155 [pool-30720-thread-1] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 47140 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 7026
2023-03-26 05:04:50,155 [pool-30720-thread-1] INFO  org.perf4j.TimingLogger - start[1679807090151] time[3] tag[writeTile]
2023-03-26 19:13:06,140 [main] INFO  org.perf4j.TimingLogger - start[1679806913217] time[51072923] tag[convertToPyramid]

That's still an overhead of a few seconds per output file which will lead to huge time increases for datasets in the 1K-100K files range. Would it be useful to regenerate these logs with additional debug statement e.g. in the new writeTIFFHeader logic to identify the bottleneck?

melissalinkert · 2023-03-29T20:12:38Z

Based on the logs, it looks like the main difference is in writeIFDs, after all of the pixel data is written. For NIRHTa+001, writeIFDs alone is ~1h13m vs ~3h27m. I haven't yet been able to reproduce times anywhere near that, using 9cb9421 with https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/BBBC017/single-file/ (which was converted with bioformats2raw 0.6.1 and default options). I see total raw2ometiff times under 25 minutes in this case, with or without --split.

a617ff8 adds some extra stopwatches in and around writeIFDs, and the corresponding --split log when I run this locally is nirht-split.log.gz.

…split-tiffs

sbesson · 2023-05-04T09:54:12Z

Coming back finally to this. I managed to run a performance assessment of this PR using the sample files mentioned in #80 (comment).

The conversions were executed without this PR using 0.4.1, with this PR using the default (single file mode) and with this PR using the split option. Each file was converted three times to account for file system caching except for the large_plate which takes a very long time.

Additionally, tests were performed both on two VMs in OpenStack using either SSD storage or non-SSD storage.

With SSD storage, the metrics were as follows:

raw2ometiff 0.4.1
large_plate.ome.zarr	1337m28.407s
Leica-1.ome.zarr	0m55.755s	0m31.082s	0m31.415s
Leica-2.ome.zarr	1m45.399s	1m16.079s	1m25.103s
Leica-3.ome.zarr	2m37.039s	1m53.134s	1m48.959s
LuCa-7color_Scan1.ome.zarr	0m52.752s	0m31.807s	0m32.470s
NIRHTa+001.ome.zarr	48m10.809s	47m34.109s	47m13.373s

raw2ometiff 0.5.0-SNAPSHOT
large_plate.ome.zarr	1349m26.828s
Leica-1.ome.zarr	0m50.247s	0m29.958s	0m29.982s
Leica-2.ome.zarr	1m35.226s	1m7.496s	1m19.366s
Leica-3.ome.zarr	2m36.889s	1m36.601s	1m39.665s
LuCa-7color_Scan1.ome.zarr	0m50.233s	0m35.132s	0m36.562s
NIRHTa+001.ome.zarr	45m20.226s	45m20.226s	46m25.964s

raw2ometiff 0.5.0-SNAPSHOT --split
large_plate.ome.zarr	1386m29.097s
Leica-1.ome.zarr	0m51.300s	0m29.229s	0m29.918s
Leica-2.ome.zarr	1m45.831s	1m14.594s	1m14.296s
Leica-3.ome.zarr	2m52.555s	1m43.229s	1m39.658s
LuCa-7color_Scan1.ome.zarr	1m1.343s	0m32.355s	0m31.770s
NIRHTa+001.ome.zarr	48m20.691s	47m37.830s	48m32.093s
NIRHTa+001.ome.zarr	45m20.226s	45m20.226s	46m25.964s

Without SSD storage, the metrics were as follows:

raw2ometiff 0.4.1
large_plate.ome.zarr	1548m54.536s
Leica-1.ome.zarr	3m39.723s	0m48.079s	0m56.267s
Leica-2.ome.zarr	7m14.477s	1m57.519s	1m50.336s
Leica-3.ome.zarr	14m20.850s	4m38.970s	4m37.677s
LuCa-7color_Scan1.ome.zarr	4m47.735s	2m45.600s	0m54.907s
NIRHTa+001.ome.zarr	94m8.806s	101m7.020s	95m51.124s

raw2ometiff 0.5.0-SNAPSHOT
large_plate.ome.zarr	1625m44.363s
Leica-1.ome.zarr	3m50.656s	1m2.817s	1m0.950s
Leica-2.ome.zarr	10m24.327s	2m36.257s	2m25.496s
Leica-3.ome.zarr	39m25.295	6m26.659s	2m47.872s
LuCa-7color_Scan1.ome.zarr	3m30.630s	0m59.561s	0m59.529s
NIRHTa+001.ome.zarr	92m4.332s	91m8.011s	98m32.255s

raw2ometiff 0.5.0-SNAPSHOT --split
large_plate.ome.zarr	2086m32.042s
Leica-1.ome.zarr	3m23.873s	1m0.789s	0m57.108s
Leica-2.ome.zarr	10m22.659s	6m11.889s	5m29.050s
Leica-3.ome.zarr	17m1.730s	3m28.244s	3m6.595s
LuCa-7color_Scan1.ome.zarr	4m2.707s	0m51.477s	1m35.233s
NIRHTa+001.ome.zarr	208m35.649s	177m36.592s	186m12.939s

Main outcome of this testing is that for a performant storage, this PR does not add any significant overhead either in single-file or multi-file mode.

However for less performant storage, there is definitely a degradation of the conversion times which likely scales with the number of files to be written and can increase the conversion by a factor 100% in some cases.
At minimum, we might want to add a note/warning about the potential impact of using the split options.

Next: I will run some tests on the converted files to compare the outcome is byte identical. Initial thought is to make use of upstream's https://github.com/ome/bioformats/blob/v6.13.0/components/test-suite/src/loci/tests/testng/EquivalentPixelsTest.java. @melissalinkert how easily do you think this could be converted into a test utility similarly to ome/bioformats#3876 ?

See glencoesoftware/raw2ometiff#80

melissalinkert · 2023-05-05T20:56:59Z

Initial thought is to make use of upstream's https://github.com/ome/bioformats/blob/v6.13.0/components/test-suite/src/loci/tests/testng/EquivalentPixelsTest.java. @melissalinkert how easily do you think this could be converted into a test utility similarly to ome/bioformats#3876 ?

Main issue is that it depends on testng at the moment, so either need some build changes or to split into separate classes if we want both test and command line functionality. https://github.com/melissalinkert/bioformats/commits/equivalent-pixels-tool is a place to start, but that's on top of the work already in ome/bioformats#3876 so I will hold off on a PR until that's merged.

sbesson

Finally managed to do the pixel tests using the upstream test-equivalent target. The only exception was the large_plate sample as the command failed with out of memory exceptions even after raising the memory considerably.

For all the other targets, the pixel tests completed successfully for all series/resolutions. I'll perform a final functional test by importing 2 samples into OMERO for visual inspection. Afterwards we should be able to get this into the upcoming raw2ometiff release

melissalinkert · 2023-05-25T15:04:31Z

Final call for comments, @DavidStirling / @chris-allan. Plan to merge this on Tuesday (May 30) unless any objections.

melissalinkert added 2 commits June 30, 2022 19:32

Add --split option to write a separate file per series

327593f

Fixes glencoesoftware#79.

Reduce number of TiffSaver instances

5118896

melissalinkert requested review from chris-allan and DavidStirling July 1, 2022 01:06

Fix tile writing when using multiple workers

53675f4

melissalinkert force-pushed the split-tiffs branch from 8ba63d8 to 53675f4 Compare July 1, 2022 17:37

melissalinkert mentioned this pull request Jul 6, 2022

Consider a mode to split series into individual TIF files #79

Closed

Fix file extension handling with --split

e1ee43a

melissalinkert mentioned this pull request Jul 26, 2022

OME-TIFF: allow individual file reading for datasets with one file per series ome/bioformats#3851

Merged

Merge branch 'master' of github.com:glencoesoftware/raw2ometiff into …

39ac55e

…split-tiffs

chris-allan requested a review from sbesson February 23, 2023 11:55

Update --split to create companion OME-XML/BinaryOnly OME-TIFF data

3167d51

As noted in the comments, this is primarily to save disk space (and maybe a bit of write time) for large plates which might have thousands of files when converted with --split.

sbesson reviewed Mar 17, 2023

View reviewed changes

Rework UUID and BinaryOnly logic to reduce time per series

9cb9421

sbesson self-requested a review March 22, 2023 17:03

sbesson mentioned this pull request Mar 24, 2023

Expand manifest with additional properties #101

Merged

Add stopwatches around IFD writing

a617ff8

melissalinkert mentioned this pull request Mar 31, 2023

Datasets for benchmarking glencoesoftware/bioformats2raw#7

Closed

Merge branch 'master' of github.com:glencoesoftware/raw2ometiff into …

f51f1b7

…split-tiffs

melissalinkert added a commit to melissalinkert/bioformats that referenced this pull request May 5, 2023

Initial version of bf-equivalent-pixels tool

076bf6e

See glencoesoftware/raw2ometiff#80

sbesson reviewed May 22, 2023

View reviewed changes

sbesson approved these changes May 23, 2023

View reviewed changes

melissalinkert merged commit 9cc8db5 into glencoesoftware:master May 30, 2023

Tom-TBT mentioned this pull request Jul 7, 2023

Split channel timepoints slice #107

Closed

sbesson mentioned this pull request Oct 13, 2023

Add --split-planes option to write one plane per file #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --split option #80

Add --split option #80

melissalinkert commented Jul 1, 2022

melissalinkert commented Jul 1, 2022

melissalinkert commented Jul 1, 2022

DavidStirling commented Jul 7, 2022

melissalinkert commented Feb 22, 2023

chris-allan commented Feb 23, 2023

sbesson commented Feb 23, 2023

melissalinkert commented Feb 23, 2023

sbesson commented Mar 9, 2023

melissalinkert commented Mar 13, 2023

sbesson left a comment

melissalinkert commented Mar 21, 2023

sbesson commented Mar 28, 2023

melissalinkert commented Mar 29, 2023

sbesson commented May 4, 2023

melissalinkert commented May 5, 2023

sbesson left a comment

melissalinkert commented May 25, 2023

Add --split option #80

Add --split option #80

Conversation

melissalinkert commented Jul 1, 2022

melissalinkert commented Jul 1, 2022

melissalinkert commented Jul 1, 2022

DavidStirling commented Jul 7, 2022

melissalinkert commented Feb 22, 2023

chris-allan commented Feb 23, 2023

sbesson commented Feb 23, 2023

melissalinkert commented Feb 23, 2023

sbesson commented Mar 9, 2023

melissalinkert commented Mar 13, 2023

sbesson left a comment

Choose a reason for hiding this comment

melissalinkert commented Mar 21, 2023

sbesson commented Mar 28, 2023

melissalinkert commented Mar 29, 2023

sbesson commented May 4, 2023

melissalinkert commented May 5, 2023

sbesson left a comment

Choose a reason for hiding this comment

melissalinkert commented May 25, 2023