-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --split option #80
Add --split option #80
Conversation
Not sure why the build is failing here; all tests pass locally. Will investigate in the morning. |
8ba63d8
to
53675f4
Compare
Build fixed and debugging commits force-pushed away, so ready for review now. |
I've done some testing with this, it's good stuff! In general it seems to work well and the individual series files are correctly sized. Non-split mode appears unchanged. After splitting, it seems that trying to open an individual series .tif in FIJI (and I presume any other bioformats-based viewer) 'sees' series from all the different series which were exported rather than just the opened file. The GUI seems to list them all as being within the dropped .tif but the actual data is not. This wouldn't be a major problem, however if any of the other series files are missing (e.g. we just copied Series 2 to another location and tried to open it), the importer fails to open the file with an indexing error:
I anticipate that users will want to isolate and work on the individual series files, so perhaps it's better if they're fully decoupled from eachother? In a perfect world perhaps the reader would check for access to the sub-files before listing them as options for display. As another smaller point, could we have the name generator strip |
Conflicts fixed - this is now ready for review again. |
@sbesson / @melissalinkert: Let's make sure we select a good cross section of data to run some representative conversions against with and without this PR to check that:
|
I think that covers all of the multi-Image cases. It might also be interesting to test an even larger plate - something like 1536 x 20 fields, or the largest we've encountered so far. I don't think we have a public example of that, but a fake plate might be OK in that case since the idea is to see what happens when tens of thousands of files are created. |
Capturing a few preliminary results from the testing of this functionality:
Discussing with @melissalinkert, a related issue is the fact the current implementation currently writes the same metadata in the header of each TIFF file. For large HCS data, this can become a siginificant overhead both in terms of performance but also in terms of disk space. As a potential compromise solution, a proposal would be update the current implementation and store data generated with the |
As noted in the comments, this is primarily to save disk space (and maybe a bit of write time) for large plates which might have thousands of files when converted with --split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restarted the conversion tests using the last commit against the series of samples. While the timings are largely identical for dataset with a few series, there seems to be some performance regression as the number of series increases.
Using 3 consecutive measurements, the NIRHTa+001 NGFF plate takes 86min, 118min and 85min to convert into a single OME-TIFF and 207min, 205min and 187min to convert into a multi-file OME-TIFF.
For the plate with 30K fields of view, the conversion to a single file OME-TIFF takes 1557min and the conversion to a multi-file OME-TIFF takes 2125min (but completes without exceptions).
I think 9cb9421 will help, but I'd still expect |
With the last commit, I am still seeing a significant overhead due to the metadata writing. The full logs at debug level: with the relevant part being the last lines lines % tail -n 5 *.log
==> NIRHTa+001_default.log <==
2023-03-24 21:13:54,675 [pool-2304-thread-6] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 81797 compressed bytes to NIRHTa+001.ome.tiff at 2830557771
2023-03-24 21:13:54,675 [pool-2304-thread-6] INFO org.perf4j.TimingLogger - start[1679692434668] time[6] tag[writeTile]
2023-03-24 21:13:54,677 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 325464 compressed bytes to NIRHTa+001.ome.tiff at 2830639568
2023-03-24 21:13:54,678 [pool-2304-thread-3] INFO org.perf4j.TimingLogger - start[1679692434666] time[11] tag[writeTile]
2023-03-24 22:26:42,624 [main] INFO org.perf4j.TimingLogger - start[1679692383152] time[4419472] tag[convertToPyramid]
==> NIRHTa+001_split.log <==
2023-03-24 17:24:27,294 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 325464 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 575996
2023-03-24 17:24:27,295 [pool-2304-thread-3] INFO org.perf4j.TimingLogger - start[1679678667283] time[11] tag[writeTile]
2023-03-24 17:24:27,295 [pool-2304-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 269584 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 901460
2023-03-24 17:24:27,295 [pool-2304-thread-2] INFO org.perf4j.TimingLogger - start[1679678667280] time[15] tag[writeTile]
2023-03-24 20:51:02,428 [main] INFO org.perf4j.TimingLogger - start[1679678611862] time[12450565] tag[convertToPyramid]
==> large_plate_split.log <==
2023-03-26 05:04:50,153 [pool-30720-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 7010 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 16
2023-03-26 05:04:50,153 [pool-30720-thread-2] INFO org.perf4j.TimingLogger - start[1679807090152] time[1] tag[writeTile]
2023-03-26 05:04:50,155 [pool-30720-thread-1] DEBUG c.g.p.PyramidFromDirectoryWriter - writing 47140 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 7026
2023-03-26 05:04:50,155 [pool-30720-thread-1] INFO org.perf4j.TimingLogger - start[1679807090151] time[3] tag[writeTile]
2023-03-26 19:13:06,140 [main] INFO org.perf4j.TimingLogger - start[1679806913217] time[51072923] tag[convertToPyramid] That's still an overhead of a few seconds per output file which will lead to huge time increases for datasets in the 1K-100K files range. Would it be useful to regenerate these logs with additional debug statement e.g. in the new |
Based on the logs, it looks like the main difference is in a617ff8 adds some extra stopwatches in and around |
Coming back finally to this. I managed to run a performance assessment of this PR using the sample files mentioned in #80 (comment). The conversions were executed without this PR using 0.4.1, with this PR using the default (single file mode) and with this PR using the split option. Each file was converted three times to account for file system caching except for the large_plate which takes a very long time. Additionally, tests were performed both on two VMs in OpenStack using either SSD storage or non-SSD storage. With SSD storage, the metrics were as follows:
Without SSD storage, the metrics were as follows:
Main outcome of this testing is that for a performant storage, this PR does not add any significant overhead either in single-file or multi-file mode. However for less performant storage, there is definitely a degradation of the conversion times which likely scales with the number of files to be written and can increase the conversion by a factor 100% in some cases. Next: I will run some tests on the converted files to compare the outcome is byte identical. Initial thought is to make use of upstream's https://github.com/ome/bioformats/blob/v6.13.0/components/test-suite/src/loci/tests/testng/EquivalentPixelsTest.java. @melissalinkert how easily do you think this could be converted into a test utility similarly to ome/bioformats#3876 ? |
Main issue is that it depends on testng at the moment, so either need some build changes or to split into separate classes if we want both test and command line functionality. https://github.com/melissalinkert/bioformats/commits/equivalent-pixels-tool is a place to start, but that's on top of the work already in ome/bioformats#3876 so I will hold off on a PR until that's merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally managed to do the pixel tests using the upstream test-equivalent
target. The only exception was the large_plate sample as the command failed with out of memory exceptions even after raising the memory considerably.
For all the other targets, the pixel tests completed successfully for all series/resolutions. I'll perform a final functional test by importing 2 samples into OMERO for visual inspection. Afterwards we should be able to get this into the upcoming raw2ometiff release
Final call for comments, @DavidStirling / @chris-allan. Plan to merge this on Tuesday (May 30) unless any objections. |
See #79.
raw2ometiff --split input.zarr output
should write a set of OME-TIFF files namedoutput_s0.ome.tiff
,output_s1.ome.tiff
, etc., with one file per input series.showinf -noflat output_s0.ome.tiff
should detect all of theoutput_s*.ome.tiff
files together, and display the same dimensions as the original data that was converted by bioformats2raw.raw2ometiff input.zarr output.ome.tiff
should behave as before; in addition to correctness, it's worth checking for any noticeable performance differences. I didn't see a real difference when testing with some small fake plates andCMU-1.ndpi
, but it's worth double-checking.Given how much this modifies TIFF writing internals, this will need careful review and testing with both HCS and WSI data, and with and without the new
--split
option.