Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

tstoeter · 2024-06-20T12:43:10Z

This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially. Usage information for reading file names from stdin are added with the second commit.

Code compiles with ant clean jars tools, compilation with Maven not tested.
All tests using ant test have passed.

…le names read from stdin. This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially.

melissalinkert · 2024-06-20T14:51:42Z

Thanks, @tstoeter. Could you give a sense of the performance improvement you observe with and without this change?

I'm also curious what the use case is for running showinf on many files at once. I can understand where this feature would be useful for conversion (i.e. in bfconvert), but less clear on how it applies to just displaying metadata/images. Are you parsing the output of showinf in some way?

tstoeter · 2024-06-20T15:25:15Z

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Certainly, the code can also be quickly adopted by bfconvert for speeding up batch processing converting images, too.

melissalinkert · 2024-06-20T18:29:12Z

I do see a noticeable difference in a simple test without this change:

$ bfconvert "test&sizeX=10&sizeY=10&sizeZ=1000.fake" test_z%z.tiff
$ time for i in `ls *.tiff`; do showinf -no-upgrade -nopix $i; done

and with this change (on the same data):

$ time ls *.tiff | showinf -no-upgrade -nopix --

I'd be fine with including this in CI builds, but defer to @ome/formats on timing.

Note that before merging this pull request, we will need to have a signed Contributor License Agreement. @tstoeter, please submit the CLA according to the instructions here: https://ome-contributing.readthedocs.io/en/latest/cla.html

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Interesting. Outside the scope of this pull request, note that showinf may not print every DICOM tag in a file. All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

tstoeter · 2024-06-21T05:58:04Z

Thanks a lot @melissalinkert for the additional tests and pointing out the limitations of showinf regarding DICOM tags. I have signed and submitted the CLA and would be happy if my PR got merged.

joshmoore · 2024-06-21T06:55:30Z

All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

Thanks, @melissalinkert. Outside the scope of this PR, but I wonder if as a parallel to the ability to add metadata from #4016 there could be a flag to include the additional metadata in an OME-XML annotation.

sbesson

Assuming there is a use case, I also don't have strong objection to adding support for reading files sequentially.

Looking at equivalent functionality, several GNU utilities will have read from the standard input either if - is passed or if no file are given - see e.g. https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation or https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation

Any rationale for using -- rather than - here?

tstoeter · 2024-06-24T07:31:25Z

Thank you @sbesson. I intended to contrast the -- functionality from the other options, but using - instead works just as well.

Adhere to GNU utils convention of using `-` for reading from stdin. Also expand and adapt usage information.

sbesson

Apologies for the delays @tstoeter, this slept through the Summer cracks and was brought again to my attention.

Retested the latest commit with a version of the command-line tools built with this PR

sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files 
test&sizeX=2048&sizeY-2048.fake
test&sizeT=20.fake
test&sizeC=3.fake
sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix - 
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.043s

Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 2048
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 2048 x 512
	Thumbnail size = 128 x 32
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
====% test&sizeT=20.fake
Initializing reader
FakeReader initializing test&sizeT=20.fake
Initialization took 0.002s

Reading core metadata
filename = test&sizeT=20.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeT=20.fake]
Series count = 1
Series #0 :
	Image count = 20
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 20
	SizeC = 1
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0
	Plane #8 <=> Z 0, C 0, T 8
	Plane #9 <=> Z 0, C 0, T 9
	Plane #10 <=> Z 0, C 0, T 10
	Plane #11 <=> Z 0, C 0, T 11
	Plane #12 <=> Z 0, C 0, T 12
	Plane #19 <=> Z 0, C 0, T 19


Reading global metadata

Reading metadata
====% test&sizeC=3.fake
Initializing reader
FakeReader initializing test&sizeC=3.fake
Initialization took 0.002s

Reading core metadata
filename = test&sizeC=3.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeC=3.fake]
Series count = 1
Series #0 :
	Image count = 3
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 3
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0
	Plane #1 <=> Z 0, C 1, T 0
	Plane #2 <=> Z 0, C 2, T 0


Reading global metadata

Reading metadata

The format of the output is unchanged for single file input

sbesson@Sebastiens-MacBook-Pro-3 bioformats % ./tools/showinf -nopix test.fake
Initializing reader
FakeReader initializing test.fake
Initialization took 0.041s

Reading core metadata
filename = test.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
sbesson@Sebastiens-MacBook-Pro-3 bioformats %

Finally tested the error scenario in the case where one of the input files fails to initialize. The utility aborts and returns with a non-zero return code which is certainly my expectation

sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix -
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.041s

Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 2048
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 2048 x 512
	Thumbnail size = 128 x 32
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
====% test.ini
Initializing reader
Exception in thread "main" java.io.FileNotFoundException: test.ini (No such file or directory)
	at java.base/java.io.RandomAccessFile.open0(Native Method)
	at java.base/java.io.RandomAccessFile.open(RandomAccessFile.java:356)
	at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:273)
	at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:223)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:130)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:151)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:165)
	at loci.common.Location.getHandle(Location.java:522)
	at loci.common.Location.getHandle(Location.java:462)
	at loci.common.Location.getHandle(Location.java:443)
	at loci.common.Location.getHandle(Location.java:426)
	at loci.common.Location.checkValidId(Location.java:551)
	at loci.formats.ImageReader.getReader(ImageReader.java:183)
	at loci.formats.ImageReader.setId(ImageReader.java:859)
	at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:692)
	at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:1048)
	at loci.formats.tools.ImageInfo.main(ImageInfo.java:1145)

From my side, happy to get this into the roadmap for the upcoming Bio-Formats release (with the appropriate changes to the reference documentation).

melissalinkert · 2024-08-29T22:57:01Z

This looks fine to me as well. I've added the include label now, so this should go into tonight's merge builds. If builds are still green tomorrow, I'll approve.

melissalinkert

https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-test-repo/167/ continues to pass, and https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-push/180/console confirms this was included.

tstoeter · 2024-08-30T15:01:39Z

Great news, thank you all!

tstoeter added 2 commits June 20, 2024 11:25

Allow processing multiple files sequentially in a single run, with fi…

ec5f8c2

…le names read from stdin. This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially.

Add usage information for reading file names from stdin.

41922a8

sbesson reviewed Jun 24, 2024

View reviewed changes

Use - instead of -- to read files from stdin.

a569f5a

Adhere to GNU utils convention of using `-` for reading from stdin. Also expand and adapt usage information.

sbesson self-requested a review August 26, 2024 13:19

sbesson approved these changes Aug 27, 2024

View reviewed changes

sbesson requested a review from melissalinkert August 27, 2024 09:58

sbesson added this to the 8.0.0 milestone Aug 27, 2024

melissalinkert added the include label Aug 29, 2024

melissalinkert approved these changes Aug 30, 2024

View reviewed changes

sbesson merged commit aaaf99d into ome:develop Sep 2, 2024
18 checks passed

tstoeter mentioned this pull request Sep 11, 2024

Add documentation for showinf batch processing via stdin. ome/bio-formats-documentation#392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

tstoeter commented Jun 20, 2024

melissalinkert commented Jun 20, 2024

tstoeter commented Jun 20, 2024

melissalinkert commented Jun 20, 2024

tstoeter commented Jun 21, 2024

joshmoore commented Jun 21, 2024

sbesson left a comment

tstoeter commented Jun 24, 2024

sbesson left a comment

melissalinkert commented Aug 29, 2024

melissalinkert left a comment

tstoeter commented Aug 30, 2024

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Conversation

tstoeter commented Jun 20, 2024

melissalinkert commented Jun 20, 2024

tstoeter commented Jun 20, 2024

melissalinkert commented Jun 20, 2024

tstoeter commented Jun 21, 2024

joshmoore commented Jun 21, 2024

sbesson left a comment

Choose a reason for hiding this comment

tstoeter commented Jun 24, 2024

sbesson left a comment

Choose a reason for hiding this comment

melissalinkert commented Aug 29, 2024

melissalinkert left a comment

Choose a reason for hiding this comment

tstoeter commented Aug 30, 2024