Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Merged
merged 3 commits into from
Sep 2, 2024

Conversation

tstoeter
Copy link
Contributor

This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially. Usage information for reading file names from stdin are added with the second commit.

Code compiles with ant clean jars tools, compilation with Maven not tested.
All tests using ant test have passed.

tstoeter added 2 commits June 20, 2024 11:25
…le names read from stdin.

This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument
then file names are read line by line from stdin and processed sequentially.
@melissalinkert
Copy link
Member

Thanks, @tstoeter. Could you give a sense of the performance improvement you observe with and without this change?

I'm also curious what the use case is for running showinf on many files at once. I can understand where this feature would be useful for conversion (i.e. in bfconvert), but less clear on how it applies to just displaying metadata/images. Are you parsing the output of showinf in some way?

@tstoeter
Copy link
Contributor Author

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Certainly, the code can also be quickly adopted by bfconvert for speeding up batch processing converting images, too.

@melissalinkert
Copy link
Member

I do see a noticeable difference in a simple test without this change:

$ bfconvert "test&sizeX=10&sizeY=10&sizeZ=1000.fake" test_z%z.tiff
$ time for i in `ls *.tiff`; do showinf -no-upgrade -nopix $i; done

and with this change (on the same data):

$ time ls *.tiff | showinf -no-upgrade -nopix --

I'd be fine with including this in CI builds, but defer to @ome/formats on timing.

Note that before merging this pull request, we will need to have a signed Contributor License Agreement. @tstoeter, please submit the CLA according to the instructions here: https://ome-contributing.readthedocs.io/en/latest/cla.html

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Interesting. Outside the scope of this pull request, note that showinf may not print every DICOM tag in a file. All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

@tstoeter
Copy link
Contributor Author

Thanks a lot @melissalinkert for the additional tests and pointing out the limitations of showinf regarding DICOM tags. I have signed and submitted the CLA and would be happy if my PR got merged.

@joshmoore
Copy link
Member

All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

Thanks, @melissalinkert. Outside the scope of this PR, but I wonder if as a parallel to the ability to add metadata from #4016 there could be a flag to include the additional metadata in an OME-XML annotation.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there is a use case, I also don't have strong objection to adding support for reading files sequentially.

Looking at equivalent functionality, several GNU utilities will have read from the standard input either if - is passed or if no file are given - see e.g. https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation or https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation

Any rationale for using -- rather than - here?

@tstoeter
Copy link
Contributor Author

Thank you @sbesson. I intended to contrast the -- functionality from the other options, but using - instead works just as well.

Adhere to GNU utils convention of using `-` for reading from stdin.
Also expand and adapt usage information.
@sbesson sbesson self-requested a review August 26, 2024 13:19
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delays @tstoeter, this slept through the Summer cracks and was brought again to my attention.

Retested the latest commit with a version of the command-line tools built with this PR

sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files 
test&sizeX=2048&sizeY-2048.fake
test&sizeT=20.fake
test&sizeC=3.fake
sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix - 
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.043s

Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 2048
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 2048 x 512
	Thumbnail size = 128 x 32
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
====% test&sizeT=20.fake
Initializing reader
FakeReader initializing test&sizeT=20.fake
Initialization took 0.002s

Reading core metadata
filename = test&sizeT=20.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeT=20.fake]
Series count = 1
Series #0 :
	Image count = 20
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 20
	SizeC = 1
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0
	Plane #8 <=> Z 0, C 0, T 8
	Plane #9 <=> Z 0, C 0, T 9
	Plane #10 <=> Z 0, C 0, T 10
	Plane #11 <=> Z 0, C 0, T 11
	Plane #12 <=> Z 0, C 0, T 12
	Plane #19 <=> Z 0, C 0, T 19


Reading global metadata

Reading metadata
====% test&sizeC=3.fake
Initializing reader
FakeReader initializing test&sizeC=3.fake
Initialization took 0.002s

Reading core metadata
filename = test&sizeC=3.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeC=3.fake]
Series count = 1
Series #0 :
	Image count = 3
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 3
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0
	Plane #1 <=> Z 0, C 1, T 0
	Plane #2 <=> Z 0, C 2, T 0


Reading global metadata

Reading metadata

The format of the output is unchanged for single file input

sbesson@Sebastiens-MacBook-Pro-3 bioformats % ./tools/showinf -nopix test.fake
Initializing reader
FakeReader initializing test.fake
Initialization took 0.041s

Reading core metadata
filename = test.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 512
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 512 x 512
	Thumbnail size = 128 x 128
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
sbesson@Sebastiens-MacBook-Pro-3 bioformats % 

Finally tested the error scenario in the case where one of the input files fails to initialize. The utility aborts and returns with a non-zero return code which is certainly my expectation

sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix -
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.041s

Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
	Image count = 1
	RGB = false (1) 
	Interleaved = false
	Indexed = false (true color)
	Width = 2048
	Height = 512
	SizeZ = 1
	SizeT = 1
	SizeC = 1
	Tile size = 2048 x 512
	Thumbnail size = 128 x 32
	Endianness = intel (little)
	Dimension order = XYZCT (certain)
	Pixel type = uint8
	Valid bits per pixel = 8
	Metadata complete = true
	Thumbnail series = false
	-----
	Plane #0 <=> Z 0, C 0, T 0


Reading global metadata

Reading metadata
====% test.ini
Initializing reader
Exception in thread "main" java.io.FileNotFoundException: test.ini (No such file or directory)
	at java.base/java.io.RandomAccessFile.open0(Native Method)
	at java.base/java.io.RandomAccessFile.open(RandomAccessFile.java:356)
	at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:273)
	at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:223)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:130)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:151)
	at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:165)
	at loci.common.Location.getHandle(Location.java:522)
	at loci.common.Location.getHandle(Location.java:462)
	at loci.common.Location.getHandle(Location.java:443)
	at loci.common.Location.getHandle(Location.java:426)
	at loci.common.Location.checkValidId(Location.java:551)
	at loci.formats.ImageReader.getReader(ImageReader.java:183)
	at loci.formats.ImageReader.setId(ImageReader.java:859)
	at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:692)
	at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:1048)
	at loci.formats.tools.ImageInfo.main(ImageInfo.java:1145)

From my side, happy to get this into the roadmap for the upcoming Bio-Formats release (with the appropriate changes to the reference documentation).

@sbesson sbesson requested a review from melissalinkert August 27, 2024 09:58
@sbesson sbesson added this to the 8.0.0 milestone Aug 27, 2024
@melissalinkert
Copy link
Member

This looks fine to me as well. I've added the include label now, so this should go into tonight's merge builds. If builds are still green tomorrow, I'll approve.

Copy link
Member

@melissalinkert melissalinkert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstoeter
Copy link
Contributor Author

Great news, thank you all!

@sbesson sbesson merged commit aaaf99d into ome:develop Sep 2, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants