-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200
Conversation
…le names read from stdin. This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially.
Thanks, @tstoeter. Could you give a sense of the performance improvement you observe with and without this change? I'm also curious what the use case is for running |
Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images). Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images. Certainly, the code can also be quickly adopted by |
I do see a noticeable difference in a simple test without this change:
and with this change (on the same data):
I'd be fine with including this in CI builds, but defer to @ome/formats on timing. Note that before merging this pull request, we will need to have a signed Contributor License Agreement. @tstoeter, please submit the CLA according to the instructions here: https://ome-contributing.readthedocs.io/en/latest/cla.html
Interesting. Outside the scope of this pull request, note that |
Thanks a lot @melissalinkert for the additional tests and pointing out the limitations of |
Thanks, @melissalinkert. Outside the scope of this PR, but I wonder if as a parallel to the ability to add metadata from #4016 there could be a flag to include the additional metadata in an OME-XML annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there is a use case, I also don't have strong objection to adding support for reading files sequentially.
Looking at equivalent functionality, several GNU utilities will have read from the standard input either if -
is passed or if no file are given - see e.g. https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation or https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html#cat-invocation
Any rationale for using --
rather than -
here?
Thank you @sbesson. I intended to contrast the |
Adhere to GNU utils convention of using `-` for reading from stdin. Also expand and adapt usage information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delays @tstoeter, this slept through the Summer cracks and was brought again to my attention.
Retested the latest commit with a version of the command-line tools built with this PR
sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files
test&sizeX=2048&sizeY-2048.fake
test&sizeT=20.fake
test&sizeC=3.fake
sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix -
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.043s
Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
Image count = 1
RGB = false (1)
Interleaved = false
Indexed = false (true color)
Width = 2048
Height = 512
SizeZ = 1
SizeT = 1
SizeC = 1
Tile size = 2048 x 512
Thumbnail size = 128 x 32
Endianness = intel (little)
Dimension order = XYZCT (certain)
Pixel type = uint8
Valid bits per pixel = 8
Metadata complete = true
Thumbnail series = false
-----
Plane #0 <=> Z 0, C 0, T 0
Reading global metadata
Reading metadata
====% test&sizeT=20.fake
Initializing reader
FakeReader initializing test&sizeT=20.fake
Initialization took 0.002s
Reading core metadata
filename = test&sizeT=20.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeT=20.fake]
Series count = 1
Series #0 :
Image count = 20
RGB = false (1)
Interleaved = false
Indexed = false (true color)
Width = 512
Height = 512
SizeZ = 1
SizeT = 20
SizeC = 1
Tile size = 512 x 512
Thumbnail size = 128 x 128
Endianness = intel (little)
Dimension order = XYZCT (certain)
Pixel type = uint8
Valid bits per pixel = 8
Metadata complete = true
Thumbnail series = false
-----
Plane #0 <=> Z 0, C 0, T 0
Plane #8 <=> Z 0, C 0, T 8
Plane #9 <=> Z 0, C 0, T 9
Plane #10 <=> Z 0, C 0, T 10
Plane #11 <=> Z 0, C 0, T 11
Plane #12 <=> Z 0, C 0, T 12
Plane #19 <=> Z 0, C 0, T 19
Reading global metadata
Reading metadata
====% test&sizeC=3.fake
Initializing reader
FakeReader initializing test&sizeC=3.fake
Initialization took 0.002s
Reading core metadata
filename = test&sizeC=3.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeC=3.fake]
Series count = 1
Series #0 :
Image count = 3
RGB = false (1)
Interleaved = false
Indexed = false (true color)
Width = 512
Height = 512
SizeZ = 1
SizeT = 1
SizeC = 3
Tile size = 512 x 512
Thumbnail size = 128 x 128
Endianness = intel (little)
Dimension order = XYZCT (certain)
Pixel type = uint8
Valid bits per pixel = 8
Metadata complete = true
Thumbnail series = false
-----
Plane #0 <=> Z 0, C 0, T 0
Plane #1 <=> Z 0, C 1, T 0
Plane #2 <=> Z 0, C 2, T 0
Reading global metadata
Reading metadata
The format of the output is unchanged for single file input
sbesson@Sebastiens-MacBook-Pro-3 bioformats % ./tools/showinf -nopix test.fake
Initializing reader
FakeReader initializing test.fake
Initialization took 0.041s
Reading core metadata
filename = test.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test.fake]
Series count = 1
Series #0 :
Image count = 1
RGB = false (1)
Interleaved = false
Indexed = false (true color)
Width = 512
Height = 512
SizeZ = 1
SizeT = 1
SizeC = 1
Tile size = 512 x 512
Thumbnail size = 128 x 128
Endianness = intel (little)
Dimension order = XYZCT (certain)
Pixel type = uint8
Valid bits per pixel = 8
Metadata complete = true
Thumbnail series = false
-----
Plane #0 <=> Z 0, C 0, T 0
Reading global metadata
Reading metadata
sbesson@Sebastiens-MacBook-Pro-3 bioformats %
Finally tested the error scenario in the case where one of the input files fails to initialize. The utility aborts and returns with a non-zero return code which is certainly my expectation
sbesson@Sebastiens-MacBook-Pro-3 bioformats % cat files | ./tools/showinf -nopix -
====% test&sizeX=2048&sizeY-2048.fake
Initializing reader
FakeReader initializing test&sizeX=2048&sizeY-2048.fake
ignoring token: sizeY-2048
Initialization took 0.041s
Reading core metadata
filename = test&sizeX=2048&sizeY-2048.fake
Used files = [/Users/sbesson/Documents/GitHub/bioformats/test&sizeX=2048&sizeY-2048.fake]
Series count = 1
Series #0 :
Image count = 1
RGB = false (1)
Interleaved = false
Indexed = false (true color)
Width = 2048
Height = 512
SizeZ = 1
SizeT = 1
SizeC = 1
Tile size = 2048 x 512
Thumbnail size = 128 x 32
Endianness = intel (little)
Dimension order = XYZCT (certain)
Pixel type = uint8
Valid bits per pixel = 8
Metadata complete = true
Thumbnail series = false
-----
Plane #0 <=> Z 0, C 0, T 0
Reading global metadata
Reading metadata
====% test.ini
Initializing reader
Exception in thread "main" java.io.FileNotFoundException: test.ini (No such file or directory)
at java.base/java.io.RandomAccessFile.open0(Native Method)
at java.base/java.io.RandomAccessFile.open(RandomAccessFile.java:356)
at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:273)
at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:223)
at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:130)
at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:151)
at loci.common.NIOFileHandle.<init>(NIOFileHandle.java:165)
at loci.common.Location.getHandle(Location.java:522)
at loci.common.Location.getHandle(Location.java:462)
at loci.common.Location.getHandle(Location.java:443)
at loci.common.Location.getHandle(Location.java:426)
at loci.common.Location.checkValidId(Location.java:551)
at loci.formats.ImageReader.getReader(ImageReader.java:183)
at loci.formats.ImageReader.setId(ImageReader.java:859)
at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:692)
at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:1048)
at loci.formats.tools.ImageInfo.main(ImageInfo.java:1145)
From my side, happy to get this into the roadmap for the upcoming Bio-Formats release (with the appropriate changes to the reference documentation).
This looks fine to me as well. I've added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-test-repo/167/ continues to pass, and https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-push/180/console confirms this was included.
Great news, thank you all! |
This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially. Usage information for reading file names from stdin are added with the second commit.
Code compiles with
ant clean jars tools
, compilation with Maven not tested.All tests using
ant test
have passed.