Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk annotations and streaming #5

Merged
merged 8 commits into from
Sep 3, 2018

Conversation

kkoz
Copy link
Contributor

@kkoz kkoz commented Aug 20, 2018

Added changes to support bulk annotations for projects and datasets, and to change csv reading to process the data line by line instead of buffering the data in memory and writing to the table all at once.

@@ -24,12 +24,16 @@

"""

import csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import causes the Python code compliance step to fail in Travis

@sbesson
Copy link
Member

sbesson commented Aug 21, 2018

@kkoz thanks for your first contribution to this plugin. I commented on one import which causes the Travis build to fail. Once CI passes, we can look into testing these changes functionally. A couple of immediate questions:

  • I assume this change should be tested against examples of P/D as well as S/P/W for non-regression. We have certainly examples of bulk annotations for both types of hierarchies in IDR
  • briefly scanning through the changes, the API changes in populate_metadata probably mean this would be released as 0.3.0
  • it looks like info argument is now largely a no-op as write_to_omero is no longer conditional. From the perspective of the CLI plugin this is entirely fine as this argument was never exposed.

@kkoz
Copy link
Contributor Author

kkoz commented Aug 21, 2018

@sbesson Thanks for the notes. I will be making a commit shortly with the unnecessary import removed. For future reference, is there any way to check if my code will pass a travis-ci check without doing a commit? I've responded to your questions/comments below.

  • The code should be tested against P/D and S/P/W data.

  • That's fine - I will update CHANGES.rst

  • Removed info from command line args. Please let me know if there's anything else that needs to be done on this front.

@@ -29,7 +29,9 @@

from omero.model import PlateI, WellI, WellSampleI, OriginalFileI
from omero.rtypes import rint, rstring, unwrap
from omero.util.populate_metadata import ParsingContext
from omero.util.populate_metadata import (ParsingContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realizing that this still imports the legacy OMERO.py classes rather than the class in the package. Largely, something I missed in #1 as I did not activate these tests. Can you try to update this line to

from populate_metadata import (ParsingContext,
...
``

and similarly in `test/integration/metadata/test_populate.py` so that your new factory class is picked up?

NB: in a follow-up PR, we'll probably want to look into properly namespacing these packages.

@sbesson
Copy link
Member

sbesson commented Aug 22, 2018

@kkoz for pre-testing, assuming you have Docker installed, you should be able to test your changes locally by running the same calls as the ones executed in .travis.yml:

$ git clone git://github.com/openmicroscopy/omero-test-infra .omero
$ .omero/cli-docker

This is going to consume the omero-test-infra infrastructure to 1- create a Docker-based OMERO environment (DB, server, web), 2- install the plugin and run Python compliance checks, 3- execute the integration tests.

@kkoz
Copy link
Contributor Author

kkoz commented Aug 22, 2018

@sbesson Thanks for the info. I ran .omero/cli-docke but the output is mostly flake8 notes like ./omero-metadata/venv/lib/python2.7/distutils/__init__.py:100:1: E305 expected 2 blank lines after class or function definition, found 0
There are a lot of these: flake8.main.application MainProcess 130168 INFO Found a total of 108169 violations and reported 104755
Is there some way for me to disable the flake8 checks so I can see the output of the tests?
Also, which version of populate_metadata.py is being used in the container? How can I make sure it's the one I want to test? I'll perform another commit with the import changes in the meantime. Thanks again for your help.

@sbesson
Copy link
Member

sbesson commented Aug 22, 2018

@kkoz looks great now. Re flake8, you can disable it by updating .omero/py-check and commenting the flake8 line. Is the venv folder your local development virtual environment? If so, it should be possible to add a file excluding the check of known folders.

The version of populate_metadata.py tested should be the local version. The local directory is copied in the docker container via docker cp and installed there via python setup.py install before executing the tests.

class ParsingContext(object):
"""Generic parsing context for CSV files."""

def __init__(self, client, target_object, file=None, fileid=None,
def __init__(self, client, target_object,
parsing_util_factory, file=None, fileid=None,
Copy link
Member

@sbesson sbesson Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parsing_util_factory a new mandatory argument in ParsingContext as proposed here or an optional key/value pair as suggested in the test updates. Trying to consume this branch using the CLI wrapper against a project, I get the following error:

bash-4.2$ pip install git+git://github.com/kkoz/omero-metadata.git@bulk-annotations-and-streaming#egg=omero-metadata --user
Collecting omero-metadata from git+git://github.com/kkoz/omero-metadata.git@bulk-annotations-and-streaming#egg=omero-metadata
...
bash-4.2$  OMERO-server/OMERO.server/bin/omero metadata populate --report --batch=10000 --file IDR-import/idr0021-lawo-pericentriolarmaterial/experimentA/idr0021-experimentA-annotation.csv Project:1
Using session for root@localhost:4064. Idle timeout: 10 min. Current group: system
Traceback (most recent call last):
  File "OMERO-server/OMERO.server/bin/omero", line 125, in <module>
    rv = omero.cli.argv()
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1620, in argv
    cli.invoke(args[1:])
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1095, in invoke
    stop = self.onecmd(line, previous_args)
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1172, in onecmd
    self.execute(line, previous_args)
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1254, in execute
    args.func(args)
  File "/home/omero/.local/lib/python2.7/site-packages/omero_cli_metadata.py", line 482, in populate
    options=localcfg)
TypeError: __init__() takes at least 4 arguments (9 given)

NB: this also reveals the absence of integration test for the CLI metadata populate subcommand which would have picked up this error. This is independent of this PR but I will try and produce an extra commit to cover this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to have it be mandatory. This was probably a bit of premature refactoring on my part. I will see if I can change the code to restore the signature of __init__ in ParsingContext. My eventual goal was to refactor the code such that it allowed for unit testing. But that is outside the scope of this PR so I'll try to change it.

@sbesson sbesson added this to the 0.3.0 milestone Aug 24, 2018
@sbesson
Copy link
Member

sbesson commented Aug 24, 2018

A couple of notes from the discussion with @emilroz @kkoz and @joshmoore. Generally, in the absence of any other outstanding requirement, we are starting to define the roadmap for a 0.3.0 release of omero-metadata including:

  • the support of large CSV (multiple 10-100G)
  • better support/fixes of the behavior for various column types
  • additional API cleanup/refactoring including =the usage of the new factory class
  • more unit and integration tests (CLI metadata populate integration tests #8)

In terms of functional testing perspective, while the latest omero-metadata:0.2.2 can keep been used in production, different types of IDR studies e.g. idr0021 and idr0008 should be used to confirm the various population contexts (bulk annotation population, map annotation population, map annotation deletion) remain functional.

Testing the last commit, the default metadata populate command failed as the CLI wrapper is expecting ctx.write_to_omero which is removed by this PR:

bin/omero metadata populate --report --batch=1000 --file=../../IDR-import/screenA/idr0037-screenA-annotation.csv Screen:52
Using session for root@localhost:4064. Idle timeout: 10 min. Current group: system
DEBUG:omero.util.populate_metadata:Loading Screen:52
...
Traceback (most recent call last):
  File "bin/omero", line 125, in <module>
    rv = omero.cli.argv()
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1620, in argv
    cli.invoke(args[1:])
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1095, in invoke
    stop = self.onecmd(line, previous_args)
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1172, in onecmd
    self.execute(line, previous_args)
  File "/home/omero/workspace/OMERO-server/OMERO.server/lib/python/omero/cli.py", line 1254, in execute
    args.func(args)
  File "/home/omero/.local/lib/python2.7/site-packages/omero_cli_metadata.py", line 492, in populate
    ctx.write_to_omero(batch_size=args.batch, loops=loops, ms=ms)

As discussed, no major objection to removing the write_to_omero API as long as the CLI options (--batch, --loops, --ms) remain functional. The simplest option might be to add these to the context constructor and handle them internally depending on the type of operation.

…() for all contexts. Updated omero_cli_metadata.py and integration tests to no longer call write_to_omero.
@kkoz
Copy link
Contributor Author

kkoz commented Aug 27, 2018

@sbesson I made changes to solve the problem, but I'm having trouble testing them. When I run:
bin/omero metadata populate --report --batch=1000 --file=/home/kevin/testdata/test.csv Plate:1
I get:
bin/omero: error: argument <subcommand>: invalid choice: 'metadata'
Also it might be good to have an IDR csv file to test with. How can I get one?

@sbesson
Copy link
Member

sbesson commented Aug 28, 2018

@kkoz about the invalid choice issue, immediate question: does pip freeze list you the omero-metadata module?
All of the IDR CSV files are located in https://github.com/IDR/idr-metadata and its submodules - we are moving towards 1 repository per IDR study. For testing, two "small" studies we have been using are:

For both of these, the raw data is also available if needed.

@sbesson sbesson mentioned this pull request Aug 28, 2018
@kkoz
Copy link
Contributor Author

kkoz commented Aug 29, 2018

@sbesson I successfully imported the idr0021datasets into omero. I added them to a project I created and ran ./bin/omero metadata populate --report --batch=1000 --file=/home/kevin/code/idr0021-lawo-pericentriolarmaterial/experimentA/idr0021-experimentA-annotation.csv Project:201 without any issues. The project now has a "bulk_annotations" attachment but none of the datasets or images have Tables with data associated with them. Did I do use the populate metadata command incorrectly, or is this unexpected behavior? I am using OMERO.web 5.4.6-ice36-b87.

@sbesson
Copy link
Member

sbesson commented Aug 30, 2018

@kkoz excellent. I'll have a go at retesting the annotation workflow (might be tomorrow).
I think the absence of Tables in P/D/I hierarchies in OMERO.web 5.4.6 is expected and this is what ome/openmicroscopy#5832 is trying to address. I will try to use a Web client deployed with these changes to confirm from the Web UI that the bulk annotation works as expected.

@sbesson
Copy link
Member

sbesson commented Aug 31, 2018

Tested these changes against the idr0021 study imported into a test server. After import, went through the three population commands:

$ bin/omero metadata populate Project:101 --file /home/omero/workspace/IDR-import/experimentA/idr0021-experimentA-annotation.csv --report
$ bin/omero metadata populate --context bulkmap --cfg /home/omero/workspace/IDR-import/experimentA/idr0021-experimentA-bulkmap-config.yml --report Project:101
$ bin/omero metadata populate --context deletemap --batch 100 --wait 120 --cfg /home/omero/workspace/IDR-import/experimentA/idr0021-experimentA-bulkmap-config.yml --report Project:101
$ bin/omero metadata populate --context bulkmap --cfg /home/omero/workspace/IDR-import/experimentA/idr0021-experimentA-bulkmap-config.yml --report Project:101

For this type of annotation i.e. P/D/I with Dataset Name and Image Name columns, everything worked as expected - see https://web-proxy.openmicroscopy.org/east-web/webclient/?show=project-101 for the output (including the webclient table support from ome/openmicroscopy#5832).

@sbesson
Copy link
Member

sbesson commented Aug 31, 2018

Executed a similar workflow to #5 (comment) against a screen (idr0008/screenB). Both table population and table to map annotation conversion worked without issue. I was not able to fully test the deletion operation due to time issue. However we know this is one for which can be improved.

I will perform a final round of code review on Monday.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few inline comments. Proposing to merge this as it is especially as further refactoring is planned down the line /cc @joshmoore

options=localcfg)
ctx.parse()
loops = 0
ms = 0
if not args.dry_run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the --dry-run argument is now largely a no-op, similarly to the -info command. Primary question is whether we might want to resurrect it at some point via some constructor option of equivalent.

WELL_NAME_COLUMN,
DATASET_NAME_COLUMN,
IMAGE_NAME_COLUMN,
'image']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this line in my #10 once merged


original_file = table.getOriginalFile()
file_annotation = FileAnnotationI()
file_annotation.ns = rstring(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, it should be possible to consume omero.constants.namespaces.NSBULKANNOTATIONS

@sbesson sbesson merged commit 782d9ed into ome:master Sep 3, 2018
manics added a commit to manics/scripts that referenced this pull request Jun 21, 2019
Dataset Project only supported for omero-metadata external plugin.
Deal with breaking change in omero-metadata: ome/omero-metadata#5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants