Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threshold logic to adjust flags for certain input sizes. #489

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tneymanov
Copy link
Collaborator

  • Extracts manually supplied flags from argv.
  • Uses supplied args and input size estimations to check whether flags should be enabled or not.

- Extracts manually supplied flags from argv.
- Uses supplied args and input size estimations to check whether
  flags should be enabled or not.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 1736

  • 201 of 204 (98.53%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 89.237%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/libs/optimize_flags_test.py 134 135 99.26%
gcp_variant_transforms/vcf_to_bq.py 1 3 33.33%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/vcf_to_bq.py 1 32.05%
Totals Coverage Status
Change from base Build 1735: 0.3%
Covered Lines: 7885
Relevant Lines: 8836

💛 - Coveralls

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tural! I have added a few comments.

@@ -0,0 +1,157 @@
# Copyright 2019 Google Inc. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new copyright format?

# See the License for the specific language governing permissions and
# limitations under the License.

"""Util class used to optimize default values for flags, based on provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A module doc string starts with one line summary. http://go/pystyle#382-modules


class Dimensions(object):
"""Contains dimensions of the input data and the manually supplied args."""
def __init__(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add one empty line.

class Dimensions(object):
"""Contains dimensions of the input data and the manually supplied args."""
def __init__(self,
line_count=None, # type: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave two spaces between , and # type

class Threshold(Dimensions):
"""Describes the limits the input needs to pass to enable a certain flag.

Unlike Dimensions object, should not have supplied_args set and not all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra spaces in the beginning of this line and the followed one. https://engdoc.corp.google.com/eng/doc/devguide/py/style/index.md?cl=head#38-comments-and-docstrings

known_args.optimize_for_large_inputs = True
if INFER_HEADERS_TS.soft_pass(input_dimensions, operator.le):
known_args.infer_headers = True
if NUM_BIGQUERY_WRITE_SHARDS_TS.soft_pass(input_dimensions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, To help decide the threshold, the 1000 genomes in the integration tests need to have this flag set, while the latest release doesn't need it. You can test on these dataset.

if INFER_ANNOTATION_TYPES_TS.soft_pass(input_dimensions, operator.le):
known_args.infer_annotation_types = True
if SHARD_VARIANTS_TS.soft_pass(input_dimensions, operator.le):
known_args.shard_variants = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shard variants is more related with the variants count in each file: if we have a lot of small files, which means there are lots of records as well, we don't need to do sharding in this case.

known_args.shard_variants = False

def _optimize_pipeline_args(pipeline_args, known_args, input_dimensions):
if NUM_WORKERS_TS.hard_pass(input_dimensions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, I think setting the number of workers doesn't mean much since no matter what you set, Dataflow will kind of ignore this number and do autoscaling. But it do matter in the annotation pipeline since it means how many VMs we are going to bring up to run VEP.

Another one we can consider is the worker-machine-type, since the job may fail or too slow due to memory issues. I would recommend to try to optimize the machine-type using the input_dimensions.

file_count)
self.flag_name = flag_name

def not_supplied(self, state):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replacing state with input_dimensions (you also used in some other places) is easier to follow.

class Threshold(Dimensions):
"""Describes the limits the input needs to pass to enable a certain flag.

Unlike Dimensions object, should not have supplied_args set and not all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overriding seems weird, especially it should not have supplied_args. How about having Dimensions just describes the dimensions of the input data (by removing supplied_args)? In this way, you can also remove the assignments of these data to know_args (https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/vcf_to_bq.py#L271).

Moreover, this class describes the limit the input dimension needs to pass to enable the flag, I think we can move the checking of whether the corresponding flag is supplied or not to the outer layer rather than checking inside this class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants