Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a QC namespace #233

Open
illusional opened this issue May 15, 2024 · 5 comments
Open

Create a QC namespace #233

illusional opened this issue May 15, 2024 · 5 comments
Assignees

Comments

@illusional
Copy link
Contributor

illusional commented May 15, 2024

Some context: https://docs.google.com/document/d/1hO4-VAKjul25_lfYrvELocgfag3TIwBzICdHaHVnATk/edit#heading=h.1atj7ihnv188

Relevant user stories:

  • As a user, I want to explore a hypothesis during early QC.
    • This involves heavy computation, usually on the full set of sequencing groups (albeit, only a section of data).
    • As there's a lot of exploration, should be able to iteratively run analysis WITHOUT review.
    • This comes with the caveat that we DON'T want users to publish this analysis. Basically a restricted test environment.

Its usage with metamist is UNDEFINED in this, and will be more properly resolved later.

This will involve creating:

  • qc accounts for each of hail, dataproc, cromwell
  • specify users in a qc group under users.yaml
  • A set of qc buckets (qc, qc-analysis, qc-tmp) (no web bucket)
    • A $dataset-qc group (of persons) that has read access to the qc-analysis, and list access to qc / qc-tmp
    • main-full can APPEND data, but NOT read (to discourage copying results back). (Not needed if QC can read from main)
  • QC service accounts cannot access main level data (based on Hope's feedback below)
    • QC service accounts can READ the main-bucket
  • I don't think we should support the depends_on flag here, so QC groups should NOT allow access to transitive datasets.
    • the qc service accounts should be able to access the common-main bucket (for reference data)

Random notes:

  • Make sure it's included in the storage.toml
  • Produce some docs that highlight the importance of being cautious of cost with full compute access.
  • Maybe some mechanism to make this a time-dependent authorization, so we don't default to the qc namespace to test our analysis.

@violetbrina, it's worth thinking about what other implications creating a new namespace has. analysis-runner, billing, etc.

@hopedisastro
Copy link

hopedisastro commented May 15, 2024

Sometimes iterative and early QC requires the full dataset, some examples:

  • Uncovering that there was a batch effect (between two datasets) when plotting Somalier sex ploidy plots based on read coverage of the crams. This pattern is hard to tease out with just five samples in test, and in many cases analysts may not have a prior hypothesis of what is driving the effect so they may need to iteratively plot different variables to see what is associating with the batch effect, which is not possible to do with the current set up in main (would dramatically increase the PR load, and time lag between waiting for PR to get approved for a very minimal change and executing that analysis to see the result)
  • Plotting QC metrics across the entire cohort "exploratory QC". EG plotting the MAF across all loci in a cohort to see what an appropriate threshold would be for the MAF hard QC filter. Sometimes the plot is skewed by outliers, so in the current set up, just plotting a straightforward plot would require at least two PRs 1) Plot the metric with all the data and 2) Change the axis limits to remove the outliers; not to mention additional PRs if there is an interesting subset of the data identified from the first two plots, that require additional investigation/subsetting.

@hopedisastro
Copy link

hopedisastro commented May 15, 2024

Would also welcome thoughts on whether storage of data in the qc bucket should be permanent or time-limited.

If we reinforce the policy that all 'final' analysis should still be done in main, then I think storage of files in the qc bucket should be time-limited (maybe a bit longer than tmp though), because the storage requested for this bucket could balloon pretty quickly given that in theory one could run practically any analysis on the full dataset without a PR.

Access to the qc bucket should also be time limited. EG I would only need access to the qc bucket during the QC stage of my analysis. I could estimate how long this stage would take (perhaps say 2 months), get approval from my manager, and then apply for approval to access this bucket for that time period. After the time period elapses, my access privileges to the bucket would then be automatically revoked and I would need to re-apply to access the bucket again. <- not sure if this process can be automated?

User requesting access to the qc bucket should also make the case for why their specific use case needs access to the whole dataset, instead of going through the conventional test then main pathways. If not, I feel like this could be a slippery slope where the default is to run everything through qc first then go to main (the incentive is getting 'main' results faster through the qc bucket pathway and without needing a PR).

@hopedisastro
Copy link

hopedisastro commented May 15, 2024

Given the increased security risk, we could also have a mandatory induction/tutorial for accessing this qc bucket, overviewing:

  • It's intended purpose
  • Recommended workflow (e.g. I think users should still test that their script works in test before running anything in qcto reduce resource wastage).
  • Acceptable uses
  • Not acceptable uses (e.g. goes without saying: but downloading files to local).

But the 2nd and 3rd points can be screened in my previous post above - if we require users to specify their use case before granting them access.

@hopedisastro
Copy link

It sounds logistically like a lot more work to implement, but I think it's worth it.

There are always going to be interesting outliers or unexpected subsets of the data in any cohort we ingest.

The current set-up creates a lot of friction when investigating these types of results (in practical terms we still could, but there is the PR system (and time lag between submitting a PR and getting it approved; not to mention the cost of context switching when one briefly has to work on a different analysis while waiting for the PR to get approved and then switching back when that PR gets approved) that can disincentivize analysts from exploring subsets of the data.

But this also needs to be balanced with data security, efficient cloud storage use, and aligned with existing CPG principles for main and test. One thought that just came to mind is a price cap. EG: if an exploratory QC job is going to cost >$50 (or some other threshold), then we could still assert the policy that the script needs to be reviewed by someone before being run (this policy can be outlined in the mandatory induction/tutorial).

@cassimons
Copy link
Contributor

A few thoughts/questions that might be useful to consider when designing a solution. Some of these may not be "in scope" for this specific user story.

  • Is "QC" the right name to use for this namespace? Hope's user story does describe QC activities, but these style activities are only one type of QC. If we label a bucket QC then it would imply it is the space for general/all QC. Do we want this?
  • Cost control. Do we want/need to consider how const control will be implemented in this scenario? The answer might be simple as "Default budgets will be set low to mitigate cost blow out risks", or there might be better controls available. Either way it would be great to have this discussed as part of the solution.
  • Is there a strong reason to exclude the web bucket? QC outputs are often well suited to web bucket. By not having one it will mean code designed to run in main may not work out of the box on qc.
  • When spitballing this general need the idea of locking this activity to specified blessed branches might be a good control. Is that something we want to consider still?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants