Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add architecture documentation #3184

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

docs: add architecture documentation #3184

wants to merge 20 commits into from

Conversation

fengelniederhammer
Copy link
Contributor

Summary

This adds architecture documentation following the https://docs.arc42.org/ template.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
    - [ ] The implemented feature is covered by an appropriate test.

@fhennig
Copy link
Contributor

fhennig commented Nov 6, 2024

Starting point: https://github.com/loculus-project/loculus/blob/arc42/architecture_docs/README.md

I just read through it all, looks great already! I really like the 4 different view (block view, runtime, deployment, ..).

As someone more recently joining, I'm quite curious about things that probably go into section 1:

  • What is the intended user/submitter audience?
    • small lab groups with limited deployement skills?
  • What are some use cases that the software should support?
    • i.e. a pathoplexus-like use-case, using it just for analysis of ENA data, using it just in your lab etc.
    • I think these different use cases also influence the architecture quite a bit, with the quite modular setup we have (i.e. you don't need the ENA submission bit, or you don't need auth, or you can even use LAPIS on its own)
  • What is maybe not in the feature scope?
  • solution strategy: Could that be 'build a web app'? I'm not sure if that is something that should go in there?

Maybe also worth adding somewhere: LAPIS is not just a "headless loculus", right? maybe that's just a "historic" thing, but I think LAPIS is pretty independent from loculus in some ways, might be interesting to have that in there somewhere too as a context or driving factor.

Just some random assorted thoughts! Not sure if any of this is really in the scope of the architecture docs.

But I really like the structure and what's there so far!


Some parts of the configuration are redundant and could be simplified.
Also, the Helm chart contains a lot of default values
that are not suitable for general Loculus instances and will result in unexpected behavior if not overwritten.
Copy link
Contributor

@anna-parker anna-parker Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ena deposition was written as an optional component, however it still needs to submit all data and keep submission state. Therefore we duplicate all records and keep them in the backend db schema and the ena deposition schema - this creates unnecessary database bloat.

Although the two schemas are in the same db they behave as separate dbs with only the backend pod directly querying the public db schema and the ena-deposition and ingest (see below) pod querying the ena-deposition schema.

Potentially the ingest and ena-submission pod should be merged together as they both interact with INSDC and are optional. Additionally, ingest queries the ena deposition schema directly at the moment to ensure it does not reingest sequences that we submitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something 👍

Comment on lines +8 to +11
We provide instructions how to install Loculus, and we host instances ourselves,
but other maintainers can host their own instances as well.
We offer guidance and documentation and are open for feature requests, but we do not provide direct support for custom instances.
We cannot forsee all possible configurations and environments that Loculus might be deployed in.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does someone have a suggestion how to define the scope better? What's in the scope, what's not in the scope?

Comment on lines +18 to +21
We decided to:
* Use NCBI to download data, because it's [datasets cli](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/)
is most convenient to use.
* Use ENA to upload data, because TODO????????
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anna-parker @corneliusroemer Do you know why this was decided?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they have a submission API that is publicly documented and no other INSDC database has this.

where it duplicates the data from the backend to keep track of which sequences have already been uploaded.
2. The ingest service accesses the same schema as the deposition to check which ingested sequences have been uploaded by Loculus.

A solution to the first problem would be to adapt the backend such that it can track which sequences have been uploaded to ENA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was decided against to keep ena deposition optional but yes it might be a better solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants