loculus-project · fengelniederhammer · Nov 4, 2024 · Nov 5, 2024 · Nov 5, 2024 · Nov 5, 2024
diff --git a/README.md b/README.md
@@ -39,6 +39,10 @@ Note that if you are developing the backend or frontend/website in isolation a f
 
 ## Architecture
 
+[architecture_docs/](./architecture_docs) contains the architecture documentation of Loculus.
+
+TLDR:
+
 - Backend code is in `backend`, see [`backend/README.md`](/backend/README.md)
 - Frontend code is in `website`, see [`website/README.md`](/website/README.md)
 - Sequence and metadata processing pipeline is in [`preprocessing`](/preprocessing) folder, see [`preprocessing/specification.md`](/preprocessing/specification.md)

diff --git a/architecture_docs/01_introduction_and_goals.md b/architecture_docs/01_introduction_and_goals.md
@@ -0,0 +1,9 @@
+# Introduction And Goals
+
+Also see the top level [README.md](../README.md) for a high-level overview of the project.
+
+Loculus is a software package to power microbial genomial databases.
+
+This is an overview of important use cases:
+
+![Use Cases](plantuml/01_use_cases.svg)
diff --git a/architecture_docs/02_architecture_constraints.md b/architecture_docs/02_architecture_constraints.md
@@ -0,0 +1,18 @@
+# Architecture Constraints
+
+Loculus is developed under the following constraints:
+
+### Open Source Software
+
+We decided to develop Loculus under an open source license.
+The code is publicly available.
+
+Some aspects why we chose to develop Loculus as open source software:
+* to increase transparency of the project,
+* to allow others to contribute,
+* others are supposed to use the software - they should be able to see how it works.
+
+### Configurability
+
+Loculus is designed to be highly configurable.
+It should be usable for different organisms and different use cases.
diff --git a/architecture_docs/03_context_and_scope.md b/architecture_docs/03_context_and_scope.md
@@ -0,0 +1,6 @@
+# Context and Scope
+
+This section puts Loculus into context with the outside world and defines the scope of the project.
+All external participants are listed in the diagram below:
+
+![Context View](plantuml/03_context_view.svg)
diff --git a/architecture_docs/04_solution_strategy.md b/architecture_docs/04_solution_strategy.md
@@ -0,0 +1,11 @@
+# Solution Strategy
+
+This describes important decision that were made to solve the problem:
+* Loculus uses [LAPIS](https://github.com/GenSpectrum/LAPIS) and [SILO](https://github.com/GenSpectrum/LAPIS-SILO) to provide fast access to the sequence data.
+* Loculus implements a central HTTP API to store and retrieve data.
+  This API encapsulates the data storage in a Postgres database.
+  All other services interact with this API.
+  The API is mostly agnostic to organism-specific logic.
+* A preprocessing pipeline handles the organism-specifics, such as alignment and translation.
+  We provide a Nextclade-based pipeline, but maintainers can plug in their own pipeline.
+
diff --git a/architecture_docs/05_building_block_view.md b/architecture_docs/05_building_block_view.md
@@ -0,0 +1,62 @@
+# Building Block View
+
+In the following diagrams, the arrows point from the actor to the system component that is used by the actor.
+Data flow may be in the opposite direction
+(e.g. in the case of a download: the actor requests a download from the website, the website sends the data to the
+actor).
+
+## Overview
+
+This diagram provides a high level overview of the components of Loculus
+and how they interact with each other and external participants.
+
+![Building Block View](plantuml/05_level_1.svg)
+
+* Users can either
+    * use the website to browse the data and download sequences
+    * or they can use LAPIS directly to query the data (e.g. for automated analysis).
+* Submitters can 
+  * log in via Keycloak 
+  * submit new sequence data via the website
+  * or they can use the API directly to automate their submission process.
+* The backend infrastructure stores and processed the data.
+* LAPIS / SILO provides the query engine for the sequence data that is stored in the backend infrastructure.
+* The backend infrastructure also fetches sequence data from / uploads sequence data to INSDC services.
+* The website and the backend infrastructure use Keycloak to verify the identity of users.
+
+## LAPIS / SILO
+
+This diagram shows how Loculus utilizes 
+[LAPIS](https://github.com/GenSpectrum/LAPIS) and
+[SILO](https://github.com/GenSpectrum/LAPIS-SILO).
+
+![LAPIS / SILO](plantuml/05_level_2_lapis.svg)
+
+* LAPIS provides an HTTP API to query the sequence data.
+  * LAPIS is used by the website, but it can also be used by users directly. 
+* The SILO API is a query engine that stores the data in memory to provide fast access.
+  LAPIS accesses it via HTTP. 
+* The SILO preprocessing fetches data from the Loculus backend in a regular interval,
+  processes it into a format that the SILO API can load and stores the result in a shared volume (on disc).
+  * The SILO API will pick up the processed data and load it into memory.
+
+## Loculus Backend Infrastructure
+
+This diagram shows the backend infrastructure of Loculus.
+
+![Backend Infrastructure](plantuml/05_level_2_backend.svg)
+
+The "Loculus Backend" is the central HTTP API.
+It encapsulates the data storage.
+All data is stored in a Postgres database.
+Several other components interact with the backend:
+* The website
+  * sends data to the backend (e.g. new sequence data, new created groups)
+  * requests data from the backend (e.g. some parts of sequence data, groups)
+* Submitters can use the API directly to submit new sequence data.
+* The preprocessing pipeline fetches unprocessed data, processes it and resubmits it to the backend.
+* The Ingest service fetches data from NCBI and submits it to the backend.
+  * Ingest must be specifically enabled for a specific organism.
+* The ENA deposition service checks whether new data has been uploaded to Loculus and submits it to ENA.
+  * ENA deposition must be specifically enabled for a specific organism.
+* The SILO preprocessing fetches all sequence data from the backend and loads it into SILO.
diff --git a/architecture_docs/06_runtime_view.md b/architecture_docs/06_runtime_view.md
@@ -0,0 +1,27 @@
+# Runtime view
+
+## Sequence Entry Lifecycle
+
+The following diagram shows a prototypical lifecycle of sequence data in Loculus:
+A submitter uploads data on the website, the backend infrastructure processes it
+and finally, the data is available for querying via LAPIS.
+
+![Submission Process](plantuml/06_submission_process.svg)
+
+The [backend runtime view](../backend/docs/runtime_view.md) provides a more detailed view of what happens in the backend
+during the submission process.
+
+## Sequence Entry Lifecycle
+
+The next diagram depicts the user interaction when data has been uploaded that is rejected by the preprocessing pipeline in more detail:
+
+![Submission Details](plantuml/06_user_submission_details.svg)
+
+Users are asked to edit erroneous data and resubmit it, before they can approve it.
+If the data has been reprocessed successfully, they can approve it, and it will be available for querying via LAPIS.
+
+## ENA deposition
+
+![ENA deposition](plantuml/06_ena_deposition.svg)
+
+TODO: describe this.
diff --git a/architecture_docs/07_deployment_view.md b/architecture_docs/07_deployment_view.md
@@ -0,0 +1,37 @@
+# Deployment View
+
+All artifacts of Loculus are available as Docker images.
+Thus, Loculus can be operated in any environment that supports Docker containers.
+Due to the extensive configuration processing, we provide a [Helm](https://helm.sh/) chart that does most of the work there,
+so we suggest to operate Loculus in a Kubernetes cluster.
+
+## High Level Overview
+
+In a productive environment, you will most likely want persistent databases.
+We recommend hosting the databases external to your Loculus cluster, as shown in the following diagram:
+
+![Deployment Overview](plantuml/07_deployment_overview.svg)
+
+For local development, we use [k3d](https://k3d.io/) to spin up a local cluster.
+There, also the databases are hosted within the cluster, because they don't need to be persistent.
+
+## Cluster Internals
+
+The following diagram sketches the internal structure of the deployed cluster.
+Only connections to/from outside the cluster are marked with arrows here.
+All other connections are omitted for simplicity.
+
+![Cluster Details](plantuml/07_cluster_details.svg)
+
+Inside the cluster, we assume that there is [Traefik](https://traefik.io/) running as an ingress controller.
+[k3s](https://k3s.io/) and k3d already come with Traefik installed by default.
+We configured Traefik to expose the relevant services to the public:
+* the website,
+* the backend,
+* LAPIS,
+* Keycloak.
+
+We only need a single instance of the website, the backend and keycloak (and their respective databases).
+The other services (LAPIS, SILO, preprocessing pipeline, ingest and ENA deposition) have to be configured
+and deployed per organism that the Loculus instance supports.
+We utilize Helm to generate those multiple service instances.
diff --git a/architecture_docs/08_crosscutting_concepts.md b/architecture_docs/08_crosscutting_concepts.md
@@ -0,0 +1,18 @@
+# Crosscutting Concepts
+
+## Logging
+
+Log messages are written directly to stdout so that they can be collected by the container orchestrator.
+
+## Request Tracing
+
+Where possible, APIs should implement request ids:
+* The API should accept a request id in the request header.
+* The API must include the request id in the response header. If no request id is provided, the API should generate one.
+* The API must include the request id in all log messages.
+
+This allows for tracing of requests through the system.
+It is also helpful if services log the request id that they receive from a service that they consume.
+
+In Spring Boot, implementing request ids is quite straight forward with `@RequestScope`.
+Also see [the implementation in the backend](https://github.com/loculus-project/loculus/blob/cbbbc9746604679df225059af6683ebcb568e038/backend/src/main/kotlin/org/loculus/backend/log/RequestId.kt).
diff --git a/architecture_docs/09_architecture_decisions.md b/architecture_docs/09_architecture_decisions.md
@@ -0,0 +1,5 @@
+# Architecture Decisions
+
+ADRs...
+
+Check Nuclino
diff --git a/architecture_docs/10_quality_requirements.md b/architecture_docs/10_quality_requirements.md
@@ -0,0 +1,26 @@
+# Quality Requirements
+
+Following the [ISO-25010](https://iso25000.com/index.php/en/iso-25000-standards/iso-25010) standard, we define the following quality requirements for our system:
+
+## Performance Efficiency
+
+* Time behavior: When a submitter uploads a sequence, then the sequence should be available for querying within 10 minutes.
+* Time behavior: When a user queries a sequence, then the query should return within 1 second.
+
+## Interaction Capability
+
+* Operability: A maintainer should be able to set up a new Loculus instance from reading the documentation.
+
+## Security
+
+* Integrity: Only submitters belonging to the respective group should be able to make changes on sequence data.
+
+## Transparency
+
+We also identified two quality requirements that don't fit into the ISO-25010 standard:
+
+* The Loculus project is transparent. Important decisions are publicly documented.
+  Users can comprehend how Loculus works and how the data is processed.
+* It is comprehensible who submitted which data and when.
+  This is important so that submitters can be credited appropriately for their work
+  (e.g. by citing their data in a publication).
diff --git a/architecture_docs/11_risks_and_technical_debt.md b/architecture_docs/11_risks_and_technical_debt.md
@@ -0,0 +1,16 @@
+# Risks and Technical Debt
+
+## Configuration Processing
+
+We use a `values.yaml` file as a main input source for the Helm chart for the configuration of a Loculus instance.
+
+We leveraged the powerful templating capabilities of Helm to generate the configuration files for the individual artifacts.
+This works well, because we can distribute the mostly redundant configuration values efficiently.
+
+However, this became quite complex and hard to maintain over time.
+It is untested and hard to debug, if something goes wrong.
+It is also (as of now) mostly undocumented.
+
+Some parts of the configuration are redundant and could be simplified.
+Also, the Helm chart contains a lot of default values 
+that are not suitable for general Loculus instances and will result in unexpected behavior if not overwritten.
diff --git a/architecture_docs/12_glossary.md b/architecture_docs/12_glossary.md
@@ -0,0 +1,5 @@
+# Glossary
+
+See glossary on the documentation page:
+* https://loculus.org/introduction/glossary/
+* [source file](../docs/src/content/docs/introduction/glossary.md)
diff --git a/architecture_docs/README.md b/architecture_docs/README.md
@@ -0,0 +1,4 @@
+# Architecture Documentation
+
+This folder documents the architecture of Loculus.
+It is based on the template provided by https://arc42.org/.
diff --git a/architecture_docs/plantuml/.gitignore b/architecture_docs/plantuml/.gitignore
@@ -0,0 +1 @@
+plantuml.jar
diff --git a/architecture_docs/plantuml/01_use_cases.puml b/architecture_docs/plantuml/01_use_cases.puml
@@ -0,0 +1,31 @@
+@startuml
+
+title Loculus Use Cases
+left to right direction
+
+actor User as user
+actor Submitter as submitter
+actor Maintainer as maintainer
+
+rectangle Loculus {
+    usecase "Upload data" as upload
+    usecase "Revise data" as revise
+    usecase "Browse data" as browse
+    usecase "Download data" as download
+
+    usecase "Configure new organism" as configure
+    usecase "Host own instance" as host
+    usecase "Sync data with INSDC" as insdc
+}
+
+submitter --> upload
+submitter --> revise
+
+user --> browse
+user --> download
+
+maintainer --> configure
+maintainer --> host
+maintainer --> insdc
+
+@enduml
diff --git a/architecture_docs/plantuml/01_use_cases.svg b/architecture_docs/plantuml/01_use_cases.svg