Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add architecture documentation #3184

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ Note that if you are developing the backend or frontend/website in isolation a f

## Architecture

[architecture_docs/](./architecture_docs) contains the architecture documentation of Loculus.

TLDR:

- Backend code is in `backend`, see [`backend/README.md`](/backend/README.md)
- Frontend code is in `website`, see [`website/README.md`](/website/README.md)
- Sequence and metadata processing pipeline is in [`preprocessing`](/preprocessing) folder, see [`preprocessing/specification.md`](/preprocessing/specification.md)
Expand Down
9 changes: 9 additions & 0 deletions architecture_docs/01_introduction_and_goals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Introduction And Goals

Also see the top level [README.md](../README.md) for a high-level overview of the project.

Loculus is a software package to power microbial genomial databases.

This is an overview of important use cases:

![Use Cases](plantuml/01_use_cases.svg)
18 changes: 18 additions & 0 deletions architecture_docs/02_architecture_constraints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Architecture Constraints

Loculus is developed under the following constraints:

### Open Source Software

We decided to develop Loculus under an open source license.
The code is publicly available.

Some aspects why we chose to develop Loculus as open source software:
* to increase transparency of the project,
* to allow others to contribute,
* others are supposed to use the software - they should be able to see how it works.

### Configurability

Loculus is designed to be highly configurable.
It should be usable for different organisms and different use cases.
6 changes: 6 additions & 0 deletions architecture_docs/03_context_and_scope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Context and Scope

This section puts Loculus into context with the outside world and defines the scope of the project.
All external participants are listed in the diagram below:

![Context View](plantuml/03_context_view.svg)
11 changes: 11 additions & 0 deletions architecture_docs/04_solution_strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Solution Strategy

This describes important decision that were made to solve the problem:
* Loculus uses [LAPIS](https://github.com/GenSpectrum/LAPIS) and [SILO](https://github.com/GenSpectrum/LAPIS-SILO) to provide fast access to the sequence data.
* Loculus implements a central HTTP API to store and retrieve data.
This API encapsulates the data storage in a Postgres database.
All other services interact with this API.
The API is mostly agnostic to organism-specific logic.
* A preprocessing pipeline handles the organism-specifics, such as alignment and translation.
We provide a Nextclade-based pipeline, but maintainers can plug in their own pipeline.

62 changes: 62 additions & 0 deletions architecture_docs/05_building_block_view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Building Block View

In the following diagrams, the arrows point from the actor to the system component that is used by the actor.
Data flow may be in the opposite direction
(e.g. in the case of a download: the actor requests a download from the website, the website sends the data to the
actor).

## Overview

This diagram provides a high level overview of the components of Loculus
and how they interact with each other and external participants.

![Building Block View](plantuml/05_level_1.svg)

* Users can either
* use the website to browse the data and download sequences
* or they can use LAPIS directly to query the data (e.g. for automated analysis).
* Submitters can
* log in via Keycloak
* submit new sequence data via the website
* or they can use the API directly to automate their submission process.
* The backend infrastructure stores and processed the data.
* LAPIS / SILO provides the query engine for the sequence data that is stored in the backend infrastructure.
* The backend infrastructure also fetches sequence data from / uploads sequence data to INSDC services.
* The website and the backend infrastructure use Keycloak to verify the identity of users.

## LAPIS / SILO

This diagram shows how Loculus utilizes
[LAPIS](https://github.com/GenSpectrum/LAPIS) and
[SILO](https://github.com/GenSpectrum/LAPIS-SILO).

![LAPIS / SILO](plantuml/05_level_2_lapis.svg)

* LAPIS provides an HTTP API to query the sequence data.
* LAPIS is used by the website, but it can also be used by users directly.
* The SILO API is a query engine that stores the data in memory to provide fast access.
LAPIS accesses it via HTTP.
* The SILO preprocessing fetches data from the Loculus backend in a regular interval,
processes it into a format that the SILO API can load and stores the result in a shared volume (on disc).
* The SILO API will pick up the processed data and load it into memory.

## Loculus Backend Infrastructure

This diagram shows the backend infrastructure of Loculus.

![Backend Infrastructure](plantuml/05_level_2_backend.svg)

The "Loculus Backend" is the central HTTP API.
It encapsulates the data storage.
All data is stored in a Postgres database.
Several other components interact with the backend:
* The website
* sends data to the backend (e.g. new sequence data, new created groups)
* requests data from the backend (e.g. some parts of sequence data, groups)
* Submitters can use the API directly to submit new sequence data.
* The preprocessing pipeline fetches unprocessed data, processes it and resubmits it to the backend.
* The Ingest service fetches data from NCBI and submits it to the backend.
* Ingest must be specifically enabled for a specific organism.
* The ENA deposition service checks whether new data has been uploaded to Loculus and submits it to ENA.
* ENA deposition must be specifically enabled for a specific organism.
* The SILO preprocessing fetches all sequence data from the backend and loads it into SILO.
27 changes: 27 additions & 0 deletions architecture_docs/06_runtime_view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Runtime view

## Sequence Entry Lifecycle

The following diagram shows a prototypical lifecycle of sequence data in Loculus:
A submitter uploads data on the website, the backend infrastructure processes it
and finally, the data is available for querying via LAPIS.

![Submission Process](plantuml/06_submission_process.svg)

The [backend runtime view](../backend/docs/runtime_view.md) provides a more detailed view of what happens in the backend
during the submission process.

## Sequence Entry Lifecycle

The next diagram depicts the user interaction when data has been uploaded that is rejected by the preprocessing pipeline in more detail:

![Submission Details](plantuml/06_user_submission_details.svg)

Users are asked to edit erroneous data and resubmit it, before they can approve it.
If the data has been reprocessed successfully, they can approve it, and it will be available for querying via LAPIS.

## ENA deposition

![ENA deposition](plantuml/06_ena_deposition.svg)

TODO: describe this.
37 changes: 37 additions & 0 deletions architecture_docs/07_deployment_view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Deployment View

All artifacts of Loculus are available as Docker images.
Thus, Loculus can be operated in any environment that supports Docker containers.
Due to the extensive configuration processing, we provide a [Helm](https://helm.sh/) chart that does most of the work there,
so we suggest to operate Loculus in a Kubernetes cluster.

## High Level Overview

In a productive environment, you will most likely want persistent databases.
We recommend hosting the databases external to your Loculus cluster, as shown in the following diagram:

![Deployment Overview](plantuml/07_deployment_overview.svg)

For local development, we use [k3d](https://k3d.io/) to spin up a local cluster.
There, also the databases are hosted within the cluster, because they don't need to be persistent.

## Cluster Internals

The following diagram sketches the internal structure of the deployed cluster.
Only connections to/from outside the cluster are marked with arrows here.
All other connections are omitted for simplicity.

![Cluster Details](plantuml/07_cluster_details.svg)

Inside the cluster, we assume that there is [Traefik](https://traefik.io/) running as an ingress controller.
[k3s](https://k3s.io/) and k3d already come with Traefik installed by default.
We configured Traefik to expose the relevant services to the public:
* the website,
* the backend,
* LAPIS,
* Keycloak.

We only need a single instance of the website, the backend and keycloak (and their respective databases).
The other services (LAPIS, SILO, preprocessing pipeline, ingest and ENA deposition) have to be configured
and deployed per organism that the Loculus instance supports.
We utilize Helm to generate those multiple service instances.
18 changes: 18 additions & 0 deletions architecture_docs/08_crosscutting_concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Crosscutting Concepts

## Logging

Log messages are written directly to stdout so that they can be collected by the container orchestrator.

## Request Tracing

Where possible, APIs should implement request ids:
* The API should accept a request id in the request header.
* The API must include the request id in the response header. If no request id is provided, the API should generate one.
* The API must include the request id in all log messages.

This allows for tracing of requests through the system.
It is also helpful if services log the request id that they receive from a service that they consume.

In Spring Boot, implementing request ids is quite straight forward with `@RequestScope`.
Also see [the implementation in the backend](https://github.com/loculus-project/loculus/blob/cbbbc9746604679df225059af6683ebcb568e038/backend/src/main/kotlin/org/loculus/backend/log/RequestId.kt).
5 changes: 5 additions & 0 deletions architecture_docs/09_architecture_decisions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Architecture Decisions

ADRs...

Check Nuclino
26 changes: 26 additions & 0 deletions architecture_docs/10_quality_requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Quality Requirements

Following the [ISO-25010](https://iso25000.com/index.php/en/iso-25000-standards/iso-25010) standard, we define the following quality requirements for our system:

## Performance Efficiency

* Time behavior: When a submitter uploads a sequence, then the sequence should be available for querying within 10 minutes.
* Time behavior: When a user queries a sequence, then the query should return within 1 second.

## Interaction Capability

* Operability: A maintainer should be able to set up a new Loculus instance from reading the documentation.

## Security

* Integrity: Only submitters belonging to the respective group should be able to make changes on sequence data.

## Transparency

We also identified two quality requirements that don't fit into the ISO-25010 standard:

* The Loculus project is transparent. Important decisions are publicly documented.
Users can comprehend how Loculus works and how the data is processed.
* It is comprehensible who submitted which data and when.
This is important so that submitters can be credited appropriately for their work
(e.g. by citing their data in a publication).
16 changes: 16 additions & 0 deletions architecture_docs/11_risks_and_technical_debt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Risks and Technical Debt

## Configuration Processing

We use a `values.yaml` file as a main input source for the Helm chart for the configuration of a Loculus instance.

We leveraged the powerful templating capabilities of Helm to generate the configuration files for the individual artifacts.
This works well, because we can distribute the mostly redundant configuration values efficiently.

However, this became quite complex and hard to maintain over time.
It is untested and hard to debug, if something goes wrong.
It is also (as of now) mostly undocumented.

Some parts of the configuration are redundant and could be simplified.
Also, the Helm chart contains a lot of default values
that are not suitable for general Loculus instances and will result in unexpected behavior if not overwritten.
Copy link
Contributor

@anna-parker anna-parker Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ena deposition was written as an optional component, however it still needs to submit all data and keep submission state. Therefore we duplicate all records and keep them in the backend db schema and the ena deposition schema - this creates unnecessary database bloat.

Although the two schemas are in the same db they behave as separate dbs with only the backend pod directly querying the public db schema and the ena-deposition and ingest (see below) pod querying the ena-deposition schema.

Potentially the ingest and ena-submission pod should be merged together as they both interact with INSDC and are optional. Additionally, ingest queries the ena deposition schema directly at the moment to ensure it does not reingest sequences that we submitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something 👍

5 changes: 5 additions & 0 deletions architecture_docs/12_glossary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Glossary

See glossary on the documentation page:
* https://loculus.org/introduction/glossary/
* [source file](../docs/src/content/docs/introduction/glossary.md)
4 changes: 4 additions & 0 deletions architecture_docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Architecture Documentation

This folder documents the architecture of Loculus.
It is based on the template provided by https://arc42.org/.
1 change: 1 addition & 0 deletions architecture_docs/plantuml/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
plantuml.jar
31 changes: 31 additions & 0 deletions architecture_docs/plantuml/01_use_cases.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
@startuml

title Loculus Use Cases
left to right direction

actor User as user
actor Submitter as submitter
actor Maintainer as maintainer

rectangle Loculus {
usecase "Upload data" as upload
usecase "Revise data" as revise
usecase "Browse data" as browse
usecase "Download data" as download

usecase "Configure new organism" as configure
usecase "Host own instance" as host
usecase "Sync data with INSDC" as insdc
}

submitter --> upload
submitter --> revise

user --> browse
user --> download

maintainer --> configure
maintainer --> host
maintainer --> insdc

@enduml
1 change: 1 addition & 0 deletions architecture_docs/plantuml/01_use_cases.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading