Request for Enhancement: X-Road Metrics version 2.x #155

melbeltagy · 2024-12-30T10:49:02Z

TL;DR

Current implementation of X-Road Metrics, although working properly and performing very well, has many pain-points from administration, security, and development point-of-views.
Proposed solution is to migrate existing applications from CLI-based applications into microservices architecture with AdminUI and UserUI components.

Detailed Edition

Background information

Current implementation of X-Road Metrics, although working properly and performing very well, has the following pain-points from administration, security, and development point-of-views.

Administration point-of-view

Administrative tasks are scattered across different modules:
- Collector module has scripts to create MongoDB's users, collections, and indexes
- OpenData module has script to create PostegeSQL' users
- While Anonymizer module creates the PostegeSQL' tables
For every module, maintaining the configurations requires physical access to the file system of the module's server and manually changing them
For every module, there are different operations that can be run. e.g., Collector has list, collect, and update, while Anonymizer has only one. The issue is, these operations are not clear to the administrator.
Usually, only one operation is scheduled as a cron job on the server
For administrators to run any of these jobs/operations outside of their schedule, or to run an operation that does not have a schedule, once again, they have to connect to the running server's shell and execute the required operation(s) manually.
Although applicable to build a docker image out of each module, running them in Kubernetes and maintaining the configurations is not an easy administrative task at all.

Security point-of-view

Since each module's job is configured as a native cron job, the docker images will always require root privilege making the docker images not adhering to images best practices from security perspective (similar to the Security Server Sidecar image).
This issue actually prevents some of the Kubernetes-based runtime, e.g., OpenShift, from running the images without workarounds that some organizations consider a big risk.
Storing configurations on the file-system, which contains database credentials is a big issue.
Enforcing backwards compatibility with Python 3.8 and Ubuntu 20.4 (Focal) by preventing dependencies upgrade to latest versions.

Development point-of-view

Each module has its own technical stack. Some are pure python CLI applications, and others are using Django framework. Then we have Networking module which is based on R language. Maintaining these different technical stacks in one repository is difficult affecting the cost of maintenance and adding new features.
A lot of workaround has been in place to keep backwards compatibility with Python 3.8 and Ubuntu 20.4 (Focal).
Despite the latest efforts of enhancing the developers experience, it's not easy to run the applications locally (must build and run in docker locally, which must be built by the developer since the repo does not have any mean of building images).
For a developer to run any module, it requires administration tasks as well, exposing the developer with the same issues as an administrator
The structure of each module is not according to the standard. e.g., tests are inside each package
Each module has some utility command lines that are either obsolete or only handful of people are aware of how to use them or what they do
The way code is structured, makes IDEs give errors and cannot work with the different imports and type checks (I tried many times to fix, but cannot without drastic changes)
There are some modules that are not published yet, but part of the code base. Makes it hard to understand which parts to maintain and which to keep aside for now.

Important

Although each of the previous pain points can be addressed individually, it's really hard and time consuming given the current state. (I tried 😞 )

Metrics Version 2.0

The proposed solution is:

Migrate existing applications (modules) from being CLI applications to be microservices based on FastAPI framework. i.e., switching from CLI to REST APIs. This applies to all modules including Networking.
Introduce Admin UI. A unified web application targeting administrators. It allows them to:
- Run different operations/jobs supported by any of the Collector, Corrector, Reports, and Anonymizer applications.
- Configure each application (including OpenData and Networking)
- Download PDF reports from UI (instead of accessing the server's file system or waiting for emails)
  - Emails would still be available
Introduce User UI. Another unified web application that is targeting public users (replacing existing Django and R UIs)
- This way, Only UserUI and its proxy is in the DMZ zone while OpenData and Networking deployed in the internal network.
Introduce a new module that would be responsible for all DB administrative tasks (creating MongoDB users, collections, and indexes. Migrate PostgreSQL' database using database migration tools. e.g., Alembic
Default packaging would be Docker images.
Notes:
Sharing my thoughts about this point:
- My preferred option is to actually drop support for Debian packaging.
- If docker images are not the only required option, then the 2nd best option is to package the application as a python application (or as a wheel) and publish them to Artifactory.
  This way, we can support any OS and be more flexible.
- If Debian packages specifically are still needed, it can still be created and published.

The proposed solution will inherently provide the following fixes as well

Migrate all configurations to be environment variables or MongoDB (depending on the configuration's nature)
Using Docker images means that we can always use latest Python and dependencies and hence security fixes.
Docker images will adhere to best practices
Fix code structure
Cleanup the code base from unused code, libraries, and tools

Additional items to include

Move extra modules that are not published yet to a different branch from main. i.e., Move analysis_module, analysis_ui_module, archive_module, opendata_collector_module to a incubation branch for enhanced clarity and maintainability.

Proposed Architecture:

Proposed Milestones/plan:

Milestone 1 Move non-production modules into incubation branch.
Milestone 2
This milestone should be easy to accomplish and would be an easy win. Admins can use Swagger at this stage
- Migrate Collector, Corrector, Reports, and Anonymizer applications into FastAPI microservices
- Introduce Administration module
- Switch to Docker images
- Migrate existing related configurations to MongoDB
- Upgrade MongoDB version to 7.x and Python version to 3.12.x
Milestone 3
- Introduce AdminUI
Milestone 4
- Migrate OpenData to FastAPI
- Introduce UserUI
Milestone 5
- Migrate Networking to FastAPI
- Use UserUI

The text was updated successfully, but these errors were encountered:

melbeltagy · 2025-01-01T07:49:50Z

Milestone 1

Regarding Milestone 1, I tried multiple approaches and found out that the simplest one is to move the non-released modules into a nested folder and disable any files that would cause Github or dependabot to consider the folders as python applications.
Already implemented in PR #156
I personally highly recommend merging this PR as it will remove confusing code from developers' concern.
They can be, of course, moved out of the folder when needed after being upgraded and fixed.

Alternate Options

Option 1

Conclusion: Causes a lot of hidden irreversible issues that are usually not noticed until it's too late (according to official doc and articles online)

Use git filter-branch. Sounded very promising until read the warnings.
I tried playing around with it, git command refused to execute it and pointed out to the second option below.

Option 2

Conclusion: A big no go in our use-case.

The previous option points to git-filter-repo as a replacement.
In summary it's a tool that can be used to remove any traces of unwanted files or folders. It works great, but does not fit our use-case as it filters out contents on repo level, not branch level.

Options 3

Conclusion: if really needed, can be done, but has drawbacks and seems to be overkill in our case.

Move each of non-released modules into its own incubation branch using git subtree command.
This tool fits best in scenarios where a folder that represents a service or a library is required to be split into its own repository. It still works for our use-case but with caveats.

VitaliStupin · 2025-01-02T14:07:22Z

If a proper rewrite is considered for X-Road Metrics, performance and optimization issues should be addressed.

The application currently fulfills its purpose but scales poorly as the volume of X-Road queries increases. For example, three months of Estonian X-tee data requires 1TB of storage, with half occupied by indexes. Some commonly used indexes reach 40GB in size, and multiple indexes are utilized concurrently. While MongoDB is optimized for in-memory operations, it's infeasible to allocate hundreds of gigabytes of memory exclusively for MongoDB. A machine with 64GB of RAM struggles to process new metrics due to excessive I/O operations, as it cannot hold all required indexes in memory.

MongoDB might not be the ideal database for X-Road Metrics. However, when considering alternatives, it's crucial to acknowledge that any other database optimized for fast ingestion and less frequent updates would likely face similar challenges. A typical sequence of database operations to ingest metrics data into MongoDB is as follows:

Collector: Adds PartyA (client or producer) data to the raw_messages collection.
Corrector: Reads PartyA data from raw_messages.
Corrector: Searches the xRequestId index to check if PartyB data has already been added to clean_data (in this flow, PartyB data is not yet available).
Corrector: Adds PartyA data to the clean_data collection.
Corrector: Marks PartyA data as processed in raw_messages.
Collector: Adds PartyB data to raw_messages.
Corrector: Reads PartyB data from raw_messages.
Corrector: Searches the xRequestId index to check if PartyA data has already been added to clean_data.
Corrector: Reads PartyA data from clean_data.
Corrector: Updates PartyA data with fields from PartyB and saves it to clean_data.
Corrector: Marks PartyB data as processed in raw_messages.
Cleanup: Finds and deletes processed documents from raw_messages.
Cleanup: Finds and deletes documents from clean_data older than X days (e.g., 90 days). This step is particularly slow as it requires checking and processing individual rows.

Some steps in this process are redundant. For example, processed documents in raw_messages could be directly deleted. Alternatively, it could be considered to bypass the raw_messages collection altogether and directly passing data from the collector to the corrector. This would slightly slow down the collection process but significantly reduce overall disk I/O.

Deleting old documents from the database is a very slow operation. Deleting a single day's worth of data can take hours. The primary issue is that MongoDB does not support table partitioning. This prevents the deletion of an entire month/week/day instantly, necessitating slow search/delete operations within MongoDB.

The current data model is also suboptimal. Client and producer data are separated into different sub-elements, and one of these sub-elements may be missing. This requires up to four separate searches for many operations, especially during report generation, instead of a single search. Furthermore, storing duplicate data in both client and producer elements introduces overhead. A more efficient approach would involve:

Fields common to both client and producer.
Aggregated fields for optimized searching and simplified reporting/visualizations.
Client-specific fields without duplicates.
Producer-specific fields without duplicates.

This structure would minimize the number of searches required and eliminate unnecessary conditions and checks for the existence of client or producer elements.

Report generation also necessitates optimization. Due to the inability of indexes to fit into memory, generating reports for members with numerous queries can be as slow as a full collection scan. Performing a single full scan and aggregating data for each member instead should be considered. Database support for partitioning would also significantly improve performance by reducing index sizes and eliminating the need to process data from previous months.

Currently, X-Road Metrics does not provide Kibana/Grafana visualizations and searches. The Estonian X-Road center synchronizes data from MongoDB to Elasticsearch and utilizes Kibana for problem identification and visualization of X-Road usage statistics. If major changes are implemented, the synchronization process may need to be adjusted. If the database is replaced, a new synchronization solution would be required. However, this should not be considered a major obstacle, as optimizing X-Road Metrics can significantly reduce hardware resource consumption.

melbeltagy changed the title ~~Enhancement Request: X-Road Metrics version 2.x~~ Request for Enhancement: X-Road Metrics version 2.x Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Enhancement: X-Road Metrics version 2.x #155

Request for Enhancement: X-Road Metrics version 2.x #155

melbeltagy commented Dec 30, 2024 •

edited

Loading

melbeltagy commented Jan 1, 2025

VitaliStupin commented Jan 2, 2025

Request for Enhancement: X-Road Metrics version 2.x #155

Request for Enhancement: X-Road Metrics version 2.x #155

Comments

melbeltagy commented Dec 30, 2024 • edited Loading

TL;DR

Detailed Edition

Background information

Administration point-of-view

Security point-of-view

Development point-of-view

Metrics Version 2.0

Additional items to include

Proposed Architecture:

Proposed Milestones/plan:

melbeltagy commented Jan 1, 2025

Milestone 1

Alternate Options

Option 1

Option 2

Options 3

VitaliStupin commented Jan 2, 2025

melbeltagy commented Dec 30, 2024 •

edited

Loading