Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: general service discovery for farm testnets #1335

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

NikolaMilosa
Copy link
Contributor

This service will be used for all sorts of farm testnets which will monitor arbitrary services. The idea is to use this to register testnet machines and then use our vector stack to scrape logs

@NikolaMilosa NikolaMilosa requested a review from a team as a code owner March 6, 2025 11:59
Copy link
Contributor

github-actions bot commented Mar 6, 2025

MSD Diff

No diff

@NikolaMilosa NikolaMilosa changed the title Nm general journald service discovery feat: general service discovery for farm testnets Mar 10, 2025
@sasa-tomic sasa-tomic requested a review from Copilot March 10, 2025 11:08
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces a general service discovery service for farm testnets that registers target machines and integrates with the vector stack for log scraping. Key changes include the implementation of a server module with endpoints for adding and retrieving targets, new storage implementations (in-memory and file), and updates to metrics, supervisor, and configuration builder modules to support the service discovery workflow.

Reviewed Changes

File Description
src/server/mod.rs Implements the server with axum routes and a graceful shutdown.
src/storage/in_memory.rs and src/storage/file.rs Introduces in-memory and file storage mechanisms for target persistence.
src/metrics.rs Adds metrics instrumentation for monitoring target statuses.
src/supervisor.rs Implements target supervision with target polling, error logging, and GC handling.
src/server/add_targets.rs & get_targets.rs Define HTTP endpoints for target management.
Cargo.toml Updates dependencies and workspace members.
multiservice-discovery-shared/builders/general_exec.rs Adds a configuration builder for general service discovery.
multiservice-discovery-downloader Updates configuration generation logic and target conversion functions.
multiservice-discovery-shared/builders/exec_log_config_structure.rs Exports builder types for command execution configurations.

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

rs/ic-observability/multiservice-discovery-downloader/src/downloader_loop.rs:107

  • Using unwrap on cli.output_dir.parent() may panic if the output directory has no parent; add a proper check or use a fallback value.
fs_err::create_dir_all(cli.output_dir.parent().unwrap()).unwrap();

rs/ic-observability/multiservice-discovery-downloader/src/downloader_loop.rs:190

  • Deserialization with unwrap in convert_and_filter_general_targets can lead to a runtime panic if the input does not match the expected schema; consider error handling to gracefully skip or report faulty entries.
values.into_iter().map(|value| serde_json::from_value(value).unwrap())

Comment on lines 176 to 180
if SystemTime::now().duration_since(self.last_successful_sync).unwrap() > self.gc_timeout {
self.info("GC elapsed, removing target...");
(self.remove_self)().await;
self.metrics.remove_observed_value(&self.target.name);
break;
Copy link
Preview

Copilot AI Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unwrap on duration_since can cause a panic if the system clock is adjusted or if an error occurs; consider handling the Err case explicitly.

Suggested change
if SystemTime::now().duration_since(self.last_successful_sync).unwrap() > self.gc_timeout {
self.info("GC elapsed, removing target...");
(self.remove_self)().await;
self.metrics.remove_observed_value(&self.target.name);
break;
match SystemTime::now().duration_since(self.last_successful_sync) {
Ok(duration) => {
if duration > self.gc_timeout {
self.info("GC elapsed, removing target...");
(self.remove_self)().await;
self.metrics.remove_observed_value(&self.target.name);
break;
}
}
Err(e) => {
self.warn(format!("Failed to calculate duration since last successful sync: {:?}", e));
}

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make any sense to make the code uglier since the time is never set to anything but SystemTime::now. If this happens ever I would want the app to completely crash since we have a bug and not to just print the error and try and recover.

This should never happen.

Copy link
Contributor

@DFINITYManu DFINITYManu Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the system time is a problem, can you change it to use monotonic time? std::time::Instant. They are guaranteed to be increasing all the time. Plsfix.

#[derive(Parser, Debug)]
struct CliArgs {
/// Storage mode used for general service discovery.
#[arg(default_value_t = StorageMode::InMemory, long, short)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the in-memory storage be the default? What happens with the list of testnets if the service is updated/restarted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where ever we deploy it it will be deployed with a file. This is the default just so that locally when testing something you can quickly do a cargo run and be off to testing and not have to provide the files all the time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love sane dev defaults.

Copy link
Member

@sasa-tomic sasa-tomic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@DFINITYManu DFINITYManu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid system clock, use std time instant. These can be compared and have no risk of falling backwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants