Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

startup and cfg issues with opamp supervisor (last_recv_remote_config.dat, opamp_server_port) #36196

Closed
cforce opened this issue Nov 5, 2024 · 7 comments
Labels
bug Something isn't working cmd/opampsupervisor needs triage New item requiring triage

Comments

@cforce
Copy link

cforce commented Nov 5, 2024

Component(s)

cmd/opampsupervisor

What happened?

I have build collector and supervisor based on tag 0.112.0 and observed following issues

user@:~/collector# ./supervisor --config=supervisor.yaml
{"level":"info","ts":1730782759.1137323,"caller":"supervisor/supervisor.go:202","msg":"Supervisor starting","id":"0192fab1-0864-75df-8ee0-a58b065f70e2"}
{"level":"info","ts":1730782759.1137323,"caller":"supervisor/supervisor.go:202","msg":"Supervisor starting","id":"0192fab1-0864-75df-8ee0-a58b065f70e2"}
{"level":"error","ts":1730782759.1138253,"caller":"supervisor/supervisor.go:778","msg":"error while reading last received config","error":"open last_recv_remote_config.dat: no such file or directory","stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).loadAndWriteInitialMergedConfig\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:778\ngithub.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).Start\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:205\nmain.runInteractive\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main.go:43\nmain.run\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main_others.go:9\nmain.main\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main.go:19\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"}
{"level":"error","ts":1730782759.1138253,"caller":"supervisor/supervisor.go:778","msg":"error while reading last received config","error":"open last_recv_remote_config.dat: no such file or directory","stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).loadAndWriteInitialMergedConfig\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:778\ngithub.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).Start\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:205\nmain.runInteractive\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main.go:43\nmain.run\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main_others.go:9\nmain.main\n\t/builds/opentelemetry-collector-contrib/cmd/opampsupervisor/main.go:19\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"}
{"level":"info","ts":1730782759.1157913,"caller":"supervisor/supervisor.go:989","msg":"No config present, not starting agent."}
{"level":"info","ts":1730782759.1157913,"caller":"supervisor/supervisor.go:989","msg":"No config present, not starting agent."}

1.) Looking at the code there might be no toleration of the fact that the file "last_recv_remote_config.dat" is not existing always.
As this might happen on a fresh initial start where no config ever has been downloaded and cached. Is there a test covering such scenario?

func (s *Supervisor) loadAndWriteInitialMergedConfig() error {

2.) The supervisor shall be fine with 1.) and retry to connect to opamp backend until it is reachable and not write "..not starting agent." : The agent (supervisor to opamp backend client) shall retry to connect continuously and never get in the situation to end up in a failed state where agent is no started. 1.) is no startup fail state - it is a bug i would say. Also in any real "startup fail state" (e.g the supervisor.yaml cfg file can't be read) the supervisor shall terminate itself by exit and not just write error logs.

3.) But maybe i am misleading the info message "No config present, not starting agent." as actually still the collector is started but with the wrong opamp port cfg - one that is different than the one what i have configured in the supervisor.yaml

supervisor.yaml

agent:
  executable: ./collector
  opamp_server_port: 12548
  health_check_port: 12547

After the initial start of the supervisor (see 1.) the effective.yaml generated looks like below
As you can see the opamp sever ports does NOT use the configured "opamp_server_port: 12548" but still uses
random ports 36455

exporters:
    nop: null
extensions:
    health_check:
        endpoint: localhost:12547
    opamp:
        instance_uid: 0192fb21-2782-7c4a-b29a-96d36787d53e
        ppid: 40915
        ppid_poll_interval: 5s
        server:
            ws:
                endpoint: ws://127.0.0.1:36455/v1/opamp
                tls:
                    insecure: true
receivers:
    nop: null
service:
    extensions:
        - health_check
        - opamp
    pipelines:
        logs:
            exporters:
                - nop
            receivers:
                - nop
    telemetry:
        logs:
            encoding: json
        resource:
            host.arch: arm64
            host.name: XXX
            os.description: YYYYY
            os.type: linux
            service.instance.id: 0192fb21-2782-7c4a-b29a-96d36787d53e
            service.name: collector
            service.version: 0.112.0

Collector version

0.112.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

related
#36001
ArthurSens@65992b7
@djaglowski fyi

@cforce cforce added bug Something isn't working needs triage New item requiring triage labels Nov 5, 2024
Copy link
Contributor

github-actions bot commented Nov 5, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@cforce
Copy link
Author

cforce commented Nov 7, 2024

".. continuously and never get in the situation to end up in a failed state where agent is no started. 1.) is no startup fail state - it is a bug i would say. "

Revisiting this sentence i think now this shall be handled like a "noop" cfg case, means nothing yet received is at the same level like noop received - defaulted to noop on "initial run" -and its a valid "good case" .. to wait until connected

An initial state - can be the case when agent id file is not existing which shall fallback to noop
If agent id file is existing but last_recv_remote_config.dat is not , this shall also fall back to noop until the opamp server is connected and cfg sync can start.
All cases -- from my view - are not worth an error log as they are good cases.

@cforce
Copy link
Author

cforce commented Nov 7, 2024

After the initial start of the supervisor (see 1.) the effective.yaml generated looks like below
As you can see the opamp sever ports does NOT use the configured "opamp_server_port: 12548" but still uses
random ports 36455

If i terminate the process and start the supervisor again it seems to clean up(effective cfg yaml ports are correct then and collector is able to connect) , but actually this manually work around will not happen in standard workflow. It shall behave on the first initial start as it does on the second

@dpaasman00
Copy link
Contributor

dpaasman00 commented Nov 7, 2024

Hey Fabian, so looking into this I think there's a couple things going on.

  1. For the error log that's output when there is no last_recv_remote_config.dat file present, I agree this shouldn't be reported as an error. This is expected and will always occur whenever the supervisor is started and it hasn't received a config file from an OpAmp server for the collector. Related to this, I recently opened this issue which would change how the supervisor handles the collector's config file.
  2. For the opamp_server_port parameter issues you were having, I believe this is because you were using a build of the supervisor based on the v0.112.0 release. That PR was merged after the v0.112.0 release, so that parameter isn't available. However it is included in release v0.113.0 which was just released yesterday, and is available here. There shouldn't be an issue using that parameter on this release. This does raise a follow up issue though that the supervisor didn't throw an error when you tried using a parameter it doesn't recognize in that build. I'll open up a bug issue for this.

Let me know if this helps or if there's anything else in your post that I missed!

Edit: fix link to issue in bullet point 1.

@cforce
Copy link
Author

cforce commented Nov 7, 2024

Tx for looking into it. I will test the 0.113.0 soon and check if the port cfg works for me as well

@cforce
Copy link
Author

cforce commented Nov 16, 2024

Tested and working

@cforce cforce closed this as completed Nov 16, 2024
@philcook-oiq
Copy link

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cmd/opampsupervisor needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

3 participants