Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_detector.GoogleCloudResourceDetector misbehaving on Buildkite #363

Closed
xrmx opened this issue Dec 11, 2024 · 2 comments · Fixed by #366
Closed

_detector.GoogleCloudResourceDetector misbehaving on Buildkite #363

xrmx opened this issue Dec 11, 2024 · 2 comments · Fixed by #366
Labels
bug Something isn't working priority: p2

Comments

@xrmx
Copy link
Contributor

xrmx commented Dec 11, 2024

I have _detector.GoogleCloudResourceDetector misbehaving when running under Buildkite (CI tool) on GKE.

When I load the resource detector the process will get restarted and I'll have an infinite loop with one of this messages from the OTel sdk per run:

Detector <GoogleCloudResourceDetector object at 0x7f8ae0522250> took longer than 5 seconds, skipping

I've debugged this a bit and there is something in _gke_resource that causes this.

As an experiment I've open coded the thing into another resource detector and to my surprise the following code is working fine:

class GoogleDebugCloudResourceDetector(ResourceDetector):
    def detect(self) -> Resource:
        from opentelemetry.resourcedetector.gcp_resource_detector import _metadata, _gke  # , _detector
        from opentelemetry.resourcedetector.gcp_resource_detector._constants import (
            ResourceAttributes,
        )

        try:
            print(_metadata.get_metadata())
        except Exception:
            return Resource.get_empty()

        if _gke.on_gke():
            cluster_location = _metadata.get_metadata()["instance"]["attributes"]["cluster-location"]
            hyphen_count = cluster_location.count("-")
            if hyphen_count == 1:
                zone_or_region_key = ResourceAttributes.CLOUD_REGION
            elif hyphen_count == 2:
                zone_or_region_key = ResourceAttributes.CLOUD_AVAILABILITY_ZONE
            else:
                print("oops no zone_or_region_key")
                zone_or_region_key = "oops"

            cluster_name = _metadata.get_metadata()["instance"]["attributes"]["cluster-name"]
            host_id = str(_metadata.get_metadata()["instance"]["id"])
            return Resource(
                {
                    ResourceAttributes.CLOUD_PLATFORM_KEY: ResourceAttributes.GCP_KUBERNETES_ENGINE,
                    zone_or_region_key: cluster_location,
                    ResourceAttributes.K8S_CLUSTER_NAME: cluster_name,
                    ResourceAttributes.HOST_ID: host_id,
                }
            )

Unrelated questions:

  • WDYT on making _metadata.is_available a wrapper for _metadata.get_metadata() that returns false if an exception is raised? This way we have one http call less.
@xrmx
Copy link
Contributor Author

xrmx commented Dec 12, 2024

This fails, I see every print so is it really _detector._make_resource the issue?

class GoogleDebugCloudResourceDetector(ResourceDetector):
    def detect(self) -> Resource:
        from opentelemetry.resourcedetector.gcp_resource_detector import _metadata, _gke, _detector
        from opentelemetry.resourcedetector.gcp_resource_detector._constants import (
            ResourceAttributes,
        )

        if not _metadata.is_available():
            return Resource.get_empty()

        print(_metadata.get_metadata())

        if _gke.on_gke():
            print("on gke")
            zone_or_region = _gke.availability_zone_or_region()
            print("got zone and region")
            zone_or_region_key = (
                ResourceAttributes.CLOUD_AVAILABILITY_ZONE
                if zone_or_region.type == "zone"
                else ResourceAttributes.CLOUD_REGION
            )

            cluster_name = _gke.cluster_name()
            print("after cluster name", cluster_name)
            host_id = _gke.host_id()
            print("after host_id", host_id)
            attrs = {
                ResourceAttributes.CLOUD_PLATFORM_KEY: ResourceAttributes.GCP_KUBERNETES_ENGINE,
                zone_or_region_key: zone_or_region.value,
                ResourceAttributes.K8S_CLUSTER_NAME: cluster_name,
                ResourceAttributes.HOST_ID: host_id,
            }
            print("before return")
            return _detector._make_resource(attrs)

@aabmass
Copy link
Collaborator

aabmass commented Dec 17, 2024

Is there any chance you have minimal repro? I'm not familiar with BuildKite but I haven't seen this issue on GKE in general

@pintohutch pintohutch added priority: p2 bug Something isn't working labels Dec 18, 2024
xrmx added a commit to xrmx/opentelemetry-operations-python that referenced this issue Dec 19, 2024
We should create a Resource instance and not use Resource.create because
if we set OTEL_EXPERIMENTAL_RESOURCE_DETECTORS we will go into an
infinite loop trying to load and instantiate all the resources
detectors.

Fix GoogleCloudPlatform#363
xrmx added a commit to xrmx/opentelemetry-operations-python that referenced this issue Dec 24, 2024
We should create a Resource instance and not use Resource.create because
if we set OTEL_EXPERIMENTAL_RESOURCE_DETECTORS we will go into an
infinite loop trying to load and instantiate all the resources
detectors.

Fix GoogleCloudPlatform#363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants