Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experience running this plugin with gRPCRoutes in a linkerd-meshed cluster #75

Open
FredrikAugust opened this issue Aug 30, 2024 · 10 comments
Assignees
Labels

Comments

@FredrikAugust
Copy link
Contributor

FredrikAugust commented Aug 30, 2024

It appears we're among the first to test out this plugin with linkerd and grpcroutes so I thought I'd share some knowledge which might help others.

We're running a custom build (just from trunk) hosted in s3 and injecting that into argo rollouts using the helm chart:

        controller:
          trafficRouterPlugins:
            trafficRouterPlugins: |-
              - name: "argoproj-labs/gatewayAPI"
                location: "https://********/gatewayapi-plugin-linux-amd64"

Our grpcRoutes look something like this:

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: GRPCRoute
metadata:
  annotations:
    retry.linkerd.io/grpc: cancelled,deadline-exceeded,resource-exhausted,unavailable,internal
    retry.linkerd.io/limit: "3"
    retry.linkerd.io/timeout: 300ms
  name: abc-grpc-route-query
  namespace: abc
spec:
  parentRefs:
  - group: ""
    kind: Service
    name: init
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: xyz
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: xyz-canary
      port: 80
      weight: 0
    matches:
    - method:
        method: xyz
        service: abc.xyz
        type: Exact

Our rollouts have the following canary strategy configuration:

      trafficRouting:
        plugins:
          argoproj-labs/gatewayAPI:
            grpcRoutes:
            - name: xyz-read-grpc-route-query
            - name: xyz-read-grpc-route-command
            namespace: init

We just rolled this out to our staging cluster, and the grpcRoutes seem to update just fine in realtime like they're supposed to. I'm going to try to get some metrics from linkerd to see how it all works and post that here within a couple of days.

We're running linkerd-enterprise-control-plane helm chart version 2.16 which introduced support for retries in grpcroutes, which was our motivation for migrating everything over to the new gateway api.

@FredrikAugust
Copy link
Contributor Author

One thing we did run into while setting this up was the CRD incompatibility between linkerd-crds and this plugin. linkerd-crds installs httpRoute v1alpha2 whereas this plugin expects v1. This was relatively easy to bypass as traefik which we also use ships with v1 crds.

@FredrikAugust
Copy link
Contributor Author

So we've deployed this all to our cluster, but I'm having a hard time verifying if it's working as we're only using the HTTP/GRPC routes for traffic routing during canary rollouts. Our linkerd-proxy metrics show no data for routes. I'm going to try to create a minimal repro locally with kind to see if it actually uses the HTTP/GRPCroutes.

@FredrikAugust
Copy link
Contributor Author

FredrikAugust commented Sep 2, 2024

Well I just got to try HTTPRoute using the podinfo docker image to verify, and it seems to work very nicely with linkerd. I've created a simple repro here which is a bit messy, but it works 😅 https://github.com/kvist-no/linkerd-gateway-api-repro.

The way it works:

  • traefik with httproute pointing to stable svc
  • create a httproute with parentref pointing to stable svc and backendrefs stable and canary svc
  • use this plugin to point to the latter httproute

With this, it seems to work flawlessly! I'm going to test grpcroutes next and ensure they work as well

@Philipp-Plotnikov
Copy link
Collaborator

Well I just got to try HTTPRoute using the podinfo docker image to verify, and it seems to work very nicely with linkerd. I've created a simple repro here which is a bit messy, but it works 😅 https://github.com/kvist-no/linkerd-gateway-api-repro.

The way it works:

  • traefik with httproute pointing to stable svc

  • create a httproute with parentref pointing to stable svc and backendrefs stable and canary svc

  • use this plugin to point to the latter httproute

With this, it seems to work flawlessly! I'm going to test grpcroutes next and ensure they work as well

Thank you @FredrikAugust for feedback!🙏

@FredrikAugust
Copy link
Contributor Author

FredrikAugust commented Sep 2, 2024

No worries. Status now is that I've confirmed it works fine with Traefik -> Linkerd + this plugin with Argo rollouts. What's missing is testing that GRPC works as it should which is a little more tricky as Traefik as per now doesn't support GRPCRoutes.

I'll try to test this tomorrow by running a simple application which connects to stable and just calls the Info service of podinfo (which returns hostname) over and over and logs the result. That way it should be easy to see that the split is ~ 50/50 and that canary deploys are working in terms of the traffic splits. And that the retries are working as they should for GRPC (I've confirmed they're good for HTTP).

@FredrikAugust
Copy link
Contributor Author

Okay, so I got around to creating the helper tool: https://github.com/kvist-no/grpc-lb-tester.

It does two things, every n seconds it sends two gRPC queries to the podinfo backends

  • Info to get the hostname (useful for testing that canary routing works as expected)
  • Status with code: Unavailable (useful for testing retry.linkerd.io functionality)

And the verdict is that it all seems to work.

Screenshot 2024-09-03 at 11 27 25

I first set up two backends for stable, and ensured that they each got ~50% traffic (this is controlled by LB algo of l5d). Then I triggered a rollout upgrade and set the steps to

- weight = 50%
- pause

When it paused after 50% I ensured that, again the (# 1) canary pod would get ~50% traffic and the (# 2) stable ones got the other 50%.

Then I promoted the rollout and the weight flipped to 100% stable and 0% canary, and the traffic routed accordingly. I also tried undo-ing a rollout and that seemed to work fine.

Secondly, I tested the retry.linkerd.io functionality which I wasn't sure of was going to work as linkerd and traefik use different CRDs for HTTPRoutes, but they also worked fine.

For testing HTTP, I simply ran time curl localhost:8000/status/500 and saw that the response times increased as I upped the retry count for the HTTPRoute, and for gRPC I did the same, only looking at the time it took for my helper tool to get a response (see image).

Screenshot 2024-09-03 at 11 45 43

I don't think there is anything left to test from the subset of functionality that we will use, but I can loop back if we encounter any problems in production. Thank you for the great plugin, it works very well! I hope we can see a release of the gRPC functionality soon 🙌

@FredrikAugust
Copy link
Contributor Author

I'll update this once I get a response from linkerd in regards to support for grpcroute v1.

@FredrikAugust
Copy link
Contributor Author

So I've gotten a response from linkerd, and they want to support v1, but don't have a timeline for it per now. linkerd/linkerd2#13032

@kostis-codefresh
Copy link
Collaborator

kostis-codefresh commented Sep 9, 2024

@FredrikAugust 0.4.0 was just released and it includes grpc support https://github.com/argoproj-labs/rollouts-plugin-trafficrouter-gatewayapi/releases/tag/v0.4.0

@FredrikAugust
Copy link
Contributor Author

Awesome, @kostis-codefresh! I don't think we'll be able to test it before linkerd upgrades to stable though, unless there is a way to configure the version used in this plugin — which I don't think there is.

Would it be a bad idea to allow to control the api version used through an environment variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants