Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC Demo #798

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

GRPC Demo #798

wants to merge 1 commit into from

Conversation

corps
Copy link
Member

@corps corps commented Jun 14, 2024

A fully working gRPC demo. Working on a notion doc explaining all the parts, open questions, how to improve expand on, etc.

Basic idea is that this is a minimal working example of

  1. A python package that automatically builds mypy typed grpc stubs from proto files.
  2. Also uses a custom codegen tool that generates types adaptors (demonstrating how writing a custom codegen plugin works, and how adaptors can help us migrate existing endpoints)
  3. A working http 1.1 transport implementation in a basic python app
  4. A working envoy proxy configuration backing that
  5. Mutual TLS authentication backing the gRPC server, along with ability to inspect the identity of the caller (And thus enforce, for instance, ACLs)

https://www.notion.so/sentry/gRPC-Implementation-For-Seer-cb14fb94e90d4afaa4d80c664e0c9031?showMoveTo=true&saveParent=true

@@ -40,14 +40,6 @@ jobs:
- name: Build image
run: |
make update
# TODO: Re-enable this when json schemas are ready to come back
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removing some cruft, not relevant

@@ -1,4 +1,9 @@
FROM golang:1.20.0 as go-build
RUN mkdir go && cd go && export GOPATH=/go && \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tool that helps setup a mTLS key chain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mTLS requiring more infrastructure and cert management, we should also consider having a solution for bearer token based authentication with support for no-downtime token rotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here's the thing -- token based authentication will, I feel, ultimately be even more infrastructure and management.
Once we start having multiple service to service sites, and once we may want things like ACLs (for instance, seer shouldn't have access to hybrid cloud endpoints), you need identity management, which token doesn't have baked in. We'd have to /build/ all that out, and we'd /still/ end up with infra tooling that allowed no downtime token rotations in the face of that identity management. Which we have to implement in each place.

With mTLS, we have one workflow, with tools that already exist, that is industry standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use JWTs to bake in identities and carry end-user context into downstream services. We could really get a lot of benefit with the added context in the multi-tenant environment and any time we cross some sort of domain trust boundary.

mTLS alone does not lend itself well to nuanced authorization decisions, at least in my opinion. A combination of mTLS for service-to-service authentication (PeerAuthentication and using JWT to carry end-user context (RequestAuthentication) is more robust than just one or the other. It's a fairly common pattern from my research. AuthorizationPolicies can be used to introspect the JWT's claims and make authorization decisions at the service mesh level.

In terms of added infrastructure, we'd be looking at some sort of trusted Secure Token Service (STS) to support the minting of JWTs. Tokens are issued through standard OAuth flows, and services would just implement JWT verification (if we decided to not just let the service mesh do this) and logic to pass it on where necessary.

Of course, implementing one thing at a time is probably best. 🙂

PROTOS_OUT=protos/build
PROTOS_SOURCE=protos/src
.PHONY: protos
protos: ## Regenerate protobuf python files
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this workflow, all it takes is installing (pip install) to get both the protos and the generated files together, making it relatively simple to both deploy and use in development.

@@ -1,5 +1,16 @@
# seer

## gRPC experiment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Will remove, I don't intend to merge this into master. I have some ideas how we can break this out into a workable workflow and separate concerns.

@@ -0,0 +1,70 @@
static_resources:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using envoy here because I know it has decent GKE support (https://cloud.google.com/kubernetes-engine/docs/tutorials/exposing-grpc-services-on-gke-using-envoy-proxy)

But to be honest, it wouldn't have to be GKE if we didn't want it to. Here, it handles the mTLS and the http 2 -> 1.1 proxying, but I'm confident both are also configurable in k8s with many other filters. That said, envoy is pretty nice and contains a lot of nice advantages, including decent XDS configuration. For service to service at scale, I feel inclined to bring this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have envoy in a few places in our infrastructure already. Using envoy for grpc service discovery and mtls termination should be ok. If we can avoid downgrading to HTTP1.1 I think it would be worth the extra effort.

@@ -0,0 +1,237 @@
import contextlib
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minimal, demo of creating a custom codegen tool. I didn't go out of my way to make it pretty, it could definitely be made nicer if we committed to such approaches.

@@ -0,0 +1,115 @@
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: I do not actually indent to commit any of the generated files (everyting in protos/py) in practice, but I did so in this demo for demonstrative purpose of the output.

self, proto_request: ScoreRequest, context: grpc.ServicerContext
) -> ScoreResponse:
identities = context.peer_identities()
if not identities or b"consumer" not in identities:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a simple demo of how authorization can use peer information from mTLS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to do authentication checks in each RPC method will be error prone. gRPC has interceptors which seem like a good fit for authentication methods.

context.abort(grpc.StatusCode.PERMISSION_DENIED, "only consumer can access")

request = SeverityRequest()
request.adapt_from(proto_request)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a realistic rollout case: we have an existing endpoint with production clients we want to keep working, that uses pydantic classes. We create and use a generated adaptor to convert between protos and the pydantic classes during the rollout so we can incrementally change the client side to use protos natively. After that, the implementation itself could use the protos natively and possibly discard the json endpoints if they are not necessary.

It is also possible to use native proto <-> json format tools, but there are some caveats we should discuss. I think in either case these adaptors are safe tools for rolling out to existing endpoints.


from seer.severity.severity_inference import SeverityRequest


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I've implemented a custom flask transport for grpc. Because we are using http 2 -> http1.1 bridge, we end up needing to implement some grpc server details here to support that. To use the native gRPC server, we'd have to support http 2, and we'd likely have to run separate servers in situations where we are deploying it alongside an existing django or flask app.

This compromise demonstrates that if we really want to commit one server for both gRPC and our existing http 1.1 servers, it is possible, with a modest amount of leg work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in production we would want separate deployments for RPC and web traffic. However, in local development having a single server would be nice to have.

# response.raise_for_status()
# print(response.json())

with grpc.secure_channel(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, grpc clients will still natively use http 2 with grpc and do not need to be aware of how we implement the http 1.1 bridge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it's also possible to use envoy as a front end as well, meaning that application codes would not need their own credentials files. In this sort of deployment, the envoy sidecar container contains the actual client tls configuration and the application code makes an insecure channel directly to the sidecar, which handles mtls on behalf.

pred = self.classifier.predict_proba(input_data)[0][1]

return SeverityResponse(severity=round(min(1.0, max(0.0, pred)), 2))
return SeverityResponse(severity=0.6)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make the demo simpler, disabling the actual inference.

if not identities or b"consumer" not in identities:
context.abort(grpc.StatusCode.PERMISSION_DENIED, "only consumer can access")

request = SeverityRequest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could almost make this a middleware or passthrough annotation that handles conversion for you if necessary

@@ -1,4 +1,9 @@
FROM golang:1.20.0 as go-build
RUN mkdir go && cd go && export GOPATH=/go && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mTLS requiring more infrastructure and cert management, we should also consider having a solution for bearer token based authentication with support for no-downtime token rotation.

@@ -0,0 +1,70 @@
static_resources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have envoy in a few places in our infrastructure already. Using envoy for grpc service discovery and mtls termination should be ok. If we can avoid downgrading to HTTP1.1 I think it would be worth the extra effort.

self, proto_request: ScoreRequest, context: grpc.ServicerContext
) -> ScoreResponse:
identities = context.peer_identities()
if not identities or b"consumer" not in identities:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to do authentication checks in each RPC method will be error prone. gRPC has interceptors which seem like a good fit for authentication methods.


from seer.severity.severity_inference import SeverityRequest


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in production we would want separate deployments for RPC and web traffic. However, in local development having a single server would be nice to have.

@@ -0,0 +1,115 @@
"""
@generated by mypy-protobuf. Do not edit manually!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these classes generated by your code?

Comment on lines +96 to +100
self.apply_to_message(proto)
self.apply_to_has_stacktrace(proto)
self.apply_to_handled(proto)
self.apply_to_trigger_timeout(proto)
self.apply_to_trigger_error(proto)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these method calls also be passing the val parameter?

Comment on lines +11 to +33
class FromScoreRequestAdaptor:
def adapt_from_message(self, value: builtins.str):
pass

def adapt_from_has_stacktrace(self, value: builtins.int):
pass

def adapt_from_handled(self, value: builtins.bool):
pass

def adapt_from_trigger_timeout(self, value: builtins.bool):
pass

def adapt_from_trigger_error(self, value: builtins.bool):
pass

def adapt_from(self, proto: sentry_services.seer.severity_pb2.ScoreRequest):
self.adapt_from_message(proto.message)
self.adapt_from_has_stacktrace(proto.has_stacktrace)
if proto.HasField("handled"):
self.adapt_from_handled(proto.handled)
self.adapt_from_trigger_timeout(proto.trigger_timeout)
self.adapt_from_trigger_error(proto.trigger_error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider generating 'mapping functions' that accept a target object, and proto request, and return an updated target object? Currently we'll be adding many methods to domain models. A mapping function/builder wouldn't add methods to existing objects and should have similar type safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 sure, I mean, there's tons of things we can do. my point with the demo here is merely to demonstrate that working solutions can be derived from a. 100 line script that make the domain <-> wire transformation easier. When we get to a point of actual implementation, I figure it's something that we iterate on more closely before we declare it standardized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's tons of things we can do. my point with the demo here is merely to demonstrate that working solutions can be derived from a. 100 line script that make the domain <-> wire transformation easier. When we get to a point of actual implementation, I figure it's something that we iterate on more closely before we declare it standardized.

That's fair. I wasn't sure if this was the design you were planning to move forward with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say this. There are 2 main gotchas we have to consider that drove my current design wether I stick to it:

  1. It is super important that adding new fields to protos and generating a new adaptor cannot break existing code. Because I imagine protos being generated in repos separately from their consumers, it means there can be version drift by default, which is actually good. But this also means the default generated code cannot presume any behavior if and when domain model and proto versions drift (either can be updated before the other). In this case, the adaptor's default behavior is a no op unless specifically wired.

  2. It won't always be the case that straight assignment works, in the case that types need further, deeper transformation, so a simple assignment sometimes won't be enough. That's probably fine, as the domain model can write setters to receive attributes, and I suspect that is still the exception and not the rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants