Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Keylime easily deployable on Kubernetes/Openshift #1

Open
maugustosilva opened this issue Jun 2, 2023 · 2 comments
Open

Make Keylime easily deployable on Kubernetes/Openshift #1

maugustosilva opened this issue Jun 2, 2023 · 2 comments

Comments

@maugustosilva
Copy link
Contributor

After several discussions with @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse, we collectively decided that the time to have Keylime easily deployed on Kubernetes/Openshift has come. I propose we use this issue to concentrate all the relevant discussion on this topic.

I will start by listing some common relevant points, and I do thank Marcus Hesse for starting the discussion on the keylime-operator on CNCF's Slack. I believe I have addressed most of your questions on this writeup.

The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., agents) to specific verifiers but can also properly react to administrative activities such as node reboots or cordoning off.

I am not an Kubernetes/Openshift expert by any means, and therefore my proposal here is bound to be incomplete/incorrect, and therefore additions/corrects are welcome. That being said, I see the following set of intermediate steps, in increasing order of complexity, as a good way to achieve our goal.

  1. Ensure that all keylime components can be fully executed in an containerized manner. For this the following requirements should be satisfied.
    a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest" verifier, registrar and tenant to also include the rust agent image (@ansasaki is pursing this)
    b. Carefully determine the least amount of (container) privileges will be required to run the agent
    c. Provide some tool to perform containerized keylime deployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task).

  2. Create a simple Kubernetes application for keylime. At this point, we should be able to start by writing progressively more yaml files

    a. The idea is to start with very simple Deployment with the following objects:
    * AStatefulSet (initially of 1) for the Registrar
    * AStatefulSet (initially of 1) for the Verifier
    * A DaemonSet for the Agents
    * Both exposed as Service (type=NodePort)
    * mTLS certificates stored as Secrets
    * Given the fact keylime can be fully configured via environment variables, we shall use environment dependent variables on our yaml.

    b. Initially, I propose we make the following simplifying boundary conditions
    * Given the use of the sqlite we could start without any DB deployment
    * mTLS certificates are pre-generated (with keyime_ca commands) and added to the Kubernetes cluster
    * Environment variables will be also set and maintained by some external tool
    * The tenant will NOT be part of the initial deployment.
    * Make use of the "Node Feature Discovery" to mark all the nodes with tpm devices (and make it part of the DaemonSet node selector)

    c. From this point we should expand for an "scale-out" deployment.
    * Multiple Registrars and Verifiers
    * A pre-packaged helm deployment of some SQL database server will be used.
    * A Service (type=LoadBalancer)

    d. At this point, the following technical considerations should be made.
    * I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
    * Verifiers are identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
    * The load balancing algorithm will have to use the URI (which contains the agent UUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a single tenant will add all the agents to the set of verifiers)
    * Tenant is still considered as a component outside of the whole deployment

  3. Create an Operator for keylime. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:

    • Ability to automatically generate all pertinent certificates
    • Ability to deal with environment variables
    • Ability to automatically add agents to verifiers
    • Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.
  4. Make the Operator more "production-ready"

    • How to deal with (measured boot and runtime/IMA) policies?
    • How to deal with "scale-out" operations (i.e., if the number of verifier pods increase, should we perform "rebalancing")?
    • How to integrate "durable attestation" on this scenario?
  5. The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the keylime project. I will create such repository.

@mheese
Copy link
Contributor

mheese commented Jun 5, 2023

The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., agents) to specific verifiers but can also properly react to administrative activities such as node reboots or cordoning off.

Listing the goal/purpose of the operator is a great idea. We should place this in the README for everyone immediately to see.

1. Ensure that all `keylime` components can be fully executed in an containerized manner. For this the following requirements should be satisfied.
   a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest"  `verifier`, `registrar` and `tenant` to also include the rust `agent` image (@ansasaki is pursing this)
   b. Carefully determine the least amount of (container) privileges will be required to run the `agent`

@ansasaki are you actively working on this? if not, this is a good task for me to take on.

   c. Provide some tool to perform containerized `keylime` deployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task).

@maugustosilva I assume this is for containerized deployments outside of Kubernetes?

2. Create a simple Kubernetes application for `keylime`. At this point, we should be able to start by writing progressively more `yaml` files
   a. The idea is to start with very simple `Deployment` with the following objects:
   * A`StatefulSet` (initially of 1) for the `Registrar`
   * A`StatefulSet` (initially of 1) for the `Verifier`
   * A `DaemonSet` for the `Agents`
   * Both exposed as `Service` (`type=NodePort`)
   * mTLS certificates stored as `Secrets`
   * Given the fact `keylime` can be fully configured via environment variables, we shall use environment dependent variables on our yaml.
   b. Initially, I propose we make the following simplifying boundary conditions
   * Given the use of the `sqlite` we could start without any DB deployment
   * mTLS certificates are pre-generated (with `keyime_ca` commands) and added to the Kubernetes cluster
   * Environment variables will be also set and maintained by some external tool
   * The `tenant` will NOT be part of the initial deployment.
   * Make use of the "Node Feature Discovery" to mark all the nodes with `tpm` devices (and make it part of the `DaemonSet` node selector)

I like the idea of the initial boundary conditions, it will make it a lot easier to make progress. Here are some questions/comments I have:

  • we should provide the deployment as a Helm chart
  • we could easily make the helm chart accessible as ORAS artifacts like the container images over the quay.io registry
  • does the registrar really need to be a StatefulSet as well? If yes, why? I thought its design is "stateless", and a Deployment could be enough
  • the Verifier as a StatefulSet is unfortunately probably required when it is being scaled because the specific verfier "owns" agents. As we are generally redesigning some things, this is IMHO something that we should pay attention to that we could avoid this design. Maybe instead of verifiers owning agents it could be a job distribution system? That would turn verifiers into "verifier workers" that take on jobs, it makes them stateless and they are way easier to scale in general
  • as mentioned before and I think we all agreed, the agent deployment should be optional, but activated by default
  • with regards to certificates there are two things we should do: (a) to begin with we document commands on how to generate the certificates and create Kubernetes secrets from them, (b) we can have a "cert-manager" integration, as this is the most popular tool to manage certificates on Kubernetes
  • sqlite is probably a good start as long as it is possible for the registrar and the verifier to have their own sqlite database
  • registrar and verifier deployments/statefulsets must have hard-coded replicasets of 1 for now in the Helm chart
  • love the idea around the node feature discovery for discovering TPM devices, but I think this could also come in a second step
   c. From this point we should expand for an "scale-out" deployment.
   * Multiple `Registrars` and `Verifiers`
   * A pre-packaged `helm` deployment of some SQL database server will be used.
   * A `Service` (`type=LoadBalancer`)
   d. At this point, the following technical considerations should be made.
   * I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
   * `Verifiers` are identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
   * The load balancing algorithm will have to use the URI (which contains the `agent` UUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a single `tenant` will add all the `agents` to the set of `verifiers`)
   * Tenant is still considered as a component outside of the whole deployment
  • this is exactly the right next step, but I feel like this is a long way in the future unfortunately
  • we should be able to use a helm chart dependency to pull in a SQL database deployment
  • I think you are bringing the problem to the point: the tenant interaction is what is actually performing the load-balancing so to speak
  • the way how I thought about it is that any "tenant" interaction is essentially part of the "operator". The tenant essentially becomes an operator.
3. Create an `Operator` for `keylime`. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:
   
   * Ability to automatically generate all pertinent certificates
   * Ability to deal with environment variables
   * Ability to automatically add `agents` to `verifiers`
   * Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.
  • what do you mean by ability to deal with environment variables?
  • agree on certs
  • agree on automatically adding agents
  • love the idea on reacting to reboots, etc. although not all events might be easy to detect
  • the language of choice should be golang for the operator (as that ecosystem is basically all golang)
  • most of the goals should be doable with CRDs and their respective Kubernetes controllers
  • however, there might be a need to create a Kubernetes resource in the registrar to kickstart the process to make it "automatic" (otherwise the creation of a resource would be the tenant CLI equivalent)
4. Make the `Operator` more "production-ready"
   
   * How to deal with (`measured boot` and `runtime/IMA`) policies?
   * How to deal with "scale-out" operations (i.e., if the number of `verifier` pods increase, should we perform "rebalancing")?
   * How to integrate "durable attestation" on this scenario?

These are the $100 questions :)

5. The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the `keylime` project. I will create such repository.

@maugustosilva if you don't mind, I would start to create issues for at least some of the work that you are proposing here, so that I can get started to work on them?

@maugustosilva
Copy link
Contributor Author

Hey @mheese, trying to answer a few of the questions here, but will most definitely start to fold it out into multiple issues:

  • we should provide the deployment as a Helm chart
    100% agree

  • we could easily make the helm chart accessible as ORAS artifacts like the container images over the quay.io registry
    sure, cannot think of a reason why not

  • does the registrar really need to be a StatefulSet as well? If yes, why? I thought its design is "stateless", and a Deployment could be enough
    absolutely right, the Registrar does not have to be a StatefulSet

  • the Verifier as a StatefulSet is unfortunately probably required when it is being scaled because the specific verfier "owns" agents. As we are generally redesigning some things, this is IMHO something that we should pay attention to that we could avoid this design. Maybe instead of verifiers owning agents it could be a job distribution system? That would turn verifiers into "verifier workers" that take on jobs, it makes them stateless and they are way easier to scale in general
    agree, but it is a tall order, will require significant changes in keylime

  • as mentioned before and I think we all agreed, the agent deployment should be optional, but activated by default
    ah yes, yes... I have been actually playing around with some NFD script to label nodes with TPMs

  • with regards to certificates there are two things we should do: (a) to begin with we document commands on how to generate the certificates and create Kubernetes secrets from them, (b) we can have a "cert-manager" integration, as this is the most popular tool to manage certificates on Kubernetes
    on it (item a)

  • sqlite is probably a good start as long as it is possible for the registrar and the verifier to have their own sqlite database
    registrar and verifier deployments/statefulsets must have hard-coded replicasets of 1 for now in the Helm chart
    +1

  • love the idea around the node feature discovery for discovering TPM devices, but I think this could also come in a second step
    sure, not crucial, will just leave as an open issue

  • this is exactly the right next step, but I feel like this is a long way in the future unfortunately
    I see... maybe I am underestimating the complexities of it

  • we should be able to use a helm chart dependency to pull in a SQL database deployment
    I am counting on your help and expertise on that one, I am certainly not too familiar with any "good and simple" SQL helm charts

  • I think you are bringing the problem to the point: the tenant interaction is what is actually performing the load-balancing so to speak
    An unfortunate problem, which is not gonna go away any time soon (waaaay to many changes in keylime proper)

  • the way how I thought about it is that any "tenant" interaction is essentially part of the "operator". The tenant essentially becomes an operator.
    right, but even in this case a keylime admin might want to stop/remove/update a particular agent at a given time

  • what do you mean by ability to deal with environment variables?
    how do we propagate env vars back to Pods? maybe it is just a matter of envFrom with a configMapRef

  • agree on certs
    + 1

  • agree on automatically adding agents
    will generate an issue on keylime proper

  • love the idea on reacting to reboots, etc. although not all events might be easy to detect
    +1

  • the language of choice should be golang for the operator (as that ecosystem is basically all golang)
    I see.

  • most of the goals should be doable with CRDs and their respective Kubernetes controllers
    I thought so, but still do not have the full picture in my head

  • however, there might be a need to create a Kubernetes resource in the registrar to kickstart the process to make it "automatic" (otherwise the creation of a resource would be the tenant CLI equivalent)
    Hmmm, interesting

galmasi pushed a commit to galmasi/attestation-operator that referenced this issue Jan 25, 2024
… dependency in the main keylime chart.

Signed-off-by: George Almasi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants