Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Helm generated instead of cert-manager generated certs for the operator #1648

Merged
merged 25 commits into from
Mar 3, 2025
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
20c7ba1
initial draft
jvoravong Jan 29, 2025
b36d8a3
Update CI/CD to support self signed cert data
jvoravong Jan 29, 2025
688d380
update functional tests
jvoravong Feb 13, 2025
3be762e
Merge branch 'main'
jvoravong Feb 13, 2025
9670d64
more test fixes
jvoravong Feb 13, 2025
ccf0937
Merge branch 'main'
jvoravong Feb 13, 2025
18d5c01
updates to keep support for the certmanager subchart around, updated …
jvoravong Feb 14, 2025
c23c1a1
Update docs/auto-instrumentation-install.md
jvoravong Feb 14, 2025
5895726
draft migration guide for 0.118.0 to 0.119.0
jvoravong Feb 18, 2025
cf52cab
Update docs/auto-instrumentation-install.md
jvoravong Feb 18, 2025
367f02e
Merge branch 'main' of https://github.com/signalfx/splunk-otel-collec…
jvoravong Feb 26, 2025
4ade4e0
Update docs after main merge
jvoravong Feb 26, 2025
c429f2b
split our pre-commit update into a separate PR
jvoravong Feb 26, 2025
c4f6cd7
Merge branch 'main' of https://github.com/signalfx/splunk-otel-collec…
jvoravong Feb 26, 2025
83ce345
Documentation improvements, mostly just reorganize content for easier…
jvoravong Feb 27, 2025
49091d2
name fix
jvoravong Feb 27, 2025
c28f181
remove doc TODOs
jvoravong Feb 27, 2025
87e1823
More upgrading step touch ups
jvoravong Feb 27, 2025
868016f
remove functional test values file updates because they are not needed
jvoravong Feb 27, 2025
4bef359
regenerate functional_tests/testdata/expected_kind_values/expected_cl…
jvoravong Feb 27, 2025
d12f6f3
dummy commit to get CI/CD run with the "Ignore Tests" PR label
jvoravong Feb 27, 2025
6f7aea3
restore comment that wasn't ment to be removed
jvoravong Feb 27, 2025
4c8d9e4
doc update for autoGenerateCert.enabled
jvoravong Feb 28, 2025
ea8b041
Remove missed cert-manager references in docs
jvoravong Mar 3, 2025
4dff958
Update UPGRADING.md
jvoravong Mar 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .chloggen/migration-operator-helm-certs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: breaking
# The name of the component, or a single word describing the area of concern, (e.g. agent, clusterReceiver, gateway, operator, chart, other)
component: operator
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Migrate the operator to use Helm generated TLS certificates instead of cert-manager by default
# One or more tracking issues related to the change
issues: [1648]
# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: |
- Previously, certificates were generated by cert-manager by default; now they are generated by Helm templates unless configured otherwise.
- This change simplifies the setup for new users while still supporting those who prefer using cert-manager or other solutions. For more details, see the [related documentation](https://github.com/signalfx/splunk-otel-collector-chart/tree/main/docs/auto-instrumentation-install.md#tls-certificate-requirement-for-kubernetes-operator-webhooks).
- If you use `.Values.operator.enabled=true` and `.Values.certmanager.enabled=true`, please review the [upgrade guidelines](https://github.com/signalfx/splunk-otel-collector-chart/blob/main/UPGRADING.md#0119-to-0120).
18 changes: 0 additions & 18 deletions .github/workflows/functional_test_v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,6 @@ jobs:
- name: Update dependencies
run: |
make dep-update
- name: Deploy cert-manager
run: |
make cert-manager
- name: run functional tests
id: run-functional-tests
env:
Expand Down Expand Up @@ -128,9 +125,6 @@ jobs:
- name: Update dependencies
run: |
make dep-update
- name: Deploy cert-manager
run: |
make cert-manager
- name: run functional tests
env:
HOST_ENDPOINT: 0.0.0.0
Expand Down Expand Up @@ -183,19 +177,13 @@ jobs:
- name: Update dependencies
run: |
cd base && make dep-update
- name: Deploy cert-manager
run: |
cd base && make cert-manager
- name: Deploy previous version of the chart
run: |
helm list | grep -q "^sock$" && echo "Found previous 'sock' release. Deleting..." && helm delete sock
cd base && helm install sock helm-charts/splunk-otel-collector --set cloudProvider=aws --set distribution=eks --set splunkObservability.realm=us0 --set splunkObservability.accessToken=xxxxx
- name: Update dependencies
run: |
make dep-update
- name: Deploy cert-manager
run: |
make cert-manager
- name: run functional tests
env:
HOST_ENDPOINT: 0.0.0.0
Expand Down Expand Up @@ -238,19 +226,13 @@ jobs:
- name: Update dependencies
run: |
cd base && make dep-update
- name: Deploy cert-manager
run: |
cd base && make cert-manager
- name: Deploy previous version of the chart
run: |
helm list | grep -q "^sock$" && echo "Found previous 'sock' release. Deleting..." && helm delete sock
cd base && helm install sock helm-charts/splunk-otel-collector --set cloudProvider=aws --set distribution=eks --set splunkObservability.realm=us0 --set splunkObservability.accessToken=xxxxx --set operator.enabled=true --set environment=dev
- name: Update dependencies
run: |
make dep-update
- name: Deploy cert-manager
run: |
make cert-manager
- name: run functional tests
env:
HOST_ENDPOINT: 0.0.0.0
Expand Down
96 changes: 96 additions & 0 deletions UPGRADING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,101 @@
# Upgrade guidelines

## 0.119.0 to 0.120.0

This guide provides steps for new users, transitioning users, and those maintaining previously deployed Operator-related TLS certificates and configurations.

- New users: No migration is required for Operator TLS certificates.
- Previous users: Migration may be needed if using `operator.enabled=true` or `certmanager.enabled=true`.

To maintain previous functionality and avoid breaking changes, review the following sections.

### **Maintaining Previous Functionality via Helm Values Update**

#### **Scenario 1: Operator and cert-manager Deployed via This Helm Chart**

If you previously deployed both the Operator and cert-manager via this Helm chart (`operator.enabled=true` and `certmanager.enabled=true`), you can preserve functionality by adding the following values:

```yaml
operator:
enabled: true
admissionWebhooks:
certManager:
enabled: true
certificateAnnotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "1"
issuerAnnotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "1"
certmanager:
enabled: true
installCRDs: true
```

#### **Scenario 2: Operator Deployed with External cert-manager (Not Managed by This Helm Chart)**

If you previously deployed the Operator and used an externally managed cert-manager (`operator.enabled=true` and `certmanager.enabled=false`), you can preserve functionality by adding the following values:

```yaml
operator:
enabled: true
admissionWebhooks:
certManager:
enabled: true
```

### **Adopting New Functionality (Requires Migration Steps)**

If you want to migrate from cert-manager managed certificates to the now default Helm-generated certificates, additional steps may be required to avoid conflicts.

#### **Potential Upgrade Issue: Existing Secret Conflict**

If you see an error message like the following during a Helm install or upgrade:

```
warning: Upgrade "{helm_release_name}" failed: pre-upgrade hooks failed: warning: Hook pre-upgrade splunk-otel-collector/charts/operator/templates/admission-webhooks/operator-webhook.yaml failed: 1 error occurred:* secrets "splunk-otel-collector-operator-controller-manager-service-cert" already exists
```

This typically occurs because:
- cert-manager deletes its `Certificate` resources immediately.
- However, cert-manager does not delete the associated **secrets** instantly. It waits for its garbage collector process to remove them.

You will first have to delete this chart, wait for cert-manager to do garbage collection, and then install the latest version of this chart.
With the assumption your Helm release is named "splunk-otel-collector", we show the commands to run below.
- `Be aware these steps likely include the operator being unavailable and having down time for this service in your environment.`

#### **Step 1: Delete this Helm Chart**

Use a command like this to delete the chart in your namespace:

```bash
helm delete splunk-otel-collector --namespace <your_namespace>
```

#### **Step 2: Verify If the Old Cert Manager Secret Does Not Exists Anymore**

Use the following command to check if the certificate secret remains in your namespace:

```bash
kubectl get secret splunk-otel-collector-operator-controller-manager-service-cert --namespace <your_namespace>
```

#### **Step 3: Wait for Secret Removal or Manually Delete It**

If the secret still exists, you must wait for cert-manager to remove it or delete it manually:

```bash
kubectl delete secret splunk-otel-collector-operator-controller-manager-service-cert --namespace <your_namespace>
```

#### **Step 4: Proceed with Helm Install**

Once the secret is no longer present, you can install the chart with the latest version (`0.120.0`) successfully:

```bash
helm install splunk-otel-collector splunk-otel-collector-chart/splunk-otel-collector --values ~/values.yaml --namespace <your_namespace>
```

## 0.113.0 to 0.116.0

This guide provides steps for new users, transitioning users, and those maintaining previous operator CRD configurations:
Expand Down
147 changes: 93 additions & 54 deletions docs/auto-instrumentation-install.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,10 +68,6 @@ these frameworks often have pre-built instrumentation capabilities already avail
- [partially enable profiling](../examples/enable-operator-and-auto-instrumentation/instrumentation/instrumentation-enable-profiling-partially.yaml).

```bash
# Check if cert-manager is already installed, don't deploy a second cert-manager.
kubectl get pods -l app=cert-manager --all-namespaces

# If cert-manager is not deployed, make sure to add certmanager.enabled=true to the list of values to set
helm install splunk-otel-collector -f ./my_values.yaml --set operatorcrds.install=true,operator.enabled=true,environment=dev splunk-otel-collector-chart/splunk-otel-collector
```

Expand Down Expand Up @@ -462,81 +458,124 @@ helm template splunk-otel-collector-chart/splunk-otel-collector --include-crds \
| kubectl delete --dry-run=client -f -
```

### Documentation Resources
### TLS Certificate Requirement for Kubernetes Operator Webhooks

- https://developers.redhat.com/devnation/tech-talks/using-opentelemetry-on-kubernetes
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api.md#instrumentation
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md#opentelemetry-auto-instrumentation-injection
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md#use-customized-or-vendor-instrumentation
In Kubernetes, the API server communicates with operator webhook components over HTTPS, which requires a valid TLS certificate that the API server trusts. The operator supports several methods for configuring the required certificate, each with different levels of complexity and security.

### Troubleshooting the Operator and Cert Manager
---

#### Check the logs for failures
#### 1. **Using a Self-Signed Certificate Generated by the Chart**

**Operator Logs:**
This is the default and simplest method for generating a TLS certificate. It automatically creates a self-signed certificate for the webhook, making it suitable for internal environments or testing purposes. However, it may not be trusted by clients outside your cluster.

```bash
kubectl logs -l app.kubernetes.io/name=operator
**Note**: The following settings reflect the default values starting in **v1.20.0** of this chart. You only need to update them if using a **previous chart version** or if additional customization is required.

```yaml
operator:
admissionWebhooks:
autoGenerateCert:
enabled: true
certPeriodDays: 3650
certManager:
enabled: false
```

**Cert-Manager Logs:**
- Setting `operator.admissionWebhooks.certManager.enabled` to `false` and `operator.admissionWebhooks.autoGenerateCert.enabled` to `true` ensures that Helm generates a self-signed TLS certificate.
- Helm generates a self-signed certificate that is valid for 10 years (3650 days) and stores it in a secret for the Operator webhook. The certificate's validity period can be adjusted using `operator.admissionWebhooks.autoGenerateCert.certPeriodDays`.
- The certificate is **automatically regenerated** on every Helm upgrade. To disable this behavior, set `operator.admissionWebhooks.autoGenerateCert.recreate` to `false`.

```bash
kubectl logs -l app=certmanager
kubectl logs -l app=cainjector
kubectl logs -l app=webhook
```
---

#### Operator Issues
#### 2. **Using a cert-manager Certificate**

##### Networking and Firewall Requirements
Using `cert-manager` offers more control over certificate management and is more suitable for production environments. However, due to Helm’s install/upgrade order of operations, cert-manager CRDs and certificates cannot be installed within the same Helm operation. To work around this limitation, you can choose one of the following options:

Ensure the Mutating Webhook used by the operator for pod auto-instrumentation is not hindered by network policies or firewall rules. Key points to ensure:
##### Option 1: **Pre-deploy cert-manager**

- **Webhook Accessibility**: The webhook must freely communicate with the cluster IP and the Kubernetes API server. Ensure network policies or firewall rules permit operator-related services to interact with these endpoints.
- **Required Ports**: Policies should explicitly allow traffic to the necessary ports for seamless operation.
If `cert-manager` is already deployed in your cluster, you can configure the operator to use it without enabling certificate generation by Helm.

Use the following command to identify the IP addresses and ports that need to be accessible:
**Configuration:**
```yaml
operator:
admissionWebhooks:
certManager:
enabled: true
```

```bash
kubectl get svc -n {operator_namespace}
# Example output indicating necessary IP and port configurations:
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 10d
# splunk-splunk-otel-collector-agent ClusterIP 10.0.176.113 <none> 8006/TCP,14250/TCP,14268/TCP,... 3d17h
# splunk-splunk-otel-collector-operator ClusterIP 10.0.254.125 <none> 8443/TCP,8080/TCP 3d17h
# splunk-splunk-otel-collector-operator-webhook ClusterIP 10.0.222.223 <none> 443/TCP 3d17h
##### Option 2: **Deploy cert-manager and the operator together**

If you need to install `cert-manager` along with the operator, use a Helm post-install or post-upgrade hook to ensure that the certificate is created after cert-manager CRDs are installed.

**Configuration:**
```yaml
operator:
admissionWebhooks:
certManager:
enabled: true
certificateAnnotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "1"
issuerAnnotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "1"
certmanager:
enabled: true
installCRDs: true
```

- **Configuration Action**: Adjust your network policies and firewall settings based on the service endpoints and ports listed by the command. This ensures the webhook and operator services can properly communicate within the cluster.
This method is useful when installing `cert-manager` as a subchart or as part of a larger Helm chart installation.

#### Cert-Manager Issues
---

If the operator seems to be hanging, it could be due to the cert-manager not auto-creating the required certificate. To troubleshoot:
#### 3. **Using a Custom Externally Generated Certificate**

- Check the health and logs of the cert-manager pods for potential issues.
- Consider restarting the cert-manager pods.
- Ensure that your cluster has only one instance of cert-manager, which should include `certmanager`, `certmanager-cainjector`, and `certmanager-webhook`.
For full control, you can use an externally generated certificate. This is suitable if you already have a certificate issued by a trusted CA or have specific security requirements.

For additional guidance, refer to the official cert-manager documentation:
- [Troubleshooting Guide](https://cert-manager.io/docs/troubleshooting/)
- [Uninstallation Guide](https://cert-manager.io/v1.2-docs/installation/uninstall/kubernetes/)
**Configuration:**
- Set both `operator.admissionWebhooks.certManager.enabled` and `operator.admissionWebhooks.autoGenerateCert.enabled` to `false`.
- Provide the paths to your certificate (`certFile`), private key (`keyFile`), and CA certificate (`caFile`) in the values.

##### Validate Certificates
**Example:**
```yaml
operator:
admissionWebhooks:
certManager:
enabled: false
autoGenerateCert:
enabled: false
certFile: /path/to/cert.crt
keyFile: /path/to/cert.key
caFile: /path/to/ca.crt
```

This method allows you to use a certificate that is trusted by external systems, such as certificates issued by a corporate CA.

Ensure that the certificate, which the cert-manager creates and the operator utilizes, is available.
---

For more advanced use cases, refer to the [official Helm chart documentation](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-operator/values.yaml) for detailed configuration options and scenarios.

### Troubleshooting the Operator and Cert Manager

#### Check the logs for failures

**Operator Logs:**

```bash
kubectl get certificates
# NAME READY SECRET AGE
# splunk-otel-collector-operator-serving-cert True splunk-otel-collector-operator-controller-manager-service-cert 5m
kubectl logs -l app.kubernetes.io/name=operator
```

##### Using a Self-Signed Certificate for the Webhook
**Cert-Manager Logs:**

The operator supports various methods for managing TLS certificates for the webhook. Below are the options available through the operator, with a brief description for each. For detailed configurations and specific use cases, please refer to the operator’s
[official Helm chart documentation](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-operator/values.yaml)
```bash
kubectl logs -l app=certmanager
kubectl logs -l app=cainjector
kubectl logs -l app=webhook
```

### Documentation Resources

**Note**: While using a self-signed certificate offers a quicker and simpler setup, it has limitations, such as not being trusted by default by clients.
This may be acceptable for testing purposes or internal environments. For complete configurations and additional guidance, please refer to the provided link to the Helm chart documentation.
- https://developers.redhat.com/devnation/tech-talks/using-opentelemetry-on-kubernetes
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api.md#instrumentation
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md#opentelemetry-auto-instrumentation-injection
- https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md#use-customized-or-vendor-instrumentation
9 changes: 9 additions & 0 deletions examples/enable-operator-and-auto-instrumentation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ This example demonstrates how to:
- **Single App Focus:** Explore trace-related performance of a single instrumented NodeJS application in the APM console.
- **Simplified Use Case:** Although relations between applications will not be showcased in the APM console, this demo offers a simplified setup suitable for understanding basic instrumentation and trace visualization.

## [Simple Webserver - .NET Instrumentation](./otel-demo-nodejs.md)
This example demonstrates how to:
- Deploy the chart to the current namespace and the demo to the `dotnet-demo` namespace.
- Instrument a single .NET application.

**Highlights:**
- **Single App Focus:** Explore trace-related performance of a single instrumented .NET application in the APM console.
- **Simplified Use Case:** Although relations between applications will not be showcased in the APM console, this demo offers a simplified setup suitable for understanding basic instrumentation and trace visualization.

## Exploring Traces and Applications in APM Console
The examples provide practical insights into using the APM console for exploring application relations and traces.
Whether dealing with multiple applications interacting with each other or focusing on a single application, you will gain hands-on experience in visualizing trace data using Splunk Observability APM.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,4 @@ operatorcrds:
install: true
operator:
enabled: true
certmanager:
enabled: true

Loading
Loading