Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve troubleshooting of Istio #440

Open
2 of 10 tasks
barchw opened this issue Oct 25, 2023 · 0 comments
Open
2 of 10 tasks

Improve troubleshooting of Istio #440

barchw opened this issue Oct 25, 2023 · 0 comments
Labels
area/service-mesh Issues or PRs related to service-mesh Epic kind/feature Categorizes issue or PR as related to a new feature.

Comments

@barchw
Copy link
Collaborator

barchw commented Oct 25, 2023

Description

Istio service mesh is the component for which we have the highest amount of internal and customer incident. In many or most of the cases the problem is caused by misconfiguration on customer side. We should make troubleshooting easier for both customer and SRE to lower the number of the cases that get wired to Goats.

Reasons

  • Istio is difficult to troubleshoot without extensive knowledge
  • It is often hard to find the correct tooling required to find the issue as Istio documentation is huge and for istioctl some of the most useful commands are present under the so called experimental istioctl x command

Tasks to do/discuss(❓)

  • Improve and make Istio Custom Resource status more fine grained #444

    • Split up description field into additional ones that would be easier to aggregate
    • Add more information to the Custom Resource when a Warning happened
    • Add conditions to status
  • Troubleshooting guide for debugging Istio issues with concrete commands.

  • Provide solutions and best practices based on state of the cluster:

    • Warnings in Busola
    • ❓Use output of istioctl analyze - IDEA: Seperate istio-agent that can be used the retrieve info about service-mesh state
    • ❓Provide admission webhook that would block resources that have configuration not allowed in Kyma (for example Authorization Policies that block ingress-gateway healthz endpoint)
  • ❓Cluster reconciliation/upgrade check: Run a check before reconciliation (or Istio upgrade) to evaluate if the current cluster should be reconciled. This way we might reduce the incidents by setting Istio CR in a warning state (user action required) and skip reconciliation if a resource/field is used on cluster, that would lead to an error state, e.g. EnvoyFilter (proxy_protocol).

@barchw barchw added kind/feature Categorizes issue or PR as related to a new feature. Epic area/service-mesh Issues or PRs related to service-mesh labels Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/service-mesh Issues or PRs related to service-mesh Epic kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant