Skip to content
This repository has been archived by the owner on Jul 6, 2022. It is now read-only.

Deprovisioning of sql-database fails when namespace is deleted #678

Open
pabloromeo opened this issue Feb 22, 2019 · 9 comments
Open

Deprovisioning of sql-database fails when namespace is deleted #678

pabloromeo opened this issue Feb 22, 2019 · 9 comments

Comments

@pabloromeo
Copy link

pabloromeo commented Feb 22, 2019

We are currently creating on-demand environments for pull-requests and have started provisioning DBs for them using OSBA.
We create a new k8s namespace and provision the database there.

The creation works as expected. However, when we later delete the entire namespace, to destroy the whole on-demand environment, OSBA does not deprovision and delete the database.
It fails with the following error:

time="2019-02-22T18:33:22Z" level=error msg="error executing job; not submitting any follow-up tasks" error="error executing deprovisioning step \"deleteARMDeployment\" for instance \"ff5e0f6e-3531-11e9-b9d6-e68df5afd861\": error executing deprovisioning step: error deleting ARM deployment: error deleting deployment \"1d787c49-7d99-419d-8b43-30647eeabece\" from resource group \"<<redacted>>\": pollingTrackerBase#updateRawBody: failed to unmarshal response body: StatusCode=0 -- Original Error: unexpected end of JSON input" job=executeDeprovisioningStep taskID=f2137f1b-fa76-4251-b669-f68423ba5ac4

Running:
svcat get instances -n pr-NNN

       NAME         NAMESPACE       CLASS        PLAN           STATUS
+-----------------+-----------+----------------+-------+-----------------------+
  some-db   pr-NNN      azure-sql-12-0   basic   DeprovisionCallFailed

In fact it seems to have left our namespace in a broken state of eternal "Terminating".

Also, manually deprovisioning through svcat did not delete the database either:

:~$ svcat deprovision some-db -n pr-NNN
deleted some-db

But the database remains there, even after waiting a long enough period of time. No messages in the OSBA logs either.

@zhongyi-zhang
Copy link
Contributor

It looks a similar issue as #672 from Azure Go SDK side. We bumped the SDK version as they suggested. Have you upgraded to the latest OSBA release?

@pabloromeo
Copy link
Author

I can give that a try. Is there any documentation on how to safely upgrade OSBA?

@zhongyi-zhang
Copy link
Contributor

Helm can upgrade and rollback. Or, if you mean you concern about OSBA behaviors after upgrading, maybe you can verify it on a test cluster or minikube first?

@pabloromeo
Copy link
Author

I've upgraded OSBA to 1.5.0 and reinstalled the Catalog-Service from scratch, since it was in an unrecoverable state after the last attempt.

Now when the namespace is deleted, the kubectl delete namespace <namespace> operation times out after 10 minutes, and the catalog-apiserver Pod get's Terminated with an OOMKilled state. It restarts, but just ends up Terminating again after a few minutes.

running svcat against the instances I get:

       NAME         NAMESPACE       CLASS        PLAN                    STATUS
+-----------------+-----------+----------------+-------+-----------------------------------------+
  some-db   pr-XXX      azure-sql-12-0   basic   DeprovisionBlockedByExistingCredentials

and against the bindings:

           NAME             NAMESPACE      INSTANCE                STATUS
+-------------------------+-----------+-----------------+--------------------------+
  some-db-binding   pr-XXX      some-db   UnbindingRequestInFlight

@zhongyi-zhang
Copy link
Contributor

Any OSBA logs for unbinding? Let's check whether the issue is in OSBA unbinding or svcat side. Tracing back from https://github.com/kubernetes-incubator/service-catalog/blob/7fec2384506143b88910f575913f5fdbe1601d7f/pkg/controller/controller_binding.go#L796, I think it is possible that the unbinding request didn't reach OSBA.

@pabloromeo
Copy link
Author

From what I was able to see while monitoring both the OSBA logs and the service-catalog ones, service-catalog is failing and OSBA logs show nothing while it's failing to deprovision.

From the controller-manager logs, I see that initially it's attempting to delete the secrets associated to the db, which results in an initial 404.
This results in:
Error syncing ServiceBinding pr-NNN/some-db-binding (retry: 9/15): OSB client not found for the broker osba
and then
ServiceInstance "pr-NNN/some-db" v352: All associated ServiceBindings must be removed before this ServiceInstance can be deleted
then
'Warning' reason: 'DeprovisionBlockedByExistingCredentials' All associated ServiceBindings must be removed before this ServiceInstance can be deleted

Remember that the trigger for all of this was the deletion of the entire namespace.
It would appear that deleting the namespace deletes the secrets before service-catalog takes over and tries to deprovision, it doesn't find the secrets any longer (the 404) and is not be able to continue to delete the binding and ultimately the instance.

It continues to retry the entire process over and over again until it ends up crashing the catalog-catalog-apiserver continuously, goes into a crash loop of retries and ultimately restarts. I've seen it go up to hundreds of restarts. For which the only way to stop it was to just reinstall the service-catalog with Helm again.

The result seems to be a deadlock, of retrying to delete the secret which no longer exists. This leaves the Instance in DeprovisionBlockedByExistingCredentials and the binding in UnbindingRequestInFlight.

@zhongyi-zhang
Copy link
Contributor

I suppose the OSB client not found for the broker osba was caused by the re-installation of the service-catalog. The BrokerClientManager in the service-catalog lost the clients for registered brokers -- it doesn't know how to call OSBA. And OSBA logs shows nothing, so the unbinding request didn't reach OSBA at all... I am afraid this issue can't be solved in OSBA side.
Found a similar issue here: kubernetes-retired/service-catalog#1574. No matter using which service broker, deleting namespace could be an issue for the service-catalog.

@pabloromeo
Copy link
Author

Do you know of any way to reconnect the broker to those instances if service-catalog is reinstalled? I've been forced to reinstall frequently, given that provisioning fails and service-catalog starts crashing and still can't recover.
I end up with bindings in "Failed", "UnbindingRequestInFlight", or "ErrorInstanceRefsUnresolved" status that never recover, and Instances in "ReferencesNonexistentBroker" status.

I must be doing something wrong, because at the moment this strategy of provisioning and deprovisioning automatically through OSBA seems to be unstable and not very reliable :(.

I just can't identify the problem. Once provisioning fails there's no way to fix it, such as by forcing the removal of bindings and instances from the service-catalog. Even if I manually remove the azure resources that were provisioned.

@zhongyi-zhang
Copy link
Contributor

Did you try just re-register OSBA by svcat register?

I kinda understand your feelings... Still I should say that your case is in a mixed situations... If it didn't the Azure Go SDK issue and svcat is healthy, OSBA can well handle common provision failures with svcat auto calling deprovision.

For manually removing azure resources, please try one more step -- also removing the related records in the OSBA store (the redis).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants