Deprovisioning of sql-database fails when namespace is deleted #678

pabloromeo · 2019-02-22T18:50:57Z

We are currently creating on-demand environments for pull-requests and have started provisioning DBs for them using OSBA.
We create a new k8s namespace and provision the database there.

The creation works as expected. However, when we later delete the entire namespace, to destroy the whole on-demand environment, OSBA does not deprovision and delete the database.
It fails with the following error:

time="2019-02-22T18:33:22Z" level=error msg="error executing job; not submitting any follow-up tasks" error="error executing deprovisioning step \"deleteARMDeployment\" for instance \"ff5e0f6e-3531-11e9-b9d6-e68df5afd861\": error executing deprovisioning step: error deleting ARM deployment: error deleting deployment \"1d787c49-7d99-419d-8b43-30647eeabece\" from resource group \"<<redacted>>\": pollingTrackerBase#updateRawBody: failed to unmarshal response body: StatusCode=0 -- Original Error: unexpected end of JSON input" job=executeDeprovisioningStep taskID=f2137f1b-fa76-4251-b669-f68423ba5ac4

Running:
svcat get instances -n pr-NNN

       NAME         NAMESPACE       CLASS        PLAN           STATUS
+-----------------+-----------+----------------+-------+-----------------------+
  some-db   pr-NNN      azure-sql-12-0   basic   DeprovisionCallFailed

In fact it seems to have left our namespace in a broken state of eternal "Terminating".

Also, manually deprovisioning through svcat did not delete the database either:

:~$ svcat deprovision some-db -n pr-NNN
deleted some-db

But the database remains there, even after waiting a long enough period of time. No messages in the OSBA logs either.

The text was updated successfully, but these errors were encountered:

zhongyi-zhang · 2019-02-25T02:17:09Z

It looks a similar issue as #672 from Azure Go SDK side. We bumped the SDK version as they suggested. Have you upgraded to the latest OSBA release?

pabloromeo · 2019-02-25T13:52:36Z

I can give that a try. Is there any documentation on how to safely upgrade OSBA?

zhongyi-zhang · 2019-02-25T14:36:44Z

Helm can upgrade and rollback. Or, if you mean you concern about OSBA behaviors after upgrading, maybe you can verify it on a test cluster or minikube first?

pabloromeo · 2019-02-25T17:08:05Z

I've upgraded OSBA to 1.5.0 and reinstalled the Catalog-Service from scratch, since it was in an unrecoverable state after the last attempt.

Now when the namespace is deleted, the kubectl delete namespace <namespace> operation times out after 10 minutes, and the catalog-apiserver Pod get's Terminated with an OOMKilled state. It restarts, but just ends up Terminating again after a few minutes.

running svcat against the instances I get:

       NAME         NAMESPACE       CLASS        PLAN                    STATUS
+-----------------+-----------+----------------+-------+-----------------------------------------+
  some-db   pr-XXX      azure-sql-12-0   basic   DeprovisionBlockedByExistingCredentials

and against the bindings:

           NAME             NAMESPACE      INSTANCE                STATUS
+-------------------------+-----------+-----------------+--------------------------+
  some-db-binding   pr-XXX      some-db   UnbindingRequestInFlight

zhongyi-zhang · 2019-02-26T02:13:14Z

Any OSBA logs for unbinding? Let's check whether the issue is in OSBA unbinding or svcat side. Tracing back from https://github.com/kubernetes-incubator/service-catalog/blob/7fec2384506143b88910f575913f5fdbe1601d7f/pkg/controller/controller_binding.go#L796, I think it is possible that the unbinding request didn't reach OSBA.

pabloromeo · 2019-02-27T18:45:39Z

From what I was able to see while monitoring both the OSBA logs and the service-catalog ones, service-catalog is failing and OSBA logs show nothing while it's failing to deprovision.

From the controller-manager logs, I see that initially it's attempting to delete the secrets associated to the db, which results in an initial 404.
This results in:
Error syncing ServiceBinding pr-NNN/some-db-binding (retry: 9/15): OSB client not found for the broker osba
and then
ServiceInstance "pr-NNN/some-db" v352: All associated ServiceBindings must be removed before this ServiceInstance can be deleted
then
'Warning' reason: 'DeprovisionBlockedByExistingCredentials' All associated ServiceBindings must be removed before this ServiceInstance can be deleted

Remember that the trigger for all of this was the deletion of the entire namespace.
It would appear that deleting the namespace deletes the secrets before service-catalog takes over and tries to deprovision, it doesn't find the secrets any longer (the 404) and is not be able to continue to delete the binding and ultimately the instance.

It continues to retry the entire process over and over again until it ends up crashing the catalog-catalog-apiserver continuously, goes into a crash loop of retries and ultimately restarts. I've seen it go up to hundreds of restarts. For which the only way to stop it was to just reinstall the service-catalog with Helm again.

The result seems to be a deadlock, of retrying to delete the secret which no longer exists. This leaves the Instance in DeprovisionBlockedByExistingCredentials and the binding in UnbindingRequestInFlight.

zhongyi-zhang · 2019-02-28T03:27:12Z

I suppose the OSB client not found for the broker osba was caused by the re-installation of the service-catalog. The BrokerClientManager in the service-catalog lost the clients for registered brokers -- it doesn't know how to call OSBA. And OSBA logs shows nothing, so the unbinding request didn't reach OSBA at all... I am afraid this issue can't be solved in OSBA side.
Found a similar issue here: kubernetes-retired/service-catalog#1574. No matter using which service broker, deleting namespace could be an issue for the service-catalog.

pabloromeo · 2019-03-12T18:21:11Z

Do you know of any way to reconnect the broker to those instances if service-catalog is reinstalled? I've been forced to reinstall frequently, given that provisioning fails and service-catalog starts crashing and still can't recover.
I end up with bindings in "Failed", "UnbindingRequestInFlight", or "ErrorInstanceRefsUnresolved" status that never recover, and Instances in "ReferencesNonexistentBroker" status.

I must be doing something wrong, because at the moment this strategy of provisioning and deprovisioning automatically through OSBA seems to be unstable and not very reliable :(.

I just can't identify the problem. Once provisioning fails there's no way to fix it, such as by forcing the removal of bindings and instances from the service-catalog. Even if I manually remove the azure resources that were provisioned.

zhongyi-zhang · 2019-03-13T10:27:35Z

Did you try just re-register OSBA by svcat register?

I kinda understand your feelings... Still I should say that your case is in a mixed situations... If it didn't the Azure Go SDK issue and svcat is healthy, OSBA can well handle common provision failures with svcat auto calling deprovision.

For manually removing azure resources, please try one more step -- also removing the related records in the OSBA store (the redis).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprovisioning of sql-database fails when namespace is deleted #678

Deprovisioning of sql-database fails when namespace is deleted #678

pabloromeo commented Feb 22, 2019 •

edited

Loading

zhongyi-zhang commented Feb 25, 2019

pabloromeo commented Feb 25, 2019

zhongyi-zhang commented Feb 25, 2019

pabloromeo commented Feb 25, 2019

zhongyi-zhang commented Feb 26, 2019

pabloromeo commented Feb 27, 2019

zhongyi-zhang commented Feb 28, 2019

pabloromeo commented Mar 12, 2019

zhongyi-zhang commented Mar 13, 2019

Deprovisioning of sql-database fails when namespace is deleted #678

Deprovisioning of sql-database fails when namespace is deleted #678

Comments

pabloromeo commented Feb 22, 2019 • edited Loading

zhongyi-zhang commented Feb 25, 2019

pabloromeo commented Feb 25, 2019

zhongyi-zhang commented Feb 25, 2019

pabloromeo commented Feb 25, 2019

zhongyi-zhang commented Feb 26, 2019

pabloromeo commented Feb 27, 2019

zhongyi-zhang commented Feb 28, 2019

pabloromeo commented Mar 12, 2019

zhongyi-zhang commented Mar 13, 2019

pabloromeo commented Feb 22, 2019 •

edited

Loading