Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

paulyuk · 2024-07-09T00:36:59Z

Expected Behavior

Tutorials pass

Tutorials are technically unsupported parts of the Quickstart branch, but I don't like seeing regressions creep in like this and treat it as a P1, non ship blocker, non PR blocker that we should investigate. It likely points to a real product issue.

Actual Behavior

Tutorials Fail

Here is a good example where the dapr app client tries to call the target dapr app's /neworder api and it fails because it's not yet available, likely because it hasn't started yet. I can reproduce timing issues like this on my local KinD deployment too.
Failure example

Steps to Reproduce the Problem

Use the link to Action above.

But more importantly, it repros when you do a multi app run of the app on your local machine, e.g. using

dapr run -k -f .

@msfussell
@yaron2

paulyuk · 2024-07-09T06:39:13Z

Might need maintainer on dapr-cli to look at this since it feels more like health check issues for multi app run on kubernetes.

paulyuk · 2024-07-09T16:24:31Z

@yaron2 you mentioned you may be able to help with this dapr cli, multi app run health check timing? TY

mukundansundar · 2024-07-15T08:33:46Z

@paulyuk The failure seems to be transient and more related to why dapr-dev-redis was going into crashloop backof https://github.com/dapr/quickstarts/actions/runs/9702023201/job/26776912327#step:18:82...
Are you able to repro this locally?

paulyuk · 2024-07-16T18:15:59Z

Hey @mukundansundar and @yaron2, the issue I reported above repros locally.

the redis crash loop does not repro for me, using KinD

> dapr init -k --dev
⌛  Making the jump to hyperspace...
ℹ️  Note: To install Dapr using Helm, see here: https://docs.dapr.io/getting-started/install-dapr-kubernetes/#install-with-helm-advanced

ℹ️  Container images will be pulled from Docker Hub
✅  Deploying the Dapr control plane with latest version to your cluster...
✅  Deploying the Dapr dashboard with latest version to your cluster...
✅  Deploying the Dapr Redis with latest version to your cluster...
✅  Deploying the Dapr Zipkin with latest version to your cluster...
ℹ️  Applying "statestore" component to Kubernetes "default" namespace.
ℹ️  Applying "pubsub" component to Kubernetes "default" namespace.
ℹ️  Applying "appconfig" zipkin configuration to Kubernetes "default" namespace.
✅  Success! Dapr has been installed to namespace dapr-system. To verify, run `dapr status -k' in your terminal. To get started, go here: https://aka.ms/dapr-getting-started


> kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
dapr-dev-redis-master-0            1/1     Running   0          24m
dapr-dev-redis-replicas-0          1/1     Running   0          24m
dapr-dev-redis-replicas-1          1/1     Running   0          23m
dapr-dev-redis-replicas-2          1/1     Running   0          23m
dapr-dev-zipkin-7d5f8fc8b5-wds69   1/1     Running   0          24m

But the original crash above of Node app not loading with dapr run -k -f . is still happening locally. Have you tried it?

> dapr init -k --dev
⌛  Making the jump to hyperspace...
ℹ️  Note: To install Dapr using Helm, see here: https://docs.dapr.io/getting-started/install-dapr-kubernetes/#install-with-helm-advanced

ℹ️  Container images will be pulled from Docker Hub
✅  Deploying the Dapr control plane with latest version to your cluster...
✅  Deploying the Dapr dashboard with latest version to your cluster...
✅  Deploying the Dapr Redis with latest version to your cluster...
✅  Deploying the Dapr Zipkin with latest version to your cluster...
ℹ️  Applying "statestore" component to Kubernetes "default" namespace.
ℹ️  Applying "pubsub" component to Kubernetes "default" namespace.
ℹ️  Applying "appconfig" zipkin configuration to Kubernetes "default" namespace.
✅  Success! Dapr has been installed to namespace dapr-system. To verify, run `dapr status -k' in your terminal. To get started, go here: https://aka.ms/dapr-getting-started
 pyadmin   hello-kubernetes    master ≢  ?2    dapr run -k -f dapr.yaml
ℹ️  This is a preview feature and subject to change in future releases.
ℹ️  Validating config and starting app "nodeapp"
ℹ️  Deploying app "nodeapp" to Kubernetes
ℹ️  Deploying service YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/deploy/service.yaml" to Kubernetes
ℹ️  Deploying deployment YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/deploy/deployment.yaml" to Kubernetes
⚠  Error deploying pod to Kubernetes. See logs directly from Kubernetes command line.
ℹ️  Writing log files to directory : /home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/logs
ℹ️  Validating config and starting app "pythonapp"
ℹ️  Deploying app "pythonapp" to Kubernetes
ℹ️  Deploying deployment YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/python/.dapr/deploy/deployment.yaml" to Kubernetes
ℹ️  Streaming logs for containers in pod "pythonapp-5cd765b8f4-zgqlm"
ℹ️  Writing log files to directory : /home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/python/.dapr/logs
ℹ️  Starting to monitor Kubernetes pods for deletion.
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c485f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c422150>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4282d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4283d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4220d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c415cd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c432450>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4385d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTP 500 => {"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: nodeapp, err: failed to invoke target nodeapp after 3 retries. Error: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.244.0.15:50002
...

Key error is the one showing nodeapp isn't available and hence calls to it fail from pythonapp

== APP - pythonapp == HTTP 500 => {"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: nodeapp, err: failed to resolve address for 'nodeapp-dapr.default.svc.cluster.local': lookup nodeapp-dapr.default.svc.cluster.local on 10.96.0.10:53: no such host"}

after about 32 seconds (retries) the e2e works locally on kind

== APP - nodeapp == Got a new order! Order ID: 32
== APP - nodeapp == Successfully persisted state for Order ID: 32
== APP - nodeapp == Got a new order! Order ID: 33
== APP - nodeapp == Successfully persisted state for Order ID: 33
== APP - nodeapp == Got a new order! Order ID: 34

paulyuk · 2024-07-17T00:41:11Z

Note @greenie-msft and I also tried to put Dapr Resiliency in place, and Orders still got skipped until the NodeApp was ready!

We added ./resources/resiliency.yaml:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: myresiliency
# similar to subscription and configuration specs, scopes lists the Dapr App IDs that this
# resiliency spec can be used by.
spec:
  # policies is where timeouts, retries and circuit breaker policies are defined. 
  # each is given a name so they can be referred to from the targets section in the resiliency spec.
  policies:
    # retries are named templates for retry configurations and are instantiated for life of the operation.
    retries:
      retryInvokeForever:
        policy: constant
        maxInterval: 5s
        maxRetries: -1 # retry indefinitely

  # targets are what named policies are applied to. Dapr supports 3 target types - apps, components and actors
  targets:
    apps:
      nodeapp:
        retry: retryInvokeForever

and modified dapr.yaml common section to apply the resiliency policy crd

version: 1
apps:
  - appDirPath: ./node
    appID: nodeapp
    appPort: 3000
    containerImage: ghcr.io/dapr/samples/hello-k8s-node:latest
    createService: true
  - appDirPath: ./python
    appID: pythonapp
    containerImage: ghcr.io/dapr/samples/hello-k8s-python:latest
common: # optional section for variables shared across apps  
  resourcesPath: ./resources # any dapr resources to be shared across apps

This is almost as if this type of exception (Max retries exceeded with url: /neworder ) does not factor into dapr resiliency. Unless the try catch for the sample is interfering?

yaron2 · 2024-07-17T01:15:47Z

The issue here is that the app can't reach it's Dapr instance because the latter isn't up yet. This means the resiliency policy is ineffective as the requests never hit Dapr to begin with. This is in reference to all errors that contain connection refused

yaron2 · 2024-07-17T01:17:58Z

I'm investigating

paulyuk · 2024-07-17T03:56:24Z

Thank you. It put my brain in a loop thinking about how your app's dapr sidecar can check on the other app's sidecar, if it's not up, or you dont know the healthz endpoint you're trying to hit for remote app.

greenie-msft · 2024-07-17T17:32:05Z

Thanks for the explanation, Yaron. So the retries we're seeing in the logs are coming from python http client?

yaron2 · 2024-07-22T19:16:41Z

Thanks for the explanation, Yaron. So the retries we're seeing in the logs are coming from python http client?

Yes

paulyuk · 2024-07-22T19:59:31Z

@yaron2 @msfussell - per our chat the tactical solution in 1.14 for this is will revert back to single dapr run -- ... commands since Multiapp run doesn't know about dependencies and ordering priorities that affect timing. I filed a separate issue requesting support for that here <dapr/cli#1435 >

I will take ownership of the 1.14 scoped fix so we get tests passing again. PR on the way.

paulyuk · 2024-07-23T16:41:44Z

@paulyuk The failure seems to be transient and more related to why dapr-dev-redis was going into crashloop backof https://github.com/dapr/quickstarts/actions/runs/9702023201/job/26776912327#step:18:82... Are you able to repro this locally?

Hey - I am hitting this issue now, and it is blocking tests with --dev init, but it will not occur locally. It's only in the GH action runner.

paulyuk · 2024-07-23T16:45:29Z

I'm filing a specific bug on the remaining failing issue with Redis crashloop:
dapr/cli#1436
@yaron2

paulyuk · 2024-07-26T20:19:17Z

fixed by #1057 and dapr/cli#1437

paulyuk added bug Something isn't working P1 unsupported labels Jul 9, 2024

paulyuk added this to the 1.14 milestone Jul 9, 2024

paulyuk assigned paulyuk, msfussell and yaron2 Jul 9, 2024

paulyuk added this to v1.14 Release Tracking Board Jul 9, 2024

github-project-automation bot moved this to Needs Owner in v1.14 Release Tracking Board Jul 9, 2024

paulyuk added the area/tutorials label Jul 9, 2024

mikeee mentioned this issue Jul 9, 2024

v1.14 Endgame dapr/dapr#7862

Closed

paulyuk unassigned paulyuk and msfussell Jul 9, 2024

paulyuk mentioned this issue Jul 22, 2024

Multi app run does not understand dependencies and ordering - leads to timing issues dapr/cli#1435

Open

paulyuk assigned paulyuk and unassigned yaron2 Jul 22, 2024

paulyuk mentioned this issue Jul 22, 2024

Updating Readme to favor kubectl based deploy vs. multi app run #1057

Merged

4 tasks

paulyuk mentioned this issue Jul 23, 2024

Pin bitnami chart version for redis dapr/cli#1437

Merged

paulyuk closed this as completed Jul 26, 2024

github-project-automation bot moved this from Needs Owner to Done in v1.14 Release Tracking Board Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

paulyuk commented Jul 9, 2024 •

edited

Loading

paulyuk commented Jul 9, 2024

paulyuk commented Jul 9, 2024

mukundansundar commented Jul 15, 2024

paulyuk commented Jul 16, 2024 •

edited

Loading

paulyuk commented Jul 17, 2024

yaron2 commented Jul 17, 2024 •

edited

Loading

yaron2 commented Jul 17, 2024

paulyuk commented Jul 17, 2024 •

edited

Loading

greenie-msft commented Jul 17, 2024

yaron2 commented Jul 22, 2024

paulyuk commented Jul 22, 2024

paulyuk commented Jul 23, 2024

paulyuk commented Jul 23, 2024

paulyuk commented Jul 26, 2024

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

Comments

paulyuk commented Jul 9, 2024 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

paulyuk commented Jul 9, 2024

paulyuk commented Jul 9, 2024

mukundansundar commented Jul 15, 2024

paulyuk commented Jul 16, 2024 • edited Loading

paulyuk commented Jul 17, 2024

yaron2 commented Jul 17, 2024 • edited Loading

yaron2 commented Jul 17, 2024

paulyuk commented Jul 17, 2024 • edited Loading

greenie-msft commented Jul 17, 2024

yaron2 commented Jul 22, 2024

paulyuk commented Jul 22, 2024

paulyuk commented Jul 23, 2024

paulyuk commented Jul 23, 2024

paulyuk commented Jul 26, 2024

paulyuk commented Jul 9, 2024 •

edited

Loading

paulyuk commented Jul 16, 2024 •

edited

Loading

yaron2 commented Jul 17, 2024 •

edited

Loading

paulyuk commented Jul 17, 2024 •

edited

Loading