Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

Closed
paulyuk opened this issue Jul 9, 2024 · 14 comments · Fixed by #1057
Closed

Tutorials Tests/Checks FAIL (dapr app not ready / timing) #1048

paulyuk opened this issue Jul 9, 2024 · 14 comments · Fixed by #1057
Assignees
Labels
Milestone

Comments

@paulyuk
Copy link
Contributor

paulyuk commented Jul 9, 2024

Expected Behavior

Tutorials pass

Tutorials are technically unsupported parts of the Quickstart branch, but I don't like seeing regressions creep in like this and treat it as a P1, non ship blocker, non PR blocker that we should investigate. It likely points to a real product issue.

Actual Behavior

Tutorials Fail

Here is a good example where the dapr app client tries to call the target dapr app's /neworder api and it fails because it's not yet available, likely because it hasn't started yet. I can reproduce timing issues like this on my local KinD deployment too.
Failure example

Steps to Reproduce the Problem

Use the link to Action above.

But more importantly, it repros when you do a multi app run of the app on your local machine, e.g. using

dapr run -k -f .

@msfussell
@yaron2

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 9, 2024

Might need maintainer on dapr-cli to look at this since it feels more like health check issues for multi app run on kubernetes.

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 9, 2024

@yaron2 you mentioned you may be able to help with this dapr cli, multi app run health check timing? TY

@mukundansundar
Copy link
Contributor

@paulyuk The failure seems to be transient and more related to why dapr-dev-redis was going into crashloop backof https://github.com/dapr/quickstarts/actions/runs/9702023201/job/26776912327#step:18:82...
Are you able to repro this locally?

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 16, 2024

Hey @mukundansundar and @yaron2, the issue I reported above repros locally.

the redis crash loop does not repro for me, using KinD

> dapr init -k --dev
⌛  Making the jump to hyperspace...
ℹ️  Note: To install Dapr using Helm, see here: https://docs.dapr.io/getting-started/install-dapr-kubernetes/#install-with-helm-advanced

ℹ️  Container images will be pulled from Docker Hub
✅  Deploying the Dapr control plane with latest version to your cluster...
✅  Deploying the Dapr dashboard with latest version to your cluster...
✅  Deploying the Dapr Redis with latest version to your cluster...
✅  Deploying the Dapr Zipkin with latest version to your cluster...
ℹ️  Applying "statestore" component to Kubernetes "default" namespace.
ℹ️  Applying "pubsub" component to Kubernetes "default" namespace.
ℹ️  Applying "appconfig" zipkin configuration to Kubernetes "default" namespace.
✅  Success! Dapr has been installed to namespace dapr-system. To verify, run `dapr status -k' in your terminal. To get started, go here: https://aka.ms/dapr-getting-started


> kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
dapr-dev-redis-master-0            1/1     Running   0          24m
dapr-dev-redis-replicas-0          1/1     Running   0          24m
dapr-dev-redis-replicas-1          1/1     Running   0          23m
dapr-dev-redis-replicas-2          1/1     Running   0          23m
dapr-dev-zipkin-7d5f8fc8b5-wds69   1/1     Running   0          24m

But the original crash above of Node app not loading with dapr run -k -f . is still happening locally. Have you tried it?

> dapr init -k --dev
⌛  Making the jump to hyperspace...
ℹ️  Note: To install Dapr using Helm, see here: https://docs.dapr.io/getting-started/install-dapr-kubernetes/#install-with-helm-advanced

ℹ️  Container images will be pulled from Docker Hub
✅  Deploying the Dapr control plane with latest version to your cluster...
✅  Deploying the Dapr dashboard with latest version to your cluster...
✅  Deploying the Dapr Redis with latest version to your cluster...
✅  Deploying the Dapr Zipkin with latest version to your cluster...
ℹ️  Applying "statestore" component to Kubernetes "default" namespace.
ℹ️  Applying "pubsub" component to Kubernetes "default" namespace.
ℹ️  Applying "appconfig" zipkin configuration to Kubernetes "default" namespace.
✅  Success! Dapr has been installed to namespace dapr-system. To verify, run `dapr status -k' in your terminal. To get started, go here: https://aka.ms/dapr-getting-started
 pyadmin   hello-kubernetes    master ≢  ?2    dapr run -k -f dapr.yaml
ℹ️  This is a preview feature and subject to change in future releases.
ℹ️  Validating config and starting app "nodeapp"
ℹ️  Deploying app "nodeapp" to Kubernetes
ℹ️  Deploying service YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/deploy/service.yaml" to Kubernetes
ℹ️  Deploying deployment YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/deploy/deployment.yaml" to Kubernetes
⚠  Error deploying pod to Kubernetes. See logs directly from Kubernetes command line.
ℹ️  Writing log files to directory : /home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/node/.dapr/logs
ℹ️  Validating config and starting app "pythonapp"
ℹ️  Deploying app "pythonapp" to Kubernetes
ℹ️  Deploying deployment YAML "/home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/python/.dapr/deploy/deployment.yaml" to Kubernetes
ℹ️  Streaming logs for containers in pod "pythonapp-5cd765b8f4-zgqlm"
ℹ️  Writing log files to directory : /home/pyadmin/src/paulyuk/quickstarts/tutorials/hello-kubernetes/python/.dapr/logs
ℹ️  Starting to monitor Kubernetes pods for deletion.
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c485f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c422150>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4282d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4283d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4220d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c415cd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c432450>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTPConnectionPool(host='localhost', port=3500): Max retries exceeded with url: /neworder (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e5c4385d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
== APP - pythonapp == HTTP 500 => {"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: nodeapp, err: failed to invoke target nodeapp after 3 retries. Error: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.244.0.15:50002
...

Key error is the one showing nodeapp isn't available and hence calls to it fail from pythonapp

== APP - pythonapp == HTTP 500 => {"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: nodeapp, err: failed to resolve address for 'nodeapp-dapr.default.svc.cluster.local': lookup nodeapp-dapr.default.svc.cluster.local on 10.96.0.10:53: no such host"}

after about 32 seconds (retries) the e2e works locally on kind

== APP - nodeapp == Got a new order! Order ID: 32
== APP - nodeapp == Successfully persisted state for Order ID: 32
== APP - nodeapp == Got a new order! Order ID: 33
== APP - nodeapp == Successfully persisted state for Order ID: 33
== APP - nodeapp == Got a new order! Order ID: 34

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 17, 2024

Note @greenie-msft and I also tried to put Dapr Resiliency in place, and Orders still got skipped until the NodeApp was ready!

We added ./resources/resiliency.yaml:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: myresiliency
# similar to subscription and configuration specs, scopes lists the Dapr App IDs that this
# resiliency spec can be used by.
spec:
  # policies is where timeouts, retries and circuit breaker policies are defined. 
  # each is given a name so they can be referred to from the targets section in the resiliency spec.
  policies:
    # retries are named templates for retry configurations and are instantiated for life of the operation.
    retries:
      retryInvokeForever:
        policy: constant
        maxInterval: 5s
        maxRetries: -1 # retry indefinitely

  # targets are what named policies are applied to. Dapr supports 3 target types - apps, components and actors
  targets:
    apps:
      nodeapp:
        retry: retryInvokeForever

and modified dapr.yaml common section to apply the resiliency policy crd

version: 1
apps:
  - appDirPath: ./node
    appID: nodeapp
    appPort: 3000
    containerImage: ghcr.io/dapr/samples/hello-k8s-node:latest
    createService: true
  - appDirPath: ./python
    appID: pythonapp
    containerImage: ghcr.io/dapr/samples/hello-k8s-python:latest
common: # optional section for variables shared across apps  
  resourcesPath: ./resources # any dapr resources to be shared across apps

This is almost as if this type of exception (Max retries exceeded with url: /neworder ) does not factor into dapr resiliency. Unless the try catch for the sample is interfering?

@yaron2
Copy link
Member

yaron2 commented Jul 17, 2024

The issue here is that the app can't reach it's Dapr instance because the latter isn't up yet. This means the resiliency policy is ineffective as the requests never hit Dapr to begin with. This is in reference to all errors that contain connection refused

@yaron2
Copy link
Member

yaron2 commented Jul 17, 2024

I'm investigating

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 17, 2024

Thank you. It put my brain in a loop thinking about how your app's dapr sidecar can check on the other app's sidecar, if it's not up, or you dont know the healthz endpoint you're trying to hit for remote app.

@greenie-msft
Copy link
Contributor

Thanks for the explanation, Yaron. So the retries we're seeing in the logs are coming from python http client?

@yaron2
Copy link
Member

yaron2 commented Jul 22, 2024

Thanks for the explanation, Yaron. So the retries we're seeing in the logs are coming from python http client?

Yes

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 22, 2024

@yaron2 @msfussell - per our chat the tactical solution in 1.14 for this is will revert back to single dapr run -- ... commands since Multiapp run doesn't know about dependencies and ordering priorities that affect timing. I filed a separate issue requesting support for that here <dapr/cli#1435 >

I will take ownership of the 1.14 scoped fix so we get tests passing again. PR on the way.

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 23, 2024

@paulyuk The failure seems to be transient and more related to why dapr-dev-redis was going into crashloop backof https://github.com/dapr/quickstarts/actions/runs/9702023201/job/26776912327#step:18:82... Are you able to repro this locally?

Hey - I am hitting this issue now, and it is blocking tests with --dev init, but it will not occur locally. It's only in the GH action runner.

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 23, 2024

I'm filing a specific bug on the remaining failing issue with Redis crashloop:
dapr/cli#1436
@yaron2

@paulyuk
Copy link
Contributor Author

paulyuk commented Jul 26, 2024

fixed by #1057 and dapr/cli#1437

@paulyuk paulyuk closed this as completed Jul 26, 2024
@github-project-automation github-project-automation bot moved this from Needs Owner to Done in v1.14 Release Tracking Board Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

5 participants