Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask: capture live logs from running Dask scheduler and worker pods #610

Open
tiborsimko opened this issue Nov 13, 2024 · 0 comments · May be fixed by reanahub/reana#850 or #616
Open

dask: capture live logs from running Dask scheduler and worker pods #610

tiborsimko opened this issue Nov 13, 2024 · 0 comments · May be fixed by reanahub/reana#850 or #616
Assignees

Comments

@tiborsimko
Copy link
Member

Current behaviour

When the administrator enables "live logs" feature on the REANA deployment:

$ cd reana
$ g diff -w | cat
diff --git a/helm/configurations/values-dev.yaml b/helm/configurations/values-dev.yaml
index c264323..b021a89 100644
--- a/helm/configurations/values-dev.yaml
+++ b/helm/configurations/values-dev.yaml
@@ -15,7 +15,7 @@ components:
     environment:
       REANA_RUNTIME_KUBERNETES_KEEP_ALIVE_JOBS_WITH_STATUSES: failed
       REANA_OPENSEARCH_USE_SSL: false
-        REANA_OPENSEARCH_ENABLED: false
+      REANA_OPENSEARCH_ENABLED: true
   reana_workflow_engine_cwl:
     image: docker.io/reanahub/reana-workflow-engine-cwl
   reana_workflow_engine_yadage:
@@ -36,7 +36,7 @@ pgbouncer:

 # OpenSearch configuration for dev environment
 opensearch:
-  enabled: false
+  enabled: true
   tls:
     generate: false
   singleNode: true
@@ -69,9 +69,12 @@ opensearch:

 # FluentBit configuration for dev environment
 fluent-bit:
-  enabled: false
+  enabled: true
   outputConfig:
     tls: "Off"
     tlsCaFile: ""
   extraVolumes: []
   extraVolumeMounts: []
+
+dask:
+  enabled: true

And when a user runs a Dask workflow:

cd reana-demo-dask-coffea
reana-client run -w dask

Then the "live logs" do not really show any progress with the Dask jobs, and the user sees only:

$ rc logs -w dask --follow -i 10
2024-11-13 09:26:46,169 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ

In other words, we are not capturing logs from the running Dask scheduler and the worker pods, such as:

$ k logs reana-run-dask-6f93ffcf-7e23-42de-a269-e2ce629f80f7-schedud7wfs
/usr/local/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  warnings.warn(
2024-11-13 09:26:36,664 - distributed.scheduler - INFO - -----------------------------------------------
2024-11-13 09:26:37,462 - distributed.scheduler - INFO - State start
2024-11-13 09:26:37,464 - distributed.scheduler - INFO - -----------------------------------------------
2024-11-13 09:26:37,465 - distributed.scheduler - INFO -   Scheduler at:    tcp://10.244.0.85:8786
2024-11-13 09:26:37,465 - distributed.scheduler - INFO -   dashboard at:                     :8787
2024-11-13 09:26:48,243 - distributed.scheduler - INFO - Receive client connection: Client-6839cd01-a1a1-11ef-800d-6ad72d89ef66
2024-11-13 09:26:48,244 - distributed.core - INFO - Starting established connection to tcp://10.244.0.86:39330
2024-11-13 09:26:52,455 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.244.0.87:46297', name: reana-run-dask-6f93ffcf-7e23-42de-a269-e2ce629f80f7-default-worker-239985c006, status: init, memory: 0, processing: 0>
2024-11-13 09:26:52,455 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.244.0.87:46297
2024-11-13 09:26:52,455 - distributed.core - INFO - Starting established connection to tcp://10.244.0.87:37246
2024-11-13 09:27:07,521 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.244.0.88:39151', name: reana-run-dask-6f93ffcf-7e23-42de-a269-e2ce629f80f7-default-worker-4e117f4fec, status: init, memory: 0, processing: 0>
2024-11-13 09:27:07,522 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.244.0.88:39151
2024-11-13 09:27:07,522 - distributed.core - INFO - Starting established connection to tcp://10.244.0.88:37438
$ k logs reana-run-dask-6f93ffcf-7e23-42de-a269-e2ce629f80f7-defauljdnbt
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron
  warnings.warn(
2024-11-13 09:27:04,324 - distributed.utils_perf - INFO - full garbage collection released 39.17 MiB from 104785 reference cycles (threshold: 9.54 MiB)
2024-11-13 09:27:14,928 - distributed.utils_perf - INFO - full garbage collection released 39.54 MiB from 104785 reference cycles (threshold: 9.54 MiB)
2024-11-13 09:27:27,675 - distributed.utils_perf - INFO - full garbage collection released 32.21 MiB from 86289 reference cycles (threshold: 9.54 MiB)
2024-11-13 09:27:40,523 - distributed.utils_perf - INFO - full garbage collection released 25.70 MiB from 67810 reference cycles (threshold: 9.54 MiB)
2024-11-13 09:27:55,872 - distributed.utils_perf - INFO - full garbage collection released 41.95 MiB from 110946 reference cycles (threshold: 9.54 MiB)

Expected behaviour

The user would expect to see the live logs from the running Dask scheduler and workers when the user follows live workflow execution.

Note

Note that when the workflow successfully finishes, this is what we capture in the database:

$ rc logs -w dask
==> Workflow engine logs
2024-11-13 09:26:46,169 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2024-11-13 09:31:43,425 | root | MainThread | INFO | Workflow 6f93ffcf-7e23-42de-a269-e2ce629f80f7 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/6f93ffcf-7e23-42de-a269-e2ce629f80f7.



==> Job logs
==> Step: process
==> Workflow ID: 6f93ffcf-7e23-42de-a269-e2ce629f80f7
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-e476c924-95ee-4c11-9207-084afa3c7843
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: finished
==> Started: 2024-11-13T09:26:46
==> Finished: 2024-11-13T09:31:39
==> Logs:
Matplotlib created a temporary cache directory at /tmp/matplotlib-fkzl98_7 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[##                                      ] | 5% Completed | 19.4s
[#####                                   ] | 13% Completed | 44.3s
[########                                ] | 21% Completed |  1min  8.3s
[############                            ] | 30% Completed |  1min 30.9s
[###############                         ] | 38% Completed |  1min 53.6s
[##################                      ] | 45% Completed |  2min 16.2s
[#####################                   ] | 54% Completed |  2min 38.9s
[########################                ] | 62% Completed |  3min  1.5s
[###########################             ] | 69% Completed |  3min 24.1s
[###############################         ] | 77% Completed |  3min 46.9s
[##################################      ] | 85% Completed |  4min  9.5s
[#####################################   ] | 93% Completed |  4min 32.2s
all events 53446198
number of chunks 534

And this is what we capture in OpenSearch "live logs":

$ rc logs -w dask --follow
2024-11-13 09:26:46,169 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2024-11-13 09:31:43,425 | root | MainThread | INFO | Workflow 6f93ffcf-7e23-42de-a269-e2ce629f80f7 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/6f93ffcf-7e23-42de-a269-e2ce629f80f7.
==> Workflow has completed, you might want to rerun the command without the --follow flag.
@tiborsimko tiborsimko added this to Dask Nov 13, 2024
@tiborsimko tiborsimko moved this to Ready for work in Dask Nov 13, 2024
@tiborsimko tiborsimko changed the title dask: capure live logs from running Dask scheduler and worker pods dask: capture live logs from running Dask scheduler and worker pods Nov 13, 2024
@Alputer Alputer self-assigned this Nov 26, 2024
Alputer added a commit to Alputer/reana-workflow-controller that referenced this issue Nov 26, 2024
Alputer added a commit to Alputer/reana-workflow-controller that referenced this issue Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In review
2 participants