Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy sonar prototype to Fox #57

Closed
17 of 21 tasks
lars-t-hansen opened this issue Sep 7, 2023 · 9 comments
Closed
17 of 21 tasks

Deploy sonar prototype to Fox #57

lars-t-hansen opened this issue Sep 7, 2023 · 9 comments
Assignees
Labels
component:infra Shell scripts, cron scripts, web server, etc
Milestone

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Sep 7, 2023

Since sonar and sonalyze now seem to be OK for multi-node systems, we should start collecting data on Fox.

There are some issues around whether Sonar is fast enough, which I'm addressing:

At the moment a Sonar invocation on a Fox node takes about 100ms (this is not a high quality measurement); as NordicHPC/sonar#86 shows, we should be able to cut this time by roughly half. Deploying to Fox probably does not depend on that fix, but it would be nice to get it done.

We would like to also implement the other three features for performance, reliability, and quality reasons, but it would be good to first measure their impact on Fox nodes with and without GPUs.

In addition, there's the deployment checklist:

@lars-t-hansen lars-t-hansen added component:infra Shell scripts, cron scripts, web server, etc and removed task:misc labels Oct 17, 2023
@lars-t-hansen
Copy link
Collaborator Author

If cron is not possible, it may be possible to trigger the sonar run remotely by ssh, every few minutes, for every node. Anyway we'll find it out.

@lars-t-hansen lars-t-hansen added this to the M2 milestone Oct 27, 2023
@Sabryr
Copy link
Contributor

Sabryr commented Nov 6, 2023

I think for the Fox meeting, better to propose that you would deploy this on 4 compute nodes (2 with GPU and 2 without). i.e. not all at the same time. Even on NRIS side (consultation with Radovan et all) we ask "some" nodes for the fist round. This gives the admins more assurance on a production system.

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Nov 7, 2023

Some results from a meeting

  • We run sonar and analysis as ec-fox-sw (actually this turns out to be controversial)
  • We use cron on each node but we will do this by copying a file into /etc/cron.d/ (also on each node) and make the system run it, not have a crontab running for the user
  • We run analysis on the interactive nodes, login-2 was mentioned, but eventually we move this into a VM (which will be the target of message-based, not disk-based, logging)
  • Sonar data currently go into /cluster/var/sonar/data
  • Lockfiles should go in /var/tmp per bhm
  • The sonar binary can go into /cluster/sbin (or with sonalyze/naicreport) for now, though eventually we want it in /node/bin, but that requires more infra and we want to be more in production than we currently are
  • The sonalyze/naicreport binaries can be anywhere really, perhaps in home dir for user running the analysis, /fp/home01/u01/ec-fox-sw/sonar/bin
  • Crontab sources live in /fp/home01/u01/ec-fox-sw/sonar/cron (and of course github)
  • Probably the dashboard will show a host as down if the scripts stop running (b/c heartbeat is gone) and we don't need any more than that

@lars-t-hansen
Copy link
Collaborator Author

The analysis is live on Fox (on a small number of nodes).

@lars-t-hansen
Copy link
Collaborator Author

The analysis is now live on all Fox nodes.

@lars-t-hansen lars-t-hansen self-assigned this Nov 20, 2023
@lars-t-hansen
Copy link
Collaborator Author

Lockfile cleanup can be accomplished by an @reboot action in the cron file.

@lars-t-hansen
Copy link
Collaborator Author

The analysis is live on Fox cpu, gpu, interactive, and login nodes. Data are exfiltrated to the remote analysis host. All blocking bugs are really sonar bugs.

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Jan 17, 2024

Lockfile removal: ~sonar is /var/run/sonar on compute and gpu nodes, and this would work just fine. But on int and login it is /home/sonar - not fine. Have petitioned to change homedir to /var/run/sonar on int and login and then just change the script from using /var/tmp to using /var/run/sonar for lockfiles.

Edit: Discussion with fox admins: We'll move the homedir on the int and login nodes, it was always the intention that it should have been /var/run/sonar, and if new nodes are created they will get that too.

Edit^2: Except then we need a service to create /var/run/sonar on boot, probably, one can do something via tmpfiles.d(5) but it's becoming fairly elaborate, esp if we want to move toward making sonar a systemd service anyway.

@lars-t-hansen
Copy link
Collaborator Author

I'm kicking the lockfile issue down the road, issue #352.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:infra Shell scripts, cron scripts, web server, etc
Projects
None yet
Development

No branches or pull requests

2 participants