Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GUI button to update users' Della htr2hpc installs #44

Open
4 tasks done
cmroughan opened this issue Feb 3, 2025 · 11 comments
Open
4 tasks done

GUI button to update users' Della htr2hpc installs #44

cmroughan opened this issue Feb 3, 2025 · 11 comments
Assignees
Labels
🆕 enhancement New feature or request

Comments

@cmroughan
Copy link
Collaborator

cmroughan commented Feb 3, 2025

Requesting a code addition that would create a button in eScr which a user could press to automatically update their local installation of htr2hpc on Della. This would streamline any updates that might need to be made during the beta test, for example updating the logic for dynamic slurm resource allotment, if necessary.

  • write a custom celery task to run remote hpc setup
  • modify profile template to add instructions and button to kick off hpc setup task
  • add hpc setup script output to task report message
  • resolve coremltools installation problem
@rlskoeser
Copy link
Contributor

@cmroughan I took a stab at this, changes are in #47

I did some testing locally and think it's generally working (although currently installs develop version of htr2hpc). I thought we could try this on the test site once you've finished testing the other changes.

@rlskoeser rlskoeser moved this from In Progress to Under Review in Iteration Planning Board Feb 5, 2025
@rlskoeser
Copy link
Contributor

Initial testing feedback from @mnaydan - tested but couldn't tell anything was happening when pushed the button. After second or third try was able to run it and get the success message. (Probably once the setup script runs more quickly)

I've updated the task, it should now send an info notification when it starts with text indicating that the setup will be slow on the first run. I'm not seeing that notification reliably, sometimes it doesn't show up at all and at least once it showed up after the success message (although maybe that was for separate runs of the task?).

@mnaydan mnaydan moved this from Under Review to In Progress in Iteration Planning Board Feb 5, 2025
@cmroughan
Copy link
Collaborator Author

Copied from comment on PR 47 :

Checking in admin for the results of a new user's train task -- the task looks to have completed successfully, with the model indeed uploaded to eScr. The task report messaging shows that we did hit an error -- am I remembering correctly that right now the setup script is installing a different branch of htr2hpc? Maybe that's causing the disconnect:

Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'
WARNING:py.warnings:/home/wh4213/.conda/envs/htr2hpc/lib/python3.11/site-packages/PIL/Image.py:2926: RuntimeWarning: divide by zero encountered in divide
  As = 1.0 / w

(See in admin the report for Task 4296.)

Update:

I attempted to replicate on my user by running the htr2hpc update task and then running a train task, but I did not get the missing coremltools error. Probably because it is already installed in my htr2hpc conda env and reinstalling the htr2hpc package did nothing to change that. But we will want to pin down why coremltools is not getting set up by default.

@rlskoeser
Copy link
Contributor

@cmroughan I didn't add logic to add messages to a task report because I thought I had to set it up, but when I was demoing the new features I saw that there were task reports showing up in this list. Should I try adding the script output to the task report so it's easier to troubleshoot?

I'd forgotten about this coremltools problem, is that still unresolved?

@cmroughan
Copy link
Collaborator Author

To my knowledge the absence of coremltools is still unresolved -- I thought maybe it was being caused by the script installing the develop branch and so perhaps the bug was there. Perhaps we just have to add a pip install coremltools to the first-time install script.

If it's easy to quickly add the script output to the task report that could be helpful! Now that I'm thinking about it, there definitely might be troubleshooting that needs to happen with the setup script as new users sign on for the beta test.

@rlskoeser
Copy link
Contributor

rlskoeser commented Feb 10, 2025

@cmroughan is it possible the coremltools issue is due to anaconda3/2024.2 vs anaconda3/2024.6 ? I was using different versions inconsistently in the code and made them all 2024.6 but I just tested the setup script and I get an error with 2024.6 but it seems to work with 2024.2. I'm not sure how to duplicate your error, but coremltools is showing as installed and I can run ketos and kraken scripts with no args without errors (I don't know if that's a sufficient test)

When I ran the script with anaconda3/2024.6, this is the error I saw:

ERROR: Could not find a version that satisfies the requirement torch==2.1 (from versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0)
ERROR: No matching distribution found for torch==2.1

@rlskoeser rlskoeser moved this from In Progress to Under Review in Iteration Planning Board Feb 10, 2025
@cmroughan
Copy link
Collaborator Author

Oh perhaps that was it! I think to properly test it, either I need to wipe everything to do a fresh install on my account or grab someone with a Della account to run the htr2hpc install and then try to run a train job.

(The issue had appeared in Wouter's train job -- the training ran successfully, but presented those warnings in the task message (and might have hit an error if the script had needed to determine a best model rather than using the one kraken had found). I could ask Wouter to click the button to reinstall htr2hpc and run another train job, see if the same error appears.)

@rlskoeser
Copy link
Contributor

@cmroughan this feature is ready for testing again. I figured out where/how the escriptorium code creates the task report for me (there are some celery signal handlers that run before and after the task) and was able to hook into that to find the task report and add messages with the script command and script output. The test site is updated with these changes

I've also switched the setup script and slurm code to use the anacaonda3/2024.2 module, which I think resolves the setup problem. If you want to test/experiment manually, you can copy the new version of the user setup script to della and then change the conda_env_name variable to something different.

@rlskoeser
Copy link
Contributor

rlskoeser commented Feb 10, 2025

@cmroughan if Wouter is up for removing his htr2hpc conda env and running the setup task again, that would be helpful!

I had this handy from my own testing, documenting in case useful:

conda env remove -n htr2hpc

@cmroughan
Copy link
Collaborator Author

I deleted my env and ran a test to try the htr2hpc install with a clean slate. I get different errors than Mary's, but I do get Wouter's error when running a train task.

The output of the setup task, which hits errors at the pip install step:

Running setup script:
./user_setup.sh --skip-ssh-setup --reinstall-htr2hpc

setup script output:

Setting up your account for htr2hpc ....
This process may take five minutes or more on first run. Do not exit until the process completes.
Creating conda environment htr2hpc and installing dependencies
Retrieving notices: ...working... done
Channels:
- defaults
- conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

Installed package of scikit-learn can be accelerated using scikit-learn-intelex.
More details are available here: https://intel.github.io/scikit-learn-intelex

For example:

$ conda install scikit-learn-intelex
$ python -m sklearnex my_application.py

done
Installing pip dependencies: ...working... Ran pip subprocess with arguments:
['/home/croughan/.conda/envs/htr2hpc/bin/python', '-m', 'pip', 'install', '-U', '-r', '/tmp/condaenv.krtta1mm.requirements.txt', '--exists-action=b']
Pip subprocess output:
Processing /scratch/gpfs/croughan/.conda/envs/htr2hpc

failed
/home/croughan
Creating htr2hpc working directory in scratch: /scratch/gpfs/croughan/htr2hpc
Setup complete! 🚀 🚃

Pip subprocess error:
ERROR: file:///. (from -r /tmp/condaenv.krtta1mm.requirements.txt (line 2)) does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

CondaEnvException: Pip failed

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0 requires torch==2.1.0, but you have torch 2.4.1 which is incompatible.

And then, when running a task, I get the same coremltools error that Wouter did:

Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'

@cmroughan
Copy link
Collaborator Author

Also, I don't know for sure, but I suspect that the initial "Running user setup script, on first run this may take a while..." is often failing to appear because clicking the setup button sends a POST that refreshes the page. Perhaps the message tries to appear but gets cleared out by the page refresh?

@mnaydan mnaydan moved this from Under Review to In Progress in Iteration Planning Board Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🆕 enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

2 participants