Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows ssh support by ipcluster #837

Closed
ottointhesky opened this issue Sep 20, 2023 · 45 comments
Closed

Windows ssh support by ipcluster #837

ottointhesky opened this issue Sep 20, 2023 · 45 comments

Comments

@ottointhesky
Copy link
Contributor

Due to the standard OpenSSH Windows package it is straight forward to setup ssh servers on Windows 10 and above (https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?tabs=gui). Therefore, ipcluster should also support Windows (or even better mixed OS environments incl. MacOS) if possible.

So I decided to give it a try, to setup a minimal cluster of 2 Windows computers using ipcluster based on the ssh profile. And I was successful :-)

I’m happy to contribute necessary code changes if there is an interest from the community but it would require some discussions to achieve a proper implementation. As a start some details on my solution:

Beside some obvious command/script changes (e.g. export -> set, rm -> del, test –e -> if exist ….), the most tricky part was in starting the engines in a way, that they are not closed by Windows when closing the ssh connection (for Linux the nohup command is used – cygwin nohup didn’t work for me). The trick was to start the process with the ‘CREATE_BREAKAWAY_FROM_JOB’ flag (as mentioned in https://stackoverflow.com/questions/52449997/how-to-detach-python-child-process-on-windows-without-setsid). I have currently solved this by a separate nohup.py script but maybe there are better and easier ways to achieve that.

Also shutting down was more difficult than expected. ‘Taskkill’ returned an ‘Access denied’ error message (for whatever reason) but using Powershell and the Stop-Process command did work eventually.

Since shell commands will always (slightly) differ from OS to OS, there are basically two option to address the differences. All shell commands are changed to (small) python scripts which have OS independent calling pattern or the OS on the other side of ssh connection is known so that the correct command sequence can be sent. Both options have pros and cons: switching to python from the very start requires a working python instance on the engine computers which might make trouble-shooting / error message reporting more difficult, if python is not working (properly). Using OS dependent shell commands requires knowledge of the OS/shell in the first place (can be worked out by a command that works on all platforms e.g.: echo "OS-CMD=%OS%;OS-PW=$env:OS;OS-LIUNX=$OSTYPE;SHELL=$SHELL") and then storaging this information (adaption of internal data structure necessary?).

Looking forward to your comments!

@minrk
Copy link
Member

minrk commented Sep 21, 2023

That would be great, thanks for offering! Without any ability to test, it's likely to end up unmaintained unless you or someone with access to Windows + SSH can volunteer to keep it maintained, because I can't do it.

Do you think you can install Windows SSH on the GitHub actions runner?

@ottointhesky
Copy link
Contributor Author

You are right, testing is a must! I have zero experience with GitHub action runners but if there is a corresponding ssh ipcluster test for Linux I'm 100% sure that the same thing is achievable for Windows. Maybe the existing test could be simply extended by a Windows instance...

By the way, I'm also happy to maintain this feature since I will use it quite often in the future…

@minrk
Copy link
Member

minrk commented Sep 21, 2023

We have Windows tests already (the runs-on field picks the OS), but I imagine there may be some additional installation or configuration necessary to make openssh available.

So you can probably start with a PR that duplicates the ssh test run to also run on Windows, then you can try figuring out what needs to be set up. Currently, the SSH tests use docker-compose to launch the cluster, which I guess won't work for testing Windows.

@ottointhesky
Copy link
Contributor Author

I will try my best, but as mentioned before I'm new to GitHub action runners. Let's see how far I will get...

Meanwhile we should start the discussion on how to technically supporting different OS and shells (It might be that ipcluster currently doesn't fully work on exotic Linux shells. Furthermore, under Windows you can configure OpenSSH to use cmd.exe or powershell.exe which have totally different syntax). As mentioned in my first post, the differences can be addressed by a) the 'client' side via python scripts or b) on the 'server' side by determining the client OS/shell.

ad a)
pros: +unified way, less OS dependent code
cons: - it might be difficult/impossible to find a unify way of correctly single and/or double quoted parameter for all OS
- requires correctly configured/installed python from the very beginning, which could make trouble-shooting difficult

ad b)
pros: +free of prerequisite
+can support any OS and shells (correctly quoting parameter is possible)
cons: -more OS dependent code required
-storage of OS and shell type is required. Adaption of internal data structure required

Is it necessary to have a broader discussion on this topic?

@minrk
Copy link
Member

minrk commented Sep 22, 2023

If we want to support Windows, the way to go is probably to implement as much as possible in Python (perhaps move all this to ipyparallel.cluster.ssh), such that the only command-line operations to execute are really /path/to/python.exe -m ipyparallel.cluster.ssh [operation]. That should limit the scope of string quoting/etc. we need to deal with.

@ottointhesky
Copy link
Contributor Author

Sounds good to me and I will go in that direction.
Meanwhile I have been working hard to get the Windows SSH cluster to run Docker/Github (similar to the Linux version - also with user ciuser). It took me some days, but I think I'm there now. I do the testing in my forked repository (if you want to have a look...).

I have to admit that Windows tries its best to NOT let you run its OpenSSH server under Docker. Luckily I was able to get around all the issues...

However, there is one thing in the Linux ssh cluster I do not understand: In the docker container the ssh key pair is generated. But I'm not able to pin point the location where the private key is copied from the docker container to the main runner VM (is then used by the ipcluster_config.py - https://github.com/ipython/ipyparallel/blob/main/ci/ssh/ipcluster_config.py)? Can you give me a hint...

I have a solution for Windows, but I want to sick to the Linux solution as close as possible...

@minrk
Copy link
Member

minrk commented Sep 28, 2023

The ssh key is extracted from the running container with docker cp here

@ottointhesky
Copy link
Contributor Author

Thanks very much for the hint!
As mentioned in the code, this should be moved to the corresponding Dockerfile or "next" to the docker compose statement. I suggest the later one, since we need something similar for windows. Docker in "Windows mode" doesn't support 'docker cp' at all (one really annoying thing...). So I manually copy the ipyparallel files to the docker container using scp (the other option would be using git inside the docker container itself, but installing git under the currently used windows image (no .net framework) is not so straight forward).

My point is that we require a slightly different ssh section in test.yml for windows and linux anyway. So we could probably remove all docker command from conftest.py. By the way, what's the point of ssh_dir? Returning the local ci/ssh directory but also checking if the docker container is running. Two totally different thing in one function? A bit confusing....

I suggest to split that up into two functions or maybe moving the actual ssh_dir code directly in ssh_key (I havn't found any other reference of ssh_dir)

Since my ssh docker test under windows is now fully working, I will proceed with the actual implementation of ipyparallel.cluster.ssh...

@minrk
Copy link
Member

minrk commented Sep 28, 2023

Feel free to reorganize it in a way that makes sense. The main goal is that the ssh tests should be skipped if the ssh context can't be set up. Originally, I wanted the fixture to call docker-compose build/up/etc., but that wasn't working for whatever reason.

@ottointhesky
Copy link
Contributor Author

Today, I spend quite some time in understanding how the parameter setting and actual function calling work in the app classes like IPCluster, etc.
Since we decided to go for the python solution - something like
/path/to/python.exe -m ipyparallel.cluster.ssh [operation] [--param1 value1] ...
I thought of applying the existing application/parameter concepts. But since I don’t fully understand the concepts yet (including all consequences), it’s hard for me to tell if the existing ‘ipython’ concepts are an overkill and I should rather go for a simple ‘argparse’ implementation.
What do you suggest?
If we go for the ipython concept which base class for the “ssh app” should be used? Traitlets.config.application.Application, IPython.core.application.BaseIPythonApplication, …

@minrk
Copy link
Member

minrk commented Sep 29, 2023

I think a simple cli entrypoint with argparse makes the most sense. That ought to be quicker to load. If we find that we have some server-side configuration to do, we should switch it to a traitlets app to be consistent with config loading.

@ottointhesky
Copy link
Contributor Author

Thanks for the quick reply. So I will implement an argparse solution...

@ottointhesky
Copy link
Contributor Author

I'm getting closer to a fully working implementation. But before actually replacing shell code in ipyparallel/cluster/launcher.py I may ask for your feedback regarding the proposed implementation:

I changed to name to ipyparallel/cluster/shellcmd because, it allows sending commands to any shell. Hence, it’s not limited to ssh which makes it very versatile and also easy test. I have already established a corresponding test that secures that code works with all possible shells. I haven’t tested the code on linux yet, but I do not expect any surprises since it works under wsl in windows.

The actual implementation consists of two classes:
ShellCommandSend and ShellCommandReceive which have matching command functions: cmd_start, cmd_running, cmd_kill, cmd_mkdir, cmd_rmdir, cmd_exists and cmd_remove. So one can use that nice python API and there is no need to mess with subprocess calls. The object internally build the actual shellcmd call (e.g. python –m ipyparallel.cluster.shellcmd mkdir test_dir) and sends it to the corresponding shell. On the ‘other side’, the main function of shellcmd.py receives the command and calls the corresponding member function of ShellCommandReceive which then actual executes the command. This way, it’s straight forward to add additional parameters or further commands.

During testing I discovered that calling the ipyparallel.cluster.shellcmd module takes quite some time. In case of ssh it is necessary that the remote version of shellcmd.py also matches the local version (not so relevant for a final distribution but a bit annoying during development/debugging). Hence I implemented an option that sends the ShellCommandReceive code (through stdin pipe) to the other python instance. This is approx. 10x times faster than calling the ipyparallel.cluster.shellcmd module and also removes the necessity of the matching file versions. Currently, both options are implemented (see flag use_code_sending) but for simplicity we should eventually stick to one solution only. Although performance is probably not so important (at least for starting up a cluster) I tend to use the code sending method. What’s your option on that?

Final question: For establishing an SSH IPCluster the ssh infrastructure needs to be set up for all computers involved. Furthermore, python and ipyparallel package is required in advance. But python and ipyparallel doesn’t have to be a prerequisite, if IPCluster would support an (engine) initialisation step based on a shell scripts. IPCluster could even support different scripts for different systems (ShellCommandSend knowns the system on the other side). This would be a specific nice feature for windows, since it doesn’t come with python per default. What do you think?

@ottointhesky
Copy link
Contributor Author

ottointhesky commented Oct 12, 2023

I have been working work hard to get the shellcmd functionality working on all platforms and shells. Only relying on the Github runners was way to ineffective (much slower compared to local setup and hard/impossible to analyze errors/problems). I have now two full development environments: one on Windows and one on Linux (Ubuntu). Although the setup took some while, I have now full debugging functionality on both system, which was well worth.

As mentioned before the functionality is split into two classes: ShellCommandSend and ShellCommandReceive with matching member functions. However, communicating from the sender to the receiver over ‘normal’ cli calls is really a nightmare. As long as you have only simple ‘words’ as parameters things are easy, but in case of complex parameters (which incl single, double quotes, white spaces, etc.) things are getting really nasty. Windows cmd.exe, Windows powershell, ssh/bash behave all slightly different and we need to support all shells since OpenSSH can use cmd and powershell on Windows.

Things are much simpler if we send the command via python. Similar as e.g. the existing ssh_waitpid function. In that sense we won’t need (or at least won’t use) a cli interface and we can avoid the quoting horror. This makes the code more compact, easier to understand and maintain. If you agree on that I would change the code accordingly.

The pytest test_ssh now works smoothly on my Windows and Linux machines with one exception: The SSHProxy (-> SSHProxyEngineSetLauncher) tests do not work under Windows. I tried to understand/debug the problem, but it difficult, since I do not fully understand the communication of objects (yet). I’m sure it’s not a problem of my changes since I was able to pass the tests (multiple times) by debugging/stepping through the code. Hence, I guess there is some sort of synchronization problem on Windows. Will have a look later again, but maybe it’s quicker that somebody with good knowledge of the system tries solves this issue.

@minrk
Copy link
Member

minrk commented Oct 12, 2023

Thanks for working on this!

I don't have time to get into too much detail, but I will say that piping code to Python is a good idea, rather than importing an entrypoint on the other side. That will be a lot more reliable, I think. It does put small limitations on the far side of what code is acceptable, but since it's meant to be very simple RPC shell commands, that shouldn't be a problem. So if you send a Python script to be executed via stdin which listens for commands on lines, you should be able to send simple commands on lines, e.g. serialized as JSON for reasonable behavior without platform-sensitive command-line parsing.

BTW, feel free to do your work in a draft pull request. That will make conversations about specific code easier, and there's no need to wait for things to be 'complete' before you open one.

@ottointhesky
Copy link
Contributor Author

Thanks very much for your quick reply.

Ok, then I will definitely go for the piping method. I will clean up my code a bit and send a draft pull request in the next days... :-)

@ottointhesky
Copy link
Contributor Author

I havn't recieved any feedback on my PR yet. Is this because of viewers lack of time or because th PR is not fully working under Github Runners yet? or some other reason? Please get back to me and drop me a short note...

@minrk
Copy link
Member

minrk commented Oct 23, 2023

Sorry, it's just because I've been busy. I clicked the 'approve and run' button, so it should be running on CI now. I will try to get to it when I can. Thank you so much for doing the work! I'll post a review on the PR.

@ottointhesky
Copy link
Contributor Author

Thanks and I didn't want to stress you. Since I have no experince with PR I wasn't sure if I did all correctly. By the way, the pre-commit.ci run shows some error. How can I fixe those (output somehow truncated)? Is there a way to run this locally? Thanks!

@minrk
Copy link
Member

minrk commented Oct 25, 2023

Yes, you can install pre-commit:

pip install pre-commit

and run

pre-commit install

in the repo (many repos use this tool). Then every time you commit, these checks will be run on the changes about to be committed. You can also run the hooks on already-committed and/or unstaged files with

pre-commit run --all-files

@ottointhesky
Copy link
Contributor Author

Thanks very much for the pre-commit explanations!

Finally, I almost got the ssh tests with the Windows github runner to work - see my last SSH-Test run. The last ‘big’ obstacle was a file name issue of the engine logs (forward/backward slashes), which is why I never got the engine logs retrieved. Since it worked in my local environment (no idea why the different slashes work on my machine but not in the github runner), I thought for a long time that the engines never got started. But this wasn't the problem...

Now I get the engine logs and can see the final problem: The engines try to register against an incorrect controller IP. At the end of Set up docker-compose for ssh windows launcher step I output the network addresses (ipconfig):

Ethernet adapter Ethernet 2: 10.1.106.1
Ethernet adapter vEthernet (nat): 172.22.160.1
Ethernet adapter vEthernet (a8fa81d95af0738): 172.27.144.1

But the engines try to register against 10.1.54.0:

2023-10-26 16:18:55.413 [IPEngine] Registering with controller at tcp://10.1.54.0:50868
2023-10-26 16:19:00.423 [IPEngine] CRITICAL | Registration timed out after 5.0 seconds

How do the engines determine the controller ip? Can this somehow be controlled?

Under linux things work correctly:
ifconfig:

r-9da2dc09472a: 172.18.0.1
docker0: 172.17.0.1
eth0: 10.1.0.49

log:

2023-10-26 15:50:50.561 [IPEngine] Registering with controller at tcp://10.1.0.49:35785

@minrk
Copy link
Member

minrk commented Oct 27, 2023

How do the engines determine the controller ip? Can this somehow be controlled?

There are two meanings of 'ip'. One is the 'bind' ip address (likely 0.0.0.0 for ssh), set with --ip or --engine-ip, and one is the 'connect' ip, used only when bind is ambiguous (i.e. 0.0.0.0 or 127.0.0.1). This latter one is set with 'location':

ipcontroller --ip 0.0.0.0 --location 10.1.54.0

or

ipcontroller --ip 0.0.0.0 --location 'controller-container-name'

If they should differ, bind ips can be set separately for engines and clients with --engine-ip, --client-ip, etc.

The default for the location is to use socket.gethostname(), which usually gets the container name or id. In containers, hostnames are generally preferred to ips, since connect ips are often not knowable inside the container. Hostnames may not work if the client is actually on the host system, which doesn't get DNS resolution for containers, unlike within other containers.

@ottointhesky
Copy link
Contributor Author

Thanks again for the infos. By adding a corresponding entry to the container hosts file (of course the docker add host command does NOT support windows...), I was able to 'place' the correct ip address. Although I tried different IP addresses of the different host network adapters, I never got the connection to work. The IPEngine always reports a 'registration timed out'....

I guess the github runner windows firewall blocks the network traffic. This is all so annoying, since there is no issue on a local machine...
I hope I can find a solution for this again

@ottointhesky
Copy link
Contributor Author

YEEESSS. Turning off the windows firewall was the final missing link. As you can see here, the ssh tests did work (persistence paid off eventually)
As mentioned before, the SSHProxy test do not work yet. This is a 'real' issue since the behaviour is identical on my local machines.

I will update my PR next week. So stay tuned!

@ottointhesky
Copy link
Contributor Author

Finally, everything is working and code, github workflows, etc. is cleaned up and (hopefully) properly comment. So from my side the PR is ready "to go..."

@ottointhesky
Copy link
Contributor Author

I do want to be annoying, but did you have time to look at the pull request? If you have any suggests for improvement/changes I'm happy to adapt the code. I'm also willing to change to documentation regarding the ssh windows support. But I would like to finish this work as soon as possible. Thanks

@ottointhesky
Copy link
Contributor Author

thanks for your feeback on the PR. I was busy this week, but I can hopefully work on the code next week again

@ottointhesky
Copy link
Contributor Author

My apologies for not working on the PR for such a long time. There were so many urgent issues in the last 6 month that I never had a few days in a row to finish up the PR. But now I’m back!

After resyncing with the main project, I now have an issue when installing ipyparallel in the docker container for the ssh test. In my local container everything works perfectly fine, but NOT in the github windows runner. The main issue is that the editable pip installation of ipyparallel fails within the container. I have only basic knowledge about the pip installation process and JupyterLab, so I don’t understand the problem. Could you point me into the right direction?

Please have a look at the end of step Set up docker-compose for ssh windows launcher (not the failed ssh test step ) of
this github runner test. There, you will find messages like

lab/src/clusters.tsx(99,33): error TS2344: Type 'import("C:/src/ipyparallel/node_modules/@lumino/widgets/types/widget").Widget' does not satisfy the constraint 'import("C:/src/ipyparallel/node_modules/@jupyterlab/apputils/node_modules/@lumino/widgets/types/widget").Widget'. [...]

Any idea? As mentioned before is must be related to the github windows runner (no issue in a local docker container…)
Thanks & Cheers

@ottointhesky
Copy link
Contributor Author

Just a small update: Found the docker-compose issue. Somehow the environment variable IPP_DISABLE_JS=1 is not fully propergated into the docker container. Very strange since a simple
echo IPP_DISABLE_JS=$env:IPP_DISABLE_JS
command tells me that it's set. Nevertheless I'm sure that I will find a work-a-round...

@minrk
Copy link
Member

minrk commented Jun 18, 2024

Great! No rush, and yeah it is definitely fine to set IPP_DISABLE_JS in the containers for testing this, I'm not sure why it's not propagating, but setting it explicitly ought to work.

@ottointhesky
Copy link
Contributor Author

Thanks for your reply!

Setting the environment variable using powershell commands resolved the issue. See
https://github.com/ottointhesky/ipyparallel/blob/0a0efb02be1da21c224f9e805159c13809902e4f/ci/ssh/win_Dockerfile_template#L9-L13
and
https://github.com/ottointhesky/ipyparallel/blob/0a0efb02be1da21c224f9e805159c13809902e4f/ci/ssh/win_Dockerfile_template#L23C1-L23C112

In my PR pre-commit.ci ruff complains on a lot of files, which I havn't touched. Is this a problem or can I simply ignore that?

I have some more questions regarding your code suggestions: I have replied to about half of them already. Is it ok that way I did that or do you expect a separate commit for each comment? Should I resolve the conversation if I incorporate your suggestions or do you always want to do that?

Thanks & cheers

@minrk
Copy link
Member

minrk commented Jun 19, 2024

In my PR pre-commit.ci ruff complains on a lot of files, which I havn't touched. Is this a problem or can I simply ignore that?

My guess is most of this will be resolved by merging from main.

I have some more questions regarding your code suggestions: I have replied to about half of them already. Is it ok that way I did that or do you expect a separate commit for each comment? Should I resolve the conversation if I incorporate your suggestions or do you always want to do that?

Resolving yourself with a link to the commit is a fine way to do it, or just a description of what you changed. Feel free to leave them open with a comment if you feel like I should have a look to decide if it's been addressed or not. It definitely doesn't need to be a dedicated commit for each comment.

Don't worry about the git history, I think we'll use squash & merge for this one.

@ottointhesky
Copy link
Contributor Author

Thanks again for your comments.

I have resolved all code comments except of one (allways (re-)opening the debug log file handle). Have a look on my comment and feel free to presist on your suggestion...

Apart from that I think my PR is ready to "go"

Just a final note: I have added the test_shellcmd.py also to the ssh github runner test. Otherwise the tests
[ssh-win], [ssh-win_src], [ssh-linux] and [ssh-linux_src] are never executed.

@ottointhesky
Copy link
Contributor Author

Yesterday I found and fixed a small issue in my breakaway support detection. It was working in github runners but not in the "real world" (the work-a-round was always used...)
Unfortunately, the test workflow action is not successful because github declared macos-11 runners as deprecated. Simply switching to macos-12 results in some test errors :-(
see macos-12 test
How should we proceed? Is there someone how is fixing the macos issue?

@minrk
Copy link
Member

minrk commented Jun 25, 2024

I'll have a look at the macos issue in #886

@ottointhesky
Copy link
Contributor Author

I'll have a look at the macos issue in #886

perfect & thanks! There is obviously an issue with the hostname ip resolution (I remember a similar problem is the windows docker container)

@ottointhesky
Copy link
Contributor Author

I don't want to be annoying but is there a chance that we get my PR to a committable state within the next two weeks (afterwards I'm on holidays for 3 weeks...)? The new feature would also require some adaptions of the ipcluster docu. Should I write that?

@minrk
Copy link
Member

minrk commented Jul 10, 2024

Thanks for your huge amount of work! I'm also going on holiday. If you could remove all the temporary skips of tests and things, and make appropriate changes to the docs. I can probably take it from there when I have a chance.

@ottointhesky
Copy link
Contributor Author

You are very welcome! Your comments and the reviewing process was a positive experience for me.
I have now also commited the docu changes and I hope that I removed all temporary skips of tests.

The only thing which is left, is the 'SSHProxy' support/test under Windows. I remember darkly that it was basically working on my machine but not reliable. Since my knowledge about the network communication of ipp is still limited, I didn't investigate the issue further. I'm willing to assist here as well, but in a later separate PR (since a normal ssh cluster works perferctly fine).

I hope you find time to merge the PR soon. Enjoy your holidays!

@minrk
Copy link
Member

minrk commented Oct 28, 2024

Finally merged, thank you!

@minrk minrk closed this as completed Oct 28, 2024
@ottointhesky
Copy link
Contributor Author

Thanks for improving and merging my code parts! Coming from C++ I can still see that I haven't fully adapted to all possibilities of dynamic programming languages ;-)
I can’t wait to see the changes in the next pip release of ipyparallel. Is there a date/plan for the next release?

I also have a completely different question: Is there a possibility to submit (a small amount of) auxiliary/custom data along with a task that will be stored in the central task database? Is there a better place to ask such questions? Thanks & Cheers

@minrk
Copy link
Member

minrk commented Nov 4, 2024

I've been cleaning up a few things and fixing compatibility with the latest version of ipykernel and then I'll make a release. The only thing I have left is a docs PR that I'm working on, then I'll make a release. Hopefully this week or next.

As for the other question, it would be better to open a new issue with more details. I'd need more information about what you want to do with the centrally stored data.

@ottointhesky
Copy link
Contributor Author

Many thanks for the quick reply and the infos.
And I will open a new issue with more details.

@minrk
Copy link
Member

minrk commented Nov 6, 2024

@ottointhesky ipyparallel 9.0 is out your PR. Thanks, again!

@ottointhesky
Copy link
Contributor Author

Perfect and thanks for the info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants