-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UDP/TCP port fowarding to a host without setting up a tun #1179
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution! Before we can merge this, we need @cre4ture to sign the Salesforce Inc. Contributor License Agreement. |
Thanks a ton for taking this project on! I took a stab at this before finding your changes here, it's here if you want to take a look. In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test. I was able to improve throughput by ~7x by adding some buffering to the I still think there are several improvements left on the table:
|
Hey, cool. I could use some support especially regarding the performance and the automated testing. But also for other golang specifics as I'm rather unexperienced there. I tried to do some performance profiling with the help of It seems that its doing some dynamic memory allocation. But from a first glance into the relevant source-code I could not see the reason for it. Do you have an idea? I'm actually already failing to upgrade the gvisior to the latest version. something is wrong with it such that it complains about multiple packages in a directrory. But how can this be undiscovered by the maintainters of gvisor? I have the feeling that I'm doing something wrong... :-/ I will for sure try your idea with the buffered queues. |
@akernet I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-) Your performance improvement lead to a improvment of x6 for the test in your testscript-commit on my hardware. I will do a further test where I will use this on a real-world example (the file-copy test with two seperate machines) to confirm this improvment. I'm a bit concerned because the pprof profiling directed my in some different direction. At least this is how I interpret it right now. I will let you know. |
it seems to me that the improvement with the buffered pipes has no impact on the my testing scenario with the file-copy between two different machines. I'm using a "private cloud" server storage called "garage" mounted via "rclone". The two machines are my laptop and a NUC PC ("server"). Both connected via ethernet cable (1GBit/s) to my local network. I use the offical release binary for the laptop and exchange for testing purposes the binary on the server side. I measure the throughput rate (nautilus file copy) on the laptop. And I measure the CPU load on the server side. These are my results:
The last test shall demonstrate the the locally compiled binary has comparable performance as the release binary. In general, the userspace tun/stack has an impact of factor 4. which results from x2 times CPU load and 1/2 throughput rate put together. |
I'm also new to go but I'll try to find some time in the next days! :)
Ofc!
Cool, I've been looking at garage for the past weeks so nice to see that you are using it with Nebula!
Yeah this is a bit strange. How is the system load overall, is it close to max? The buffering should not make Nebula more efficient when it comes to CPU, in fact it's probably gonna be more costly due to the copying. What it does is decoupling the outside and inside (gvisor) parts, allowing them to run in parallel. If the system is at limits already this is unlikely to help, however. I'll try to do some benchmarking in CPU limited scenarios too, gvisor is always gonna use some additional resources but the 66% of your pprof run seems a bit high. |
Update regarding performance:
This brings the copy-speed to almost the reference (99MB/second) - only 10% difference remaining. So overall we are at around 175% of the reference. Which is significant compared to the initial ~ 400%. @akernet I think the main difference comes from a further improvement of your performance tuning "buffering to UserDevice". I achieved it by addtionally avoiding data-copy and dynamic allocation steps (i think). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! Cool that you got the buffer reuse to work
// Its there to avoid a fresh dynamic memory allocation of 1 byte | ||
// for each time its used. | ||
var BYTE_SLICE_ONE []byte = []byte{1} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving this to a separate PR since it affects normal configs too
port-forwarder/fwd_tcp.go
Outdated
default: | ||
} | ||
|
||
rn, r_err := from.Read(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be replaced with io.Copy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can. I once tried this. But it didn't improve the performance. Thats why I went on with other experiments. But now, as we solved the issue to a big extent, I will introduce it again as it simplifies the code. Thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a variant with io.Copy().
Problem: It seems to slightly decrease performance. So I'm wondering if I better should keep the original implementation.
e2523ca
to
08c3922
Compare
Thanks for the contribution! Before we can merge this, we need @akernet to sign the Salesforce Inc. Contributor License Agreement. |
9718675
to
3a03eb9
Compare
@akernet please re-review. and it seems that you need to also sign the cla as I cherry picked from your branch to honor your contribution. :-) |
5be9458
to
24169d7
Compare
FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning |
…isor_stack # Conflicts: # examples/go_service/main.go # service/service.go
I merged the changes from main. It seems that the interface is OK. |
udp/udp_linux.go
Outdated
@@ -315,6 +321,10 @@ func (u *StdConn) getMemInfo(meminfo *[unix.SK_MEMINFO_VARS]uint32) error { | |||
|
|||
func (u *StdConn) Close() error { | |||
//TODO: this will not interrupt the read loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnmaguire I think I have the test stable now. At least on linux. The issue was a unclean shutdown of the nebula service at the end of each test which caused confusion in the following tests.
The change here in this file solved the issue.
I will check the next days if there are some sleeps to cleanup, now as I found the issue.
ed51dd9
to
84d1a26
Compare
return s.eg.Wait() | ||
err := s.eg.Wait() | ||
|
||
s.ipstack.Destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this line made the windows test stable. They now run 200+ in a row without issue.
@@ -323,6 +329,14 @@ func (u *RIOConn) Close() error { | |||
windows.PostQueuedCompletionStatus(u.rx.iocp, 0, 0, nil) | |||
windows.PostQueuedCompletionStatus(u.tx.iocp, 0, 0, nil) | |||
|
|||
u.rx.mu.Lock() // for waiting till active reader is done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to run the test 1000 times to achieve stable reproduction on my windows 11. Is fixed with this change.
a2cbb62
to
cd510b3
Compare
@johnmaguire after extensive testing and a few findings I dare to claim that the tests are now stable on windows and linux. Can you please approve another workflow run on this PR? And have a look about what is still do be done to merge this? |
@cre4ture Any new changes? Currently it seems to be incompatible with the master branch. |
…isor_stack # Conflicts: # service/service_test.go
@ExplodingDragon thanks for the notification. I merged the changes in. I think the PR is ready for review. It's not clear when the Maintainers will find time for it. |
Is it possible to also add the ability to map two cidrs against each other?
|
Hello maggie44, thanks for your interesting question. I can't really answer that without further studying the documentation myself. hope this helps you a step forward. |
I meant in relation to this PR. Based on the example in the initial post it is configured like this:
I was querying whether it could also be configured like this:
But I see now that the local address would need an interface? Which would defeat the object of removing the TUN device. The goal is to be able to map a lot of devices to the local ports or address without TUN, and without having super long lists of IPs in the config file. |
Best I can think of right now would be to have TCP forwarding from a local IP, like
Where Or for port compatibility:
Straying off the original intent of this PR though, initially I was thinking of it only as a CIDR mapping which would have been a smaller change. Looking at the Service function example there might already be the ability to do this anyway. |
@maggie44 Aha. I see the purpose of your question more clear now. |
Thanks @cre4ture. Here is the sort of thing I had in mind. Looks like it's possible with the gVisor https://gist.github.com/maggie44/189b2536e39f4660f12f3b92235ca3fc |
First of all, thank you for this contribution. It works very well for me. I just wanted to note that port forwarding from the same node to itself does not work with this solution. I know that it's unnecessary because I can directly connect to the port on the same node, but by accident it happened to me. The following does not work on my machine with nebula/dial_address 192.168.100.3.
I don't know enough details of how it works, so I can't debug it. It tries to connect but gets stuck.
Feel free to ignore, since I can bypass the issue by connecting directly to the port on the same machine. |
Hi @niowniow, thanks for testing the implementation on this PR and your feedback. So, I can't really tell if this is supported or not as it depends on a part of the code that this PR doesn't touch. I think. |
This implements #1014.
It also allows to use remote SOCKS proxy (or
ssh
server) to enable the forwarding of more than a single port at a time.This would solve the needs for #915.
Build-Status twin PR: cre4ture#1
How to be used can be read in the example
config.yml
:Till now I only did some basic manual tests with netcat. They resulted in the expected behavior.Further tests where done by @sybrensa and also by myself.
usafe_routesFAILS, but potential fix for the user-tun available here: https://github.com/cre4ture/nebula/tree/feature/enable_unsafe_routes_for_user_tunThese points are still open:a TODODONE.also TODODONE.I thinkDONE.I did file copy tests.
What I got was 1/2 the rate and around 4 times more CPU usage compared to the case with kernel-tun.I tried multiple different improvements on my code, but I fear its actually more a limitation of the gvisor netstack :-/
Any help or idea is welcome.
Update: My file copy tests achieve now around 90% of the speed at 160% of CPU load of nebula. This is a significant improvment and acceptable (at least for my useases). Thanks @akernet for the support here.