Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup: refactor and use bicopy everywhere #2944

Closed
wants to merge 1 commit into from

Conversation

tamird
Copy link
Contributor

@tamird tamird commented Nov 25, 2024

See commit message.

@tamird tamird force-pushed the remove-bicopy branch 2 times, most recently from 4ec7fa0 to 63ae319 Compare November 27, 2024 17:49
Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior but you did not provide any reason for the change. If the purpose is to remove eliminate a package, just inline the package code in the single user. There is no need to change the code.

If you want to reimplement it and/or change the behavior, please explain why.

Also we have good tests for port_darwin? Before rewriting we should have good tests ensuring that we don't change the behavior by mistake. It should be simple test coping data over unix sockets.

pkg/hostagent/port_darwin.go Outdated Show resolved Hide resolved
pkg/hostagent/port_darwin.go Outdated Show resolved Hide resolved
pkg/hostagent/port_darwin.go Outdated Show resolved Hide resolved
pkg/hostagent/port_darwin.go Show resolved Hide resolved
@tamird tamird force-pushed the remove-bicopy branch 2 times, most recently from 6a612d9 to c8e7582 Compare November 27, 2024 21:30
pkg/vz/network_darwin.go Show resolved Hide resolved
pkg/vz/network_darwin.go Show resolved Hide resolved
@tamird tamird force-pushed the remove-bicopy branch 3 times, most recently from a2a75a4 to b92ac24 Compare November 28, 2024 13:55
@tamird tamird changed the title cleanup: remove package bicopy cleanup: refactor and use bicopy everywhere Nov 28, 2024
@AkihiroSuda AkihiroSuda requested a review from nirs December 2, 2024 02:07
pkg/bicopy/bicopy.go Outdated Show resolved Hide resolved
pkg/bicopy/bicopy.go Outdated Show resolved Hide resolved
pkg/bicopy/bicopy.go Outdated Show resolved Hide resolved
pkg/vz/network_darwin.go Outdated Show resolved Hide resolved
pkg/portfwd/client.go Outdated Show resolved Hide resolved
pkg/portfwd/client.go Outdated Show resolved Hide resolved
pkg/bicopy/bicopy.go Show resolved Hide resolved
@nirs nirs requested a review from balajiv113 December 3, 2024 01:14
@nirs
Copy link
Member

nirs commented Dec 3, 2024

I think the common behavior in all 3 cases is that we start an endless io.Copy() in both directions. The copy is not expected to complete in normal conditions. This is the case for vz/network_darwin.go, I' not sure about the other 2, maybe port forwarding may end earlier in normal conditions.

In the vz network case, the copy should stop only when the vm is shutting down, or if socket_vmnet crashed or terminated. In this case we want to log an error or warning message about the peer closing the socket. Any other error should be logged as an error. In base case we want to log the issue immediately without a delay, so we don't lose errors.

When one of the io.Copy() goroutines fail, we want to other to fail immediately. I'm not sure this is ensured by current code. This should fixed and tested before we unify the implementation.

When both io.Copy() finished, we already logged the error, so there is nothing to return to the caller, even if we have a caller waiting for the result.

Usually saving the first error (synchronized with a lock or atomics) and returning it will be more helpful than returning 8 possible errors joined. After something fail, the next errors are usually just noise. If we can ensure that the second copy fail immediately when the first fail, we can return the first error to the caller and the caller can log it. This will be more useful for a library since different callers may want to use different loggers. May not be relevant to lima use case.

If we improve bicopy we want to share them with upstream. Did you open upstream issues for the problems you are trying to fix in bicopy?

@tamird
Copy link
Contributor Author

tamird commented Dec 3, 2024

If we improve bicopy we want to share them with upstream. Did you open upstream issues for the problems you are trying to fix in bicopy?

I'm just removing features we aren't using. If they need early cancellation that's just fine.

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplified Bicopy!

pkg/vz/network_darwin.go Outdated Show resolved Hide resolved
pkg/bicopy/bicopy.go Show resolved Hide resolved
@nirs
Copy link
Member

nirs commented Dec 3, 2024

If we improve bicopy we want to share them with upstream. Did you open upstream issues for the problems you are trying to fix in bicopy?

I'm just removing features we aren't using. If they need early cancellation that's just fine.

Good reason why this is not relevant to upstream.

pkg/vz/network_darwin.go Show resolved Hide resolved
pkg/portfwd/client.go Show resolved Hide resolved
pkg/hostagent/port_darwin.go Show resolved Hide resolved
pkg/bicopy/bicopy.go Show resolved Hide resolved
pkg/bicopy/bicopy.go Outdated Show resolved Hide resolved
pkg/bicopy/bicopy.go Show resolved Hide resolved
pkg/portfwd/client.go Outdated Show resolved Hide resolved
@nirs
Copy link
Member

nirs commented Dec 3, 2024

@tamird see #2969 for the io.EOF checks.

@tamird
Copy link
Contributor Author

tamird commented Dec 4, 2024

Looks like I caught a test flake - I don't see bicopy in the stack trace.

TestGetPorts/loadBalancer_service
2024-12-03T23:19:21.6922190Z panic: test timed out after 10m0s

@jandubois
Copy link
Member

Looks like I caught a test flake - I don't see bicopy in the stack trace.

I've triggered a re-run, but I'm waiting for approval from another reviewer before merging; not going to work my way through 72 comments... 😄

<-finish
// TODO: return copied bytes
}()
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need bicopy to handle the error and not throw it to the caller ??

I would prefer it to be handled by the caller itself. It gives us better control. Handling error ourself looks generic but all we are doing is just logging. But in some cases we may need to act on error in caller side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been a lot of back and forth on this already. I'm inclined to leave it as is. Another option is to pass a handler callback, but nobody needs this today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but nobody needs this today - If errors are returned back, callers will have more control to handle logging parts. Some cases might need more detailed logging rather than generic message context: io.Copy(y, x) which is not self explanatory.

If you have a valid reason for not returning error please let me know.

Also i think func Bicopy(context string, x, y NamedReadWriter) error {} should be good as well. We don't need to complicate with handlers / callbacks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs expressed concern that if io.Copy in one direction returns, the io.Copy in the other direction might not - and returning an error cannot happen until both functions complete. That's why the latest versions logs directly instead of returning an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will eventually fail because one end of stream is closed already.

Since we have wg.Done at last. Returning error after both io.Copy are closed looks correct to me.

Copy link
Member

@balajiv113 balajiv113 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With errgroup as soon as one is returned non-nil its over. It will stop the other go func as well.

If we are using reader/writer then we will not even have close so it's just stopping the loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine is canceled magically? how does it stop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything in the docs about cancelling goroutine blocked on a syscall.
https://pkg.go.dev/golang.org/x/sync/errgroup#WithContext

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, My mistake.

Only context is closed. So we need to check for context close and close the connection.

For pure reader / writer until that syscall is returned we have no other option. But most of our reader / writer have close method so io.Closable should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be added to tcpproxy upstream - ensure termination and return an error. With both we can use it instead of Bicopy.

Improve error logging in bicopy and extend its use to
`pkg/vz/network_darwin.go`.

Signed-off-by: Tamir Duberstein <[email protected]>
logrus.WithError(err).Debug("failed to call xCloser.Close")
defer wg.Done()
if _, err := io.Copy(x.ReadWriter, y.ReadWriter); err != nil {
logrus.WithError(err).Errorf("%s: io.Copy(%s, %s)", context, x.Name, y.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs an error like:

"forwarding packets: io.Copy(VMNENT, VZ)" ... {"error": "actual error from io.Copy"}

For code already calling bicopy, this is an improvement, since before we logged a DEBUG message without any information on which direction failed, or what is the purpose of the copy:

"failed to call io.Copy" ... ... {"error": "actual error from io.Copy"}

But for the call from network_darwin this is worse. Before we had this clear error:

"Failed to forward packets from VZ to VMNET: actual error from io.Copy"

I would like to keep the previous error message or something very close. I don't see the value of logging the error separately using logrous.WithError().

It can be done using:

logrus.Errorf("Failed to %s from %s to %s: %s", context, y.Name, x.Name, err)

defer ac.Close()
unixConn, err := net.DialUnix("unix", nil, plf.unixAddr)
if err != nil {
return err
logrus.WithError(err).Errorf("pseudoloopback forwarder: failed to dial %q", plf.unixAddr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error adds context (good) but logs the error separately (bad). Before we logged only the error from net.DialUnix (line 139).

}, bicopy.NamedReadWriter{
ReadWriter: unixConn,
Name: "unix",
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context does not match other context ("pseudoloopback fowarder" vs "forward packets") and the names of the readwriters are little bit too generic, I'm not sure what is the context on this code, may be others can suggest better values.

@@ -29,7 +29,14 @@ func HandleTCPConnection(ctx context.Context, client *guestagentclient.GuestAgen
}

rw := &GrpcClientRW{stream: stream, id: id, addr: guestAddr, protocol: "tcp"}
bicopy.Bicopy(rw, conn, nil)

bicopy.Bicopy("tcp tunnel", bicopy.NamedReadWriter{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the tunnel for tcp port forwarding, so maybe a better context would be "forward tcp packets for port N". We probably can get the port from conn.RemoteAddr(), since we forward port N to port N on the host.

Name: guestAddr,
}, bicopy.NamedReadWriter{
ReadWriter: conn,
Name: id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a description of the entire port forwading action: "tcp-%s-%s" (see line 17), so it is not a good for the conn.


bicopy.Bicopy("tcp tunnel", bicopy.NamedReadWriter{
ReadWriter: rw,
Name: in.Id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what Id is. Can you share the output we get in runtime?

@@ -32,7 +32,14 @@ func (s *TunnelServer) Start(stream api.GuestService_TunnelServer) error {
return err
}
rw := &GRPCServerRW{stream: stream, id: in.Id}
bicopy.Bicopy(rw, conn, nil)

bicopy.Bicopy("tcp tunnel", bicopy.NamedReadWriter{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably use the same context used in the client with the guest port.

@tamird
Copy link
Contributor Author

tamird commented Jan 13, 2025

Since I don't remember why I care about this and it's not fixing obvious bugs, I'm closing. If you want to reuse this work, you have my permission to do so without attribution.

@tamird tamird closed this Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants