Recv from channel hangs forever if other party is gone #12

stepancheg · 2016-09-01T22:01:50Z

Code:

extern crate tokio_core;
extern crate futures;

use tokio_core::Loop;

use futures::stream::Stream;


#[test]
fn recv_after_close() {
    let mut lp: Loop = Loop::new().unwrap();

    let (sender, receiver) = lp.handle().channel::<u32>();

    let receiver = lp.run(receiver).unwrap();

    drop(lp);
    drop(sender);

    receiver.wait().next();
}

This test hangs. I cannot tell for sure, but seems like it worked (i. e. did not hang) two days ago.

The text was updated successfully, but these errors were encountered:

alexcrichton · 2016-09-02T01:19:54Z

Aha, after looking into this I think it's relatively expected behavior. When an I/O object (like this channel) is converted to a blocking future then the only way for it to be guaranteed to get resolved is if the event loop is running in the background somewhere. Here, though, the event loop isn't running, so I believe it was just by luck that before this was resolved.

If you change this to lp.run(receiver.into_future()) instead of receiver.wait(), it works out.

stepancheg · 2016-09-02T02:05:53Z

It doesn't help:

extern crate tokio_core;
extern crate futures;

use tokio_core::Loop;

use futures::stream::Stream;


#[test]
fn recv_after_close() {
    let mut lp: Loop = Loop::new().unwrap();

    let (sender, receiver) = lp.handle().channel::<u32>();

    let receiver = lp.run(receiver).unwrap();

    drop(lp);
    drop(sender);

    let mut lp2 = Loop::new().unwrap();
    lp2.run(receiver.into_future());
}

Still hangs.

The problem with current bevaior (IMHO) is that if event loop is terminated for some reason (e. g. because of error), peers from others threads waiting for completion of some operation (e. g. response from server) simply hang instead of getting any error.

Because channel is bound to event loop, I think destruction of event loop should invalidate all associated objects (channels), and any operations on these channel should produce an error.

alexcrichton · 2016-09-02T03:15:18Z

Oh sorry, to clarify I mean that for an I/O object to work the event loop it came from must still be alive somewhere. That is, because you drop lp, the receiver here won't work regardless of what happens next.

How come the event loop is dropping early for you? It's intended that event loops are very long lived (for the entire lifetime of a program), and it's only cleaned up once the entire world is being torn down.

stepancheg · 2016-09-02T03:27:03Z

How come the event loop is dropping early for you?

I can think of two scenarios:

Simply panic or logic error or whatever. Program should properly terminate even in that case.
Event loop is created to work with single client connection. When connection is closed by the server, event loop terminates, and all sessions associated with the connection should die. In my case it is grpc client, which works with single connection and serves multiple concurrent requests initiated by other threads.

alexcrichton · 2016-09-02T05:43:20Z

Ah ok it looks like we may be working with different models of when event loops are created perhaps then? The Loop::new() function is quite expensive in terms of resource allocation, so it's not intended to be called once per client. Instead, each client is intended to run inside the same event loop as all the others.

Additionally it's assumed that panics in each client are isolated from all others. The catch_unwind combinator exists to catch a panic during a poll, and this would be used on the outer edge to make sure it doesn't take down the event loop.

stepancheg · 2016-09-02T17:11:45Z

The Loop::new() function is quite expensive in terms of resource allocation, so it's not intended to be called once per client.

Suppose, your client is a command line database client. That database client has a single connection per process, so it has event loop that serves that single connection. All connection-protocol handling logic happens inside that event loop, and high level synchronous blocking logic (like reading from file system etc.) may happen in another threads, and they communicate via channels. If event loop dies, other threads blocking on channels need to be notified of that fact.

Additionally it's assumed that panics in each client are isolated from all others. The catch_unwind combinator exists to catch a panic during a poll, and this would be used on the outer edge to make sure it doesn't take down the event loop.

Catch panic won't help, if future itself is completed with error without panic. Or simply completed successfully because there no more things to to. Have a look at this example:

extern crate tokio_core;
extern crate futures;

use std::io;
use std::sync::mpsc;
use std::thread;
use std::net::ToSocketAddrs;

use futures::Future;
use futures::stream::Stream;

use tokio_core::Loop;


#[test]
fn recv_after_close() {
    let (init_sender, init_receiver) = mpsc::channel();
    thread::spawn(move || {
        let mut lp: Loop = Loop::new().unwrap();

        let (sender, receiver) = lp.handle().channel::<u32>();

        let receiver = lp.run(receiver).unwrap();

        init_sender.send(receiver);

        let listener = lp.handle().tcp_listen(&("::1", 3600).to_socket_addrs().unwrap().next().unwrap());

        lp.run(listener.and_then(move |listener| {
            sender.send(10);
            listener.incoming().for_each(|_| Ok(()))
        }));
    });


    let receiver: tokio_core::Receiver<u32> = init_receiver.recv().unwrap();

    receiver.wait().next();

    println!("server started");
}

(I've created two channels because it is a demo, real example would be much larger).

It server fails to bind on port (e. g. if you replace 3600 with 36), program, again, hangs forever. This problem can be obviosly worked around, but the problem with current (new) API is that it is very easy to create such bugs.

It reminds me problems of CompletableFuture in java: if you forget to properly handle exception somewhere and do not fire error on future, program may hang (for example, if connection counter won't be decremented properly).

But the problem is unsolvable in Java, because Java has no RAII. And Rust has it. Why not take advantage of it to make API safer?

alexcrichton · 2016-09-06T04:17:08Z

Yes unfortunately this is just something that fundamentally isn't designed to work. The source of "blocking" here is the event loop, as that's what the channel is bound to. If that's destroyed, the the channel has no ability to actually block.

What you'd want to use in a case like that is something like futures::stream::channel which doesn't rely on an event loop for blocking.

stepancheg · 2016-09-06T15:59:19Z

If that's destroyed, the the channel has no ability to actually block.

Channel should not block, it should immediately return an error.

What you'd want to use in a case like that is something like futures::stream::channel which doesn't rely on an event loop for blocking.

Probably. I don't yet fully understand why different channels exist.

Anyway, I simply don't understand your motivation. AFAIU, it is possible to just make receiver not to hang. It is described in #17 . I does not practically add any overhead and makes API safer. Why do not do it?

ArtemGr · 2017-03-29T14:33:20Z

Perhaps related to this issue, I have an anecdotal example, spending several hours debugging while the tokio-core just wouldn't work. No errors, no nothing, scheduled futures just won't run in production while working perfectly in a unit test. Turns out it had to do with initializing a Core from inside the PHP's get_module function: my hypothesis is that the PHP-FPM calls get_module in the parent process and then forks (even though according to the Zend specs get_module should be called separately in every PHP process), - so the Core-driving thread would likely vanish with the fork.
What bugs me is that the tokio-core library was silent, no panics, no errors, everything's fine. It just refused to run the futures that we've handed to it. This, from a Rust library, is surprizing.
P.S. I know it might be tricky to do the necessary sanity checks there. But maybe there's a way to detect if the Core thread is gone and a place for the relevant assertion in debug builds?

alexcrichton · 2017-03-29T15:58:47Z

@ArtemGr note that this is planned to be fixed in #17, it's just a breaking change so we're holding off until 0.2.

ArtemGr · 2017-03-29T16:03:53Z

@alexcrichton Awesome!

This commit is targeted at solving tokio-rs/tokio-core#12 and incorporates the solution from tokio-rs/tokio-core#17. Namely the `need_read` and `need_write` functions on `PollEvented` now return an error when the connected reactor has gone away and the task cannot be blocked. This will typically naturally translate to errors being returned by various connected I/O objects and should help tear down the world in a clean-ish fashion.

stepancheg mentioned this issue Sep 1, 2016

Check in Cargo.lock stepancheg/grpc-rust#5

Closed

stepancheg mentioned this issue Sep 6, 2016

Do not hang in poll if event loop is destroyed #17

Open

alexcrichton added the 0.2-cleanup label Mar 15, 2017

alexcrichton mentioned this issue Dec 5, 2017

Change need_read and need_write to return an error tokio-rs/tokio#53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recv from channel hangs forever if other party is gone #12

Recv from channel hangs forever if other party is gone #12

stepancheg commented Sep 1, 2016 •

edited

Loading

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016

Uh oh!

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016 •

edited

Loading

Uh oh!

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016 •

edited

Loading

Uh oh!

alexcrichton commented Sep 6, 2016

Uh oh!

stepancheg commented Sep 6, 2016

Uh oh!

ArtemGr commented Mar 29, 2017 •

edited

Loading

Uh oh!

alexcrichton commented Mar 29, 2017

Uh oh!

ArtemGr commented Mar 29, 2017

Uh oh!

Recv from channel hangs forever if other party is gone #12

Recv from channel hangs forever if other party is gone #12

Comments

stepancheg commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016

Uh oh!

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Sep 2, 2016

Uh oh!

stepancheg commented Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Sep 6, 2016

Uh oh!

stepancheg commented Sep 6, 2016

Uh oh!

ArtemGr commented Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Mar 29, 2017

Uh oh!

ArtemGr commented Mar 29, 2017

Uh oh!

stepancheg commented Sep 1, 2016 •

edited

Loading

stepancheg commented Sep 2, 2016 •

edited

Loading

stepancheg commented Sep 2, 2016 •

edited

Loading

ArtemGr commented Mar 29, 2017 •

edited

Loading