Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaws is stucked and can not receive any new HTTP connection #492

Open
yaocl opened this issue Jul 9, 2024 · 17 comments
Open

yaws is stucked and can not receive any new HTTP connection #492

yaocl opened this issue Jul 9, 2024 · 17 comments

Comments

@yaocl
Copy link

yaocl commented Jul 9, 2024

We used yaws server with embedded mode and the server can be started to process some requests.
HAProxy is used in front end to monitor yaws cluster nodes and receive/dispatch new HTTP webpage/websocket connections.

yaws node will be stucked suddenly and can not receive any new HTTP connection.

We tried to use observer_cli to check the erlang VM node and found the stucked node:

  1. yaws_server process stucked a Messsage: [{inet_async, #Port<0.14205>, 4242, {ok, #Port<0.17083>}}] the Message is not processed and flushed
  2. the function call is yaws_server:acceptor0/2 , Function: prim_inet:accept0/2

We had tested yaws version 2.0.6, 2.0.9 and tried to change the erlang version 20, 25.
The problem always happened randomly.

截圖 2024-07-09 11 37 11

截圖 2024-07-09 11 38 22

截圖 2024-07-09 11 39 19

@vinoski
Copy link
Collaborator

vinoski commented Jul 10, 2024

First thing you might try is building from the yaws master branch to see if you can reproduce the problem with the latest code.

@vinoski
Copy link
Collaborator

vinoski commented Jul 10, 2024

For this observer image it would be good if you could provide all the process info you can get from observer for pid <0.767.0>. Undoubtedly that pid will vary for each server execution, but I'm interested in any stack trace you can get for the process that might help determine why it's not receving the queued message.

@vinoski
Copy link
Collaborator

vinoski commented Jul 10, 2024

The same image referenced in the previous comment shows a {inet_async, ...} message. I don't believe yaws calls anything that would generate such a message. I therefore assume you're running your own code within the same virtual machine as yaws — can you explain what it does? Are you running some sort of distributed multi-node system?

@yaocl
Copy link
Author

yaocl commented Jul 10, 2024

The 2nd snapshot is the process info of <0.767.0>.
We used yaws with our code and activated it using the function yaws:start_embedded. We had two VM nodes running the same code. And the mnesia tables are copied and synced between the nodes.

I think inet_async is created by the erlang VM to receive a tcp connection from acceptor in OS. The yaws stuck with this message therefore the new tcp connection is not processed. But we had no idea why yaws gen server stuck and can not process new messages.

@vinoski
Copy link
Collaborator

vinoski commented Jul 11, 2024

For the second image it would be good to also see the Current Stack tab for that process.

@yaocl
Copy link
Author

yaocl commented Jul 12, 2024

theses are other infos of process <0.767.0>

截圖 2024-07-09 11 39 36

截圖 2024-07-09 11 39 48

@vinoski
Copy link
Collaborator

vinoski commented Jul 12, 2024

Thanks. Given that yaws doesn't make any calls itself that would result an inet_async message reply, it's likely that your application is doing something within a yaws process that causes such a message to arrive in this inbox. The fact that you're using yaws in embedded mode and that no other user has, as far as I can recall, ever reported a problem like this would seem to reinforce the idea that this is an application issue. Even so, I'm curious as to how this can happen, so I'm wondering if you can supply more details about what your application code is doing, especially within processes that are controlled by yaws and are dispatching into your code.

@yaocl
Copy link
Author

yaocl commented Jul 15, 2024

This is another case today which is attached below.

I cannot provide the code since it's a company product.
The application primary uses web socket to process client data. The client web socket processes can transmit data between theses websock clients.
There are some websocket connections from another java appllication and theses connections are always connected to stand by process data. In this case, there are 44 standby web socket processes.

2 erlang cluster nodes running yaws in two different machines. The websocket process Pids are stored in mnesia table and synced between the nodes. When one web socket client need to send data to other clients. We have to looking for process Pids in mnesia table and send data to these Pids.

HAproxy is in front of the yaws nodes. It is configured to use "option httpchk *" to check yaws online status every 5 seconds. So there are http connections that happened from HAProxy.

截圖 2024-07-15 08 41 49

截圖 2024-07-15 08 42 11

截圖 2024-07-15 08 42 21

截圖 2024-07-15 08 42 35

@vinoski
Copy link
Collaborator

vinoski commented Jul 16, 2024

When your websocket code is finished, does it perform a socket close? If not, can you try adding a call to yaws_api:websocket_close/1 to your code?

@yaocl
Copy link
Author

yaocl commented Jul 17, 2024

The server does not close the websocket connection but always wait for client connection opened/closed.

Based on the server log, when some websocket connection are closed by clients, the websocket callback module function are invoked:

handle_message({close, Status, Reason}, _)

terminate(Reason, State)

Should I put yaws_api:websocket_close/1 in terminate/2 ? terminate/2 seems to be the yaws websocket callback function.

After testing these days, we had found that "clamscan" in ClamAV may cause more yaws_server:acceptor0/2 to pending the inet_async Message. When "clamscan" is execued, the CPU loading and Memory usage are extremely high.
We've tried to use nice, ionice, cpulimit to limit the clamscan and need some days to verify the result.

/usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/clamscan -i -r /directory_to_scan >> /var/log/clamscan/clamscan_daily.log &
sleep 5
cpulimit -e clamscan --limit 10 --lazy &

"clamscan" is execused at 4:20AM in daily cronjob, but the yaws_server:acceptor0 process pending "inet_async" message still happened when we monitor the VM in the day time.

Sometimes the VM has 1 or 2 yaws_server:acceptor0 processes that pending the inet_async Message, but the yaws still accepts new http connection. Is it possible monitor the incorrect pending message process and kill these processes?

@vinoski
Copy link
Collaborator

vinoski commented Jul 17, 2024

Apologies for not being clear. Rather than making an API call, if your websocket callback module receives any sort of message particular to your application in its handle_message/2 function indicating that the websocket connection is no longer needed, please return {close, normal} from handle_message/2. Also, if handle_message/2 receives a {close, Status, Reason} message, please return {close, Status}.

@yaocl
Copy link
Author

yaocl commented Jul 18, 2024

It's written this way

handle_message({close, Status, Reason}, _) ->
  lager:debug("Close connection: status=~p reason=~p", [Status, Reason]),
  {close, Status}.

@vinoski
Copy link
Collaborator

vinoski commented Jul 18, 2024

OK, that's good, but I assume the application protocol you've created for your application on top of websocket also has a shutdown or closing aspect, and the message you use to indicate that also needs to be handled with {close, normal}. See the basic callback example code for an example.

@yaocl
Copy link
Author

yaocl commented Jul 19, 2024

Our code follows basic_echo_callback_extended.

What's the difference between handle_message/1 and handle_message/2 ?
Do we need to add a handle_message/1 in code basic_echo_callback_extended module?

@vinoski
Copy link
Collaborator

vinoski commented Jul 19, 2024

Ok, since you're using the extended example, do you have a handle_message/2 that returns a close tuple, similar to the one in the example?

The handle_message/1 function gets only a message as an argument, whereas handle_message/2 gets a message and the callback module state as arguments.

@yaocl
Copy link
Author

yaocl commented Jul 23, 2024

We had found the problem source comes from

yaws/src/yaws.erl

Line 2736 in 0c1d25e

Recv = do_recv(CliSock, 0, SSL),

yaws tried to get all http headers. When it invokes do_recv as Count =:= 0, but do_recv returns nothing. It looks like gen_tcp:recv in do_recv does not timeout and return error.

http_collect_headers(CliSock, Req, H, SSL, Count) when Count < 1000 ->
    setopts(CliSock, [{packet, httph}, {packet_size, 16#4000}], SSL),
    Recv = do_recv(CliSock, 0, SSL),

The problem is not about websocket. It's caused by haproxy regular httpchk to check if the server is alive. And the problem is not happened immediately when the server starts. We've tried to decrease the httpchk interval from 5 seconds to 200ms in two servers. Sometimes it happened after 1 hour and sometimes 2 or 3 hours.

Some posts suggest to use prim_inet:recv but prim_* modules are internal modules in erlang SDK.

@vinoski
Copy link
Collaborator

vinoski commented Jul 23, 2024

But surely httpchk is sending a legal http request? It seems to send an OPTIONS request by default, which Yaws can handle. It would be interesting to enable Yaws traffic tracing to see what it's receiving. You can do that either by passing -t to yaws when you run it, or by configuring trace = traffic in the global part of your yaws.conf file. Traces will be sent to trace log files, siblings of other regular Yaws log files. Alternatively you can trace only HTTP traffic with -T or with trace = http.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants