Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSSL TLS provider not working correctly #304

Open
tripleslash opened this issue Nov 27, 2023 · 25 comments
Open

OpenSSL TLS provider not working correctly #304

tripleslash opened this issue Nov 27, 2023 · 25 comments
Assignees

Comments

@tripleslash
Copy link
Contributor

tripleslash commented Nov 27, 2023

When a HTTPs response with a large content body is sent, then not all bytes will be transferred and the connection will hang indefinitely until it reaches a timeout.

Example request:

curl -X 'GET' -k
'https://192.168.200.232/v2/endpoint'
-H 'accept: application/json'
-H 'Authorization: Basic dGVzdDo=' --trace-time -v
Note: Unnecessary use of -X or --request, GET is already inferred.
15:44:55.711265 * Trying 192.168.200.232:443...
15:44:55.717252 * Connected to 192.168.200.232 (192.168.200.232) port 443 (#0)
15:44:55.719318 * ALPN, offering h2
15:44:55.719346 * ALPN, offering http/1.1
15:44:55.719472 * TLSv1.0 (OUT), TLS header, Certificate Status (22):
15:44:55.719503 * TLSv1.3 (OUT), TLS handshake, Client hello (1):
15:44:55.761049 * TLSv1.2 (IN), TLS header, Certificate Status (22):
15:44:55.761226 * TLSv1.3 (IN), TLS handshake, Server hello (2):
15:44:55.761775 * TLSv1.2 (IN), TLS header, Finished (20):
15:44:55.761849 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.761886 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
15:44:55.761913 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.761999 * TLSv1.3 (IN), TLS handshake, Certificate (11):
15:44:55.763293 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.763344 * TLSv1.3 (IN), TLS handshake, CERT verify (15):
15:44:55.763410 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.763449 * TLSv1.3 (IN), TLS handshake, Finished (20):
15:44:55.763489 * TLSv1.2 (OUT), TLS header, Finished (20):
15:44:55.763513 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
15:44:55.763558 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
15:44:55.763591 * TLSv1.3 (OUT), TLS handshake, Finished (20):
15:44:55.763633 * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
15:44:55.763656 * ALPN, server did not agree to a protocol
15:44:55.763675 * Server certificate:
15:44:55.763690 * subject:
15:44:55.763699 * start date: Oct 4 08:46:14 2023 GMT
15:44:55.763707 * expire date: Sep 10 08:46:14 2123 GMT
15:44:55.763722 * issuer:
15:44:55.763752 * SSL certificate verify result: self-signed certificate in certificate chain (19), continuing anyway.
15:44:55.763795 * TLSv1.2 (OUT), TLS header, Supplemental data (23):
15:44:55.763831 > GET /v2/endpoint HTTP/1.1
15:44:55.763831 > Host: 192.168.200.232
15:44:55.763831 > User-Agent: curl/7.81.0
15:44:55.763831 > accept: application/json
15:44:55.763831 > Authorization: Basic dGVzdDo=
15:44:55.763831 >
15:44:55.776094 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.776234 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
15:44:55.776353 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:55.776497 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
15:44:55.776742 * old SSL session ID is stale, removing
15:44:56.901714 * TLSv1.2 (IN), TLS header, Supplemental data (23):
15:44:56.910324 * Mark bundle as not supporting multiuse
15:44:56.910420 < HTTP/1.1 200 OK
15:44:56.910559 < content-length: 56531
15:44:56.910639 < content-type: application/json; charset=utf-8
15:44:56.910772 < server: backend-api/0.1.0.6822
15:44:56.910938 < access-control-allow-origin: *
15:44:56.911114 < date: Mon, 27 Nov 2023 14:44:52 GMT
15:44:56.911246 <
{"json_content_that_will_be_cut_off
15:44:56.912609 * TLSv1.2 (IN), TLS header, Supplemental data (23):
^C ////// <<<< timeout

Is it possible that the last chunk of data is not flushed to the output?

@an-tao
Copy link
Owner

an-tao commented Nov 27, 2023

Which version of trantor or drogon did you use?

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 27, 2023

My trantor lib is built from commit tree https://github.com/an-tao/trantor/tree/26a599857a2fa5d61823ef85012e3d9123d5a675 as of right now and drogon is built from tag 1.9.0 tree.

I saw there were no relevant changes to the OpenSSL handling in either drogon or trantor since those tags

@tripleslash
Copy link
Contributor Author

If I keep the connection open and don't cancel it I get the following curl error after 30 seconds:

image

@tripleslash
Copy link
Contributor Author

Is it possible that the last chunk of data is not properly flushed to the TCP socket when the HTTP connection is closed? Maybe there are still some bytes pending and trantor does not wait until it is fully transferred to the client before closing the connection?

@an-tao
Copy link
Owner

an-tao commented Nov 27, 2023

I can't reproduce this error in my environment.

curl -X 'GET' -k https://10.0.0.100:1100/test -v -s -o /dev/null
*   Trying 10.0.0.100:1100...
* Connected to 10.0.0.100 (10.0.0.100) port 1100 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [6 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [1470 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [520 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  start date: Feb  9 12:37:34 2023 GMT
*  expire date: Jan 16 12:37:34 2123 GMT
*  issuer: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  SSL certificate verify result: self-signed certificate (18), continuing anyway.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /test HTTP/1.1
> Host: 10.0.0.100:1100
> User-Agent: curl/7.81.0
> Accept: */*
> 
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 56531
< content-type: text/html; charset=utf-8
< server: drogon/1.9.1
< date: Mon, 27 Nov 2023 15:49:03 GMT
< 
{ [16243 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection #0 to host 10.0.0.100 left intact

@marty1885 What do you think might be causing this issue?

@tripleslash
Copy link
Contributor Author

I had a similar issue once in an implementation of a TCP SSL socket using asio. Turned out I had to call asio::ssl_stream::shutdown before calling the socket close method. Maybe it could be something similar here? Where some bytes remain unsent on the socket until it is gracefully shut down (which would cause a flush)?

@tripleslash
Copy link
Contributor Author

Also happens with latest trantor (v1.5.15) and drogon (v1.9.1)

@marty1885
Copy link
Collaborator

Me too I can't replicate the issue what so ever. This is what is blocking me from fixing it. :(

@tripleslash Can you share details about the environment you are using? OpenSSL version, compiler, distro, etc.. And if you have Botan3. Can you try Botan 3 and see if the same thing is happening?

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

Custom built Linux BSP with Kernel 6.5.12 and OpenSSL 3.1.0. I cant change my TLS Provider because many other applications on my system depend on OpenSSL
Compiler is GCC 12.2.1 armv7

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

20231128 08:24:30.062488 UTC 4347 TRACE [newFileResponse] send http file:/var/www//ui/assets/images/loadingd80ae7c176ae41d8797c.png offset 0 length 0 - HttpResponseImpl.cc:296
20231128 08:24:30.063095 UTC 4347 TRACE [sendStaticFileResponse] Save in cache for 5 seconds - StaticFileRouter.cc:562
20231128 08:24:30.063734 UTC 4347 TRACE [renderToBuffer] reponse(no body):HTTP/1.1 200 OK
content-length: 51581
content-type: image/png
server: backend-api/0.1.0.6824
access-control-allow-origin: *
accept-range: bytes
expires: Thu, 01 Jan 1970 00:00:00 GMT
last-modified: Mon, 27 Nov 2023 15:39:30 GMT
date: Tue, 28 Nov 2023 08:24:30 GMT

 - HttpResponseImpl.cc:704
20231128 08:24:30.069186 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.086815 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.086909 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.093514 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.107780 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.107874 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.124704 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.155241 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.155333 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.181652 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.223903 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:33.879093 UTC 4347 TRACE [operator()] Erase cache - StaticFileRouter.cc:568
20231128 08:24:34.320739 UTC 4347 TRACE [recvData] Received 473 bytes from lower layer - OpenSSLProvider.cc:533
20231128 08:24:34.321107 UTC 4347 TRACE [processApplicationData] Received 451 bytes from SSL - OpenSSLProvider.cc:739

This is what happens when an incomplete file is transferred in the drogon.log with the highest log level. Maybe this can help you find the problem?

@an-tao
Copy link
Owner

an-tao commented Nov 28, 2023

Can you test your application in a normal linux x86 server?

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
this line always seems to happen when the issue appears

@an-tao
Copy link
Owner

an-tao commented Nov 28, 2023

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215 this line always seems to happen when the issue appears

You could modify that line to output the error number.

@marty1885
Copy link
Collaborator

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

Resource temporarily unavailable (errno=11) send file in loop: break on socket buffer full (?) - 11

errno = 11 = EAGAIN

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

It is quite some effort to add a new BSP build rule to build a completely new TLS lib. Apart from that I really would like to use the same TLS implementation across the entire system

@tripleslash
Copy link
Contributor Author

tripleslash commented Nov 28, 2023

The issue seems to happen more frequently when you try to connect to the device from outside the intranet or over SSH proxies for example. In intranet it is hard to reproduce I agree.

Try to establish a connection from some remote network (maybe over your phone via LTE) and see if it happens for you. If not then try an ssh socks5 proxy.

The File also needs to be quite large. 100 KB should be enough.

@marty1885
Copy link
Collaborator

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

@tripleslash
Copy link
Contributor Author

tripleslash commented Dec 4, 2023

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

What type of access are you looking for? I could forward a https Port but cannot give you SSH access to the device. If thats good enough for you to see the issue in action then you can add me on Discord under xerabtw

@marty1885
Copy link
Collaborator

@tripleslash Sorry getting back to you late. Can you try the latest master? We fixed a bug that may be related to this bug report.

@tripleslash
Copy link
Contributor Author

@marty1885 This issue still persists with the current master of trantor+drogon

@an-tao
Copy link
Owner

an-tao commented Dec 21, 2023

I think the PR may resolve this problem, @tripleslash would u please check it out?

@tripleslash
Copy link
Contributor Author

I think the PR may resolve this problem, @tripleslash would u please check it out?

I can confirm that the issue seems to be gone with the refactored trantor API ! @an-tao

@tripleslash
Copy link
Contributor Author

another update on this: i was able to still reproduce this error but it seems to happen much more rarely now with the trantor refactoring

@an-tao
Copy link
Owner

an-tao commented Dec 22, 2023

@tripleslash Thanks so much for your test on this. I added some debug logs to the destructor of the Tcp Connection Impl class (6f85477).
please check it out. when the hanging occurs, disconnect from client, and find those logs on server, let's figure it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants