Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic :bad_match and :function_clause errors (read_timeout) #176

Open
dfrese opened this issue Feb 13, 2024 · 11 comments
Open

Comments

@dfrese
Copy link

dfrese commented Feb 13, 2024

Hi there,

occasionally I get :function_clause and :bad_match errors (with the latter exposing data from the db!).
It can usually be remedied by increasing the :read_timeout parameter, indicating the Oracle server is just sometimes a little faster, sometimes slower.

But isn't that a bug? It seems to me, that the Erlang code tries to decode a partial message after the socket recv times out, leading to failures in attempting to decode it. The second case here seems suspicious to me: Returning :ok on timeout:

recv(read_timeout, Socket, Length, {_Tout, ReadTout} = Touts, Acc, Data) ->
    case sock_recv(Socket, 0, ReadTout) of
        {ok, NetworkData} ->
            recv(Socket, Length, Touts, <<Acc/bits, NetworkData/bits>>, Data);
        {error, timeout} ->
            {ok, ?TNS_DATA, Data};
        {error, Reason} ->
            {error, socket, Reason}
    end.

The actual error location is usually somewhere in the 'decoder', but it is hidden by the try-catch in jamdb_oracle.ex/sql_query

  1. Maybe there is a reason for that, but shouldn't it return an error? Just removing the second case maybe?

  2. Why is there a separate read_timeout in addition to the overall timeout anyway? If multiple socket reads are needed (because of the fetch size, I guess), why not just calculate an absolute 'end time' from the overall query timeout at the start, and then give each socket recv a remaining relative time - giving up if that is <= 0. In that way, I can say for each Query how long I want to wait, regardless of the fetch size or the way the communication with the server is split. I might want to use different timeouts for different queries, so setting a global read_timeout that serves all is not very practical.

@vstavskyi
Copy link
Member

Packet header unfortunately doesn't contain PacketSize for long packets > 8k
PacketSize is always <= 8k (for ethernet)

recv() loop is executed until timeout, driver can't endlessly wait last packet's part

PacketSize is here

decode_packet(<<PacketSize:16, _PacketFlags:16, ?TNS_DATA, _Flags:8, 0:16, _DataFlags:16, Rest/bits>>

Add erlang:display and you'll see where break occurs

decode_token(<<Token, Data/binary>>, Acc) ->
    erlang:display({Token, Data,Acc}),
    case Token of

@vstavskyi
Copy link
Member

Please test stage branch.
Protocol version is different.
The header now contains dataflags.

decode_packet(<<PacketSize:32, ?TNS_DATA, _Flags:8, 0:16, DataFlags:16, Rest/bits>>) ->
    BodySize = PacketSize-10,
    case Rest of
        <<PacketBody:BodySize/binary, Rest2/bits>> when DataFlags =:= 0 ->
            {more, ?TNS_DATA, PacketBody, Rest2};
        <<PacketBody:BodySize/binary, Rest2/bits>> when DataFlags =:= 8192 ->
            {ok, ?TNS_DATA, PacketBody, Rest2};

@dfrese
Copy link
Author

dfrese commented Feb 26, 2024

Packet header unfortunately doesn't contain PacketSize for long packets > 8k PacketSize is always <= 8k (for ethernet)

I don't quite understand; gen_tcp:recv is always called with length=0 as far as I can see. So as long as it is only called when at least one byte is definitely still missing, I don't see the problem. You don't have to "detect" the end of a message via the timeout, do you?

recv() loop is executed until timeout, driver can't endlessly wait last packet's part

Sure. It should wait for the overall/total timeout chosen by the user minus the already elapsed time. (unless the user gives a timeout of :infinity)

@vstavskyi
Copy link
Member

0 means that all available bytes are returned. All bytes of one packet.
After each data packet of maximal size (or marker packet) recv(read_timeout) receives next packet or aborts by timeout.

@dfrese
Copy link
Author

dfrese commented Feb 26, 2024

0 means that all available bytes are returned. All bytes of one packet. After each data packet of maximal size (or marker packet) recv(read_timeout) receives next packet or aborts by timeout.

Sure. But I still don't understand why it's necessary to try to decode a packet after a timeout occured. (assuming the timeout would correspond to the query timeout specified by the user; instead of being a fixed global value)

erlang:display({Token, Data,Acc}),

I've seen it with a TTI_RXH token, if that helps.
(It only happens with a large 'FETCH' params and a small 'READ_TIMEOUT' btw.)

Please test stage branch.
Protocol version is different.

Getting timeouts (DBConnection.ConnectionError) :timeout now for any query, and with small fetch size and large read-timeout. handle_login seems to succeed though.
With Oracle 19c, that is.

@vstavskyi
Copy link
Member

vstavskyi commented Feb 27, 2024

Getting timeouts

fix

It seems dataflags are not empty only with 23c :(

@dfrese
Copy link
Author

dfrese commented Feb 27, 2024

That kind of helped. But now it actually waits for the full read_timeout in normal operations! I.e. it gets reaaaally slow with a read_timeout of 500ms; even the connect fails with a read_timeout of 3000.
That wasn't the case before. Before, I only got timeouts with complex queries and a slow server.

So this made things much worse, I dare to say.

@vstavskyi
Copy link
Member

vstavskyi commented Feb 27, 2024

stage23c is slow with 19c
Don't use it.

Try add parameter socket_options with recbuf
parameters: [... read_timeout: 500, socket_options: [{:recbuf, 8192}] ...]
If sdu=8192 (default) then recbuf=8192 or you can use other values 4096,2048,...

Looks better
commit

@dfrese
Copy link
Author

dfrese commented Feb 27, 2024

Try add parameter socket_options with recbuf parameters: [... read_timeout: 500, socket_options: [{:recbuf, 8192}] ...] If sdu=8192 (default) then recbuf=8192 or you can use other values 4096,2048,...

Why?

I have removed the :error -> :ok clause in my fork (pre-stage23c) and use a read_timeout of 3 seconds. That works ok for me. It's not ideal, because I have basically no control over the maximum time a query can take. But smaller and larger queries work ok, if the server is fast enough. And in my usecase it's not a huge problem, as long as a query doesn't fail after only 500ms, because the server needs that quite often. (And because after 5 seconds, a Phoenix UI will usually fail anyway).

Overall I just wanted to make a suggestion for improvement. I don't see why the read_timeout parameter is necessary at all. It doesn't correspond to any external/high level behaviour of running queries (*), nor does it affect the server, as far as I understand it.

(* something similar might make sense for stream() queries; but not for ordinary queries. And Ecto.Repo.stream() has a separate max_rows parameter anyway, and probably a default semantic for what the query timeout means for stream() queries)

@dfrese
Copy link
Author

dfrese commented Mar 20, 2024

Hi,

even with my changes (not trying to decode after a socket timeout), I get sporadic :bad_match errors in production.
They come in two forms, and start like

{:badmatch, <<0, 0, 1, 10, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 61, 0, 1, 1, 0, 0, 0, 0, 25, 79, 82, 65 ... >>}
{:badmatch, <<1, 10, 1, 10, 0, 6, 34, 1, 7, 0, 1, 100, 0, 0, 0, 7, 2, 193, 50, 3, 194, 17, 33, 3, 194, 9, 17, ... >>}

Tried to enforce it by doing many queries, but without luck. Happens every few hours, so it's hard to collect any more information.
Can you see from the bytes what the problem might be?

Thanks!

@dfrese
Copy link
Author

dfrese commented May 7, 2024

I also see this sporadically

(MatchError) no match of right hand side value: "03: no data found\n"

(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:495: :jamdb_oracle_tns_decoder.decode_ub4/1
(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:86: :jamdb_oracle_tns_decoder.decode_token/2
(jamdb_oracle 0.5.9) src/jamdb_oracle_tns_decoder.erl:131: :jamdb_oracle_tns_decoder.decode_token/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants