Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPE: ACK node overwritten message (impact unknown) #253

Open
ethouris opened this issue Jan 29, 2018 · 23 comments · Fixed by #761 · May be fixed by #1888
Open

IPE: ACK node overwritten message (impact unknown) #253

ethouris opened this issue Jan 29, 2018 · 23 comments · Fixed by #761 · May be fixed by #1888
Assignees
Labels
[core] Area: Changes in SRT library core Priority: High Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@ethouris
Copy link
Collaborator

There's a message appeared in one of tests:

11:54:57.501306/SRT:RcvQ:worker*E: SRT.c: IPE: ACK node overwritten when acknowledging 1979 (ack extracted: 0)

Impact unknown, although this is qualified as IPE, so it shouldn't happen. This was during tests against 1.1.3 version (Makito) and it might come from the other side. Need to make sure that this kind of problem is never CAUSED by the 1.3 version client.

@ethouris
Copy link
Collaborator Author

Further study states that:

  1. The implementation of the ACK window - refactored only formally (just in order to avoid dynamic allocation and too fine-grained delocalization) - is messy, although it's meant to be for relatively simple use. It's hard to say the exact reason that ACK node may be overwritten.
  2. A probable reason why this happens is that the ACK window is "too short" and the ACKACK response for ACK message has a journal of an ACK message that has been already wiped out the ACK window. Trouble is, the ACK window has set a size of 1024 elements, and this is about having ACKACK response lacking for the whole window, which is either that the ACKACK response isn't being sent (seems impossible) or lost on track (as larger packets are more probable to be lost, and data packets are not being lost, very hardly possible), or there's some unknown internal implementation problem.
  3. The only impact of this event is that the RTT, which is measured as a distance in time between the moment when this ACK message is being sent and when the ACKACK message with the same journal ("ACKNumber") is received, cannot be measured because one of the ingredients for this calculation is missing. RTT is however measured all the time, so losing one opportunity to measure RTT is not a big deal. It would be much worse if such messages appear very often one after another, which means that RTT isn't really being measured, but this isn't the case here.

Research result: minor problem, rather annoying, but it's a developer-uncomfortable situation of an unknown source of problems, even if they are minor.

Recommended: Use some general class of a ring buffer and implement CACKWindow class anew on that base. Important is a possibility to track the running size of the ACK window.

@ethouris ethouris added Priority: Low Type: Bug Indicates an unexpected problem or unintended behavior Status: On Hold labels Jan 29, 2018
@ethouris ethouris removed the Type: Bug Indicates an unexpected problem or unintended behavior label Mar 21, 2018
@ethouris ethouris reopened this Nov 29, 2019
@ethouris ethouris added this to the v1.4.2 milestone Nov 29, 2019
@ethouris
Copy link
Collaborator Author

There should be determined the reason why this node overriding happens. If this isn't anything that influences the overall processing, then the log contents should be changed.

@maxsharabayko
Copy link
Collaborator

Message can be seen when netem is applied:

sudo tc qdisc add dev eno1 root netem delay 25ms 10ms loss 4%

SRT default settings with 100 ms latency over a local switch

Note. Delete netem config

sudo tc qdisc del dev eno1 root netem delay 25ms 10ms loss 4%

@ethouris
Copy link
Collaborator Author

When deleting, only the parameters up to root are required.

@maxsharabayko maxsharabayko modified the milestones: v1.5.0, v1.5.1 Dec 27, 2019
@stoyanovgeorge
Copy link

I am also experiencing the same problem with v.1.4.1. I think this is connected when you have higher jitter.

@ddm1233
Copy link

ddm1233 commented Aug 5, 2020

I am also experiencing the same problem with v.1.4.1

image

client : windows
server : ubuntu 18.04

srt file transfer mode

image

client call srt_send sync only and server recv sync

image

@Mutanter
Copy link

Mutanter commented Oct 19, 2020

Well, got these message too.

Pretty constant by using OpenMPTCP link bonding while MLVPN is selected as VPN for ICMP and UDP packet stream.

However, only as soon as the upload bandwidth of one link is completely exhausted by increasing the stream bitrate > eth1 max-upload size, therefore the router has to add a second link (eth2) to the bitstream. This event seems to relate/lead to the message "IPE: ACK node overwritten when acknowledging". From this point, a lot of these messages appear, even though the SRT stream runs smoothly and without visible disturbance. Once the bitrate drops below the maximum upload value of a single uplink gateway the messages disappear.

Let me know if I can provide further information since the test-setup is probably kinda unusual.
Might help you to reproduce the "problem" if it's still a concern.

@Fanfwe
Copy link

Fanfwe commented Mar 4, 2021

Hello,
I am experiencing this also.
Based on my tests, this happens when the connection has some jitter, as @stoyanovgeorge also indicated. I have played with this through a network simulator, and the results are pretty clear:
With no jitter, things work fine. I can transport a 1Mbit/s stream, and the actual bandwidth on wire is approx. 1.1/1.2Mbit/s. No messages in the logs either.
Then I introduce a 50ms jitter through the simulator. Right away, I get spammed with a ton of messages in the logs, like the one described above. Another interesting thing is that in the SRT stats, the calculated RTT is still good, but I can see the counter of lost packets increasing, and my bandwidth on-wire jumps to 1.7Mbit/s ! The stream is delivered properly, no lost frames, but at a cost of almost double the bandwidth, where this shouldn't be necessary (the bandwidth available between sender and receiver is way more than 1Mbit/s, and I have configured SRT with a RTT of 3600, which is very, very conservative).

@rationalsa
Copy link

rationalsa commented Mar 8, 2021

I've also seen excessive retransmissions on networks with high jitter / reordering even for packets within their TTL window. This patch took care of it, but it's probably a bit heavy handed. NAKs are still sent through other code paths for packets which are actually lost / late.

@mbakholdina mbakholdina self-assigned this Mar 16, 2021
@mbakholdina mbakholdina modified the milestones: v1.5.1, v1.4.3 Mar 16, 2021
@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Mar 16, 2021

There are two possible reasons for the "ACK node overwritten" error message:

  1. incoming ACKACK packet acknowledges some ACK packet that is already missing in the record of the last 1024 ACK packets sent. Full ACK is sent at least every 10 ms, meaning that the last 10 seconds of ACK packets sent are stored. With RTT below this value, it should not be the reason for this message.
  2. The estimated RTT value is negative or zero, meaning that there was a shift in clock time, and ACK was sent later than the corresponding ACKACK has been received according to that clock. Should not happen with monotonic/steady clock.
  3. Peer error. Sender might have sent some wrong ACKACK packet with a wrong packet sequence number (counter). This value was not found in the records. Similar to no. 1, but a different reason.
int32_t ack = 0;
const int rtt = m_ACKWindow.acknowledge(ctrlpkt.getAckSeqNo(), ack);
if (rtt <= 0)
{
    LOGC(inlog.Error,
            log << CONID() << "IPE: ACK node overwritten when acknowledging " << ctrlpkt.getAckSeqNo()
                << " (ack extracted: " << ack << ")");
    break;
}

TODO

@maxsharabayko
Copy link
Collaborator

@rationalsa You are disabling periodic NAK reports, which could also be done using the SRTO_NAKREPORT socket option.

To reduce the overhead of retransmissions, please enable a more clever retransmission algorithm using SRTO_RETRANSMITALGO=1.
URI query parameter retransmitalgo=1.

@mbakholdina
Copy link
Collaborator

mbakholdina commented Mar 19, 2021

@maxsharabayko
Here is an update on the first test results on the issue. I've performed all the tests using two CentOS7 machines and LanForge network emulator. I haven't been able to reproduce the issue with the latest SRT version (latest master, at least release candidate v1.4.3, commit 449d917, used srt-xtransmit test application). The following cases were covered:

Scenario Link Capacity RTT Jitter Packet Loss Sending Rate SRT Settings
1 45 Mbps 100 ms 1 % 0 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
2 45 Mbps 100 ms 10 % 0 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
3 45 Mbps 100 ms 20 % 0 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
4 45 Mbps 100 ms 50 % 0 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
5 45 Mbps 100 ms 50 % 1 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
6 45 Mbps 100 ms 50 % 5 % 1 Mbps latency = 400 (4RTT without taking the jitter into account)
7 45 Mbps 100 ms 50 % 5 % 10 Mbps latency = 400 (4RTT without taking the jitter into account)
8 45 Mbps 20 ms 10 ms 4 % 10 Mbps latency = 80 (4RTT without taking the jitter into account)

Jitter and packet loss are set up on both ends.

In case 8 which is the reproduction from here, I only see dropped packets. However, it's expected behavior due to high jitter and not enough latency which doesn't take the jitter into account. Should be at least 120.

13:16:58.074516 [D] SOCKET::SRT 0x21A70521 (srt://:4200) Accepted connection 0x21A70520
13:16:58.392392/SRT:TsbPd!W:SRT.br: @564593952:RCV-DROPPED 1 packet(s), packet seqno %1555099858 delayed for 0.084 ms
13:17:18.997238/SRT:TsbPd!W:SRT.br: @564593952:RCV-DROPPED 1 packet(s), packet seqno %1555119440 delayed for 0.094 ms
13:17:54.329206/SRT:TsbPd!W:SRT.br: @564593952:RCV-DROPPED 1 packet(s), packet seqno %1555153018 delayed for 0.093 ms
13:18:25.076541/SRT:TsbPd!W:SRT.br: @564593952:RCV-DROPPED 1 packet(s), packet seqno %1555182239 delayed for 0.059 ms

Commands

_build/bin/srt-xtransmit receive srt://:4200?latency=80 -v
_build/bin/srt-xtransmit generate --sendrate 10Mbps --duration 120 srt://192.168.2.2:4200?latency=80 -v

Next steps:

@mbakholdina

This comment has been minimized.

@mbakholdina mbakholdina added the [core] Area: Changes in SRT library core label Mar 19, 2021
@rationalsa
Copy link

@mbakholdina Just adding some jitter with tc qdisc add dev eth0 root netem delay 200ms 40ms seems to be enough to cause this in a 449d917 srt-live-transmit listener.

@mbakholdina
Copy link
Collaborator

mbakholdina commented Mar 22, 2021

Hi @rationalsa,

I confirm the reproduction with the use of netem, srt-xtransmit test application and the SRT version from PR #1876.

I setup tc qdisc add dev eth0 root netem delay 200ms 40ms on the receiver, listener side. The log is the following:

[haivision@big-flop srt-xtransmit]$ _build/bin/srt-xtransmit receive srt://:4200?latency=800 -v
16:35:26.442500 [I] SRC URL: srt://:4200?latency=800
16:35:26.443001 [D] SOCKET::SRT 0x7F93333 (srt://:4200) Listening
16:35:26.443033 [D] SOCKET::SRT 0x7F93333 (srt://:4200) ASYNC Waiting for incoming connection
16:35:35.884545 [D] SOCKET::SRT 0x7F93333 (srt://:4200) Accepted connection 0x7F93332
16:35:36.353145/SRT:RcvQ:w1*E:SRT.in: @133772082:IPE: The record about ACK is not found, RTT estimate at the receiver side can not be calculated (ACK number: 2, last ACK sent: 16, oldest ACK record: not yet available, RTT (EWMA): 115747)
16:35:36.402885/SRT:RcvQ:w1*E:SRT.in: @133772082:IPE: The record about ACK is not found, RTT estimate at the receiver side can not be calculated (ACK number: 4, last ACK sent: 19, oldest ACK record: not yet available, RTT (EWMA): 127244)
16:35:36.554384/SRT:RcvQ:w1*E:SRT.in: @133772082:IPE: The record about ACK is not found, RTT estimate at the receiver side can not be calculated (ACK number: 14, last ACK sent: 30, oldest ACK record: not yet available, RTT (EWMA): 153433)
16:35:36.718613/SRT:RcvQ:w1*E:SRT.in: @133772082:IPE: The record about ACK is not found, RTT estimate at the receiver side can not be calculated (ACK number: 25, last ACK sent: 43, oldest ACK record: not yet available, RTT (EWMA): 164014)

Commands

_build/bin/srt-xtransmit receive srt://:4200?latency=800 -v
_build/bin/srt-xtransmit generate --sendrate 1Mbps --duration 120 srt://192.168.3.2:4200?latency=800 -v

Important
I've just figured out that jitter hasn't been applied in my previous tests with the use of Lanforge. TODO: Figure out the possible reason and replicate some of the tests if required.

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Mar 22, 2021

Thanks for the investigation! Now the reason is clear.

The class CACKWindow is responsible for tracking ACK-ACKACK pairs. When ACK is sent, it stores a record with the sending time. When ACKACK is received, it estimates RTT and removes all records older than the current one.

E.g., ACKACK no. 10 arrives on the receiver, ACK records 1-10 are removed.
ACKACK no 9 arrives out of order, there is no record, which results in the IPE message.

TODO

@maxsharabayko maxsharabayko self-assigned this Mar 22, 2021
@maxsharabayko maxsharabayko modified the milestones: v1.4.3, v1.4.4 Mar 22, 2021
@maxsharabayko maxsharabayko added Priority: High Type: Bug Indicates an unexpected problem or unintended behavior and removed Priority: Low labels Mar 22, 2021
@ethouris

This comment has been minimized.

@ethouris

This comment has been minimized.

@maxsharabayko

This comment has been minimized.

@ethouris

This comment has been minimized.

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Mar 31, 2021

Status Update

SRT v1.4.3 will preserve the current behavior of ignoring out-of-order ACKACK packets. To be improved in further versions.

Meanwhile ,logs have been changed in PR #1876.
Now, the expected log message in case of reordering is the following warning:

... W:SRT.in: ... ACKACK out of order, skipping RTT calculation (ACK number: ###, last ACK sent: ###, RTT (EWMA): ###

Error log will be shown if:

  • Incoming ACKACK is for an ACK sent more than 10s ago.
    IPE: ACK record not found, can't estimate RTT ...
  • Negative RTT estimation (time shift?)
    IPE: invalid RTT estimate ###, possible time shift. Clock: <clock_type>

Please feel free to check the behavior with the latest master and provide your feedback.

@gregfr
Copy link

gregfr commented Apr 14, 2022

Is it possible that this problem creates a desync between audio and video?

@maxsharabayko
Copy link
Collaborator

Is it possible that this problem creates a desync between audio and video?

Not really. The issue mainly affects RTT estimations.
Especially if your audio and video are multiplexed into the same SRT stream, that should not cause the desync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Priority: High Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
9 participants