Added logic to skip ahead to next intact message after overflow #41

mikael-s-persson · 2024-07-30T22:17:59Z

Added logic to skip ahead to next intact message after overflowing the QS queue, this removes spammy "Bad checksum" errors.

This was discovered while having an issue of not being able to send the QS messages fast enough, which leads to an overflow of the QS buffer (fixed capacity circular buffer), which overwrites the data at the TX end (tail). This results in pretty much all subsequent messages coming out being corrupted, leading to a massive amount of "Bad checksum error" spamming the qspy output without leaving anything useful.

This PR just adds simple logic so that when the buffer is overflown (clobbering the tail), the tail is seek'ed forward to the next intact message by consuming bytes up to and including the next QS_FRAME.

The resulting behavior is that the overflowing condition results in "Discontinuity" errors rather than a wall of checksum errors, which is much more intuitive to diagnose the issue. It incurs more additional loss of information since the partially clobbered message at the tail is not lost already by the time this code reached. And finally, the additional non-deterministic timing of that loop is innocuous in a debugging facility (behind Q_SPY flag).

I tested this in my project and it works as described above.
I'm not sure if there's a test suite somewhere where this regression test condition should be added to. That should be pretty trivial to do.

…e QS queue, this removes spammy bad checksum errors

quantum-leaps · 2024-07-31T00:54:39Z

Hi Sven
Thank you for your interest in QP and QP/Spy software tracing.

...This results in pretty much all subsequent messages coming out being corrupted

I don't understand that claim. I think that only one (the oldest) trace record gets partially overwritten by the new trace record. Of course, if you keep adding to the head without advancing the tail, more and more old trace records gets overwritten, but still only one record at a time gets corrupted. Please explain why you think that all subsequent records come out corrupted.

--MMS

mikael-s-persson · 2024-07-31T01:14:04Z

I was using the TX mechanism from the arm-cr dpp example: https://github.com/QuantumLeaps/qpc-examples/blob/8912782f022cdf8ebf2528619d7d46af7fd4f097/arm-cr/dpp_launchxl2-tms57012/qv/bsp.c#L322

When reaching full capacity on the buffer, the two ends of that pipe are basically racing each other constantly and the data coming out on sci is pretty much a continuous stream of nonsense, with only occasional intact messages getting through when the consumer end is winning briefly. And if you've reached full capacity, then the consumer most consistently losing that race is expected.

The PR fix is actually paired with a fix on the TX side, implementing the QV_onIdle via this function (which you might consider using in the dpp example):

void QS_updateOverSci(int32_t flush_msg_count) {
    QS_rxParse();  // parse all the received bytes

    if (flush_msg_count == 0 && !sciIsTxReady(sciREG1)) {
        // Cannot write, tx is not ready.
        return;
    }
    for (;;) {
        QF_INT_DISABLE();
        uint16_t b = QS_getByte();
        QF_INT_ENABLE();

        if (b != QS_EOD) {
            sciSendByte(sciREG1, b);
        } else {
            break;
        }
        if (b == QS_FRAME) {
            --flush_msg_count;
        }
        if (flush_msg_count <= 0) {
            break;
        }
    }
}

The above fix on the TX side could not work (easily) without changing QS's internals to seek forward to the message boundaries.

quantum-leaps · 2024-07-31T13:15:08Z

Hi Sven,
The situation you describe with a persistent TX buffer overflow is pathological. Sure, you can give the consumer side a leg up by discarding some bytes that otherwise would have to be sent out. But the producer quickly catches up and you're again in the same place. Then you try to help the consumer again by indiscriminately flushing out some messages. But then you relinquish the control of what information gets discarded.

Producing less data is a much better (arguably the only) way to fix such issues. For this, QS provides powerful filtering mechanisms. These "global" and "local" filters allow you precisely to decide what is valuable and what is not. This is a much better policy than throwing stuff out indiscriminately.

Finally, I'd like you to understand my reluctance to accept such fixes. The main problem is that functions like QS_endRec_() run inside a critical section, which means that they should be short and deterministic. I hope you see that adding loops with unknown iteration counts does not help in this regard. Here is a significantly more efficient version of your fix for the QS_endRec_() function (without the overhead of calling QS_getByte() and all the other unnecessary checks):

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        while (buf[head] != QS_FRAME) {
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
        }
        QS_priv_.tail = head;  // shift the tail to the old data
    }
}

But even this version is non-deterministic and I doubt that such a change wouldn't hurt more than it helps. (I would just use the filters!)

--MMS

mikael-s-persson · 2024-07-31T15:29:16Z

The situation you describe with a persistent TX buffer overflow is pathological.

Of course it is. The point is being able to diagnose this effectively. When seeing a large number of "Bad checksum" errors, it was not easy to understand what the problem was. All I could find out about this error was this post: https://sourceforge.net/p/qpc/discussion/668726/thread/935d7a34a8/
Needless to say, the answer there served no purpose in understand what was going on, and worse, sent me on a wild goose chase suspecting some issue with interrupts, or the sci interface, or memory corruption, etc.
So, this patch (and the per-message QV_onIdle sending) is mainly aimed at making the errors more directly related to the underlying issue, i.e., causing "Discontinuity" errors rather than "Bad checksum" errors. There might be other ways to achieve the same result.
In lieu of fixing that problem, could you document (maybe somewhere under https://www.state-machine.com/qtools/qpspy.html, like in a troubleshooting section) that one of the likely cause of a large number of "Bad checksum" errors is this particular situation? It might save others the trouble I went through.

The main problem is that functions like QS_endRec_() run inside a critical section, which means that they should be short and deterministic.

Yes, I understand that. I did make a note of that fact in the original comment: "the additional non-deterministic timing of that loop is innocuous in a debugging facility (behind Q_SPY flag).". If you require run-time determinism even under the "SPY" build mode, then obviously, this is not appropriate.

These "global" and "local" filters allow you precisely to decide what is valuable and what is not.

I could not figure out a way to specifically filter out all QS output related in any way to one particular event (which fires at 1kHz, and is the main culprit for the inability to keep up). Any help on that would be appreciated.

Here is a significantly more efficient version of your fix

Thanks. There is an off-by-one bug in that code. In case others find this thread and want to patch this in, this is the fixed version:

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        for (;;) {
            uint8_t discarded_b = buf[head];
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
            if (discarded_b == QS_FRAME) {
                break;
            }
        }
        QS_priv_.tail = head;  // shift the tail to start of oldest record
    }
}

quantum-leaps · 2024-07-31T16:08:52Z

Hi Sven,
It seems to me that the real issue is filtering out this high-bandwidth QS trace record. If this is your own application-specific trace record with the record-ID, say, MY_RECORD, you can apply the QS global filter: QS_GLB_FILTER(-MY_RECORD) (please note the minus sign). If this is a pre-defined trace record, say, QS_QF_TICK, you filter it out: QS_GLB_FILTER(-QS_QF_TICK) (again, please note the minus sign). This is actually demonstrated in the DPP example.

But maybe your situation is a bit more subtle than that. Please provide more details if this crude solution does not address your situation.

Finally, regarding the proposed "fix" to the QS_endRec_() function, how about this version:

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        do {
            b = buf[head];
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
        } while (b != QS_FRAME)
        QS_priv_.tail = head;  // shift the tail to the old data
    }
}

--MMS

mikael-s-persson · 2024-07-31T16:35:29Z

But maybe your situation is a bit more subtle than that. Please provide more details if this crude solution does not address your situation.

In a DMA interrupt (from ADC sampling at 1kHz), I post an event like this:

        ADCRXEvt *adc_ce = Q_NEW(ADCRXEvt, ADC_RX_READY_SIG);
        ...
        QACTIVE_POST(AO_SensorAggregator, &adc_ce->super, &l_dmaISR);

And essentially, what I need is something like QS_LOC_FILTER(-ADC_RX_READY_SIG) that would remove any traces directly resulting from that signal/event.

quantum-leaps · 2024-07-31T18:02:51Z

...what I need is something like QS_LOC_FILTER(-ADC_RX_READY_SIG)

Unfortunately, filtering by event signal is not possible because the signal space is too vast (it could be 32 bits).

You could filter out everything related to your AO_SensorAggregator AO. You can do this by: QS_LOC_FILTER(-AO_SensorAggregator->prio).

In case you also trace the Q_NEW() allocations, you'd need to suppress them as well. If your ADCRXEvt is allocated from a dedicated pool (i.e., the pool serves only these events), you could suppress the pre-defined record QS_QF_NEW only for that event pool by: QS_LOC_FILTER(-(QS_EP_ID + <pool-id>)).

Finally, a quick question: does the DMA need to post so frequently? I mean, DMA can collect ADC data into a buffer (event) and interrupt only when the buffer is full.

--MMS

mikael-s-persson · 2024-07-31T19:37:11Z

Yeah, that's pretty much the conclusion I had reached as far as filtering. And yes, we are going to improve our DMA collection to something more reasonably tuned for our needs.

quantum-leaps and others added 2 commits July 23, 2024 14:58

7.4.0-rc.3

a576ed4

Added logic to skip ahead to next intact message after overflowing th…

cff3516

…e QS queue, this removes spammy bad checksum errors

Improved version of fix

cfb4da9

quantum-leaps force-pushed the master branch 4 times, most recently from 720f487 to 37b686e Compare August 6, 2024 14:47

quantum-leaps force-pushed the master branch from ec20d14 to c284a04 Compare August 16, 2024 14:58

quantum-leaps force-pushed the master branch from e5cc46a to d3f94c7 Compare September 5, 2024 15:10

quantum-leaps force-pushed the master branch 4 times, most recently from a80ce2e to 309f482 Compare October 17, 2024 18:28

quantum-leaps force-pushed the master branch from 0cf63da to afb40bd Compare October 30, 2024 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added logic to skip ahead to next intact message after overflow #41

Added logic to skip ahead to next intact message after overflow #41

mikael-s-persson commented Jul 30, 2024

quantum-leaps commented Jul 31, 2024 •

edited

Loading

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

Added logic to skip ahead to next intact message after overflow #41

Are you sure you want to change the base?

Added logic to skip ahead to next intact message after overflow #41

Conversation

mikael-s-persson commented Jul 30, 2024

quantum-leaps commented Jul 31, 2024 • edited Loading

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024

mikael-s-persson commented Jul 31, 2024

quantum-leaps commented Jul 31, 2024 •

edited

Loading