Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logic to skip ahead to next intact message after overflow #41

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mikael-s-persson
Copy link

Added logic to skip ahead to next intact message after overflowing the QS queue, this removes spammy "Bad checksum" errors.

This was discovered while having an issue of not being able to send the QS messages fast enough, which leads to an overflow of the QS buffer (fixed capacity circular buffer), which overwrites the data at the TX end (tail). This results in pretty much all subsequent messages coming out being corrupted, leading to a massive amount of "Bad checksum error" spamming the qspy output without leaving anything useful.

This PR just adds simple logic so that when the buffer is overflown (clobbering the tail), the tail is seek'ed forward to the next intact message by consuming bytes up to and including the next QS_FRAME.

The resulting behavior is that the overflowing condition results in "Discontinuity" errors rather than a wall of checksum errors, which is much more intuitive to diagnose the issue. It incurs more additional loss of information since the partially clobbered message at the tail is not lost already by the time this code reached. And finally, the additional non-deterministic timing of that loop is innocuous in a debugging facility (behind Q_SPY flag).

I tested this in my project and it works as described above.
I'm not sure if there's a test suite somewhere where this regression test condition should be added to. That should be pretty trivial to do.

quantum-leaps and others added 2 commits July 23, 2024 14:58
@quantum-leaps
Copy link
Contributor

quantum-leaps commented Jul 31, 2024

Hi Sven
Thank you for your interest in QP and QP/Spy software tracing.

...This results in pretty much all subsequent messages coming out being corrupted

I don't understand that claim. I think that only one (the oldest) trace record gets partially overwritten by the new trace record. Of course, if you keep adding to the head without advancing the tail, more and more old trace records gets overwritten, but still only one record at a time gets corrupted. Please explain why you think that all subsequent records come out corrupted.

--MMS

@mikael-s-persson
Copy link
Author

I was using the TX mechanism from the arm-cr dpp example: https://github.com/QuantumLeaps/qpc-examples/blob/8912782f022cdf8ebf2528619d7d46af7fd4f097/arm-cr/dpp_launchxl2-tms57012/qv/bsp.c#L322

When reaching full capacity on the buffer, the two ends of that pipe are basically racing each other constantly and the data coming out on sci is pretty much a continuous stream of nonsense, with only occasional intact messages getting through when the consumer end is winning briefly. And if you've reached full capacity, then the consumer most consistently losing that race is expected.

The PR fix is actually paired with a fix on the TX side, implementing the QV_onIdle via this function (which you might consider using in the dpp example):

void QS_updateOverSci(int32_t flush_msg_count) {
    QS_rxParse();  // parse all the received bytes

    if (flush_msg_count == 0 && !sciIsTxReady(sciREG1)) {
        // Cannot write, tx is not ready.
        return;
    }
    for (;;) {
        QF_INT_DISABLE();
        uint16_t b = QS_getByte();
        QF_INT_ENABLE();

        if (b != QS_EOD) {
            sciSendByte(sciREG1, b);
        } else {
            break;
        }
        if (b == QS_FRAME) {
            --flush_msg_count;
        }
        if (flush_msg_count <= 0) {
            break;
        }
    }
}

The above fix on the TX side could not work (easily) without changing QS's internals to seek forward to the message boundaries.

@quantum-leaps
Copy link
Contributor

Hi Sven,
The situation you describe with a persistent TX buffer overflow is pathological. Sure, you can give the consumer side a leg up by discarding some bytes that otherwise would have to be sent out. But the producer quickly catches up and you're again in the same place. Then you try to help the consumer again by indiscriminately flushing out some messages. But then you relinquish the control of what information gets discarded.

Producing less data is a much better (arguably the only) way to fix such issues. For this, QS provides powerful filtering mechanisms. These "global" and "local" filters allow you precisely to decide what is valuable and what is not. This is a much better policy than throwing stuff out indiscriminately.

Finally, I'd like you to understand my reluctance to accept such fixes. The main problem is that functions like QS_endRec_() run inside a critical section, which means that they should be short and deterministic. I hope you see that adding loops with unknown iteration counts does not help in this regard. Here is a significantly more efficient version of your fix for the QS_endRec_() function (without the overhead of calling QS_getByte() and all the other unnecessary checks):

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        while (buf[head] != QS_FRAME) {
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
        }
        QS_priv_.tail = head;  // shift the tail to the old data
    }
}

But even this version is non-deterministic and I doubt that such a change wouldn't hurt more than it helps. (I would just use the filters!)

--MMS

@mikael-s-persson
Copy link
Author

The situation you describe with a persistent TX buffer overflow is pathological.

Of course it is. The point is being able to diagnose this effectively. When seeing a large number of "Bad checksum" errors, it was not easy to understand what the problem was. All I could find out about this error was this post: https://sourceforge.net/p/qpc/discussion/668726/thread/935d7a34a8/
Needless to say, the answer there served no purpose in understand what was going on, and worse, sent me on a wild goose chase suspecting some issue with interrupts, or the sci interface, or memory corruption, etc.
So, this patch (and the per-message QV_onIdle sending) is mainly aimed at making the errors more directly related to the underlying issue, i.e., causing "Discontinuity" errors rather than "Bad checksum" errors. There might be other ways to achieve the same result.
In lieu of fixing that problem, could you document (maybe somewhere under https://www.state-machine.com/qtools/qpspy.html, like in a troubleshooting section) that one of the likely cause of a large number of "Bad checksum" errors is this particular situation? It might save others the trouble I went through.

The main problem is that functions like QS_endRec_() run inside a critical section, which means that they should be short and deterministic.

Yes, I understand that. I did make a note of that fact in the original comment: "the additional non-deterministic timing of that loop is innocuous in a debugging facility (behind Q_SPY flag).". If you require run-time determinism even under the "SPY" build mode, then obviously, this is not appropriate.

These "global" and "local" filters allow you precisely to decide what is valuable and what is not.

I could not figure out a way to specifically filter out all QS output related in any way to one particular event (which fires at 1kHz, and is the main culprit for the inability to keep up). Any help on that would be appreciated.

Here is a significantly more efficient version of your fix

Thanks. There is an off-by-one bug in that code. In case others find this thread and want to patch this in, this is the fixed version:

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        for (;;) {
            uint8_t discarded_b = buf[head];
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
            if (discarded_b == QS_FRAME) {
                break;
            }
        }
        QS_priv_.tail = head;  // shift the tail to start of oldest record
    }
}

@quantum-leaps
Copy link
Contributor

Hi Sven,
It seems to me that the real issue is filtering out this high-bandwidth QS trace record. If this is your own application-specific trace record with the record-ID, say, MY_RECORD, you can apply the QS global filter: QS_GLB_FILTER(-MY_RECORD) (please note the minus sign). If this is a pre-defined trace record, say, QS_QF_TICK, you filter it out: QS_GLB_FILTER(-QS_QF_TICK) (again, please note the minus sign). This is actually demonstrated in the DPP example.

But maybe your situation is a bit more subtle than that. Please provide more details if this crude solution does not address your situation.

Finally, regarding the proposed "fix" to the QS_endRec_() function, how about this version:

void QS_endRec_(void) {
    . . .
    // overrun over the old data?
    if (QS_priv_.used > end) {
        QS_priv_.used = end;   // the whole buffer is used
        do {
            b = buf[head];
            ++head;
            if (head == end) {
                head = 0U;
            }
            --QS_priv_.used;
        } while (b != QS_FRAME)
        QS_priv_.tail = head;  // shift the tail to the old data
    }
}

--MMS

@mikael-s-persson
Copy link
Author

But maybe your situation is a bit more subtle than that. Please provide more details if this crude solution does not address your situation.

In a DMA interrupt (from ADC sampling at 1kHz), I post an event like this:

        ADCRXEvt *adc_ce = Q_NEW(ADCRXEvt, ADC_RX_READY_SIG);
        ...
        QACTIVE_POST(AO_SensorAggregator, &adc_ce->super, &l_dmaISR);

And essentially, what I need is something like QS_LOC_FILTER(-ADC_RX_READY_SIG) that would remove any traces directly resulting from that signal/event.

@quantum-leaps
Copy link
Contributor

...what I need is something like QS_LOC_FILTER(-ADC_RX_READY_SIG)

Unfortunately, filtering by event signal is not possible because the signal space is too vast (it could be 32 bits).

You could filter out everything related to your AO_SensorAggregator AO. You can do this by: QS_LOC_FILTER(-AO_SensorAggregator->prio).

In case you also trace the Q_NEW() allocations, you'd need to suppress them as well. If your ADCRXEvt is allocated from a dedicated pool (i.e., the pool serves only these events), you could suppress the pre-defined record QS_QF_NEW only for that event pool by: QS_LOC_FILTER(-(QS_EP_ID + <pool-id>)).

Finally, a quick question: does the DMA need to post so frequently? I mean, DMA can collect ADC data into a buffer (event) and interrupt only when the buffer is full.

--MMS

@mikael-s-persson
Copy link
Author

Yeah, that's pretty much the conclusion I had reached as far as filtering. And yes, we are going to improve our DMA collection to something more reasonably tuned for our needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants