Performance issue caused by GC #1195

linfan68 · 2023-08-12T02:47:31Z

linfan68
Aug 12, 2023

I'm trying to offload some UART IO work into a worker: #1193

Since in this approach, main task delegate all UART IO into the worker, so there are lots of message passing between them.

I found the delay between "postMessage" and "onmessage" changes a lot (I'm using new Date().getTime() to get a rough tick count to measure preformance)
from the simpleworker demo, it's basically 1ms.
And in my code sometimes much longer (few hundred milliseconds). I tried to eliminate all other activities in the code and just measure the message passing delay, and it's still more than 10ms.

To my understanding, (I'm running on ESP32-S3) each worker runs on FreeRTOS task, and FreeRTOS message queues are used for inter-task communication. I checked the ESP32 codebase, preemption is supported in my configuration and the tick Hz is 1000. So the delay while system is idle should be ~1-2ms.

So I guess this time I'm running into performance issues into this tiny 240MHz chip... My guesses are:

some system level activity (like GC) takes too much time?
I'm measuring the latency in the wrong way?
My code has LOTS of async/await

Any suggestions?

Answered by phoddie

Aug 15, 2023

That's some helpful data, thank you.

It is great to see that the worker managing serial is running fully asynchronously.

I added logging for async job execution time as well as GC time. for GC I'm not 100% sure where is the core GC function, I added a log at the end of fxCollect(), like this:

That's correct.

The garbage collector is taking most of the time. It is running more frequently than I had expected based on the initial report. So, let's focus there.

From the instrumentation line for the main VM, we can see:

System bytes free: 7563034
Chunk used: 76828
Chunk available: 78848
Slot used: 473760
Slot available: 474720
Stack available: 163840

You have lots of free system memory, so…

View full answer

phoddie · 2023-08-12T03:41:25Z

phoddie
Aug 12, 2023
Maintainer

I'm not sure what is going on. I'd like to understand better what how much data you are pushing through the system:

Baiud rate of incoming serial
Average bytes per second received (it is using all the serial bandwidth or just a part)
Number of messages sent by the worker by second
Average message payload size sent by the worker

My code has LOTS of async/await

Using async/await is fine, of course. But nothing is free. Each await allocates memory and so puts a little pressure on the GC. That eventually adds up. Also the calling resolved promises will block messages received from the worker. The virtual machine can only do one thing at a time.

Can you quantify "LOTS"? I'm curious about how many promises are resolved per second. If you aren't sure, we can add some simple logging to fxRunPromiseJobs in xsPromise.c.

To my understanding, (I'm running on ESP32-S3) each worker runs on FreeRTOS task, and FreeRTOS message queues are used for inter-task communication. I checked the ESP32 codebase, preemption is supported in my configuration and the tick Hz is 1000. So the delay while system is idle should be ~1-2ms.

This is correct. And you see the 1 ms latency in the worker example, so we know that is possible.

Do make sure that you have both cores enabled. I think the ESP-IDF default is just one. The Moddable SDK sdkconfig settings enable both -- if you are using that.

I'm measuring the latency in the wrong way?

What you are doing is accurate. Date.now() has less overhead, and so would be preferred here. The lowest overhead timing would be to use Time.ticks from the Time module, as it is faster to return system milliseconds than Unix time.

0 replies

linfan68 · 2023-08-12T14:17:00Z

linfan68
Aug 12, 2023
Author

Yeah it's hard to explain what's going on... We have a hardware abstraction protocol that tunnels all peripherals through a UART. And we're using this mechanism to run automated tests, which indeed stress the system a lot.
I spent some time to focus on one function call that's slow. Basically, it uploads an event data through the communication module (we call it DTU)

the call chain looks like this:

sendEvent()
    sendCmd()
        sendData()
            atCmd()  // This is the code that doing real work
            atCmd()  // This is the code that doing real work

Each atCmd() call send some data (~20-50 bytes) to the worker, the worker send data through UART. Then atCmd() wait for a data coming back from communication module (like "OK") and continue.

Below is the original log with lots of details.

each log is prefixed with a timestamp in ms
I added # fxRunPromiseJobs to show async job runs
also instruments logs are kept
[HOST] logs are from test framework; it generates the responses needed by atCmd()
'---- PERF---- ' marks worker-main message passing logs

There're a few things from the log:

the total time (see [2937] line) took 900+ms
I've tried to eliminate all other activities (may be not clean enough, there're multiple promise job run between [2330] and [2338]
There are 29 # fxRunPromiseJobs within this 900 ms.
form [2330] the 'write' message got posted, and [2338] work tooks 8 ms to pick up the message
from [2340] worker posted writeDone to [2368] main got writeDone , it tooks 28ms. It seems no other async activity going on, main VM is await-ing the writeDone resolver. Same thing happened between [2727] and [2754]

[1968] [ProtoHub]>>> ProtocolHandlerDtu.sendEvent()
[2001] [ProtoHub] data prepared, before sending
    [2002] [ProtoHub]>>> ProtocolHanlderBase.sendCmd()
    [2004] [ProtoHub] _sendCmdBuffer...
        [2005] [    EC20]>>> DtuDrviverEC20.sendData()
            [2006] [     Dtu]>>> DtuDriver.atCmd((AT+QISEND=0,48, 1000, >))
            # fxRunPromiseJobs
            [2081] [     Dtu] atCmd> AT+QISEND=0,48
                [2169] [    VCom]>>> VCommRootNode.sendData()
                [2279] [    VCom] MOSI0000000d [WD] [0x00000103]UART3 [30+16=46]
                [2330] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: posted Message 'write'
                # fxRunPromiseJobs
                [2333] [    VCom]<<< VCommRootNode.sendData() 164ms
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            instruments: 0,0,0,0,0,5,0,0,0,0,1,7577010,1,65,72384,78848,459856,460496,8784,163840,18,207,0,0,453
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            [2338] ---- PERF---- UartDeviceImpl_esp32_worker: onmessage write
            # fxRunPromiseJobs
            [2340] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'writeDone'
            
                [4G] AT+QISEND=0,48 > // HERE the testing framework running on PC sending data to MCU
            
            [2359] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'onData'
            [2368] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage writeDone
            [2394] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage onData
            [2471] [    VCom] MISOffffffff [DS] [0x10000103]1000UART3 [30+3=33]
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            [2499] [     Dtu] atCmd< >
            [2500] [     Dtu]<<< DtuDriver.atCmd() 494ms
        # fxRunPromiseJobs
        [2526] [    EC20] Sending data: 48 bytes
            [2528] [     Dtu]>>> DtuDriver.atCmd((168,147,42..., 1000, SEND OK))
            # fxRunPromiseJobs
            [2556] [     Dtu] atCmd> a893....
            instruments: 10332,78848,6224,163824,2192,163840,1,2,0,0,19
                [2626] [    VCom]>>> VCommRootNode.sendData()
                [2688] [    VCom] MOSI0000000e [WD] [0x00000103]UART3 [30+48=78]
                [2718] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: posted Message 'write'
                # fxRunPromiseJobs
                [2720] [    VCom]<<< VCommRootNode.sendData() 94ms
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            ...tick...
            [2725] ---- PERF---- UartDeviceImpl_esp32_worker: onmessage write
            # fxRunPromiseJobs
            [2727] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'writeDone'
            
                [4G] GOT payload A8 93.... // HERE the testing framework running on PC sending data to MCU
            
            [2746] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'onData'
            [2754] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage writeDone
            [2781] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage onData
            [2879] [    VCom] MISOffffffff [DS] [0x10000103]1000UART3 [30+9=39]
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            # fxRunPromiseJobs
            [2907] [     Dtu] atCmd< SEND OK

            [2934] [     Dtu]<<< DtuDriver.atCmd() 406ms
        # fxRunPromiseJobs
        [2935] [    EC20]<<< DtuDrviverEC20.sendData() 930ms
    # fxRunPromiseJobs
    [2936] [ProtoHub]<<< ProtocolHanlderBase.sendCmd() 934ms
# fxRunPromiseJobs
[2937] [ProtoHub]<<< ProtocolHandlerDtu.sendEvent() 969ms

4 replies

phoddie Aug 12, 2023
Maintainer

Thanks for the additional details. The UART and promise activity doesn't obviously look overwhelming.

I see two possibilities that are delaying the delivery of the worker messages:

Additional work going on in the main thread that we aren't seeing in this log
Something in the runtime that is giving promises too much priority over worker messages

Unfortunately, I don't have time to investigate much today, so (2) will need to wait. I suggest we gather a little more data to help with (1):

How much time is spent running promise jobs
How much time is spent in the garbage collector. Your main virtual machine is quite large, so the GCs will take longer.

We can do that by adding some logging in fxRunPromiseJobs (building on what you did) and fxCollect in xsMemory.c.

// start
int start = modMilliseconds();
fprintf(stderr, "\n# fxRunPromiseJobs");

// end
int end = modMilliseconds();
fprintf(stderr, "\n# fxRunPromiseJobs - %d\n", end - start);

Note that modMilliseconds() is also used by Time.ticks so you could also log the time stamps if that would be more convenient to match your logs.

linfan68 Aug 13, 2023
Author

@phoddie Thanks for your help!
I added logging for async job execution time as well as GC time. for GC I'm not 100% sure where is the core GC function, I added a log at the end of fxCollect(), like this:

    int end = modMilliseconds();
    printf("[%04d] [GC] <<<< fxCollect END %d ms\n", end % 10000, end - start);

And here is the log, focused on the gap between worker 'posted Message writeDone' and main thread 'onmessage writeDone', and this time it took 47ms.

From what I can see, the async jobs seem ok: the job (the one posting 'writeDone') finished 1ms after posting the message. And in between there is on job run for 1ms, nothing significant.

But the GC seems problematic: two GC tooks 52ms in total. Also notice the first GC (the [2506] line) started before worker thread posting 'write Done' I think that the reason.

[2485] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'writeDone'
[2486]# <<<< fxRunPromiseJob[0] - 2 ms
[2502] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'onData'
[2506] [GC] <<<< fxCollect END 27 ms
[2506]# >>>> fxRunPromiseJob[0]
[2507]# <<<< fxRunPromiseJob[0] - 1 ms
[2532] [GC] <<<< fxCollect END 25 ms
[2532] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage writeDone

The log above was collected with other system activities suppressed. In normal situation GC happens much more frequently:
In this case, more data (GNSS data I guess) comes in in between, so we can see multiple 'onData' and some more processing got triggered.
In this case the gap becomes 763ms.

[6686] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'writeDone'
[6687]# <<<< fxRunPromiseJob[0] - 2 ms
MOSI0000001a [WD] [0x10000103]1000UART3 [30+16=46] dtu
[4G] AT+QISEND=0,48 >
MISOffffffff [DS] [0x10000103]1000UART3 [30+3=33] 
[6703] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'onData'
[6705] [GC] <<<< fxCollect END 29 ms
[6706]# <<<< fxRunPromiseJob[2] - 30 ms
[6706]# >>>> fxRunPromiseJob[3]
[6707] [ DataMgr]>>> DataManager.onParamsUpdated()
[6708] [ DataMgr] syncDataPackFromParams
[6731] [GC] <<<< fxCollect END 22 ms
[6755] [GC] <<<< fxCollect END 22 ms
[6779] [GC] <<<< fxCollect END 23 ms
MISOffffffff [DS] [0x10000105]1000UART5 [30+70=100] 
[6800] ---- PERF---- UartDeviceImpl_esp32_worker: posted Message 'onData'
[6804] [GC] <<<< fxCollect END 24 ms
[6828] [GC] <<<< fxCollect END 23 ms
[6851] [GC] <<<< fxCollect END 23 ms
[6874] [GC] <<<< fxCollect END 22 ms
[6897] [GC] <<<< fxCollect END 23 ms
[6851] [EventBus]>>> EventBus.emit(reset-status, 1 listeners)
[6920] [GC] <<<< fxCollect END 23 ms
[6942] [GC] <<<< fxCollect END 22 ms
[6969] [GC] <<<< fxCollect END 26 ms
[6996] [GC] <<<< fxCollect END 23 ms
[6996]# <<<< fxRunPromiseJob[3] - 290 ms
[6996]# >>>> fxRunPromiseJob[4]
[6997]# <<<< fxRunPromiseJob[4] - 1 ms
[7023] [GC] <<<< fxCollect END 26 ms
[7024] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage onData
[7050] [GC] <<<< fxCollect END 22 ms
[7078] [GC] <<<< fxCollect END 22 ms
[7105] [GC] <<<< fxCollect END 22 ms
[7131] [GC] <<<< fxCollect END 23 ms
[7158] [GC] <<<< fxCollect END 23 ms
[7185] [GC] <<<< fxCollect END 23 ms
[7214] [GC] <<<< fxCollect END 27 ms
[7241] [GC] <<<< fxCollect END 22 ms
[7266] [GC] <<<< fxCollect END 22 ms
[7293] [GC] <<<< fxCollect END 23 ms
[7318] [GC] <<<< fxCollect END 23 ms
[7341] [GC] <<<< fxCollect END 22 ms
[7342] [    VCom] MISOffffffff [DS] [0x10000105]1000UART5 [30+70=100]
[7369] [GC] <<<< fxCollect END 23 ms
[7370] [EventBus]>>> EventBus.emit(gnss, 1 listeners)
[7395] [GC] <<<< fxCollect END 23 ms
[7422] [GC] <<<< fxCollect END 26 ms
[7449] [GC] <<<< fxCollect END 26 ms
[7449] [esp_uart] ---- PERF---- UartDeviceImpl_esp32: onmessage writeDone

And here is the 'instruments' log after the log above (I guess for the worker and main thread)

[worker?] instruments: 19188,78848,7504,163824,2192,163840,1,2,0,0,25
...
[main?]    instruments: 0,0,0,0,0,3,0,0,0,0,0,7563034,100,0,76828,78848,473760,474720,208,163840,0,207,0,0,0

phoddie Aug 15, 2023
Maintainer

That's some helpful data, thank you.

It is great to see that the worker managing serial is running fully asynchronously.

I added logging for async job execution time as well as GC time. for GC I'm not 100% sure where is the core GC function, I added a log at the end of fxCollect(), like this:

That's correct.

The garbage collector is taking most of the time. It is running more frequently than I had expected based on the initial report. So, let's focus there.

From the instrumentation line for the main VM, we can see:

System bytes free: 7563034
Chunk used: 76828
Chunk available: 78848
Slot used: 473760
Slot available: 474720
Stack available: 163840

You have lots of free system memory, so we have room to adjust the memory allocations if needed. And perhaps it is needed. There are two heaps for each VM - the slot heap (16 byte blocks) and the chunk heap (variable size blocks) What's important to understand is that when either heap is full, the GC runs to make room. Both your chunk and slot heaps are very close to full (97% for chunk heap and 99% for slots). The lack of free space means the GC will run relatively often to make space. That appears to match what we see.

As a naive change, let's double the size of both the slot and chunk heaps to increase the free space. That should reduce the frequency of the garbage collections. The updated allocations would look something like this in your manifest:

	"creation": {
		"static": 0,
		"chunk": {
			"initial": 157696,
			"incremental": 0
		},
		"heap": {
			"initial": 59340,
			"incremental": 0
		},
		"stack": 10240
	},

Keep in mind that the stack and heap initial/incremental are expressed in slots, so the number of bytes allocated is 16x the value. The chunk initial/incremental are expressed in bytes.

FWIW – your stack size is huge. An 8 KB stack is usually more than enough (512 slots). There's probably no harm with having it so large, but it does stand out and eventually you might want that memory for something else.

Answer selected by linfan68

linfan68 Aug 16, 2023
Author

@phoddie That make sense. Besides doubling the initial allocation, what does "incremental" mean? I think one of the issues is the allocation should grow with a strategy (like std::vector<> which doubles every time). If we can get a good incremental strategy, then the initial allocation configuration becomes less critical.

phoddie · 2023-08-16T17:30:39Z

phoddie
Aug 16, 2023
Maintainer

@linfan68 Your intuition about the use of "incremental" is on target. It is described in the manifest and XS in C docs.

For embedded projects, we almost always want to set the incremental values to 0. That prevents the virtual machine from growing at runtime, so that it cannot use more memory than expected -- which could disrupt operation of other parts of the system. It is also more efficient to initially allocate the memory needed for a VM, rather than incrementally growing the VM.

For devices with lots of memory like yours, having a non-zero value for incremental can be convenient, even if sub-optimal. If nothing more, it would be helpful during development to see where the memory allocation stabilizes. Those values can then be used to set the initial allocations.

I think one of the issues is the allocation should grow with a strategy (like std::vector<> which doubles every time)

These kind of heuristics can be useful. We haven't taken a doubling approach in XS as it will consume memory faster which can be dangerous on a memory constrained device.

It is easy enough to experiment with different behaviors. I think the following will do what you describe. For slots, after this line:

moddable/xs/sources/xsMemory.c

Line 1648 in d471f26

    
           fxGrowSlots(the, !(the->collectFlag & XS_SKIPPED_COLLECT_FLAG) ? the->minimumHeapCount : 64);

add:

the->minimumHeapCount *= 2;

For chunks, after this line:

moddable/xs/sources/xsMemory.c

Line 570 in d471f26

if (buffer) {

add:

if (the->firstBlock)
	the->minimumChunksSize *= 2;

Please give that a try and share how it goes. If the direction is promising we can think about how to integrate something like that.

14 replies

linfan68 Aug 27, 2023
Author

@phoddie I got a chance to try this change, but still get one more issue.
It seems like if the "static" field set zero, the worker is not picking up the configure:

//  manifest
{
    "static": 0,
    "stack": 2048,
    "chunk": {
      "initial": 315392,
      "incremental": 0
    },
    "heap": {
      "initial": 78848,
      "incremental": 0
    }
  },

//  Worker config
{
  static: 0,
  chunk: {
    initial: 16384,
    incremental: 0
  },
  heap: {
    initial: 512,
    incremental: 0
  },
  stack: 512
}

produces the following result:

instruments: 0,0,0,0,0,3,0,0,0,0,4,5470558,1,39,82844,315392,493536,1261552,5280,32768,3,215,0,0,217,17
instruments: 13772,315392,6544,1261552,1120,32768,0,2,0,0,4,0

By changing "static: 0" to "static: 64 * 1024" in worker config, worker memory looks correct:

instruments: 0,0,0,0,0,2,0,0,0,0,8,7014830,0,20,82844,315392,490560,1261552,5280,32768,2,215,0,0,236,8
instruments: 8368,16384,3984,8176,1120,8192,1,2,0,0,10,0

phoddie Aug 27, 2023
Maintainer

Thanks for confirming that it mostly works. I thought I tested the case with static of 0, but maybe not. I will take a look.

phoddie Aug 27, 2023
Maintainer

Either I didn't test it or I made a mistake committing the change. All you need to do is change this line:

moddable/xs/platforms/mc/xsHosts.c

Line 490 in 2f9eb25

the = xsPrepareMachine(NULL, preparation, (char *)name, NULL, archive);

Replace the first NULL with creation. We'll update this repository with the correction shortly.

Thanks for the report. Good find.

linfan68 Sep 12, 2023
Author

I just got a chance to try this and it works as expected!

phoddie Sep 12, 2023
Maintainer

Very nice. Thanks for confirming. It is included in Moddable SDK 4.1, that was just released today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue caused by GC #1195

{{title}}

Replies: 3 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance issue caused by GC #1195

linfan68 Aug 12, 2023

Replies: 3 comments · 18 replies

phoddie Aug 12, 2023 Maintainer

linfan68 Aug 12, 2023 Author

phoddie Aug 12, 2023 Maintainer

linfan68 Aug 13, 2023 Author

phoddie Aug 15, 2023 Maintainer

linfan68 Aug 16, 2023 Author

phoddie Aug 16, 2023 Maintainer

linfan68 Aug 27, 2023 Author

phoddie Aug 27, 2023 Maintainer

phoddie Aug 27, 2023 Maintainer

linfan68 Sep 12, 2023 Author

phoddie Sep 12, 2023 Maintainer

linfan68
Aug 12, 2023

Replies: 3 comments 18 replies

phoddie
Aug 12, 2023
Maintainer

linfan68
Aug 12, 2023
Author

phoddie Aug 12, 2023
Maintainer

linfan68 Aug 13, 2023
Author

phoddie Aug 15, 2023
Maintainer

linfan68 Aug 16, 2023
Author

phoddie
Aug 16, 2023
Maintainer

linfan68 Aug 27, 2023
Author

phoddie Aug 27, 2023
Maintainer

phoddie Aug 27, 2023
Maintainer

linfan68 Sep 12, 2023
Author

phoddie Sep 12, 2023
Maintainer