don't use safe caller for event delegation to subscribers #861

maggu2810 · 2019-06-09T19:37:49Z

This changes should address #829

At the moment we warn if an event subscriber exceeds the limit of expected max. runtime.
This has been done before, too.
It would also be possible (which another implementation I played around today, too) to interrupt the event subscriber if it exceeds the limit. I don't think we should do this. This could perhaps lead into a very confusion behaviour on very high load. The warning should be enough to detect a really bad working event subscriber.

Signed-off-by: Markus Rathgeb <[email protected]>

mhilbush · 2019-06-09T22:07:31Z

Adding a summary of my testing from here.

Test parameters

AMD 3.7 GHz 6 core CPU with 8 GB RAM
openHAB build 1607
Event rate of 31.4 events/second (see note 1 below)
org.openhab.core built from recent clone with don't use safe caller for event delegation to subscribers #861 (this PR) applied

Build 1607:

CPU: ranging between 25% and 85% on one core
Memory: ~15.4%
Event Manager thread CPU time rate of increase (see note 2 below)

   8322719   ESH-OSGiEventManager
   8328524   ESH-OSGiEventManager
   8334892   ESH-OSGiEventManager
   8341976   ESH-OSGiEventManager

Build 1607 with org.openhab.core as described above:

CPU: ranging between 2% and 5% with occasional spikes to 20-25%
Memory: ~8.0%
Event Manager thread CPU time rate of increase (see note 2 below)

 2839      OH-OSGiEventManager
 10243     OH-EventHandlerExecutor-1
 758       OH-EventHandlerWatcher-1
-----------------------
 2854      OH-OSGiEventManager
 10301     OH-EventHandlerExecutor-1
 762       OH-EventHandlerWatcher-1
-----------------------
 2868      OH-OSGiEventManager
 10354     OH-EventHandlerExecutor-1
 766       OH-EventHandlerWatcher-1
-----------------------
 2884      OH-OSGiEventManager
 10412     OH-EventHandlerExecutor-1
 770       OH-EventHandlerWatcher-1
-----------------------
 2900      OH-OSGiEventManager
 10471     OH-EventHandlerExecutor-1
 775       OH-EventHandlerWatcher-1

Note 1: Event rate measured with all event types in org.ops4j.pax.logging.cfg set to INFO level and the following command:

    tail -f /opt/openhab2/userdata/logs/events.log | pv --line-mode --average-rate > /dev/null

Note 2: Event Manager thread CPU time measured by taking snapshots every 10 seconds using the following command:

        /opt/openhab2/runtime/bin/client -p habopen -l 0 -- threads --list | \
            awk -F"│" '{ gsub(/[ \t]+$/, "", $2);  print $4 $2 }' | \
            sort --numeric-sort --reverse

maggu2810 · 2019-06-09T22:15:42Z

Thanks a lot to @mhilbush for his testing!

mhilbush · 2019-06-09T22:34:41Z

I cranked the event rate up to 69 events per second. Pretty darn good!

CPU: 15-20% with spikes to 40%
Memory: ~8.5%
Event Manager thread CPU

 7030      OH-OSGiEventManager
 34676     OH-EventHandlerExecutor-1
 2200      OH-EventHandlerWatcher-1
-----------------------
 7060      OH-OSGiEventManager
 35206     OH-EventHandlerExecutor-1
 2216      OH-EventHandlerWatcher-1
-----------------------
 7087      OH-OSGiEventManager
 35696     OH-EventHandlerExecutor-1
 2232      OH-EventHandlerWatcher-1
-----------------------
 7118      OH-OSGiEventManager
 36207     OH-EventHandlerExecutor-1
 2249      OH-EventHandlerWatcher-1
-----------------------
 7147      OH-OSGiEventManager
 36718     OH-EventHandlerExecutor-1
 2265      OH-EventHandlerWatcher-1
-----------------------
 7178      OH-OSGiEventManager
 37254     OH-EventHandlerExecutor-1
 2283      OH-EventHandlerWatcher-1
-----------------------
 7207      OH-OSGiEventManager
 37752     OH-EventHandlerExecutor-1
 2301      OH-EventHandlerWatcher-1
-----------------------

Thanks a lot to @mhilbush for his testing!

Thanks. Ran a Fortune 100 web site for 8 years. Our philosophy was:

Test it till it breaks
Fix issues
Go back to step 1

kaikreuzer · 2019-06-10T09:23:56Z

.../org.openhab.core/src/main/java/org/eclipse/smarthome/core/internal/events/EventHandler.java

+    private final ScheduledExecutorService watcher = Executors
+            .newSingleThreadScheduledExecutor(new NamedThreadFactory("EventHandlerWatcher"));
+    private final ExecutorService executor = Executors
+            .newSingleThreadExecutor(new NamedThreadFactory("EventHandlerExecutor"));


Afair, the safeCaller does not use a single thread, but a pool of x threads. Are we safe here to reduce it to a single thread or might we run into some other issues?

I also remember a feature that we assigned events to bindings and made sure that if one binding exhausts all threads of a pool, that this does not negatively impact the others. I cannot find this in the code right now, so I assume this was not part of the internal event handling itself.

Events should be handled in order.
Using multiple threads for the event handling can result into a different event order.
Assume changed events "1->2", "2->3".
If you are using different threads how do you want to ensure that the submitted execution does not change the order because the one thread comes in front of the other?
E.g. you submit the event handling for "1->2" to the executor, after that you submit the event handling for "2->3" to the executor.
The executor uses thread1 for "1->2" and thread2 for "2->3".
Now, sometimes thread1 is executed in front of thread2, other times thread2 is executed in front of thread1 (or they are executed at the same time using different CPUs, or the execution is switched between etc.).
The event order will be non deterministic anymore.

So, regardless how it has been done, it should be done in order.
IIRC this has been also discussed some years ago.

As written above already, we could interrupt the event subscriber if the limit has been exceeded.
But if you realize that an addon consumes to much resources, perhaps it would be better to fix that addon (and realize the bad behaviour).

Just to put one more data point out there...

On my test system (with a ~9 year old AMD CPU with each of 6 cores running at 3.7 GHz) I max out one core at ~90 events per second. I produce this load with 5 systeminfo things set to update all channels (except disk because I don't know the performance penalty of that function) at a refresh rate of interval_high=1, interval_medium=4.

The top cpu consuming threads are:

735734 OH-EventHandlerExecutor-1 20119 Timer-7 19935 ESH-usb-serial-discovery-linux-sysfs-1 18066 pool-4-thread-1 17530 OH-OSGiEventManager 11320 EventAdminAsyncThread #11 11261 EventAdminAsyncThread #12 9450 safeCall-40 9211 OH-EventHandlerWatcher-1 9129 safeCall-38 8779 pool-11-thread-1 8565 Start Level: Equinox Container: 7dcb28cd-c6e7-4135-a1f8-6d3d46449a09 8531 safeCall-39 6941 sshd-SshServer[5e8ba151]-nio2-thread-1 6514 sshd-SshServer[5e8ba151]-nio2-thread-2 6321 safeCall-37 6212 pool-9-thread-1 6205 ZWaveReceiveInputThread 4087 safeCall-41 2974 Finalizer 2922 Reference Handler 2681 items-67 2461 pool-706-thread-1 2097 fileinstall-/opt/openhab2/userdata/etc 2078 Karaf Lock Monitor Thread 1925 items-68 1643 openHAB-job-scheduler_Worker-1 1580 openHAB-job-scheduler_Worker-2 1555 ResourcePublisher 1516 Active Thread: Equinox Container: 7dcb28cd-c6e7-4135-a1f8-6d3d46449a09

Overall, openHAB is consuming a little less than 2 CPUs worth of load.

top - 08:03:15 up 10 days, 14:04, 2 users, load average: 2.03, 1.43, 1.13 Tasks: 330 total, 2 running, 263 sleeping, 0 stopped, 0 zombie %Cpu(s): 32.6 us, 1.8 sy, 0.0 ni, 65.3 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 8167104 total, 2072852 free, 2031328 used, 4062924 buff/cache KiB Swap: 16876540 total, 16876540 free, 0 used. 5811896 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31984 root 20 0 7042648 747712 23360 S 181.7 9.2 65:11.77 java 10228 mark 20 0 819608 100000 53848 S 9.3 1.2 23:08.14 Xorg 15095 xrdp 20 0 59036 28976 8084 S 7.0 0.4 20:59.19 xrdp 10900 mark 20 0 542100 41572 25628 S 4.3 0.5 8:15.48 xfce4-terminal 10323 mark 20 0 429236 26368 20684 S 0.7 0.3 0:41.54 xfwm4 1932 root 20 0 1668112 66936 38988 S 0.3 0.8 15:09.72 dockerd 3341 gdm 20 0 589708 24896 18020 R 0.3 0.3 5:55.77 gsd-color 27420 mark 20 0 13824 876 776 S 0.3 0.0 0:01.59 tail 31641 mark 20 0 42052 3800 2912 R 0.3 0.0 0:00.07 top 1 root 20 0 226512 10252 6708 S 0.0 0.1 0:42.78 systemd

@maggu2810 We never had the requirement to respect a specific order and afair, older OSGi specs didn't ask for an ordered delivery either. I remember discussions where we found out that Equinox used single threaded delivery while Felix used a thread pool instead.

It seems that OSGi 7 has now introduced a specific setting for defining whether ordering is required or not.

But if you realize that an addon consumes to much resources, perhaps it would be better to fix that addon (and realize the bad behaviour).

Sure, but the idea was that its effect is constrained to its own functionality and does not impact others to misbehave at the same time. I remember situations where a binding took too long when consuming events and this resulted in late delivery of events to other bindings, which then showed very odd behavior and it was difficult to actually figure out, which binding was the culprit. Much better if only the one behaves badly, which actually has the bug.

IMO, this PR is a dramatic improvement over the current implementation. If there are no issues other than the ability to scale beyond the rates I've outlined above, can't this be merged, as it resolves a very critical issue in the current snapshots.

If there's a desire to change the implementation to scale better, can't that be done in another PR?

I had a look at the code and the changes in this PR don't impact how commands are sent to ThingHandlers using the SafeCaller:

openhab-core/bundles/org.openhab.core.thing/src/main/java/org/eclipse/smarthome/core/thing/internal/profiles/ProfileCallbackImpl.java

Lines 71 to 75 in e875826

safeCaller.create(handler, ThingHandler.class)

.withTimeout(CommunicationManager.THINGHANDLER_EVENT_TIMEOUT).onTimeout(() -> {

logger.warn("Handler for thing '{}' takes more than {}ms for handling a command",

handler.getThing().getUID(), CommunicationManager.THINGHANDLER_EVENT_TIMEOUT);

}).build().handleCommand(link.getLinkedUID(), command);

So commands will still be handled asynchronously which was my main concern. Commands will thus continue to be handled even if a couple of ThingHandlers hijack threads e.g. due to network timeouts or when waiting on locks.

FTR this is how I generated the event load.

https://github.com/mhilbush/openhab2-addons/tree/load-generator-binding/bundles/org.openhab.binding.loadgenerator

I was just about to ask @mhilbush to add some command handling to the load generator (with e.g. 300ms processing time of the handler).
But @wborn's observation that this had actually moved to the ProfileCallback should hopefully answer it - this should not be related to the event dispatching itself anymore. I consider my second concern to be clarified then.

Thank you all for the hard work and your patience with me!

Thank you all for the hard work and your patience with me!

No worries. It's a change to a critical area, so I get the need for diligence.

I was just about to ask @mhilbush to add some command handling to the load generator (with e.g. 300ms processing time of the handler).

I wrote the load generator in a hurry, so there's room for improvement. As I plan to use it in the future, and possibly extend it, what were you looking for here?

what were you looking for here?

A feature that makes the load generator not only create status updates, but also commands and then have the handleCommand method in the ThingHandler to block for e.g. 300ms to simulate some heavy work (like accessing external devices). This would then better simulate a setup where users can directly observe whether there is a delay in event processing or not - if switching on a device suddenly takes a few seconds, because the command hang in a single-threaded queue, it is instantly visible and annoying to the user. But as mentioned above: This PR should not have an issue on this behavior, so such an addition would only be interesting for future tests in other contexts.

maggu2810 · 2019-06-19T20:27:08Z

@mhilbush Can you check that summary of your test?

Is my understanding correct, that you state the test makes no sense against the current master branch, because it is such slow?

id	naming here	#763	#861	notes	events / seconds	cpu load
1	Baseline	no	no	some time in the past before we noticed that the proxies using wrong classloading	> 1370
2	Test 2	no	yes	should not be important as #763 should not be reverted again	> 1370	perform substantially better than "1" in terms of CPU consumption perform better than "1" in terms of top CPU consuming threads
3	Current Master	yes	no	current master	<= 40
4	Test 1	yes	yes	master if this one gets merged	> 1370	perform substantially better than "1" in terms of CPU consumption perform better than "1" in terms of top CPU consuming threads

mhilbush · 2019-06-19T20:39:45Z

Is my understanding correct, that you state the test makes no sense against the current master branch, because it is such slow?

Yes, the current master branch is too slow to run anything close to the volume I ran against Baseline, Test 1, and Test 2.

can you fill such a table WRT your current machine?

I was just thinking about doing a table like this. 😉 Working on it now.

mhilbush · 2019-06-19T20:46:58Z

Can you update the ID column with the following so that the rows map to the tests above? Otherwise, you've captured it correctly.

1 => Baseline
2 => Test 2
3 => Current Master
4 => Test 1

maggu2810 · 2019-06-19T20:50:07Z

I assume the naming "baseline" is very confusing. So I preferred not choosing a special name at all, but an ID, and a description, and a truth table. 😉
But if you are more happy, I will add another table to refer the naming in your comment above

mhilbush · 2019-06-19T20:53:47Z

You can leave it as is. I used the term Baseline to reflect the performance of the system prior to PRs #763 and #861.

kaikreuzer · 2019-06-19T21:43:12Z

Thank you all for the hard work and your patience with me!

openhab-bot · 2019-07-19T20:01:46Z

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/preparation-for-2-5m2/75738/73

Signed-off-by: Markus Rathgeb <[email protected]> GitOrigin-RevId: 8e195d4

don't use safe caller for event delegation to subscribers

e875826

Signed-off-by: Markus Rathgeb <[email protected]>

5iver mentioned this pull request Jun 9, 2019

Extreme CPU and Memory Load since 2.5 Build 1579 #829

Closed

kaikreuzer reviewed Jun 10, 2019

View reviewed changes

5iver mentioned this pull request Jun 16, 2019

ASH: ERROR received (code 81). Disconnecting. openhab/org.openhab.binding.zigbee#440

Closed

kaikreuzer approved these changes Jun 19, 2019

View reviewed changes

kaikreuzer merged commit 8e195d4 into openhab:master Jun 19, 2019

maggu2810 deleted the event-delegation branch June 19, 2019 21:47

bjoernbrings mentioned this pull request Jun 20, 2019

Reduce polling frequency of usb serial scanner #889

Merged

maggu2810 mentioned this pull request Jul 9, 2019

SafeCaller usage: ThingManager impl. #914

Closed

5iver mentioned this pull request Jul 14, 2019

Commands no longer sent after daily heal openhab/org.openhab.binding.zwave#1178

Open

wborn mentioned this pull request Jul 22, 2019

RFC: add a much simpler "safe" caller #916

Closed

wborn added this to the 2.5 milestone Jul 30, 2019

Rosi2143 pushed a commit to Rosi2143/openhab-core that referenced this pull request Dec 26, 2020

Added transformation service installation info (openhab#861)

59c40c6

splatch pushed a commit to ConnectorIO/copybara-hab-core that referenced this pull request Jul 11, 2023

don't use safe caller for event delegation to subscribers (openhab#861)

41ddf91

Signed-off-by: Markus Rathgeb <[email protected]> GitOrigin-RevId: 8e195d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't use safe caller for event delegation to subscribers #861

don't use safe caller for event delegation to subscribers #861

maggu2810 commented Jun 9, 2019

mhilbush commented Jun 9, 2019 •

edited

Loading

maggu2810 commented Jun 9, 2019

mhilbush commented Jun 9, 2019

kaikreuzer Jun 10, 2019

maggu2810 Jun 10, 2019 •

edited

Loading

mhilbush Jun 10, 2019

kaikreuzer Jun 10, 2019

mhilbush Jun 10, 2019

wborn Jun 19, 2019

mhilbush Jun 19, 2019

kaikreuzer Jun 19, 2019

mhilbush Jun 19, 2019

kaikreuzer Jun 19, 2019

maggu2810 commented Jun 19, 2019 •

edited

Loading

mhilbush commented Jun 19, 2019

mhilbush commented Jun 19, 2019

maggu2810 commented Jun 19, 2019

mhilbush commented Jun 19, 2019

kaikreuzer commented Jun 19, 2019

openhab-bot commented Jul 19, 2019

	safeCaller.create(handler, ThingHandler.class)
	.withTimeout(CommunicationManager.THINGHANDLER_EVENT_TIMEOUT).onTimeout(() -> {
	logger.warn("Handler for thing '{}' takes more than {}ms for handling a command",
	handler.getThing().getUID(), CommunicationManager.THINGHANDLER_EVENT_TIMEOUT);
	}).build().handleCommand(link.getLinkedUID(), command);

don't use safe caller for event delegation to subscribers #861

don't use safe caller for event delegation to subscribers #861

Conversation

maggu2810 commented Jun 9, 2019

mhilbush commented Jun 9, 2019 • edited Loading

maggu2810 commented Jun 9, 2019

mhilbush commented Jun 9, 2019

Choose a reason for hiding this comment

maggu2810 Jun 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maggu2810 commented Jun 19, 2019 • edited Loading

mhilbush commented Jun 19, 2019

mhilbush commented Jun 19, 2019

maggu2810 commented Jun 19, 2019

mhilbush commented Jun 19, 2019

kaikreuzer commented Jun 19, 2019

openhab-bot commented Jul 19, 2019

mhilbush commented Jun 9, 2019 •

edited

Loading

maggu2810 Jun 10, 2019 •

edited

Loading

maggu2810 commented Jun 19, 2019 •

edited

Loading