Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a virtual threads friendly pool with Jackson #35816

Closed
wants to merge 8 commits into from

Conversation

mariofusco
Copy link
Contributor

This pull request is a follow up of the work done to make the internal pool used by Jackson configurable and pluggable. It is not intended to be merged (also because we will have to wait the stable release of Jackson 2.16 before doing so), but to start some discussion on how and when to use this feature.

/cc @geoand @cescoffier @franz1981

@quarkus-bot quarkus-bot bot added area/dependencies Pull requests that update a dependency file area/jackson Issues related to Jackson (JSON library) labels Sep 8, 2023
@@ -63,6 +64,9 @@ public ObjectMapper objectMapper(@All List<ObjectMapperCustomizer> customizers,
for (ObjectMapperCustomizer customizer : sortedCustomizers) {
customizer.customize(objectMapper);
}
if (true) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default and mostly for backward compatibility reasons Jackson uses the ThreadLocal based pool that however isn't virtual thread friendly.

We should decide under which condition we may want to replace that default pool with something not relying on ThreadLocal. Personally I wouldn't leave this decision to end users, but I'd take it automatically. For instance we could use a virtual threads friendly pool when the project is running on Java 21+ and uses the @RunOnVirtualThread annotation at least once.

@@ -63,6 +64,9 @@ public ObjectMapper objectMapper(@All List<ObjectMapperCustomizer> customizers,
for (ObjectMapperCustomizer customizer : sortedCustomizers) {
customizer.customize(objectMapper);
}
if (true) {
objectMapper.getFactory().setBufferRecyclerPool(BufferRecyclerPool.LockFreePool.shared());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm just demonstrating how to configure a different pool implementation using the new Jackson API. The specific pool that I'm using is one of the implementations now offered out of the box by Jackson. This pool works reasonably well also with virtual threads, but it's pretty simple also because one of the constraint understandably required by Jackson maintainers was to not add any dependency to jackson-core. I believe that we could develop our own pool implementation, for instance based on some JCTools highly concurrent data structures, and if we find that it performs satisfyingly in all conditions (with both native and virtual threads) we may also decide to always use it regardless.

@mariofusco mariofusco marked this pull request as draft September 8, 2023 08:33
@cescoffier
Copy link
Member

@geoand @franz1981 @mariofusco I think for the first step we should have a condition like this:

java >= 21. && at least one @RunOnVirtualThread annotation used

I would use one of the default implementation for now, until we can measure the performance and see which one is the

  • best in virtual thread
  • best in platform thread (both worker and event loop)

We also need to see if we can leverage the JCTool queues to improve the performances.

@geoand
Copy link
Contributor

geoand commented Sep 8, 2023

Currently @RunOnVirtualThread can be applied to a method, class or the entire application, correct?

What I am thinking, is that we might want to be more grained about which Jackson ObjectMapper we update.

@mariofusco
Copy link
Contributor Author

What I am thinking, is that we might want to be more grained about which Jackson ObjectMapper we update.

If I understand this comment correctly "being more grained" will imply that we will have more than one pool instance around. While the current Jackson implementation supports this scenario without any problem, I advice against this because we will have an unnecessary proliferation of those BufferRecycler objects scattered among the different pools in use, which is exactly what the presence of a pool (possibly a single one) is intended to avoid.

@geoand
Copy link
Contributor

geoand commented Sep 8, 2023

Gotcha thanks.

What I am concerned about is that is there is say a single @RunOnVirtual thread say for one JAX-RS class, then that will end up impacting the ObjectMapper used by every JAX-RS class (and actually every other piece of Quarkus code that uses the default mapper).
If we don't think that will cause problems, I'm fine with the global configuration.

@mariofusco
Copy link
Contributor Author

If we don't think that will cause problems, I'm fine with the global configuration.

As I wrote my ideal solution would be to have a pool implementation (possibly leveraging JCTools or some similar library) that is good enough in all possible situations (with and without virtual threads) so we could plug and use it regardless of the specific project's conditions and environment. I will investigate this, possibly with the help of our performances superhero @franz1981.

@geoand
Copy link
Contributor

geoand commented Sep 8, 2023

Sounds good!

@franz1981
Copy link
Contributor

franz1981 commented Sep 8, 2023

@geoand @cescoffier Thanks to the change by @mariofusco I think we could leverage this already to speedup non-loom too :P:P

See here

image

This is the flamegraph of the JSON test on techempower (which is using the reactive stack AND no pipelining): the encoding to JSON is fairly dominated by SoftReference::get and ThreadLocal::get (at https://github.com/FasterXML/jackson-core/blob/f511bcd022acb3d33fbe8cbf4c44d3de340e4b91/src/main/java/com/fasterxml/jackson/core/util/BufferRecyclers.java#L57): I think that now that we can plug our recycler we could opt to use FastThreadLocal(s) and maybe remove the soft reference (or use a better mechanism?).
Removing the thread local to favour fast thread local should be no brain and I suggest to do it in a separate PR, wdyt?

FYI soft ref are pretty useless (in practice) due to how SoftRefLRUPolicyMSPerMB(http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#gc_softrefs) works, and it usually is triggered when is too late; I think we can make a better choice there, really.

Just for reference from https://docs.oracle.com/en/java/javase/17/gctuning/other-considerations.html#GUID-9E3E5371-20F5-4B70-A003-9D7851B115AF

Soft References

Soft references are kept alive longer in the server virtual machine than in the client.
The rate of clearing can be controlled with the command-line option -XX:SoftRefLRUPolicyMSPerMB=, which specifies the number of milliseconds (ms) a soft reference will be kept alive (once it is no longer strongly reachable) for each megabyte of free space in the heap. The default value is 1000 ms per megabyte, which means that a soft reference will survive (after the last strong reference to the object has been collected) for 1 second for each megabyte of free space in the heap. This is an approximate figure because soft references are cleared only during garbage collection, which may occur sporadically.

Which means that BufferRecycler need to not be strong recheable anymore (it should happen on the JsonGenerator::close); and should happen right after the encode is completed (unless we are streaming JSON) and the GC policy would kick-in: let's say we have 50 MB free in the heap, it means that such reference will be kept alove for 50 seconds(!!!) that's quite a long time!!! And get worse and worse as we use less heap (because more data will be kept alive for longer and longer!)

@geoand
Copy link
Contributor

geoand commented Sep 8, 2023

Removing the thread local to favour fast thread local should be no brain and I suggest to do it in a separate PR, wdyt?

Yeah I agree, @mariofusco's change opens this up for us, that's great!

In summary, if we can prove that our own BufferRecyclerPool is better regardless of Loom, I'm all for including it

@franz1981
Copy link
Contributor

franz1981 commented Sep 8, 2023

@geoand I can easily run (ehm ehm gotta check if the CI is working TBH) a custom quarkus build - as long as the the jackson version is on maven - or use my shiny new Ryzen and play with it already

Beware that the fast thread local one is for java < 21 because similarly to thread locals, it won't play nicely with loom, but still...given that it requires few years before everything will run on VT, we can have benefits for everyone in the meantime

@mariofusco
Copy link
Contributor Author

As anticipated my original goal was implementing a pool that is good enough for "all seasons" so I put together all suggestions from @cescoffier and @geoand (yeah, why not doing something more fine grained that behaves differently for native and virtual threads and takes the best of the two worlds?) other than the irreplaceable help of @franz1981 to implement this new HybridJacksonPool.

This pool works regardless of the version of the JVM in use and internally uses 2 distinct pools one for native threads (which is exactly the same ThreadLocal based one provided by Jackson out of the box) and the other designed for being virtual threads friendly. It switches between the 2 only depending on the nature of thread (virtual or not) requiring the acquisition of a pooled resource, obtained via MethodHandle to guarantee compatibility also with old JVM versions.

Note that this pool also guarantees that the pooled resource is always released to the same internal pool from where it has been acquired, regardless if the releasing thread is different from the one that originally made the acquisition.

I developed and ran some preliminary benchmarks and the results looks extremely promising. In particular comparing the original jackson pool with my hybrid one using a benchmark having 100 threads (either native or virtual) running in parallel and marshalling a simple object in json I got the following outcome:

Benchmark                                          (objectSize)  (parallelTasks)  (poolStrategy)  (useVirtualThreads)   Mode  Cnt      Score     Error  Units
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100    THREAD_LOCAL                 true  thrpt   10   6309.715 ±  16.574  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100    THREAD_LOCAL                false  thrpt   10  22541.301 ± 648.275  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100          HYBRID                 true  thrpt   10  19365.310 ± 285.984  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100          HYBRID                false  thrpt   10  22568.321 ± 158.982  ops/s

If these performances are confirmed (I will ask @franz1981 to double check them) I don't see any reason why we shouldn't unconditionally use this pool.

@geoand
Copy link
Contributor

geoand commented Sep 15, 2023

Awesome!


private static Predicate<Thread> findIsVirtual() {
try {
MethodHandle virtualMh = MethodHandles.publicLookup().findVirtual(Thread.class, "isVirtual", MethodType.methodType(boolean.class));
Copy link
Contributor

@franz1981 franz1981 Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would save the MethodHandle in a static final variable or sadly it would be quite costy.
I know that's captured by the predicate but I suggest to benchmark to be sure that will be optimized the same...if we care to not have the MethodHandle populated until necessary, we could use a nice static holder pattern
ie a VirtualChecker static class which would hold the method handle ONLY when it will be called the first time (if will be called the first time).
In this way the usual exceptions thrown by the JVM during the reflective search with method handle, won't happen, until used


private static long getProbeOffset() {
try {
return UnsafeAccess.UNSAFE.objectFieldOffset(Thread.class.getDeclaredField("threadLocalRandomProbe"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cescoffier there's anything to do here considering that we use the UnsafeAccess from JCTools AND probably native image won't be happy about it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it might be a problem in native. We avoid Unsafe.

@jponge that's also something to consider when using JCTools in Mutiny.


private final int mask;

private final MpmcUnboundedXaddArrayQueue<BufferRecycler>[] queues;
Copy link
Contributor

@franz1981 franz1981 Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue is cool, but uses Unsafe and I didn't (ehm ehm my fault!) implemented a varhandle/atomicref/long version of it, as per other JCTools variants...so maybe @cescoffier got some suggestion about native image or the approach to follow here; I'm not aware of quarkus code directly using Unsafe, but I can be wrong (I didn't searched it)...

Last but not least: JCTools is great (not by merit :P), but the benchmarks doesn't tell it to be the best by large margin tool for the job, because of how cpu works:

  • the mpmc queue uses an array adding/removing elements in FIFO order out of it, BUT: if the queue is empty or with less than 16/32 elements in, producer(s) and consumer(s) will perform store(s) (by adding new values, or reading and adding null to mark consumption) in the same cache-line, competing for it. The Cache Coherency protocol of CPU will make the ping-pong of such up-to-date information, the primary bottleneck. In order to benefit from using a queue (and get the contention to spread between head/tail), you need a much larger already populated pool (more than 32 elements), but it would cost! In the "real" world (but we can add telemetry and report it, to see if I'm right or not!) the queue will be mostly empty, causing this contention to happen for real, slowing down everyone
  • there is a LOCK_FREE variant which use a single atomic ref as the top of a linked-list stack: i'm not a BIG FAN of linked-lists, but still, for the most practical cases,it will deliver similar performance, no Unsafe nor JCTools usage, and it's all jackson stuff. And although counterintuitive the reason is what I've explained in the previous point: we fight for a single cache line in both cases. The only difference is that with JCTools releasing doesn't cause any form of competition to change the state, but will still hit the contention of the cache-line in the array, while the LOCK_FREE pool, instead, is always competing to update the state, but is the only relevant difference.

The striping instead, can be made A LOT better, known the fact we have talked about related the contention, in short:

we can use a AtomicReferenceArray which elements are distant 16 elements each (or 32, to be safe? this is because of some Intel quirks with HW prefercher which can make false-sharing to happen with 2 cache lines, not just one), and the slots we can really use are padded to avoid false sharing while we update them (because of java heap noisy neighbours or just JIT code which use/read the java Object header, which is likely on the same cache line of the node value in top element).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of quarkus code directly using Unsafe

Directly, no, we do not use it, but JBoss Threads which we do lean on uses it and I don't see any GraalVM substitutions for it, so I assume it works in native

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Unsafe nor JCTools usage, and it's all jackson stuff

All else being equal, this would definitely be preferable for Quarkus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Unsafe nor JCTools usage, and it's all jackson stuff

All else being equal, this would definitely be preferable for Quarkus

I agree, I discussed this yesterday with @franz1981 and we believe that we can achieve similar performances without using Unsafe or JCTools. I will work on this in the next days.

* Mix thread id with golden ratio and then xorshift it
* to spread consecutive ids (see Knuth multiplicative method as reference).
*/
int probe = (int) ((Thread.currentThread().getId() * 0x9e3779b9) & Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the perf if we use such? @mariofusco

@mariofusco
Copy link
Contributor Author

As discussed I developed a new pool, not depending on JCTools, that is a more sophisticated variation of my original lock free pool. I compared this new pool with the JCTools based one, with and without using unsafe for the queues striping algorithm and obtained these results:

Benchmark                                          (objectSize)  (parallelTasks)           (poolStrategy)  (useVirtualThreads)   Mode  Cnt      Score     Error  Units
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100             THREAD_LOCAL                 true  thrpt   40   6306.450 ±  24.453  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100             THREAD_LOCAL                false  thrpt   40  23023.395 ± 255.326  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100           HYBRID_JCTOOLS                 true  thrpt   40  18951.275 ±  85.859  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100           HYBRID_JCTOOLS                false  thrpt   40  22515.013 ±  88.088  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100         HYBRID_LOCK_FREE                 true  thrpt   40  18880.090 ±  75.578  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100         HYBRID_LOCK_FREE                false  thrpt   40  23043.384 ± 333.119  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100    HYBRID_JCTOOLS_UNSAFE                 true  thrpt   40  18769.230 ±  57.271  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100    HYBRID_JCTOOLS_UNSAFE                false  thrpt   40  23172.108 ± 231.973  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100  HYBRID_LOCK_FREE_UNSAFE                 true  thrpt   40  18607.895 ±  83.783  ops/s
JacksonMultithreadWriteVanilla.writePojoMediaItem         small              100  HYBRID_LOCK_FREE_UNSAFE                false  thrpt   40  22735.674 ±  81.071  ops/s

In case you want to give a look, or try the benchmarks on your own, all the experiments that I made with the various pool implementations that I tried are available here.

In essence this new version of the pool that neither depends on JCTools nor uses unsafe seems to be even a little bit faster, or however not slower, than the old JCTools based one (18880 op/sec vs. 18769) when using virtual threads (with native thread they both fallback to the old ThreadLocal based implementation so they are identical), so with my last commit I updated this pull request with the new pool implementation.

/cc @cescoffier @geoand @franz1981 @theRealAph

Comment on lines 150 to 156
return virtualMh != null ? t -> {
try {
return (boolean) virtualMh.invokeExact(t);
} catch (Throwable e) {
throw new RuntimeException(e);
}
} : t -> false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reallly try hard to avoid using lambas as much as possible in code that ends up on the runtime classpath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminder, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

}

private int probe() {
int probe = (int) ((Thread.currentThread().getId() * 0x9e3779b9) & Integer.MAX_VALUE);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some comment about golden ratio hashing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@geoand
Copy link
Contributor

geoand commented Nov 21, 2023

Jackson 2.16 has been released, so should this be moved forward?

@mariofusco
Copy link
Contributor Author

@geoand as anticipated I moved the pool to vert.x ( see eclipse-vertx/vert.x#4920 ) and already upgraded my PR to jackson 2.16 there.

My suggestion is to merge the vert.x pull request (and deploy a 4.5.1 release of vert.x containing it) asap, close this pull request and create a new smaller pull request for quarkus reusing the same pool implementation that we have in vert.x /cc @vietj @franz1981

@geoand
Copy link
Contributor

geoand commented Nov 21, 2023

Great, thanks for the update!

@gsmet
Copy link
Member

gsmet commented Nov 21, 2023

FWIW, the Jackson update is causing some CI issues: #37188 . Not sure if it's a regression or something on our side, I didn't have the time to investigate (and won't have anytime soon).

@geoand
Copy link
Contributor

geoand commented Nov 21, 2023

I can try and take a look at the end of the week

@@ -63,6 +63,9 @@ public ObjectMapper objectMapper(@All List<ObjectMapperCustomizer> customizers,
for (ObjectMapperCustomizer customizer : sortedCustomizers) {
customizer.customize(objectMapper);
}

objectMapper.getFactory().setBufferRecyclerPool(HybridJacksonPool.INSTANCE);
Copy link
Contributor

@fabrii fabrii Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user decides to use their own ObjectMapper, as defined in https://quarkus.io/guides/resteasy#json, and forgets to set the correct buffer recycler pool. Maybe raise a warning? Or force the buffer recycler pool in any other way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point, we need to see how it behaves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, good point!

@mariofusco
Copy link
Contributor Author

Replaced by #38196

@mariofusco mariofusco closed this Feb 8, 2024
@quarkus-bot quarkus-bot bot added the triage/invalid This doesn't seem right label Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependencies Pull requests that update a dependency file area/jackson Issues related to Jackson (JSON library) release/noteworthy-feature triage/invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants