Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CaffeineCache as backingmap can have inconsistent data after eviction #136

Open
lyiu18 opened this issue Feb 6, 2025 · 9 comments
Open
Labels
bug Something isn't working

Comments

@lyiu18
Copy link

lyiu18 commented Feb 6, 2025

Use caffeine as backingmap, saw possible inconsistency after evict entry from cache

The cache setup as following

instead of use it directly as shown in the document, added caffeine inside read-write-backing-map, as I will need to integrate cache-store

<scheme-name>caffeine-distributed-scheme</scheme-name>

<backing-map-scheme>

    <read-write-backing-map-scheme>

      <internal-cache-scheme>

                                            <caffeine-scheme />

                              </internal-cache-scheme>

                              <cachestore-scheme />

                            </read-write-backing-map-scheme>

</backing-map-scheme>

A dummy cacheloader is used, which will throw exception when accessed, this is used for simply show the problem

Insert data into cache normally

evict cache right before issue cache.clear

the evict will eventually call this line

https://github.com/oracle/coherence/blob/main/prj/coherence-core-components/src/main/java/com/tangosol/coherence/component/util/daemon/queueProcessor/service/grid/partitionedService/partitionedCache/Storage.java#L8919

and it seems CaffeineCache will record the eviction but may not synchronously finish the operation

then the cache clear will hit this line

https://github.com/oracle/coherence/blob/main/prj/coherence-core-components/src/main/java/com/tangosol/coherence/component/util/daemon/queueProcessor/service/grid/partitionedService/partitionedCache/Storage.java#L1937

which will return the keys including the key just evicted

code following this will try to remove key one by one, and found the evicted key already gone from cache, then it will try to reload it from cache loader, which will trigger the exception from dummy cache loader

Is this intended behavior? main concern here is if there will be more cases we will see the inconsistency, causing more troubles

Environment (please complete the following information):

  • Coherence 14.1.2.0.1
  • Java 21 zulu
  • OS: linux
  • OS redhat 7
  • Is this a container/cloud environment, no
@lyiu18 lyiu18 added the bug Something isn't working label Feb 6, 2025
@ben-manes
Copy link
Contributor

I am not familiar with Coherence's code, but you might be describing this scenario in our Guava migration guide.

Invalidation with concurrent computations

Guava will ignore entries during invalidation while they are still being computed. In Caffeine each entry being invalidated will block the caller until it has finished computing and will then be removed. However, computing entries may be skipped by an invalidateAll() due to those keys being suppressed by the underlying hash table. If an asynchronous cache is used instead then invalidation is non-blocking as the incomplete future will be removed and the removal notification delegated to the computing thread.

ConcurrentHashMap does not expose an iterator that includes the in-flight computeIfAbsent keys. When a Map.clear() is invoked, we have to iterate over the entries in order to both remove the mapping and update associated metadata, such as removal from the LRU list and expiration policy. As these in-flight keys are suppressed we cannot make this linearizable. If I recall correctly, internally the hash table does make it linearizable on its clear() by blocking on those computes, since it must lock the shared hashbin, but we cannot use that since we also need to know the removed entries.

Caffeine cannot directly resolve this without internal access to the hash table, which would require maintaining patches to replace internal apis (like jdk.internal.misc.Unsafe) and carefully track future changes. The alternative is to manage this externally, such as by using a generational id as part of the key so that previous data is not accessible under the new namespace.

@lyiu18
Copy link
Author

lyiu18 commented Feb 7, 2025

Thanks, Ben

I don't think this will be a big problem for caffeine itself, or even using caffeine as front/local cache. However, when using it as backing map, there are two layers holding the entries, it should have same entries, since the evict is not synchronous operation in caffeine, I am hoping it could somehow handled in coherence code, make the change reflect in coherence layer synchronously or some other ways.

Haven't read all the code, so just some thoughts.

@ben-manes
Copy link
Contributor

@aseovic helped me quite a bit, but admittedly I didn't really know what I was doing when writing the adapter. The evictionListener is called synchronously within eviction as part of the ConcurrentHashMap.compute call that removes the entry to notify the attached MapListener , which appears to include the FrontMapListener to (probably) make them consistent. So while an eviction is not synchronous with the writer, the custom logic can be performed within the atomic scope of per-entry eviction step. However what thread performs eviction is not specified, as policy updates are append to a buffer and replayed under a try-lock, which allows writers to not wait for each other after the map operation (similar to a database write-ahead transaction log). This can be asynchronous if the eviction listener is expensive, but as the cache's own is not I set it to use a calling thread to avoid any issues of sharing the ForkJoinPool's commonPool.

I tried to write tests so maybe someone can reproduce the problem in isolation. Even though the interfaces provide good separation, there might be some implicit implementation detail depending on the older LocalCache's behavior.

@aseovic
Copy link
Member

aseovic commented Feb 7, 2025

@lyiu18 This one is tricky...

LocalCache.evict is a blocking call, so by the time it returns the entry is guaranteed to be removed from the backing map. In Caffeine, as @ben-manes pointed out, it is not. The ConfigurableCacheMap interface doesn't define what the behavior should be, but I believe the expectation is that after evict returns the entry is no longer in the cache and is not returned by either the key set or the entry set iterator. This would require either Caffeine or CaffeineCache backing map implementation @ben-manes wrote to keep track of evicted keys/entries and exclude them from subsequent iteration (or other operations), which is likely too expensive and would eliminate or reduce some of the benefits of using Caffeine.

What I don't understand, though, is why you are evicting individual entries and then calling clear, especially within a cache store, and what exactly do you mean by "two layers holding the entries" -- the entries are only "held" in one place, the backing map itself (Caffeine or LocalCache), which ReadWriteBackingMap wraps.

That said, the only reason to use evict instead of remove, which I believe would work fine with Caffeine, is to avoid deleting the entry from the underlying cache store/database. If that's what you are trying to do, there may be another option, which is to perform synthetic remove using an entry processor:

    /**
     * Remove this Entry from the Map if it is present in the Map.
     * <p>
     * This method supports both the operation corresponding to {@link
     * Map#remove} as well as synthetic operations such as eviction. If the
     * containing Map does not differentiate between the two, then this
     * method will always be identical to <tt>InvocableMap.this.remove(getKey())</tt>.
     * <p>
     * As of Coherence 12.1.2, if <tt>fSynthetic</tt> is true and the
     * {@link #getBackingMap BackingMap} associated with this entry is a
     * ReadWriteBackingMap, this method will bypass the
     * {@link com.tangosol.net.cache.CacheStore CacheStore} or
     * {@link com.tangosol.net.cache.BinaryEntryStore BinaryEntryStore}.
     *
     * @param fSynthetic pass true only if the removal from the Map should
     *                   be treated as a synthetic event
     */
    void remove(boolean fSynthetic);

Basically, calling

cache.invoke(key, entry -> entry.remove(true));

will remove the entry from both CaffeineCache and LocalCache synchronously, but the remove call will not be propagated to the cache store, just like the evict wouldn't.

Does that address your concern, @lyiu18?

@ben-manes
Copy link
Contributor

The ConfigurableCacheMap interface doesn't define what the behavior should be, but I believe the expectation is that after evict returns the entry is no longer in the cache and is not returned by either the key set or the entry set iterator.

I wonder if something like this would be worth changing to?

public void evict(Object oKey)
    {
    f_cache.asMap().computeIfPresent(oKey, (k, oValueOld) 
        {
        notifyEviction(k, oValueOld, RemovalCause.EXPIRED);
        return null;
        });
    }

Currently it simply sets the entry's TTL to zero, which implicitly forces it to be expired and eventually evicted. For whatever reason, I didn't consider setExpireAfter(key, duration) to be used for immediate expiration so it is handled as a lazy reordering, whereas the above forces explicit behavior. The key/entry set iterator should check before returning an expired entry, however it does so based on the Ticker. That is typically using System.nanoTime() which is monatomic, but Coherence's is based on System.currentTimeMillis() which is not (though it tries to mitigate this). That might account for some visibility, but it is hard to tell without a reproducer.

@lyiu18
Copy link
Author

lyiu18 commented Feb 7, 2025

To Aleks’s question

What I meant by two layers, I think there are some data hold in PartitionedCache/ReadWriteBackingMap directly

In fact, when clear is called, the code will eventually call PartitionedCache#Storage method public Set collectKeySet(PartitionSet partMask, boolean fSnapshot)

This method will use PartitionedKeyIndex instead of internal map(readwritebackingmap/caffeine map)

Thus the data is different, I assume the index map will get updated when caffeine map do actually evict/callback functions

If it directly calls caffeine map I assume it will be fine, I didn’t test thoroughly, but when my debug window display all the entries in the caffeine map, the evicted key is not there (though I am not sure it is timing issue or not)

For using
cache.invoke(key, entry -> entry.remove(true));

Actually that is exactly what I used, it eventually goes to PartitionedCache#Storage method removePrimaryResource, it will call evict on the internal map, and problem started to show

To Ben’s question

I was thinking about same change, but I am not familiar enough to see how it should get changed or if that will make PartitionedCache (and its index etc) updated quickly, but may worth a try

Some background how we found this problem,

It is actually by accident, this is one of the test cases we have, that’s why we used a dummy cache loader that will throw exception (if we use a real cache loader, probably when you will see nothing, but there will be extra/waste cache loading operation), and the clear function is used for clear out cache to prepare for next unit test.

My main concern is not just about this case, it mainly how we can make sure the data will be consistent in those different layers (PartitionedCache, and its internal CaffeineCache), will that causes more problems than just this

It seems evict function schedule the work to be completed asynchronously, is there any other function has similar effects?

Or there will be some function in PartitionedCache that needs to be modified etc?

@ben-manes
Copy link
Contributor

ben-manes commented Feb 7, 2025

It seems evict function schedule the work to be completed asynchronously, is there any other function has similar effects?

In Caffeine's eviction, yes. It tries to scale concurrent, independent writes rather than perform them in lock-step, which is especially bad for computations that may call a user's cache loader. Therefore we split the hash table and policy to operate under different locks and synchronize the two using a ring buffer, so that after the hash table update (external view) writers can schedule their changes against the policy (internal view) and the two will asynchronously become consistent. If a lock-step motion is required then the adapter should (a) use its own lock for all writes and (b) call cleanUp() after the modification. I do hope that isn't necessary by instead using the atomic operations like computations and the eviction listener.

The proposed change to Coherence's evict(key) is completed synchronously by treating it as a explicit removal from Caffeine's perspective. That would bypass the evictionListener (and the async removalListener is not used), and forward the notification as if it expired to Coherence's listeners. This is similar to how the other map write methods are implemented. I'd be interested in whether that modification to CaffeineCache would pass your unit tests?

@lyiu18
Copy link
Author

lyiu18 commented Feb 7, 2025

Yes, if changed to notifyEvicted call there, it does pass the tests I have.

@aseovic
Copy link
Member

aseovic commented Feb 7, 2025

For using cache.invoke(key, entry -> entry.remove(true));

Actually that is exactly what I used, it eventually goes to PartitionedCache#Storage method removePrimaryResource, it will call evict on the internal map, and problem started to show

I see... Didn't realize it eventually ends up calling evict :-(

That said, Ben's proposed change seems reasonable, and if it passes your tests we should be good to go.

We'll discuss it internally tomorrow and assign someone to make the change and back port it to all the code lines that have Caffeine support. Thanks, @ben-manes for figuring out the solution, and @lyiu18 for reporting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants