-
Notifications
You must be signed in to change notification settings - Fork 158
Design Doc: Cache invalidation of URL patterns
Srihari Sukumaran
last updated - July 19, 2012
Allow users (webmasters) to cache invalidate individual URLs or URL patterns.
The current cache invalidation feature allows a webmaster invalidate all URLs of his site. This invalidates all the items cached prior to a timestamp (corresponding to the time of the click). This does not give the webmaster the flexibility of invalidating only specific URLs, e.g., when the webmaster updates a resource he will want to only invalidate cache entries corresponding to that resource. With the current invalidation feature, the webmaster is forced to invalidate all URLs (thus increasing latency for loading unaffected resources) or accept stale content.
We extend the idea behind the current implementation of "all cache items for a domain" invalidation -- invalidation is done at the time of cache lookup. Each URL to be invalidated and the timestamp when it was requested are written to config (storage config and then rewrite options). The current approach, for HTTP cache, of checking against these when the actual cache lookup is performed (specifically, in the callback), can be applied here also. But the approach of including the cache invalidation timestamp in the rewrite options signature (and hence metadata and property cache keys) to invalidate metadata and property caches is not necessarily what we want for URL cache invalidation. Hence we plan to support two URL cache invalidation modes:
-
'Strict' URL cache invalidation: Here we will not include the url patterns and timestamps in rewrite options signature. Thus metadata cache will not be invalidated at all. For the property cache we will explicitly use the URL patterns and timestamps, either like for HTTPCache or by including timestamps (of all matching patterns) in pcache key. This 'strict' invalidation is suitable for html, and not for resources.
-
'Reference' URL cache invalidation: Here we will include the url patterns and timestamps in RO signature, thus invalidating all metadata and pcache values for the domain. This is suitable for resources, where it makes sense to also invalidate all potential 'references' to the resource being invalidated (e.g., all rewriting metadata, html cached in blink since it could contain references to rewritten resource)
Pros:
- No major changes in the core rewriter code -- basically adding logic to existing functions.
- Allows URL patterns for invalidation.
Cons:
- Slightly more code complexity in webmaster interface.
The StorageConfig proto will have the following related to cache invalidation:
message StorageConfig {
...
// Existing field for invalidating ALL urls:
optional int64 cache_invalidation_timestamp;
// This list will be in increasing order of
// URLCacheInvalidationItem::timestamp. This should happen naturally
repeated URLCacheInvalidationItem url_cache_invalidation_items;
...
}
message URLCacheInvalidationItem {
int64 timestamp;
string url_pattern;
bool strict; // default true
}
We plan to minimally modify the current console UI to support url cache invalidation.
The ‘Flush Domain’ button should be replaced by two buttons: ‘Flush Html’ and ‘Flush all related resources’
The user can enter a URL pattern and click either of the buttons. This will be used to invalidate caches for all the domains in the project.
RewriteOptions should be modified to contain a field corresponding to url_cache_invalidation_items, which will be populated from storage config.
For invalidating individual URLs, it is more important that the latency in propagating the storage config to the different rewrite_proxy instances is minimal. The pubsub channel is should take care of this.
The current HTTP cache invalidation is based on the (pure virtual) function HTTPCache::Callback::IsCacheValid(). In HTTPCache whenever an item is retrieved from cache, its validity is checked (among other things) by invoking this function provided by the client’s callback. All cache gets requires clients to create and pass in a HTTPCache::Callback object. These objects belong to subclasses of HTTPCache::Callback, with access to RewriteOptions, hence in IsCacheValid, cache_invalidation_timestamp can be compared to the date field of the response header of the item retrieved from the cache to decide validity.
In addition to comparing against response headers date field, IsCacheValid will need to perform an equivalent of the following:
valid = true;
for each item in url_cache_invalidation_items traversed backwards:
if (item’s timestamp < date field of response header of
cache entry)
break; // all items henceforth have a greater timestamp
if (url of cache entry matches item’s url)
valid = false; break;
end
The above is done in the line of request.
Non-strict (reference) invalidation items will be included in rewrite options signature, thus invalidating the metadata cache. Strict invalidation items have no effect on metadata cache.
Non-strict (reference) invalidation items will be included in rewrite options signature, thus invalidating the property cache.
String invalidation items will be handled as follows:
Pass RewriteOptions::url_cache_invalidation_items into PropertyPage constructor (in ProxyInterface::InitiatePropertyCacheLookup).
Then in PropertyCache::CacheInterfaceCallback::Done() we can compare pcache_value->write_timestamp_ms() against page->url_cache_invalidation_items entries to determine if pcache_vaue is invalidated. If invalidated do not add it to page (i.e., do not call page_->AddValueFromProtobuf for it). Either all or no values in a cohort should get invalidated (since values in a cohort are all written together), and so in case of invalidation of a cohort we can pass false to collector’s Done method.
When the webmaster enters an URL to invalidate, it is propagated to all server instances.
URL entered for invalidation need not be persisted, in fact it is awkward to do this.
We need a URL invalidator class that is a consumer of invalidation URL publications. On receiving an update (an URL) it should invoke Delete on all caches -- http_cache() and metadata_cache().
Pros:
- Simple to implement.
- Simple UI.
- No invalidation checks in line of request.
Cons:
- Implementing the URL invalidator class is very tricky
- How to figure out which metadata keys to call Delete on?
- Will this lead to redundant Delete RPCs to remote cache servers?
- Supporting URL patterns is hard (if not impossible)
A service, e.g., InvalidateCacheUrl, with the request proto containing the URL to be invalidated and the domain. The devconsole backend server, on receipt of a InvalidCacheUrl request, will publish the URL and domain on the pubsub channel for url invalidation.
There should a URLInvalidator class that subscribes to the channnel. This class requires RewriteOptions, http_cache, metadata_cache and PssUrlNamer. It has to run as a "background task" -- wakes up when an update on the pubsub channel is received. When it receives an update it has to explicitly invoke Delete on all the caches. This will involve two Deletes on the http_cache, one for the URL and the other for its rewritten version, and multiple deletes on the metadata cache. But it is not clear how it will synthesize all the cache keys to be invalidated.
It is not even clear if this alternate design is feasible at all.