random-state.net / Method Cache Hacking (May 1st 2007)

Method Cache Hacking #
hacking, May 1st 2007

PCL uses a cache that maps classes to methods (or slot indexes, constant return values, etc) to optimize method dispatch. This cache is basically an open hash-table specialized to be keyed by lists of classes (class wrappers really, but nevermind that.)

The original implementation made some provisions to thread safety: each time the cache was written to a counter was incremented, and if the counter changed between the start and end of a read it was treated as a cache miss. I won't comment on this in more depth, as by the time of SBCL diverged from CMUCL the bit rot had already set in.

So, how else do you implement a thread safe open hash? A plain lock is out of the question, as the cache is quite important to overall CLOS performance, and locking it would serialize all calls to generic functions with an underlying cache. A read/write lock combination might be more feasible, but since the cache is filled lazily this would make the warm-up significantly more expansive.

This is 2007, multi-cores are here, and we have compare and swap. So why not implement a lockless cache using that? Happily enough, it turns out to be pretty easy. The cache vector layout looks something like this:

... wrap1 wrap2 value1 nil nil nil wrap3 wrap4 value2 ...

There you see value1 keyed by wrap1 and wrap2, an empty cache line (the three nil values), and value2 keyed by wrap3 and wrap4.

Reads work front-to-back: the wrappers are hashed, and keys are matched starting from the hashed index. A nil where a wrapper is expected indicates a miss. If there is a wrapper mismatch the next cache line is checked, and so forth.

Inserts work back-to-front: the wrappers are hashed, and the first cache line starting from that index whose last wrapper place is free is written to from back to front. Using compare and swap to write the last element ensures atomicity versus other writers, and the value can be smashed in once all the wrappers have been written; reads never see incomplete series of wrappers, and during the small window when wrappers are in place but the value is not reads just miss as if the value had been deleted. By being careful we can also deal with simultaneuous inserts using the same keys, which may involve busy-waiting, but that should be extremely rare.

Modifications work like reads front-to-back: once the right cache line has been found the new value can be smashed into place.

Deletions are lazy: instead of deleting the wrappers nil is simply assigned to the value, and the wrappers are omitted the next time the cache is copied (when it is next expanded, typically.)

Since inserts are the only operation that (potentially) needs to modify multiple elements of the cache vector we need to disable interrupts for the critical section there, but once that is done the cache is interrupt safe.

This is simplifying things a bit, and leaves aside some low-level cleverness already present in the original cache implememtation — but aside from that this is all there is to it. So, how does this perform? If my benchmarks are correct, as well as or better than the cache currently used by SBCL.

Coming to SBCL CVS near you soon.