random-state.net

Nikodemus Siivola

<< next | top | previous >>

Is That A Rest-List In Your Pocket? #
hacking, September 23rd 2012

Prelude

SBCL has for a while now been able to elide &REST list allocation when it is only used as an argument to APPLY, so

(defun foo (&rest args) (apply #'bar args))

is non-consing if BAR is. Note: I'm not saying eliding heap-allocation, I'm saying eliding list allocation completely: instead of moving the arguments into a stack or heap allocated list and then pulling them out later, the compiler generates code that directly passes them as arguments to FOO.

This doesn't make a difference to stack-allocation if all you look at is heap-consing, but it does save a noticeable amount of work at runtime, and it doesn't break tail-calls like stack allocation does.

That's how far it went, however: if you did anything else with the rest-list, the compiler gave up and allocated the full list — on stack if you asked for that using DYNAMIC-EXTENT.

First Act

Earlier this week Nathan Froyd (an SBCL hacker, and him of Ironclad-fame) committed a change that fixed a rather embarrassing oversight: we were heap allocating the rest-lists in full calls to vararg entry points to arithmetic functions like + and *.

This is less catastrophic for most code than you might imagine, since SBCL works pretty hard to call more efficient entry points — so those calls are virtually never seen in performance sensitive code.

Doesn't make it any less embarrassing, though. Years and years it's been like that, until Nathan noticed.

Nathan fixed the heap consing by adding DYNAMIC-EXTENT declarations to those functions involved, which not only reduced GC pressure a bit, but provided a small performance boost.

Second Act

Adding those DYNAMIC-EXTENT declarations had another side effect as well — a couple of backtrace tests broke from to unexpected frames, due to tail-calls being foiled by the stack allocation: several tests used division by zero to trigger an error, so the arithmetic changes showed up there.

That would have been a fair tradeoff, and the backtrace tests could just have been adjusted to allow the extra frames, but we could do a bit better.

SBCL has an extra (internal, unsupported) lambda-list keyword: SB-INT:&MORE, which is a fair deal hairier to use than &REST, but allows dealing with variable arguments without any consing — heap or stack. So those arithmetic functions got changed to use SB-INT:&MORE instead, which fixed the backtrace tests and gave another small performance boost.

Third Act

I was looking at the SB-INT:&MORE changes, and wondering if we should expose it to users as well, since it obviously is useful occasionally — and what kind of interface cleanup that would entail.

Thinking about that I realized that I could just extend the compiler smarts for dealing with &REST instead. Under the hood, when SBCL optimizes an APPLY with a rest list as the final argument, it actually changes into using &MORE.

So, I extended that part of the compiler to deal with (for starters) a few list functions that would be sufficient for implementing the arithmetic functions using rest lists and compiler magic.

The conversion path looks roughly like this:

;;; Original source, using LENGTH as an example
(defun foo (&rest args)
  (length args))

;;; Compiler adds hidden &MORE arguments when it sees the &REST.
(lambda (&rest args &more #:context #:count)
  (length args))

;;; Source level transformation notices LENGTH is applied to a &REST
;;; argument and transform into %REST-LENGTH.
(lambda (&rest args &more #:context #:count)
  (sb-c::%rest-length args #:context #:count))

;;; During optimization another transformation sees the %REST-LENGTH,
;;; and verifies that the rest list is never modified, or used in
;;; any place that would require actually allocating it — this being
;;; the case, it proceeds.
(lambda (&rest args &more #:context #:count)
  #:count)

;;; Since the rest list isn't used anymore, it is deleted.
(lambda (&more #:context #:count)
  #:count)

That's it, approximately. Currently this can be done for: ELT, NTH, CAR, FIRST, LENGTH, LIST-LENGTH, and VALUES-LIST — and additionally using a rest-list as the test-form in an IF is equally efficient and doesn't force its allocation.

LENGTH, ELT, and NTH on rest-lists deserve a special mention: they're all O(1) when this optimization has been applied.

Unfortunately we don't yet have any compiler notes about this, so if you intend to take advantage of this optimization, you're best off verifying the results from assembly.

Coda

With that in place, I rewrote the vararg arithmetic functions using &REST. Amusingly they now look rather charmingly naive: the way someone who doesn't understand the cost of list traversal would write things:

(defun - (number &rest more-numbers)
  (if more-numbers
      (let ((result number))
        (dotimes (i (length more-numbers) result)
          (setf result (- result (nth i more-numbers)))))
      (- number)))

...but using bleeding edge SBCL, this compiles into rather nice code.

Finally, some pretty pictures. These are benchmark results for calling the vararg #'+ with 2, 4, or 8 arguments. F means fixnum, S a single float, and D a double float. The numbers are benchmark iterations per second, so bigger is better. Topmost chart is for the current version using SBCL's newly found rest-smarts, middle chart is for the version using DYNAMIC-EXTENT, and bottom one is for the version before all this madness started.

Benchmarks, linear scale.

Benchmarks, logarithmic scale.

If you look at the vararg+[ff], vararg+[ffff], and vararg+[ffffffff] benchmarks, you can see how the &REST list allacation and access costs almost dominate them: even with stack allocation going from 8 to 2 arguments barely doubles the speed; with the latest version each halving of the argument count doubles the speed for both the fixnums-only and the singles-floats-only benchmarks.

This was run on x86-64, so both single-floats and fixnums are immediate objects. Doubles, however, need heap allocation here — so if you look at the double float numbers some of the allocation costs come from the numbers and intermediate results.

...but before you get too excited about these numbers, remember the reason why no-one noticed this for such a long time: in real-world performance sensitive code these entry points don't really matter that much.