random-state.net

Nikodemus Siivola

<< next | top | previous >>

How To Fix A Compiler Bug #
hacking, August 17th 2011

I've been asked for an introduction or examples in working on SBCL.

Let's see if we can make sense of lp#738464. The issue is that code such as

(flet ((foo ()))
  (declare (ftype non-function-type foo))
  (foo))

complains about "failed AVER: (NULL (FUNCTIONAL-ENTRY-FUN LEAF))" instead of giving a sensible error.

  1. Replicate the issue.

    Entering the form in Slime lands me in the debugger with the expected error. This is good. It means that the bug is still there, and that the description is valid. If that didn't happen, it doesn't necessarily mean that the bug is gone — platform specific bugs, bugs appearing only under specific build options, and bugs contingent on specific optimize settings are relatively common.

  2. Study the issue.
    1. What should have happened?

      We should have gotten an error that says "cannot use a non-function type with FTYPE" or something along those lines.

    2. Why did the thing that happened, happen?

      The bad type apparently propagated quite a ways into the system and caused bad things to happen. (No point overanalyzing things.) It is obvious to me that this error does not occur when the system is dealing with the declaration, but much later — and it's obvious because I've spent a lot of time working on SBCL. If you know nothing about SBCL, here's how you could figure this out:

      • Notice that the backtrace at the point of error says nothing about dealing with declarations, but is full of stuff like SB-C::CHANGE-REF-LEAF, SB-C::LOCALL-ANALYZE-FUN-1, SB-C::REFERENCE-ENTRY-POINT, etc.
      • Figure out where declarations are handled. You could scan through a bunch of code starting from COMPILE and meta-dotting your way forward. You could ask on #sbcl, #lisp, or sbcl-devel. You could use APROPOS or "git grep -e " with various likely things and look if you find something that looks right.

      Let's assume I git-grepped a bit and finally hit upon "git grep -e def.*ftype". One of the matching lines is:

      ir1tran.lisp:(defun process-ftype-decl (spec res names fvars context)

      which looks like it might just have to have something to do with processing FTYPE declarations...

      Before you go and take a look at it, understand this: you don't need to read all of it just now. What you need to do is to make sure it's the thing that processes FTYPE declarations. TRACE is often insanely useful here: hit C-c C-t on top of the function name in Slime, go to REPL, and try the original form from the bugreport again.

      Apropos: M-. inside SBCL sources (for various reasons that would be nice to fix) is sometimes off-by-one-or-two toplevel forms. In this particular case you're likely to end up on process-type-decl (type, not ftype!) So notice that and find the right toplevel form — or be terribly confused for a while because trace output looks strange. This happened to me just now...

      Having traced the correct definition we see it indeed being called with our NON-FUNCTION-TYPE, so our diagnosis is confirmed: the place that processes the declarations doesn't notice that it's not a function type.

    3. Do I have any idea how to fix this?

      Making process-ftype-decl complain sounds about right. Looking at the code we see that it uses compiler-specifier-type to parse the type. So we need to check that the type returned is a subtype of function.

      If you don't know yet how to do the equivalent of SUBTYPEP on SBCL internal type objects, you can use the same process as earlier to figure this out: read more source, ask around, etc. (The answer is: use CSUBTYPEP).

    4. What if you have no idea how to fix it?

      Spend a while — 15 minutes or 15 days, whatever — trying to understand the issue well enough to fix it. Maybe you'll figure it out, maybe you won't. Repeat this excercise over various bugs for a few years, and you'll be an SBCL expert. Understand that sometimes you might need to look for information outside the sources: textbooks, papers, original implementors, etc.

  3. Fix the issue.

    Because I have every intention of actually fixing this just now, I'll assign the bug to myself and mark it as "In Progress". This helps avoid duplicate work.

    We already know the general shape of the fix we're going for:

    (unless (csubtypep ...) ...complain...)

    Now's the time to figure out just what our complaint is. I opt for reporting the invalid type and then ignoring it.

    (unless (csubtypep type (specifier-type 'function))
      (compiler-style-warn "ignoring declared FTYPE: ~S (not a function type)" spec)
      (return-from process-ftype-decl res))

    I edit to above code into the right place in the definition, recompile it with C-c C-c and go to the REPL to test it. Looking good.

    Note on hotpatching the compiler: sure, you can break things easily. It is, however, much faster than doing a full rebuild. Just set up a persistent slime-scratch buffer, and save your work before doing risky stuff. Even if you break something, most of the time you can fix things without restarting — the rest of the time you do M-x slime-restart-inferior-lisp, or at the very worst you have to go to the terminal to "kill -9" the process. The thing to note is that even if you did do a full build, you'd still run the exact same risks! All that said, there are changes that are hard or impossible to hotpatch: sometimes you do need to do a full rebuild.

  4. Finish up

    I add a test-case for the bug into tests/compiler.pure.lisp:

    (with-test (:name :bug-738464)
      (multiple-value-bind (fun warn fail)
          (compile nil `(lambda ()
                          (flet ((foo () 42))
                            (declare (ftype non-function-type foo))
                            (foo))))
        (assert (eql 42 (funcall fun)))
        (assert (and warn (not fail)))))

    EDIT: At this point double-check that the test-case fails without the fix. Specifically, make sure the test fails when run as part of the testsuite. See earlier comment on bugs contingent on optimization qualities, etc. (Thanks to Lutz Euler for pointing out this omission.)

    I commit my changes to the local repository with an appropriate commit message. This is a pretty trivial change, so it doesn't need much of an explanation.

    ignore non-function FTYPEs
    
      Fixes lp#738464.
    
      Give a style-warning and ignore the bad type.

    Now I do a full rebuild and run all tests to make sure I didn't break anything.

  5. Send to upstream.

    At this point I would normally write a NEWS entry for the bug, amend the last commit with it, and push to the official repo. However, SBCL is in its montly freeze right now, so I'm holding up for now and just pushed to the pending branch on my tree at Github.

    Non-commiters should at this point just do "git format-patch -1" and add their patch to the Lauchpad bug report, and add the tag "review" to the bug. (NEWS is conflict magnet, so it's easier all around if the person who commits the patch writes the entry.)

Obviously most any point here could be expanded a lot — but hopefully this provides a starting point of sorts to people interested in digging into SBCL. Fixing tiny buglets like this is an excellent way to get your bearings.