Assertions should be fatal

Thanks to Dawson Engler for helping get us connected to Coverity last Fall. Dan Veditz and I have done several scans of Mozilla sources (Firefox branch and trunk, Thunderbird with Calendar enabled) using Coverity’s SWAT static analysis toolset.

(As an aside: I am a big fan of the Meta Compiler approach, and I have hopes of helping extend it to the Mozilla-specific domains of HTML, XUL, CSS, the DOM, and JavaScript. Some day, practical programming languages will get beyond out-Java’ing Java or out-C++’ing C++ and let us declare our invariants, instead of trying to assert them with runtime checks that duplicate and invert logic, and that are compiled out of production code.)

The good news is that our nominal error rates are respectable at first glance: as good as or better than other large open source projects. The results show many trivial redundant null checks, missing or inconsistent null checks, and the like. These numerous bugs will be filed and fixed in batches, and we haven’t classified them all yet, so the bug-filing won’t be immediate. These are low priority bugs.

There were fewer more serious errors caught, including dead code, but the most worrisome problems visible to the static analyzers were not obviously bugs. Rather, they were cases where an index into an array would be out of bounds only if assertions (NS_ASSERTION, etc.) were not seen as fatal when botched.

The toolset can be taught that NS_ASSERTION is fatal (exits the control flow graph), just as <assert.h>’s assert macro is understood by default to be fatal. But thanks to some bad ancient history in the Mozilla project, for most builds, NS_ASSERTION is not fatal!

(The bad history dates from 1999 and involved various people, mostly Netscape employees, all “trying to get their jobs done” in the face of bogus assertions from some of their fellow hackers. Instead of forcing bad assertions to be fixed, at some opportunity/social/political cost to oneself, it was too easy to go along with the minority view that “assertions should just be warnings”.

To be fair, some wanted all assertions to be combined via newer macros with run-time checks that would not compile out of production builds, but that went too far: performance-critical code with a limited set of trusted callers should not have to run-time check all indices.

Anyway, I’m sorry I gave in and went along with the assertions-as-warnings flow. That was a decision made in haste and repented at leisure!)

So back to the present. Given assertions-as-warnings, the problem becomes: do we have enough test coverage with DEBUG builds to be sure that developers will actually botch assertions whose invariants are being violated by a rare bad caller or broken data dependency? And of those few developers, how many will actually notice the warning messages in the cluttered standard output or standard error of the Mozilla-based program they’re testing? And of those, how many will actually do anything about fixing the broken assertions?

Anyone committed to reality has to believe that the odds ratios multiply into too small a fraction for us to be confident that assertions are effective “path killers” for the static analyzers to respect.

I go further: anyone committed to software quality should want fatal assertions. Assertions are proof-points, to oneself and to others reading one’s code. They are not idle wishes, or doubtful questions posed to the software gods. If your code has a bogus assertion, you should want to know why it botched, and how to fix it. You should want to keep it in a fixed form, rather than remove it, if at all possible.

Dan and I pored over the several dozen potential static and dynamic overrun errors that the Coverity tools found, and at least for the ones in Firefox, we convinced ourselves that the callers were well-behaved. So again, based on our four eyeballs and fallible brains, we believe that the tools found nothing overtly bad or exploitable.

But, we must fix this historic assertions-are-warnings botch.

Just making assertions fatal now will turn tinderboxes on Linux and Windows, at least, bright orange.

I call that a good start, and then we can close the tree, divide the labor among all hands on deck at irc.mozilla.org, and fix or remove the bogus assertions. We could do this right at the start of the 1.9 alpha cycle. You may well object that this will be too disruptive. If so, please comment here proposing a better plan that actually fixes the problem in a finite number of days. We shouldn’t put this off beyond the start of the 1.9 cycle.

/be

14 Replies to “Assertions should be fatal”

  1. Are you going to publish possible assertion suspects?
    Is there a bug number associated with this issue?

  2. JR: it’s a huge list, and we aren’t able to publish the tool output, but we’ll get bugs on file. I’ll post a bug link here when there is one (or more!).
    /be

  3. When I just wrote “it’s a huge list”, I was talking about the list of bogus assertions that would turn the DEBUG tinderboxes orange. The list of static and dynamic overruns reported by Coverity is shorter, and I’ll talk to dveditz about putting it into the bug system.
    /be

  4. 1. Whenever an assertion is printed on the console, the assertion should also be logged along with a stack trace.
    2. Change the default behavior for assertions on Linux. It is currently “just print an error message in the console”. It should be “pop up a dialog showing the assertion with the choices Break and Continue”, which is the default on Windows, or “break”.
    3. Encourage developers to treat assertion failures as having the same severity as crashes.
    4. Reopen assertion bugs that were marked as WONTFIX/INVALID with reasons like “yeah, don’t do that” (bug 90780). There are 95 bugs with “assert” in the summary that are marked as WONTFIX/INVALID.
    5. Create optimized builds in which assertions cause a stack trace to produced (e.g. using Talkback), ideally without crashing the application. Encourage testers to download these builds instead of normal nightly builds in order to help improve code quality.

  5. I often troll Asa’s blog, but thought I would visit yours too.
    1. I praise you for having comments avaliable on your blog.
    2. I praise your efforts for going back and fixing code that you deam apropriate.
    3. How large of an ordeal is this? Is this a one or two day event? How many people will this take? What effects do you think this will have on the build stability?

  6. Re: Jesse’s comments:
    > 1. Whenever an assertion is printed on the console, the assertion
    > should also be logged along with a stack trace.
    Logged where? We can stack-backtrace on multiple platforms, that’s not hard. Getting release-mode logging that has the right properties (logs to the right place in the filesystem, is configurable).
    > 2. Change the default behavior for assertions on Linux. It is
    > currently “just print an error message in the console”. It should
    > be “pop up a dialog showing the assertion with the choices Break
    > and Continue”, which is the default on Windows, or “break”.
    Yes, absolutely: Linux should behave as Windows does — we need platform parity. What about Mac?
    > 3. Encourage developers to treat assertion failures as having the
    > same severity as crashes.
    The only effective way to encourage this is to make assertions be as disruptive and hard to undo as a crash. So item 2 should suffice, and we can keep the option of making assertions warnings, but the default mode needs to be “fatal” to tinderbox build automation.
    > 4. Reopen assertion bugs that were marked as WONTFIX/INVALID with
    > reasons like “yeah, don’t do that” (bug 90780). There are 95 bugs
    > with “assert” in the summary that are marked as WONTFIX/INVALID.
    Please help by reopening any such bugs, and make these bugs block bug 279923.
    > 5. Create optimized builds in which assertions cause a stack trace
    > to produced (e.g. using Talkback), ideally without crashing the
    > application. Encourage testers to download these builds instead of
    > normal nightly builds in order to help improve code quality.
    The talkback event is a great idea, and I’ll talk to Jay about it. As far as I’m concerned we could make all daily builds have this assertion talkback feature. Why make extra build variants if we can avoid it?
    /be

  7. From 4: Which bugs should block bug 279923? Just bogus assertions, or any reproducible assertion?
    From 5: Reasons I suggested a separate build:
    a) If assertion tests make Firefox too slow, fewer people will use nightlies, and testing will suffer overall.
    How much slower does Firefox get if you turn on assertions?
    b) At least at first, assertion failures will be common. If the assertion-failure behavior is disruptive (Talkback pops up, or worse, Firefox crashes and then Talkback pops up), nobody will want to use nightlies. If the assertion-failure behavior is to issue a Talkback report silently, users have to opt in somehow, and downloading a special build is one way.
    c) Test what will be released. If nightlies have assertions enabled but release builds do not, any bug that causes some C++ code to have undefined behavior could e.g. cause release builds to crash but have no visible effect in nightly builds.
    How often do you think that would happen? Do nightlies already differ from release builds in any way? Are release candidates sufficient to work around this problem?

  8. Let’s start with bogus assertions that are likely to keep the tree closed, due to smoking-orange debug tinderboxes.
    Another reason for a separate daily build with talkback-assertions: we have enough talkback server load that we make the Windows installer toss a coin to decide whether to install talkback, even though the agent was included in the .xpi. We shouldn’t jump out of this frying pan into a bigger talkback server load fire.
    I don’t know how slow assertions and other DEBUG code make our builds, but it seems pretty severe. If only a relative few DEBUG-#ifdef’d code blocks are to blame, we should find and fix — or perhaps special-case-#ifdef — them.
    Your reason (c) seems to suppose that assertions defend against false conditions with an early return, but that’s not the case. NS_ENSURE_TRUE and similar macros do, however.
    I agree that we want to test release-mode code. I expect as our static analysis and assertion-fixing proceed in tandem, we’ll converge on a state where we have runtime checks for cases that can’t be checked with Coverity and similar tools, and DEBUG-only assertions for cases with checkable, well-behaved callers or inputs — just in case someone adds a bad caller before the next static recheck.
    Nightlies have order 10k downloads; milestones are order 100k or better. Coverage is much better in the final milestones, integrating over a couple of months. We will likely have further “rings” of users, with developer-previews of future Firefox versions, etc., to gain greater QA and prove fitness with successively larger cohorts.
    /be

  9. In (c), I’m worrying about undefined behavior in other code, possibly unrelated to the assertion. For example, consider a piece of code that uses an object that has been freed recently. This might not crash most of the time in normal builds. An assertion test might allocate memory that overlaps with the freed object, making the build more likely to crash.

  10. Assertions generally do not allocate, and without C++ magic you’d need #ifdef DEBUG around a block of code to allocate and free, anyway.
    Your worry is on target in general, though, as any assertion could have side effects. Without exhaustive code reading, we’ll just have to find out the fun way.
    /be

  11. Hmm, given the large number of crash bugs that we permit to live in our codebase I am a little bit skeptical for the improvement by this action (searching for crash and test case as keywords shows what I mean). The Windows assert is already disruptive enough for me (https://bugzilla.mozilla.org/show_bug.cgi?id=152015).
    So what should happen to the bogus asserts?
    Let me give an example:
    https://lxr.mozilla.org/seamonkey/source/layout/tables/nsTableFrame.cpp#1952
    I activated the assertion to get a handle on the bogus callers. If this assertion is triggered either the reflow is completely wrong or a performance problem is indicated. Nothing critical, but I would like to know and fix the bad callers. How can/should this be handled? I don’t want this go unnoticed. Do I need to encapsulate it in a private DEBUG macro? Or should I convert it into a warning. If so is it possible to drop from warning into the debugger like it works currently with assertions under windows by setting something similar like the XPCOM_DEBUG_BREAK? And finally is there any documentation on the macro use policy. I am quite often in doubt which one to use. And I am not alone, as reviews and sr’s quite frequently also ask others for adding an assertion or changing the type. So before making the tree orange defining the assertion policy clearly so that it can be enforced seems to one thing I would expect from the leading devs.
    I believe that people who add assertions make the commitment to fix the issues revealed by triggering these assertions.

Comments are closed.