Fresh XPCOM thinking

Everyone who gets far enough into Mozilla code has that “wow, this is chatty . . . verbose . . . inefficient even” reaction to XPCOM — or so I thought. Having played Cassandra once in the dark days before Netscape 6, lived to witness deCOMtamination, and watched the next generation of core hackers grow up wise from birth, I foolishly believed that we were well past worrying about XPCOM abuse.

But we aren’t. Perhaps SVG-the-standard is to blame, but not exclusively. We have to do better, and not just with peer pressure, but I’ll spend a paragraph on exhortation too:

Please don’t imitate COMtaminated code. Don’t use XPCOM in core rendering and layout data structures. Do use XPCOM where it wins, as high-level inter-library and inter-programming-language glue, where the QueryInterface and virtual method call costs are noise, and the benefits for programming in the large are obvious.

More than warning again, I believe it is time to change XPCOM in deeper ways, to fix several problems:

  1. The inability to free cyclic structures automatically.
  2. The code bloat and cycle costs of nsresult return type.
  3. The lack of nsIClassInfo implementation.

Problem 1 means memory leaks, forever, because we’ll never cover all cases, and new cases are being added constantly (sometimes just with JS completing the cycle, no C++ changes — so via a Firefox extension, for example).

Problem 2 makes our source code overlarge and hard to read, never mind the terrible effects on register pressure, instruction cache utilization, memory bandwidth wasted on the out param that wants to be the return value, etc.

Problem 3 means we can’t readily reflect on instance-of relationships in JS except for DOM objects. Having DOM class information (what interfaces does this object implement? is this object thread-safe?) is cool, but we still don’t take advantage of nsIClassInfo (e.g. for the solution to problem 1 suggested below) for want of ubiquity. (Another example: we believe we could use it to avoid all locking costs in XPConnect.)

These problems can be solved:

  1. I pointed dbaron and graydon at work by David Bacon et al. on reference-count cycle collecting that we could infuse into XPCOM, curing all cycle-induced leaks. We may hope to have fixed the last such leak, but Murphy was an optimist. XPCOM without some kind of cycle collector is like a car without a brake.
  2. I’m looking for volunteers to help run some experiments testing the state of C++ exception support on the compilers we care about (mainly GCC and MSVC). What is the code size increase from turning on exceptions (rtti too if required)? That’s the first experiment to run. Actually converting our mega-lines of C++ code to use exception handling instead of nsresult returning and testing will be real work, and lots of it. I have some thoughts for how to semi- or fully-automate that work that I’ll blog about soon.
  3. Right now the way you implement nsIClassInfo involves painful C macros, so no one does it. With some C++ sugar, and with the right defaults, defining class information for each concrete XPCOM class should be simple, so we can actually count on it being there.

These problems should be solved, and I’m committed to working to solve them. Who is with me?

27 Replies to “Fresh XPCOM thinking”

  1. I’m all for pushing XPCOM into the 21st century!
    Some numbers: VC8 official build, zipped (not 7z’d):
    Without exceptions: 7,038,820
    With exceptions: 7,768,638
    10% extra, zip’d size. Not sure if that would go down with 7z’ing, and I haven’t tested the performance of the two builds.. I’ll probably fire up Trender in the next day or so and see what I get.

  2. “Problem 2 makes our source code overlarge and hard to read, never mind the terrible effects on register pressure, instruction cache utilization, memory bandwidth wasted on the out param that wants to be the return value, etc.”
    Well I can understand the value in readability, but won

  3. It looks like that’s with RTTI; however, so is the without exceptions number. /GR (enable RTTI) seems to be the default now with VC8, and we don’t explicitly turn it off. Rerunning with /GR-. (I need to get these patches into the tree; the –enable-cpp-exceptions and rtti configure options only work with gcc right now.)

  4. Ok, new numbers:
    VC8, size of .zip (just “make” in browser/installer):
    no exc, no rtti: 6,891,769
    no exc, with rtti: 7,038,820 (+147,051 +2.1%)
    with exc, no rtti: 7,622,794 (+731,025 +10.6%)
    with exc, with rtti: 7,768,638 (+876,869 +12.7%)
    So RTTI doesn’t seem to be required, but is a minor hit compared to exception handling.

  5. Alex: no, exceptions would not obviously make performance worse. The most common “handler” for a failure nsresult is:
    if (NS_FAILED(rv)) return rv;
    or similar. With the right rewriting rules, this pandemic pattern *goes away*. Failure is rare, and should be rarer once we get rid of overloaded “COMFALSE” and similar nsresults that should be plain return types.
    /be

  6. Two notes on the codesize numbers:
    1) The change in mZ is a lot more important than the zip size to me (since mZ is what actually affects things like startup time, i-cache pressure, etc).
    2) I’d be interested in seeing the following 4 numbers compared (independent of RTTI):
    1) Vanilla build
    2) Build with NS_ENSURE_SUCCESS defined to be nothing.
    3) Build with exceptions.
    4) Build with exceptions and the NS_ENSURE_SUCCESS change.
    I realize that not all of our code uses NS_ENSURE_SUCCESS, but that would at least give us some order-of-magnitude ideas about how replacing it with exceptions would work…
    As a particular metric, I would be interested in the size of gklayout.dll/.so.

  7. gklayout.lib size:
    53996044 firefox-official-ex-nortti-noensure
    54026988 firefox-official-ex-nortti
    59480644 firefox-official-ex-rtti
    52476526 firefox-official-noex-rtti
    47022886 firefox-official-noex-nortti
    I don’t have numbers for a vanilla build with NS_ENSURE_SUCCESS defined to nothing, but it doesn’t look like it’ll make that big of a difference.

  8. Please don’t change the XPCOM ABI! 🙂
    I’m skeptical. Do we have proof that checking nsresults is actually hurting performance and that exceptions would make it better? Changing our entire codebase over to use exceptions would be a monumental challenge. You cannot just compile away NS_ENSURE macros. It would be a huge undertaking to revise all of our error handling. I think there are far better uses of our time.
    As for nsIClassInfo, I think we should make the macros better and encourage their use. For example, I’m all for adding nsIClassInfo to Necko classes. I think we should blackflag every explicit QueryInterface in our JS code and figure out if we can’t replace it with nsIClassInfo. That will make our toolkit easier to use.
    As for reference cycles, they are generally introduced through observer patterns. I think we should harden our observees to ensure that they only introduce reference cycles that they are guaranteed to break. (e.g., nsIChannel implementations are required to release their listener in OnStopRequest.) That will make our APIs more robust to abuse from poorly written extensions. Of course, there are other sources of reference cycles, but we can get a lot of mileage out of our existing architecture and some well placed effort.
    I guess I’m just not convinced that we need to rip-n-replace the architecture that has gotten us this far.

  9. Benjamin, your write-up didn’t do more than make assertions. I am provoking, and attempting as time allows, actual measurement as proof of improvement.
    Of course doing what bz proposed was just a first step to measure easy wins involving NS_ENSURE_* macros. No one said otherwise, and it’s not worth arguing about. The proposal I’ve made stands or falls on code size once the source has been freed of most nsresult testing and consequent early returns, and results have moved back to the return value from the last out parameter.
    Darin, none of these changes involves manual “rip and replace” of the entire architecture. As you note, class info is additive. A cycle collector requires object layout to be specified, but that is also additive — no ripping or replacing.
    The exception experiment is r&r, but here’s the deal. To be compelling, it must involve changing a large body of code, proving both significant savings and correctness. It can’t be done by hand, nor do I propose such a waste of time. The idea is to use a sound static analysis framework to assist in the rewrite. The goal is for most cases to be handled automatically. More on this in a bit.
    /be

  10. Sorry, I forgot to add in reply to Benjamin’s post that there are always ways to bridge an exception-based XPCOM C++ binding to the current one, in both call directions. I don’t want to break XPCOM, I want to fix it compatibly. If this isn’t possible, then I want proof that its nsresult design, which we know makes our source code significantly bigger than, e.g., WebKit, is not a source of significant code size and cycle costs.
    If we can prove all that, I’ll gladly conceded that we shouldn’t change XPCOM. Otherwise, all the MSCOM compatibility in the world (which we can preserve with a bit of bridging work) doesn’t cut it for me.
    /be

  11. One more point: the grass may not be much greener, but it is not obvious that making every single last call site worry about nsresult testing beats making far fewer call sites worry about using try blocks. To repeat a point for the third time, all the
    if (NS_FAILED(rv)) return rv;
    statements go away with exceptions. Those blocks are not virtuous in my eyes, they are hazardous. We know of many bugs where the wrong rv was returned, or a failing rv was dropped on the floor (bad when there is an exception left pending in JS). Such lines are too often mindless bloat, salt thrown over one’s XPCOM shoulder. Enough with superstition.
    /be

  12. > The exception experiment is r&r, but here’s the deal. To be compelling, it must involve changing a
    > large body of code, proving both significant savings and correctness. It can’t be done by hand,
    > nor do I propose such a waste of time. The idea is to use a sound static analysis framework to assist
    > in the rewrite. The goal is for most cases to be handled automatically. More on this in a bit.
    OK, I’m curious. In my opinion, solid error handling is paramount, and I am skeptical (hopefully, understandably) of any automatic tool that changes error handling code paths.
    In Necko-land, I’ve worked hard to ensure that all errors are handled. It’s true that people often do not take such care. (Indeed, there is a reason why I had to do so much work on Necko to make it handle error conditions properly, and there’s still more work to be done.) At any rate, I’m not convinced that exception handling results in better error handling in practice for many of the reasons bsmedberg enumerated. There’s simply no substitute for good programming 😉

  13. > mZ is what actually affects things like startup
    > time, i-cache pressure, etc
    The mZ effects of exception handling are hard to interpret. The unwind tables, if implemented properly, should reside in their own pages, never to be touched until an exception actually occurs (which should be rare if exceptions are being used properly), and therefore don’t affect i-cache and should have a minimal effect on startup time. I sure hope exception unwind tables don’t require dynamic linker relocations.

  14. Darin: amen to no substitute for good programming. We are arguing about trade-offs among exceptional case handling techniques. Handling all cases correctly requires care, as ever. Different programming languages can help or hinder. Modern type systems in particular can help quite a bit, but C++ and C are not modern in their type systems.
    SpiderMonkey (in C, so no chance even for C++ exception handling) uses explicit return testing, overloading null for error on pointer return type, using JSBool with false for error otherwise (mostly).
    Cairo’s approach keeps error state in each object, and makes objects go safely dead after an error. So you can generally avoid querying error state after each operation. Committing unwanted effects instead of stopping short on error becomes the new hazard. This may work better for graphics code than for other kinds of code.
    There’s no panacea, but the way nsresults are used today is pessimistic. Most of those if-return statements are properly predicted not-taken by compilers and hardware branch predictors. That suggests that there are better optimizations than writing them unalterably into the source code, and fetching them (when short) in the instruction stream.
    /be

  15. roc wrote:
    “The unwind tables, if implemented properly, should reside in their own pages, never to be touched until an exception actually occurs (which should be rare if exceptions are being used properly), and therefore don’t affect i-cache and should have a minimal effect on startup time.”
    This implies that some cases (I’ve alluded to NS_COMFALSE, but there are other non-NS_OK nsresults used frequently to indicate cases that may not be exceptional), a method will want to use a boolean return type, or some similar way of indicating that a less-frequent, but still common, case arose.
    Using a magic nsresult (NS_ERROR_NOT_AVAILABLE, e.g., in the XUL FastLoad code) made sense when the return value was preempted by nsresult to match XPCOM’s pattern. Freed of that pattern, such code may not want to throw such an nsresult code if the cost of throwing is high compared to using an “in-band” return code.
    /be

  16. I have actual experience working with a large C++ codebase that didn’t have exception handling and was changed to use it.
    1. Compilers are quite good about exception support (although you have to send some sort of “dynamic” flag to the gcc linker or things will magically not get caught properly!); that’s not the problem.
    2. I think you’re very unlikely to get a performance boost. Since you’re already pushing to use XPCOM in fewer cases, I don’t think you should be looking for one anyway.
    3. Very little of our existing codebase was ever actually moved over to use exceptions, but all of it had to guard against it when it called into new code. It was kind of a mess. But code written without exceptions in mind is tricky to convert; allowing exceptions to just fall out usually skips cleanup code.
    4. Generating decent error messages from exceptions requires a *lot* of “try, catch, annotate, rethrow”. There’s no good way around this. Is your code shorter and clearer? Sometimes. It is often indented an extra level though. (IOW, not a huge win.)
    5. I think it’s equally hard to get things right either way. The only difference is what happens when the programmer doesn’t handle error cases properly. With return codes, if there’s an oversight, errors mysteriously disappear. With exceptions, things fail (with a cryptic message) that should succeed.
    6. This is not a minor change. Using exceptions absolutely requires using the C++ language a lot more heavily. You *have* to use classes for everything that requires cleanup, which means relying on destructors and STL a lot more. It’s a totally different style of coding, and not necessarily less cumbersome overall.
    7. Debuggers have trouble with stepping through stack-unwinding. This is a huge pain in the butt.
    In short, I don’t think it’s a huge net gain. It’s like switching to a whole different language, with a different set of tradeoffs. Probably not worth the effort. Your other two suggestions sound much wiser. (BTW: Python has reference counting too, and also eventually added cyclic collection after the fact, and it introduced a bunch of bugs; so you might want to talk to those guys, pick up valuable test cases. 🙂

  17. By the way, #6 above suggests that you will pay a performance penalty in the long run. I know nothing about the Mozilla codebase, but the more you leverage the C++ language beyond C, the slower things get, naturally.

  18. For what it’s worth, I think this discussion should really be happening in the relevant newsgroup, not here. Makes it more findable too.

  19. Jason: item 3 “Very little of our existing codebase was ever actually moved over to use exceptions” contradicts your preamble “large C++ codebase that didn’t have exception handling and was changed to use it.” You can’t get the code cost wins without changing signatures to free the return type and to move results from |out| parameters to return values.
    Exceptions are not better or worse in usability compared to return codes in my view, just different. I agree with the non-advocacy parts of scc’s writeup at https://scottcollins.com/articles/exceptions.html. My sole aim here is to reduce source code bloat and compiled code costs (both static and dynamic). If in spite of (to be demonstrated) quantified, significant wins from switching exception models at scale, the Mozilla community can’t work with exceptions for some human reason (hard to believe), that *might* mean we renounce the wins. Let’s have that problem.
    bz: yeah, I’ll start threads shortly.
    /be

  20. It seems to me that the biggest problem many of you have is that you are not sure what is really essential for such a big project and where are the real problems.
    It might help that you get to know the opinion of your colleagues from gaming community that maintain similar huge projects. Their “boss” gave a talk at this year’s POPL’06 where he addressed at least issue 1, see “The Next Mainstream Programming Language: A Game Developer’s Perspective”, Tim Sweeney, https://www.cs.princeton.edu/~dpw/popl/06/Tim-POPL.ppt
    To be short, his opinion is that for big projects there is no alternative to garbadge collection. Then, the coders were ready to sacrifice 10% in efficiency of the code for gaining 10% in productivity. E.g. they abandoned assembler at all. You cannot afford to be not type safe nowadays.
    My personal remark about what garbadge collector you should use is – use the one that java uses. It is the one that works – and the one that you can always ask a question about.
    There are lots of other issues covered in the talk that could also be helpful to you. The talk also helps if you want to change a scripting language – to understand what is necessary and what is not.
    The build size is completely inessential from my viewpoint, as long as you have much bigger problems like real bugs.
    You can also try to use some model checkers to verify that you code is free from errors of certain types (e.g. NULL dereference, assertion failures and access past array bounds). The world has evolved far beyond almost unusable lint. Check what model checkers have been created in the last 5 years and try some of them out. Unfortunately, I’m unable to give an overview of good ones.
    Regards,
    sasha.

  21. sasha: I’m familiar with Tim’s POPL preso already. And besides making JS2 a stronger (but impure) functional language, I’m explicitly proposing GC for XPCOM here.
    We can’t rely on Java. There is no industry strength open source JVM, and the web needs JS, which is SpiderMonkey in Firefox, with many and deep API dependencies.
    We use model checkers (Coverity) already.
    We didn’t just fall off the turnip truck, you know. I’m frank about not having data yet to argue for exceptions, but that does not mean we don’t have any data. We know NS_FAILED(rv) tests bloat our source code. Most are propagating rare or impossible errors.
    /be

  22. More on the “use Java’s GC” velleity: besides the lack of open-source *and* high-quality JVM, we do not use Java — we use XPCOM implemented by a mix of JS (which has its own GC) and C++ (which does not, until Graydon’s new cycle collector comes up per this blog item’s proposal).
    Again we are not rewriting anything. No Javagator. So we have to make incremental, sensible engineering decisions that have good return on investment. The bad thing about XPCOM is not ref-counting, it is the reliance on only ref-counting without automating cycle breaking.
    /be

  23. melon: those network prefs are not usable by default. Unfortunately, pipelining breaks with too many proxies out there, and your setting for the maximum number of connections is just anti-social, and anti-RFC.
    /be

Comments are closed.