Static Analysis FTW

One of the best “researchy” investments we’ve made at Mozilla over the last few years has been in static analysis, both for C++ (including Taras Glek‘s excellent Dehydra, with which you write the custom analysis in JS) and now for JS itself:

DoctorJS is based on Dimitris Vardoulakis‘s work this summer implementing CFA2 for JavaScript at Mozilla. Dimitris is a student at Northeastern University under Olin Shivers (who is in fact a super-hero, not a super-villain as his name might suggest). Dimitris is one of many awesome interns we’ve managed to recruit in recent summers.

Credit for web design and nodejs-wrangling go to Dave Herman and Patrick Walton.

What is static analysis (for those of you who skipped the wikipedia link in my first line)? You could think of static code analysis as running your program without actually executing it. A static analysis tool reads source code (or sometimes bytecode or machine code) and associates concrete and abstract values with program locations or slots on a stack built as it scans the code in straight-line or abstract fashion — but the tool does not actually take every branch, of course. Yet in spite of not running your program, a good static analysis can tell you non-obvious and sometimes amazing facts about your code.

This description doesn’t begin to do justice to the cleverness of the algorithms used to keep precision while not taking too long (or effectively forever), but I hope it gives a feel for how most analyses work. The real fun starts when you have higher-order functions (as JS has).

All static analyses are approximate, since only running your program will (in general) tell you what output it gives for a given input, or even whether it ever halts. But simple programs can be modeled with great precision, and even conservative static analyses that give up at some point can shed light by pointing out sketchy or buggy parts of your code. Windows’ Static Driver Verification, based on the SLAM project at MSR, is a notable success story.

It should be clear that an optimizing compiler does static analysis of several kinds in order to translate your source language into efficient instructions written in another language, perhaps physical machine code, virtual machine code for a “managed language runtime”, or another higher-level programming language (e.g. JS — see OpenLaszlo, GWT, Cappuccino, and my latest favorite, Skulpt, among many examples).

A compiler that checks types is obviously conservative (sometimes too conservative), in that it will call a program that fails to type-check an erroneous program, even if that program would have behaved well at runtime for all possible inputs. Dynamic languages are popular in large part because programmers can keep types latent in the code, with type checking done imperfectly (yet often more quickly and expressively) in the programmers’ heads and unit tests, and therefore programmers can do more with less code writing in a dynamic language than they could using a static language.

(For many common tasks; not all static languages are less expressive all the time; qualifications needed ad nauseum. I am not religious — I use static and dynamic languages all the time — and if there is one thing I’ve learned as a  programmer, it is that there is never one right language for all jobs.)

Static analysis, since it is approximate, is not going to solve every problem. But a clever analysis, making good use of all the information its author can program it to glean, can do a lot more than what conventional static languages’ type checkers can do.

For example, do you really need to write type annotations in your code for it to go fast? I’ve argued that you don’t, for example here (where I did argue for optional annotations of runtime “guards” or “contracts”, only at API boundaries — different beasts from “types” as the term is conventionally used in static languages). Let’s see how well DoctorJS does with some SunSpider (crypto-aes.js) code:

 * AES Cipher function: encrypt 'input' with Rijndael algorithm
 *   takes   byte-array 'input' (16 bytes)
 *           2D byte-array key schedule 'w' (Nr+1 x Nb bytes)
 *   applies Nr rounds (10/12/14) using key schedule w for 'add round key' stage
 *   returns byte-array encrypted value (16 bytes)
function Cipher(input, w) {    // main Cipher function [§5.1]
    . . .

DoctorJS’s output includes this JSON fragment:

    "name": "Cipher",
    "tagfile": "js",
    "addr": "/^function Cipher(input, w) {    \/\/ main Cipher function [§5.1]$/",
    "kind": "f",
    "type": "Array[number] function(Array[number], Array[Array[number]])",
    "lineno": "13"
. . .

From the type property we can see that DoctorJS figured out that that the Cipher function takes an array of numbers as its input parameter (this should be an array of bytes, but the analysis can’t yet figure that out — yet), and a second array of arrays of numbers named w (the “key schedule”). This by itself is pretty amazing.

The addr property gives a regexp to find Cipher in the crypto-aes.js source, which happens also to be a valid ctags (or jsctags) tag address.

The other properties should be self-explanatory.

The idea for DoctorJS came to me just over a week ago when I said to Dave Herman something like “we should take Dimitris’s analysis, put it on NodeJS, and make a twitter-ific web service with several formats returned by different apps, so that everyone can use the fruits of the analysis.”

Typical pointy-haired whiteboard operation by me :-P. Of course the details involved choosing to fork a process for each analysis request, since the analysis could take a while, and it is not written in “callback” or continuation-passing style (nor should it be: this concurrency vs. code simplicity trade-off is in general a false dilemma, and it’s an issue affecting Node and JS to which I’ll return soon); fixing bugs; configuring servers and proxies; and doing some fast site design. For all of this, my thanks to Dimitris, Dave, Patrick, and Zandr Milewski (who does Mozilla Labs IT).

DoctorJS is up now, and we hope people find it useful, not just a curiosity. Is there another output format for summary jsctags or type information you would prefer, which is much more concise than the JSON currently served (so it could be worthwhile adding an app to serve that other format, instead of you having to transcode)? Are there other results you would like to see, e.g. linking uses of variables to their definitions? Or even complete JSON-encoded abstract syntax trees? Did you find what look like bugs? Please let us know.

Completely separate from DoctorJS, Dehydra, and other static analysis services and tools: an online type inference mostly-static analysis for JaegerMonkey, from the always-awesome Brian Hackett. This looks promising, although it is lower priority at the moment than other JM work.

BTW, I talked to Chris Williams of JSConf fame about DoctorJS in the current episode of A Minute With Brendan. Yes, I’m now nerding out in public about hot JS and Mozilla topics every week for a minute or so. Stay tuned, it’ll be a regular thing.