Determinism

A waterfall in Glen Nevis, Scotland

A waterfall in Glen Nevis, Scotland

It's easy to get used to thinking about computers and software in non-deterministic terms, as if they operate on the basis of probabilities rather than by following strict lines of logic. As users, we become depressingly familiar with the variable responsiveness of GUIs, “random” crashes of applications and operating systems, and web sites that occasionally take a while to respond.

This impression also pervades software development. The bugs that consume the most time are those that are cannot be reproduced by performing a fixed sequence of actions. There are many sources of non-determinism in modern software systems: Distributed systems (which covers most databases), user interaction (particularly with multiple users), multi-threaded systems (with both operating system schedulers and SMP hardware acting as sources of non-determinism), external devices (including disks), and timing dependencies (both due to explicit functionality that is time dependent, and implicit timing dependencies within the code).

Another factor is the size and complexity of many modern software systems. Reasoning about such systems deterministically is often infeasible. When you are trying to analyze the cause of an observed event, the hypothetical chains of causality can rapidly fan out to involve state spread throughout the system (and perhaps other systems). Examining all the relevant source code is not feasible, and it may not all be available. And some of those chains probably come into contact with the sources of non-determinism mentioned above. Instead, we end up working on the basis of plausible guesses, usually involving “differential software analysis” — looking at whatever changed recently.

With all that, it's easy to forget that the heart of our machines still implement the sequential and deterministic von Neumann architecture. So I find it interesting when that determinism gets exposed. Here are a couple of cases that caught my attention recently:

VMware Workstation 6 has a Record/Replay feature. When recording, all of the non-deterministic inputs to a virtual machine are logged. VMware can then replay the session exactly, by executing a VM with the logged inputs. Vyacheslav Malyugin at VMware described how this can help with debugging in his article Workstation 6.0 and the death of irreproducible bugs:

Have you ever dealt with an irreproducible bug? The one that hits once in a blue moon and hides when you try to use any debugging tools? Well, since we also get them in VMware, we decided to do something about it. So we combined the gdb support in Workstation 6.0 with the Record/Replay. The result allows you to record the execution triggering the bug and then debug it with gdb as many times as you want, each time getting 100% reproducibility.

(Unsurprisingly, Record/Replay is mutually exclusive with SMP support.)

The other story is by Eric Van Hensbergen, who works at IBM on the Blue Gene supercomputer. The processor chip used in Blue Gene has two cores. In two cores walk into a bar, Eric writes about what happened when they were first booting their port of the Plan 9 operating system on Blue Gene:

[...] We had assumed the one core was held in reset during boot.

We were wrong.

The freaky shit is, the damn OS booted to a prompt and I could type, list directories, screw around — and we were getting a single output stream that was more or less coherent. The cores were executing in such precise synchronization (in the same memory space no less) that they even were writing the same characters to the console buffers and incrementing the console pointer to the same value. Complete lock step.