Acquired

A couple of weeks ago I switched employers, but I'm only able to talk about it now that the deal has been officially announced: RabbitMQ has been acquired by VMware, and along with it, a team of developers from LShift. I'm happy to be a member of that team.

Strictly speaking, we've been acquired by SpringSource. But the boundary between SpringSource and the rest of VMware is porous. In any case, we'll be working out of a VMware office in London.

As some readers may know, I hold strong opinions about virtualization. I believe that full system virtualization rightfully belongs as an operating system feature, and I'm very happy to see the x86 PC platform and operating systems maturing to the point where that becomes a reality. Non-OS virtualization solutions are a temporary anomaly, a bit like DOS extenders were 20 years ago, and the sooner they go away, the better. For this reason, I have been an advocate of KVM for some time.

But with that said, I'm very excited to be joining VMware. They have been in the virtualization business a long time, and understand the needs of their customers in ways that are much broader than just system virtualization. I'm sure that myself and the rest of the RabbitMQ team are going to have an interesting time there.

LShift will carry on much as before, and they will be hiring even more strenuously than usual. So if you are a top-notch developer in London who is looking for a stimulating working environment, you might want to contact them.

Pachuco ported to ARM

I had a few days of vacation to use up recently, and I spent some of the time working on pachuco. The main achievement was to port it to ARM. So now the compiler supports x86, x86-64, and ARM. The code is on github.

My main motivation for this project was to learn ARM machine code. The only general-purpose ISAs with a healthy future seem to be x86, x86-64 and ARM, but I haven't done any low-level development on ARM until now. The port also proves that pachuco isn't tied to x86/x86-64. It didn't require any significant changes to the core of the compiler, though a lot of code got moved around to separate the target-specific parts from the target-independent parts.

The ARM machine I used for development is an a NSLU2. This has an 266MHz XScale-IXP42x chip implementing the ARMv5 architecture, and 32MB of memory. It supports the THUMB instruction set, but I just used the main ARM instruction set.

A couple of things that arose in the process of developing the port strike me as worthy of note:

The first relates to the bootstrapping process. Although pachuco can compile itself, I still tend to develop under sbcl, because it makes identifying the causes of bugs much easier. But sbcl hasn't been ported to ARM, so I couldn't follow exactly the same process I followed on x86. The traditional way to port a self-compiler to a new platform is to to cross-compile: run the compiler on a supported machine, but generating code for ARM; then copy the results across to the ARM machine to run, or more realistically, to find out how they fail to run. But following this process literally would introduce cumbersome steps into the edit-test cycle.

What I did instead was to substitute sbcl with a wrapper script that runs sbcl on a remote x86 system via ssh. The script automatically copies the necessary files back and forth. This is still cross-compiling, but that fact is hidden from everything but the wrapper script. This required almost no changes to the main Makefile and build scripts, and allowed me to maintain a simple and rapid edit-test cycle.

The second interesting obstacle became evident as I got close to completing the bootstrap process. It turned out that the bootstrap process would take almost an hour, rather than the one or two minutes I was expecting. The cause was the assembler. The pachuco compiler produces assembly code, and uses the system assembler (specifically gas) to turn that into an executable. The assembly file produced when pachuco compiles itself is about 130k lines, and with 32MB of memory, gas swaps a lot while processing that file. I can't see a good reason for gas to use so much memory (more than the pachuco compiler uses to hold the program), except that it is most often used in conjunction with gcc, and C source files tend to be limited in size.

The solution was to split the output of the pachuco compiler into many smaller 10k-line files. gas can assemble these without swapping, and the linker connects the program back together to make the executable. Achieving this involved shuffling the order of the generated assembly code, and using global rather than local assembly labels in the appropriate places.

Pachuco on ARM now bootstraps for me in a couple of minutes (compared to 20 seconds on my Core2 laptop). It's necessary to set several environment variables and makefile variables to get there, but most of those should go away as I refine the port.

dwragg@bb5a:/tmp/pachuco$ make clean ; HEAP_SIZE=8 BOOTSTRAP_HOST=192.168.1.65 BOOTSTRAP_COMPILER_REMOTE=/home/dwragg/work/pachuco/scripts/sbcl-wrapper time make BOOTSTRAP_COMPILER=scripts/remote-bootstrap CODEGEN=simple COMPILEOPTS="-S -s" 
rm -rf build
mkdir -p build
scripts/compile -C scripts/remote-bootstrap -S -s -o build/stage0-test test/test.pco
build/stage0-test
Tests done
mkdir -p build
scripts/compile -C scripts/remote-bootstrap -S -s -o build/stage0-gc-test test/gc-test.pco
build/stage0-gc-test
GC tests done
mkdir -p build
scripts/compile -C scripts/remote-bootstrap -S -s -o build/stage1 language/util.pco language/expander.pco language/interpreter.pco compiler/walker.pco compiler/mach.pco compiler/mach-32bit.pco compiler/mach-arm.pco compiler/compiler.pco compiler/codegen-simple.pco compiler/codegen-generic.pco compiler/codegen-arm.pco compiler/driver.pco compiler/drivermain.pco
mkdir -p build
scripts/compile -C build/stage1 -S -s -o build/stage2 language/util.pco language/expander.pco language/interpreter.pco compiler/walker.pco compiler/mach.pco compiler/mach-32bit.pco compiler/mach-arm.pco compiler/compiler.pco compiler/codegen-simple.pco compiler/codegen-generic.pco compiler/codegen-arm.pco compiler/driver.pco compiler/drivermain.pco
mkdir -p build
scripts/compile -C build/stage2 -S -s -o build/stage3 language/util.pco language/expander.pco language/interpreter.pco compiler/walker.pco compiler/mach.pco compiler/mach-32bit.pco compiler/mach-arm.pco compiler/compiler.pco compiler/codegen-simple.pco compiler/codegen-generic.pco compiler/codegen-arm.pco compiler/driver.pco compiler/drivermain.pco
cmp -s build/stage2.s build/stage3.s
114.33user 11.71system 2:37.19elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
109088inputs+42832outputs (526major+137598minor)pagefaults 0swaps

1 comment

And I wore an onion on my belt, which was the style at the time

A local kernel exploit in the Linux kernel, involving access to a NULL pointer, was publicized recently and got a lot of attention. Jonathon Corbet provided a detailed two-part writeup on LWN.net (part 1, part 2).

The key to the vulnerability is this: If the kernel tries to dereference a NULL pointer, i.e. tries to access memory at address 0 or nearby, it actually accesses the virtual memory space of the current user process (since Linux on x86 gives the bottom 3GB of the virtual memory space to the user process and reserves the top 1GB for the kernel). And user processes can arrange for memory to be mapped at address 0 (at least under certain circumstances). So it is possible that a NULL pointer dereference in the kernel will not fail with a page-fault exception, as would usually be expected, but will actually return data that is controlled by the user process. This allows an exploit to be crafted.

One thing that I haven't seen mentioned in the ensuing coverage is the fact, once upon a time, Linux excluded such vulnerabilities by design. Actually, I hadn't noticed, or have managed to forget, that Linux changed its design, so I was a bit surprised to learn that it is vulnerable to such exploits.

Back at the dawn of time, Linux on x86 used the same 3GB/1GB virtual memory split that it does today. But it also used segments to prevent unintended access from kernel code to user-space memory. The segments used when executing kernel code covered only the top 1GB of the linear memory space, so that it was impossible to accidentally access user-space addresses. Address 0 at the bottom of the kernel's segment actually referred to linear address 0xc000000, inside the kernels address space, away from the control of user processes. When kernel code really wanted to read or write the memory of a user process, it had to call special functions to do so, which used non-default segment registers.

This changed in linux-2.1.0, back in 1996. In fact, this was the major change separating 2.1.x from the 2.0.x series, and Linus devoted his pre-2.1.0 release note to this topic. Since then, Linux on x86 has used a “flat” segment model for the kernel: The segments simply cover the whole of the linear address space, from 0 to 4GB. Linus didn't highlight one unfortunate consequence of this change — user processes can control whether NULL pointer dereferences from inside the kernel succeed, and what data they yield.

I'm not aware of anything in principle that would prevent this change being reverted on x86. But the x86-64 architecture more or less disables segmentation, so it can't support a rigid distinction between the kernel and user address spaces in the same way (in fact, I can't think of any practical way to achieve something like that on x86-64). In contrast, the RISC architectures supported by Linux tend to include some notion of address space identifiers, which are used to distinguish the user memory space from the kernel memory space, so they do not have such vulnerabilities. Certainly SPARC does this, and even ARM seems to include appropriate facilities.

All of this is probably only of historical interest. But I do find the present solutions to this vulnerability, which restrict how a user process can arrange its address space, to be regrettable. There is a pleasing purity in the idea that user processes should be able to arrange their address space however they like, and the same for the kernel, without interactions between them.

Kernel Mode Setting for some ATI GPUs breaks suspend/resume

Another Fedora post.

I ran into this bug in Fedora 10, and by looking at those bug reports, I was able to work around it. But I had forgotten all about that when I installed Fedora 11. So I'm writing this partly as a note for myself in case the bug still exists in Fedora 12.

The problem is that Fedora incorporates something called Kernel Mode Setting for ATI Radeon GPUs. KMS is a technology that gives the kernel more responsibility for managing the graphics hardware, rather than leaving it in the hands of the X sever. Unfortunately, there is a bug in the Radeon KMS support, at least for the Radeon Mobility X1400 chip in my Thinkpad T60. It will suspend, but the video never comes back during a resume.

The workaround is to disable KMS. This can be done by adding the nomodeset option to the kernel command line in /boot/grub/grub.conf. You'll need to reboot for this to take effect.

KMS is a worthwhile technology, if for no other reason than making the Fedora boot process much prettier. But with suspend working, I rarely have to watch the boot process, so its appearance is not a major concern for me.

(On a distantly related note, I see that the GGI/KGI project is still just about going. That takes me back a few years.)

2 comments

Fedora 11, netbooks, and using ext4 without a journal

Fedora 11 is out. One of the highlights of this release is ext4 support (the successor to ext3 as the mainstream filesystem in the Linux world). But Fedora 11 doesn't just support ext4. It really wants you to use ext4 — in fact, the Live CD won't install to any other filesystem.

(This is because an install from the Fedora Live CD works simply by copying the ext4 filesystem image contained on the Live CD onto the target storage device, and then resizing it. This makes such installs astonishingly fast, faster than any other method I've seen to install a comparably complete Linux system. The downside is a lack of flexibility.)

I don't have too many reservations about using ext4 on my personal systems. But it did raise an issue when I came to upgrade my Acer Aspire One netbook. The model I have uses a 7.5GB SSD as its main storage. Given the price of the whole machine, it's a safe assumption that this SSD is a cheap part, without the same kind of sophisticated wear levelling algorithms that server SSDs have. But ext4 is a journaled filesystem, just like ext3. The journal is a small area of storage which gets written to for each filesystem transaction, so it's conceivable that it could wear out a region of the SSD. For this reason, netbooks that come with Linux pre-installed tend to use the ext2 filesystem, which isn't journaled. And when I had Fedora 10 installed on mine, I switched the filesystem from ext3 to ext2 for that same reason. ext3 uses similar on-disk structures to ext2, so this is a straightforward operation, but some of the new features in ext4 mean this kind of downgrade is not possible.

But there is a solution to this: ext4 has an unjournaled mode. You can remove the journal from an ext4 filesystem with the tune2fs command:

tune2fs -O ^has_journal /dev/sdXX

The only problem is that this feature was introduced fairly recently, and some tools have yet to catch up. In particular, the version of e2fsprogs currently in Fedora 11 won't read the UUIDs from such filesystems. And because the Fedora installer generates an /etc/fstab which identifies filesystems by UUID, if you just run the above command, the system can't find its root filesystem and so doesn't get very far through the boot process.

So what you need to do to disable journaling is boot from the Live CD, mount the system partition and any boot partition and edit /etc/fstab and /boot/grub/grub.conf to replace UUID references with old fashioned /dev/sdXX block device names. Then unmount it, run the tune2fs command above, and you should have a functional system without a journal.

Update: there is a comprehensive list of tips to customize Fedora 11 for SSD-based netbooks here. It explains in more detail how to remove the journal from an ext4 filesystem.

2 comments

Grains of rice on a chess board

Today I read an IBM advertisement in a magazine. The same text can be found on the IBM site here. It contains the following sentence:

Experts predict that by 2010, the amount of digital information will double every 11 hours.

Experts in what, I wonder?

1 comment

DUI.Stream and HTTP request pipelining

Today I read about DUI.Stream:

We call this technique MXHR (short for Multipart XMLHttpRequests), and we wrote an addition to our Digg User Interface library called DUI.Stream to implement it. Specifically, DUI.Stream opens and reads multipart HTTP responses piece-by-piece through an XHR, passing each chunk to a JavaScript handler as it loads.

But how is that fundamentally different from HTTP request pipelining, where an HTTP client sends a stream of HTTP requests without waiting for a response before issuing the next request? HTTP request pipelining was standardised in RFC2616 - HTTP/1.1 .

Well, one major difference is that Digg can allow all their users to take advantage of their technique today. Amazingly, HTTP request pipelining is still disabled by default in the current releases of Firefox. On the IE side, it has been enabled in IE 8.

I notice that the 10th birthday of RFC2616 is coming up in June. I wonder if we should all mark the occasion by enabling HTTP request pipelining, by upgrading our browsers, or by going to about:config and tweaking it manually.

3 comments

The grim reality of C

In the last few years, I haven't had cause to write much C code; just enough to stop me getting rusty. I remain very fond of the C programming language, and sometimes I find myself thinking that programming in C has its advantages, despite the need to pay attention to a few details that are not an issue in higher-level languages. Maybe C is not so bad. Maybe, if I had to tackle a project with the right set of requirements, I'd work in C again.

But coming into contact with C code can be like reading about surgery before anaesthetic: The unwelcome and painful realization that people had to live like that for so long, and that in some primitive parts of the world, they still do.

Although I don't get to write much C code, I do still regularly refer to C code written by others. But there's a depth of understanding of a language and its milieu that can only be gained by writing a significant amount of code in that language, or making significant changes to an existing project. Today, I've been doing the latter.

The project in question is a relatively successful open source project, of a non-trivial size (150k lines). The leaders and main contributors appear to be competent and experienced, and have made sensible design decisions. After a superficial review of the code, my response was quite positive. It adheres to a range of uncontroversial coding standards. It is consistently formatted. And it uses utility functions where appropriate to handle error prone tasks safely. In particular, it avoids the direct use of malloc and friends, and instead uses a set of helper macros and functions for allocation and deallocations of common data structures and arrays.

But after a few hours of working on this codebase, I began to realize that it is riddled with allocation and deallocation bugs. The helpers help, but not enough. The consequence of the smallest confusion over the ownership of some data structure is a memory leak.

I'm not going to name the project, because I continue to believe that its overall quality is very good, and that the developers have tried hard to get these issues right. What it brings home to me is a fact that I haven't been forced to face for a long time: while bad programmers can write bad code in any language, even good programmers find it hard to write good, bug-free code in C.

2 comments

USB pass-through with libvirt and KVM, part two

A recent post here discussed how to enable USB pass-through under libvirt. But the technique there only allowed devices to be added when a VM was next started; you couldn't connect devices to a running VM.

This is a fairly major limitation. At best, the need to restart the VM is inconvenient. At worst, it makes the use of a USB device impossible. The full use of many USB devices involves disconnecting and reconnecting them to a running machine.

Recent versions of QEMU (and so KVM) do in fact have support for this; it's just that libvirt doesn't expose it. The QEMU support involves marking USB devices (identified by a vendor/product ID pair) for autoconnect. QEMU will then listen for connection events from the host OS, and repond to connections from the relevant devices by signalling a connection of the pass-through device to an emulated USB hub within the VM.

When libvirt runs KVM/QEMU, it specifies the pass-through USB devices through command line options. So by introducing a wrapper script that rewrites the command line options, we can enable autoconnect under libvirt.

Note that you will need a recent version of the KVM userspace support for this to work. It works for me with kvm-84. kvm-74, as included in Fedora 10 and Ubuntu 8.10 is not recent enough.

Here is the script I use:

#!/bin/sh
exec /usr/bin/qemu-kvm `echo $* | sed 's|-usbdevice host:\([^ ]*\)|-usbdevice host:auto:*.*:\1|g'`

Note that different Linux distributions use slightly different names for the KVM/QEMU binary. /usr/bin/qemu-kvm above is correct for RHEL/Fedora systems; under Ubuntu, you should substitute /usr/bin/kvm.

Save the script as /etc/libvirt/qemu/qemu-kvm, and make it executable. With that in place, you need to tell libvirt to use it instead of the real KVM binary. Do that by editing the VM XML description as described in my previous post. You need to edit the //domain/devices/emulator entry to refer to the wrapper script, e.g.

<domain type='kvm'>
  <name>windowsxp</name>
  …
  <devices>
    <emulator>/etc/libvirt/qemu/qemu-kvm</emulator>
    …
  </devices>
</domain>

Restart the relevant VMs, and USB pass-through with autoconnect should now work.

It's tempting to ask why this functionality isn't built-in to libvirt. My impression is that they are aiming for something more ambitious: The ability to enumerate devices on the host and then selectively pass those through to running VMs. This will be good when it's done, it's just a shame that it isn't there yet.

8 comments

Blogging for The Man

I have a blog post up on the company blog. But it's not about work (well, not work work), so if you are one of my small band of regular readers here, you should take a look.