Using debootstrap to create a base system for qemu

Recently I wrote about how to debug the Linux kernel running under qemu.. There I showed how to give the emulated kernel to access the host's filesystem. But that access was read-only, as the consequences of giving the guest kernel write access to the filesystem of the host could be drastic. On the other hand, not being able to write to the filesystem limits the kinds of activities that can be debugged.

Fortunately, it's easy to extend the approach to provide a writable filesystem, without going to the trouble of doing a full Linux guest install. The debootstrap tool quickly builds a debian base system in a directory (and you don't even have to be running debian to use it — it's in the fedora repos). As root, do:

# debootstrap --variant=minbase sid guest-root-dir

(The --variant=minbase option requests an absolutely minimal system. Skip it for a not-quite-so-minimal system, or use the --include=pkg1,pkg2,... option to include other debian packages in the system.)

Then the options to qemu are changed slightly to use this new filesystem, and to allow read-write access to it. Also note that you now need to run qemu as root, so that it can set ownership on files within the exposed filesystem

# qemu-system-x86_64 -s -nographic \
        -kernel kernel tree path/arch/x86/boot/bzImage \
        -fsdev local,id=root,path=guest-root-dir,security_model=passthrough \
        -device virtio-9p-pci,fsdev=root,mount_tag=/dev/root \
        -append 'root=/dev/root rw rootfstype=9p rootflags=trans=virtio console=ttyS0 init=/bin/sh'

Then you can connect gdb to qemu as in the previous post.

Debugging the Linux kernel with qemu and gdb

Recently I wanted to run the Linux kernel under a debugger to understand the finer points of the networking code. It's easy to do this using qemu's gdb support, but the the details you are scattered in various places. This post pulls them together.

You can debug the kernel in the context of a full VM image. But qemu provides a more convenient alternative: You can give the guest kernel access to the host filesystem (this uses the 9P remote filesystem, running over the virtio transport rather than a network). That way, we can make use of binaries we have lying around on the host system.

First, we have to build the kernel. Of course, in order to use binaries from the host system, the architecture should match. And to be able to explore the running kernel, gdb needs debug information, so your .config should have:

CONFIG_DEBUG_INFO=y

For filesystem access, you'll need virtio and 9P support:

CONFIG_VIRTIO=y
CONFIG_VIRTIO_PCI=y
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_9P_FS=y

Other than that, the kernel configuration can be bare-bones. You don't need most device drivers. You won't need kernel module support. You won't need normal filesystems (just procfs and sysfs). So you can start from the default kernel config and turn a lot of things off. My .config for 3.17rc5 and x86-64 is here.

If we leverage the host filesystem, we are now ready to launch the kernel under qemu and gdb. I'm using qemu-1.6.2 and gdb-7.7.1 from Fedora 20. Start qemu in one terminal window (as an ordinary user, you don't need root for this) with:

$ qemu-system-x86_64 -s -nographic \
        -kernel kernel tree path/arch/x86/boot/bzImage \
        -fsdev local,id=root,path=/,readonly,security_model=none \
        -device virtio-9p-pci,fsdev=root,mount_tag=/dev/root \
        -append 'root=/dev/root ro rootfstype=9p rootflags=trans=virtio console=ttyS0 init=/bin/sh'

Here:

  • The -s option enables gdb target support.
  • The -kernel option boots the specified kernel directly, rather than going through the normal emulated boot process.
  • -fsdev ...,path=/,readonly,security_model=none tells qemu to give read-only access to the host filesystem (see this follow-up for read-write access).
  • The -append option add kernel command line parameters to tell the kernel to use the 9P filesystem as the root filesystem, to use a serial console (i.e. the terminal where you ran qemu), and to boot directly into a shell rather than into /sbin/init.

You should see the kernel boot messages appear, ending with a shell prompt. The qemu console obeys some key sequences beginning with control-A: Most importantly, C-a h for help and C-a x to terminate qemu.

Then in another terminal run gdb with:

$ gdb kernel tree path/vmlinux
GNU gdb (GDB) Fedora 7.7.1-18.fc20
[...]
Reading symbols from vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
atomic_read (v=<optimized out>) at ./arch/x86/include/asm/atomic.h:27
27              return (*(volatile int *)&(v)->counter);

The guest kernel is stopped at this point, so you can set breakpoints etc. before resuming it with continue.

A few caveats:

Because we passed init=/bin/sh on the kernel command line, there was no init system to set up various things that are normally present on a Linux system. For instance, the proc and sys filesystems are missing, and the loopback network interface has not been started. You can fix those issues with the following commands:

sh-4.2# export PATH=$PATH:/sbin:/usr/sbin
sh-4.2# mount -t proc none /proc
sh-4.2# mount -t sysfs none /sys
sh-4.2# ip addr add 127.0.0.1 dev lo
sh-4.2# ip link set dev lo up

Another consequence of starting bash directly from the kernel is this warning:

sh: cannot set terminal process group (-1): Inappropriate ioctl for device
sh: no job control in this shell

Due to this lack of job control, you won't be able to interrupt commands with control-C. So be careful that you don't lose your shell to a command that runs forever!

qemu has a -S option which doesn't start the guest until you connect with gdb and tell it to continue, so you can use gdb to debug the boot process. But I've found that doing that with x86_64 kernels tends to trigger a recent bug in qemu's gdb support. (That bug only affects x86_64 guests, so you can avoid it by building the emulated kernel for i386 or another arch. But then you can't share the filesystem from an x86_64 host.)

Tail Calls and C

Some C compilers, such as gcc and clang, can perform tail call optimization (TCO). But not all calls that are in tail position (using an intuitive notion of what tail position means in C) will be subject to TCO. The documentation for these compilers is obscure about which calls are eligible for TCO. That's disappointing if you wish to write C code which exploits this optimization.

One reason for this obscurity might be a feature of the C language that can prevent TCO even when a call is syntactically in a tail position. Consider a called function that accesses local variables of the calling function via a pointer, e.g.:

void f(void)
{
    int x = 42;
    g(&x);
}

void g(int *p)
{
    printf("%d\n", *p);
}

In this example, TCO cannot be applied to the call to g, because that would have the result that f's local variables are no longer available (having been cleaned off the stack). But the behaviour of this C code is well defined: g should be able to dereference the pointer to access the value stored in x.

That is a trivial example. But the issue doesn't only arise when pointers directly passed to a call in tail position. A pointer to a local variable of a calling function might be exposed through a less obvious route, such as a global variable or the heap. So if a pointer is taken to a local variable anywhere in the calling function, and that local variable remains in-scope at the site of a potential tail call, it might prevent TCO:

void f(void)
{
    int x = 42;

    global_var = &x;
    ...

    /* The compiler cannot perform TCO here,
     * unless it can establish that g does not
     * dereference the pointer in global_var. */
    g();
}

As the comment suggests, it's possible that the compiler can perform some analysis to establish that the called function does not in fact dereference the pointer to the local variable. But given the compilation model typically used by C compilers, it is optimistic to expect them to perform such analysis.

But perhaps there is a way to avoid this issue: If the programmer really wants the call to g to be eligible for TCO, they can make it explicit that the lifetime of x does not overlap the call by introducing a nested scope:

void f(void)
{
    {
        int x = 42;

        global_var = &x;
        ...
    }

    g();
}

Unfortunately, this does not have the desired effect for gcc (4.8.2) and clang (3.3). I have written a simple test suite to explore the TCO capabilities of gcc and clang, and it demonstrates that even with the nested scope, taking the pointer to x defeats TCO for f.

(In fact, even if the contents of the nested scope are hoisted into an inline function called from f, that is still sufficient to contaminate f and prevent TCO, in both gcc and clang.)

I'm not aware of other unrelated features of the C language that can pose an obstacle to TCO. But there are implementation issues in gcc and clang that can prevent TCO. That will be the subject of a future post.

Measuring humidity with a Raspberry Pi

I got a Raspberry Pi a few months ago, and one of the things I wanted to do with it was a bit of hardware hacking (the Raspberry Pi having an easily accessible IO header). But I didn't have a specific project in mind.

So I got a Adafruit Raspberry Pi Breakout Kit, hoping that is would act as a source of inspiration. When the novelty of playing about with LEDs and switches had worn off, I saw that Adafruit also has a very cost effective humidity sensor — the DHT22. The DHT22 is a fully integrated sensor that supplies digital relative humidity and temperature measurements. I have a not entirely frivolous reason to want to measure the humidity levels at home, so this seemed like a good project. But in the end, I chose a different sensor: the HYT-271, (bought from Farnell). The choice was because the DHT22 uses a custom bus protocol, which has to be bit-banged using GPIO pins. Adafruit has an article with sample code to do just that. But that wouldn't leave much for me to learn in the process. The HYT-271 is a little more expensive, but in contrast it uses a standard I²C interface, so it would give me an opportunity to learn something for myself while still staying close to well-trodden paths.

Connecting the HYT-271 to the Raspberry Pi

This part is easy: The four pins of the HYT-271 are wired to the corresponding pins on the Raspberry Pi's IO header (SDA, SCL, GND, and VDD to one of the 3.3V pins).

Because I²C is an open-drain bus, the SDA and SCL lines need pull-up resistors. The Raspberry Pi schematics show that it incorporates 1.8KΩ pull-up resistors on these lines, so external pull-ups are unnecessary. In fact, 1.8KΩ is close to the lowest value allowed for a 3.3V I²C bus (see this page), so it seems unlikely you would ever use external pull-ups with a Raspberry Pi.

I made the connections via the breakout kit and a breadboard. The pitch on the HYT-271's pins is 0.05 inches, but the pins are long enough that they can be carefully splayed to fit in the 0.1 inch pitch of a breadboard:

A HYT-271 humidity sensor connected to a Raspberry Pi

Enabling the I²C drivers

I'm running raspian on my Raspberry Pi. Using the I²C bus involves a small amount of configuration. I took these steps from this page about connecting to an I²C ADC. As root:

  1. Add i2c-dev to the end of /etc/modules (this allows userspace programs to access the I²C bus).
  2. Comment out the line in /etc/modprobe.d/raspi-blacklist.conf that says blacklist i2c-bcm2708 (apparently it is blacklisted simply because it is thought few users will need it).
  3. Install i2c-tools:
    apt-get install i2c-tools
  4. Add the relevant users to the i2c group, so that they can access the I²C devices:
    adduser USER i2c
  5. Reboot so that these changes take effect:
    reboot

Once that's done, we can use i2c-detect to check whether the Raspberry Pi can see the HYT-271:

pi@raspberrypi /tmp $ i2cdetect -y bcm2708_i2c.1
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- 28 -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --

The “28” is the HYT-271, which uses I²C address 0x28, so things are looking good.

(The bus name bcm2708_i2c.1 is correct for the Raspberry Pi Revision 2. On Revision 1 boards, the I²C bus on the IO header is bcm2708_i2c.0.)

Ideally at this point we would be able to use the other i2c-tools commands to verify that the HYT-271 is functioning. Unfortunately, despite the name, i2c-tools has a strong emphasis on SMBus rather than generic I²C, and its i2cget and i2cset commands cannot issue raw I²C read and write transactions. So we need some custom code to proceed further.

Code

Unfortunately, the documentation for the HYT series is lacking. The datasheets do not describe what I²C transactions are needed to get a reading from the sensor. Sample code is available on the hygrochip.com site, but the Arduino code seems to have some issues. So I examined their sample BASIC code to produce something that worked. In order to get a reading, you have to:

  1. Do a write transaction to begin a measurement (the data written seems to be irrelevant).
  2. Wait 60ms (if you do a read transaction immediately, you will get back the values for the previous measurement).
  3. Read the 4 bytes containing the humidity and temperature measurements.

(The sample Arduino code misses out steps 1 and 2, which will cause it to return the same values all the time.)

You can find my C program on github:

pi@raspberrypi /tmp/hygrochip-linux $ ./hyt-read
44.906307 21.798206

Shift Instructions

The bitwise shift instructions in the most common instruction set architectures have a quirk.

You can observe this with the following C program. It shifts 1 left by zero bits, then one bit, then two bits, then three bits, etc., printing the result:

#include <stdio.h>

int main(void) {
	unsigned int i;
	for (i = 0; i < 128; i++)
		printf("%u\n", 1U << i);

	return 0;
}

As you might expect, this program outputs increasing powers of two. But what happens when the shift count grows to the point where the set bit gets shifted off the left end of an unsigned int? A reasonable guess is that result should become zero, and stay at zero as the shift count increases further.

But if you compile and run the program on x86, the actual results look like this when plotted on a chart:

As expected, the result initially follows the exponential curve of powers of two. But when we reach the 1U << 32 case, and we might have expected a result of 0, the result actually returns to 1, and the function becomes periodic. The explanation for this is that the x86 SHL instruction only uses the bottom 5 bits of the shift count, and so the shift count is treated modulo 32.

By the way, if you try a similar experiment in languages other than C or C++, you probably won't see this behaviour. Only in C/C++ is the shift operation defined loosely enough that a compiler can use the unadorned machine instruction. Implementations of other languages do extra work to make their shift operations operate less surprisingly, and more consistently across different instruction set architectures.

Is this just a peculiar quirk of x86? Well, ARM does something similar. Here's a chart of the same program's output when running on ARM:

ARM's Logical shift left by register instruction operand type uses the bottom 8 bits of the shift count register. So 1U << i rises from one to 1U << 32, then drops to zero as the set bit is shifted off the end of the unsigned int. But then 1U << 256 returns to one, and the function repeats.

Why do x86 and ARM behave in this way? Historical reasons. Here's a note from the definition of the SHL instruction in Intel's Software Developer's Manual:

IA-32 Architecture Compatibility

The 8086 does not mask the shift count. However, all other IA-32 processors (starting with the Intel 286 processor) do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions.

This is clearly an old historical note (and not just because it is outdated — in x86-64, 64-bit shift operations mask the shift count to 6 bits). The cycle counts for shift instructions on the 8086 varied with the shift count, presumably because it implemented them with a microcode loop. But in later Intel x86 processors, shift instructions take a constant number of cycles, so the idea of a maximum execution time is an anachronism. And clearly it is never actually necessary for the hardware to do a shift by more than 32 or 64 bits: larger shift counts can be handled by simply zeroing the result (and detection of a large shift count can be done in parallel with a modulo shift, so it seems unlikely that this would be problematic in circuit design terms). This is confirmed by the SSE shift instructions (PSLL* etc.) which do not mask the shift count.

So it seems unlikely that a green-field instruction set would have these kind of quirks. They originated in processor designs many years ago that were constrained in the number of available transistors, and have been retained for compatibility.