Tail Calls and C

Some C compilers, such as gcc and clang, can perform tail call optimization (TCO). But not all calls that are in tail position (using an intuitive notion of what tail position means in C) will be subject to TCO. The documentation for these compilers is obscure about which calls are eligible for TCO. That's disappointing if you wish to write C code which exploits this optimization.

One reason for this obscurity might be a feature of the C language that can prevent TCO even when a call is syntactically in a tail position. Consider a called function that accesses local variables of the calling function via a pointer, e.g.:

void f(void)
    int x = 42;

void g(int *p)
    printf("%d\n", *p);

In this example, TCO cannot be applied to the call to g, because that would have the result that f's local variables are no longer available (having been cleaned off the stack). But the behaviour of this C code is well defined: g should be able to dereference the pointer to access the value stored in x.

That is a trivial example. But the issue doesn't only arise when pointers directly passed to a call in tail position. A pointer to a local variable of a calling function might be exposed through a less obvious route, such as a global variable or the heap. So if a pointer is taken to a local variable anywhere in the calling function, and that local variable remains in-scope at the site of a potential tail call, it might prevent TCO:

void f(void)
    int x = 42;

    global_var = &x;

    /* The compiler cannot perform TCO here,
     * unless it can establish that g does not
     * dereference the pointer in global_var. */

As the comment suggests, it's possible that the compiler can perform some analysis to establish that the called function does not in fact dereference the pointer to the local variable. But given the compilation model typically used by C compilers, it is optimistic to expect them to perform such analysis.

But perhaps there is a way to avoid this issue: If the programmer really wants the call to g to be eligible for TCO, they can make it explicit that the lifetime of x does not overlap the call by introducing a nested scope:

void f(void)
        int x = 42;

        global_var = &x;


Unfortunately, this does not have the desired effect for gcc (4.8.2) and clang (3.3). I have written a simple test suite to explore the TCO capabilities of gcc and clang, and it demonstrates that even with the nested scope, taking the pointer to x defeats TCO for f.

(In fact, even if the contents of the nested scope are hoisted into an inline function called from f, that is still sufficient to contaminate f and prevent TCO, in both gcc and clang.)

I'm not aware of other unrelated features of the C language that can pose an obstacle to TCO. But there are implementation issues in gcc and clang that can prevent TCO. That will be the subject of a future post.

Measuring humidity with a Raspberry Pi

I got a Raspberry Pi a few months ago, and one of the things I wanted to do with it was a bit of hardware hacking (the Raspberry Pi having an easily accessible IO header). But I didn't have a specific project in mind.

So I got a Adafruit Raspberry Pi Breakout Kit, hoping that is would act as a source of inspiration. When the novelty of playing about with LEDs and switches had worn off, I saw that Adafruit also has a very cost effective humidity sensor — the DHT22. The DHT22 is a fully integrated sensor that supplies digital relative humidity and temperature measurements. I have a not entirely frivolous reason to want to measure the humidity levels at home, so this seemed like a good project. But in the end, I chose a different sensor: the HYT-271, (bought from Farnell). The choice was because the DHT22 uses a custom bus protocol, which has to be bit-banged using GPIO pins. Adafruit has an article with sample code to do just that. But that wouldn't leave much for me to learn in the process. The HYT-271 is a little more expensive, but in contrast it uses a standard I²C interface, so it would give me an opportunity to learn something for myself while still staying close to well-trodden paths.

Connecting the HYT-271 to the Raspberry Pi

This part is easy: The four pins of the HYT-271 are wired to the corresponding pins on the Raspberry Pi's IO header (SDA, SCL, GND, and VDD to one of the 3.3V pins).

Because I²C is an open-drain bus, the SDA and SCL lines need pull-up resistors. The Raspberry Pi schematics show that it incorporates 1.8KΩ pull-up resistors on these lines, so external pull-ups are unnecessary. In fact, 1.8KΩ is close to the lowest value allowed for a 3.3V I²C bus (see this page), so it seems unlikely you would ever use external pull-ups with a Raspberry Pi.

I made the connections via the breakout kit and a breadboard. The pitch on the HYT-271's pins is 0.05 inches, but the pins are long enough that they can be carefully splayed to fit in the 0.1 inch pitch of a breadboard:

A HYT-271 humidity sensor connected to a Raspberry Pi

Enabling the I²C drivers

I'm running raspian on my Raspberry Pi. Using the I²C bus involves a small amount of configuration. I took these steps from this page about connecting to an I²C ADC. As root:

  1. Add i2c-dev to the end of /etc/modules (this allows userspace programs to access the I²C bus).
  2. Comment out the line in /etc/modprobe.d/raspi-blacklist.conf that says blacklist i2c-bcm2708 (apparently it is blacklisted simply because it is thought few users will need it).
  3. Install i2c-tools:
    apt-get install i2c-tools
  4. Add the relevant users to the i2c group, so that they can access the I²C devices:
    adduser USER i2c
  5. Reboot so that these changes take effect:

Once that's done, we can use i2c-detect to check whether the Raspberry Pi can see the HYT-271:

pi@raspberrypi /tmp $ i2cdetect -y bcm2708_i2c.1
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- 28 -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --

The “28” is the HYT-271, which uses I²C address 0x28, so things are looking good.

(The bus name bcm2708_i2c.1 is correct for the Raspberry Pi Revision 2. On Revision 1 boards, the I²C bus on the IO header is bcm2708_i2c.0.)

Ideally at this point we would be able to use the other i2c-tools commands to verify that the HYT-271 is functioning. Unfortunately, despite the name, i2c-tools has a strong emphasis on SMBus rather than generic I²C, and its i2cget and i2cset commands cannot issue raw I²C read and write transactions. So we need some custom code to proceed further.


Unfortunately, the documentation for the HYT series is lacking. The datasheets do not describe what I²C transactions are needed to get a reading from the sensor. Sample code is available on the hygrochip.com site, but the Arduino code seems to have some issues. So I examined their sample BASIC code to produce something that worked. In order to get a reading, you have to:

  1. Do a write transaction to begin a measurement (the data written seems to be irrelevant).
  2. Wait 60ms (if you do a read transaction immediately, you will get back the values for the previous measurement).
  3. Read the 4 bytes containing the humidity and temperature measurements.

(The sample Arduino code misses out steps 1 and 2, which will cause it to return the same values all the time.)

You can find my C program on github:

pi@raspberrypi /tmp/hygrochip-linux $ ./hyt-read
44.906307 21.798206

Shift Instructions

The bitwise shift instructions in the most common instruction set architectures have a quirk.

You can observe this with the following C program. It shifts 1 left by zero bits, then one bit, then two bits, then three bits, etc., printing the result:

#include <stdio.h>

int main(void) {
	unsigned int i;
	for (i = 0; i < 128; i++)
		printf("%u\n", 1U << i);

	return 0;

As you might expect, this program outputs increasing powers of two. But what happens when the shift count grows to the point where the set bit gets shifted off the left end of an unsigned int? A reasonable guess is that result should become zero, and stay at zero as the shift count increases further.

But if you compile and run the program on x86, the actual results look like this when plotted on a chart:

As expected, the result initially follows the exponential curve of powers of two. But when we reach the 1U << 32 case, and we might have expected a result of 0, the result actually returns to 1, and the function becomes periodic. The explanation for this is that the x86 SHL instruction only uses the bottom 5 bits of the shift count, and so the shift count is treated modulo 32.

By the way, if you try a similar experiment in languages other than C or C++, you probably won't see this behaviour. Only in C/C++ is the shift operation defined loosely enough that a compiler can use the unadorned machine instruction. Implementations of other languages do extra work to make their shift operations operate less surprisingly, and more consistently across different instruction set architectures.

Is this just a peculiar quirk of x86? Well, ARM does something similar. Here's a chart of the same program's output when running on ARM:

ARM's Logical shift left by register instruction operand type uses the bottom 8 bits of the shift count register. So 1U << i rises from one to 1U << 32, then drops to zero as the set bit is shifted off the end of the unsigned int. But then 1U << 256 returns to one, and the function repeats.

Why do x86 and ARM behave in this way? Historical reasons. Here's a note from the definition of the SHL instruction in Intel's Software Developer's Manual:

IA-32 Architecture Compatibility

The 8086 does not mask the shift count. However, all other IA-32 processors (starting with the Intel 286 processor) do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions.

This is clearly an old historical note (and not just because it is outdated — in x86-64, 64-bit shift operations mask the shift count to 6 bits). The cycle counts for shift instructions on the 8086 varied with the shift count, presumably because it implemented them with a microcode loop. But in later Intel x86 processors, shift instructions take a constant number of cycles, so the idea of a maximum execution time is an anachronism. And clearly it is never actually necessary for the hardware to do a shift by more than 32 or 64 bits: larger shift counts can be handled by simply zeroing the result (and detection of a large shift count can be done in parallel with a modulo shift, so it seems unlikely that this would be problematic in circuit design terms). This is confirmed by the SSE shift instructions (PSLL* etc.) which do not mask the shift count.

So it seems unlikely that a green-field instruction set would have these kind of quirks. They originated in processor designs many years ago that were constrained in the number of available transistors, and have been retained for compatibility.