Shift Instructions

The bitwise shift instructions in the most common instruction set architectures have a quirk.

You can observe this with the following C program. It shifts 1 left by zero bits, then one bit, then two bits, then three bits, etc., printing the result:

#include <stdio.h>

int main(void) {
	unsigned int i;
	for (i = 0; i < 128; i++)
		printf("%u\n", 1U << i);

	return 0;
}

As you might expect, this program outputs increasing powers of two. But what happens when the shift count grows to the point where the set bit gets shifted off the left end of an unsigned int? A reasonable guess is that result should become zero, and stay at zero as the shift count increases further.

But if you compile and run the program on x86, the actual results look like this when plotted on a chart:

As expected, the result initially follows the exponential curve of powers of two. But when we reach the 1U << 32 case, and we might have expected a result of 0, the result actually returns to 1, and the function becomes periodic. The explanation for this is that the x86 SHL instruction only uses the bottom 5 bits of the shift count, and so the shift count is treated modulo 32.

By the way, if you try a similar experiment in languages other than C or C++, you probably won't see this behaviour. Only in C/C++ is the shift operation defined loosely enough that a compiler can use the unadorned machine instruction. Implementations of other languages do extra work to make their shift operations operate less surprisingly, and more consistently across different instruction set architectures.

Is this just a peculiar quirk of x86? Well, ARM does something similar. Here's a chart of the same program's output when running on ARM:

ARM's Logical shift left by register instruction operand type uses the bottom 8 bits of the shift count register. So 1U << i rises from one to 1U << 32, then drops to zero as the set bit is shifted off the end of the unsigned int. But then 1U << 256 returns to one, and the function repeats.

Why do x86 and ARM behave in this way? Historical reasons. Here's a note from the definition of the SHL instruction in Intel's Software Developer's Manual:

IA-32 Architecture Compatibility

The 8086 does not mask the shift count. However, all other IA-32 processors (starting with the Intel 286 processor) do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions.

This is clearly an old historical note (and not just because it is outdated — in x86-64, 64-bit shift operations mask the shift count to 6 bits). The cycle counts for shift instructions on the 8086 varied with the shift count, presumably because it implemented them with a microcode loop. But in later Intel x86 processors, shift instructions take a constant number of cycles, so the idea of a maximum execution time is an anachronism. And clearly it is never actually necessary for the hardware to do a shift by more than 32 or 64 bits: larger shift counts can be handled by simply zeroing the result (and detection of a large shift count can be done in parallel with a modulo shift, so it seems unlikely that this would be problematic in circuit design terms). This is confirmed by the SSE shift instructions (PSLL* etc.) which do not mask the shift count.

So it seems unlikely that a green-field instruction set would have these kind of quirks. They originated in processor designs many years ago that were constrained in the number of available transistors, and have been retained for compatibility.