Tailbone: GCC's Inline Assembler

Using GCC's Inline Assembler

GCC has come a long way in the past twenty years. It has a good enough optimizer to make assembler programming unprofitable for most general tasks, but there is always a subset that may exploit the machine more effectively than the compiler semantics will allow. GCC has a rather nice inline assembler, which offers a few advantages over a standalone assembler:

ABI independence
Code inlining
Flexible register substitution

These allow for efficient integration of assembler code. So long as you have a fairly narrow range of target platforms, you can't beat it with a stick. In addition, LLVM's Clang supports most GCC extensions - specifically "asm" - so you have a choice of at least two compilers.

For a sample target, I grabbed this routine that I had noticed in a slide deck on OpenBGP:


uint8_t
mask2prefixlen6(struct sockaddr_in6 *sa_in6)
{
	uint8_t	l = 0, *ap, *ep;

	/*
	 * sin6_len is the size of the sockaddr so substract the offset of
	 * the possibly truncated sin6_addr struct.
	 */
	ap = (uint8_t *)&sa_in6->sin6_addr;
	ep = (uint8_t *)sa_in6 + sockaddr_len((struct sockaddr *)sa_in6);
	for (; ap < ep; ap++) {
		/* this "beauty" is adopted from sbin/route/show.c ... */
		switch (*ap) {
		case 0xff:
			l += 8;
			break;
		case 0xfe:
			l += 7;
			return (l);
		case 0xfc:
			l += 6;
			return (l);
		case 0xf8:
			l += 5;
			return (l);
		case 0xf0:
			l += 4;
			return (l);
		case 0xe0:
			l += 3;
			return (l);
		case 0xc0:
			l += 2;
			return (l);
		case 0x80:
			l += 1;
			return (l);
		case 0x00:
			return (l);
		default:
			fatalx("non contiguous inet6 netmask");
		}
	}

	return (l);
}

Yak! It seemed an ideal example to illustrate potential benefits of embedding assembler routines in high-level code:

It's ugly... but it could be a quick-and-dirty solution that was never intended to survive
It's generically targeted: (practically) no ALU width requirements or advantages
It's not easy to improve using standard high-level constructs
It's likely faster than it looks
It's easily and efficiently solved using specialized processor instructions (BSF and/or POPCNT on x86, CLS on ARM, etc.)

I'm only looking to open a particular can of worms, but I feel obligated to point out a few others:

The routine takes a fixed action on invalid data, but...
The routine only tests a small subset of values for validity due to the short-circuit returns

...and two issues (apparently) not specific to this routine:

The mask element is stored as a big-endian byte array
The 16B address structure may be truncated?

All sorts of interesting design choices there.
I include the former two as they relate to code design:

I would not have a utility routine dictate program flow or exception handling
Incomplete and/or incorrect processing should be avoided, or, if justifiable, identified and documented

I include the latter two as they directly affect any solution:

Endian swapping is annoying but not difficult or particularly costly; simple operations (e.g. mask and compare-for-equality) can be performed without normalization, but for any more detailed processing I would (ideally) establish a boundary and convert data that crosses it
The "possibly truncated sin6_addr struct" appears to be a red herring stemming from (mostly) early implementation errors - a truncated field would have an undetermined value

So how bad is it? To find out, I simplified the routine a bit:


static inline uint8_t mask2prefixlen6(uint8_t *ap) {
    uint8_t l = 0, *ep;

    ep = (uint8_t *) ap + 16;
    for (; ap < ep; ap++) {
        switch (*ap) {
        case 0xff:
            l += 8;
            break;
        case 0xfe:
            l += 7;
            return (l);
        case 0xfc:
            l += 6;
            return (l);
        case 0xf8:
            l += 5;
            return (l);
        case 0xf0:
            l += 4;
            return (l);
        case 0xe0:
            l += 3;
            return (l);
        case 0xc0:
            l += 2;
            return (l);
        case 0x80:
            l += 1;
            return (l);
        case 0x00:
            return (l);
        default:
            return (0);
        }
    }
    return (l);
}

I use "static inline" to eliminate call overhead; I use it by default on my own routines, as they're quite short (but wide):


typedef uint8_t IPv6_Size_t;

// Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.)

static inline IPv6_Size_t IPv6_Mask_to_Size(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) {
    uint64_t Size_u64, Temp_u64;

    __asm__(   "popcntq		%[MaskH], %[Size]		\n"	// Size = count of 1s in MaskH;
	"	popcntq		%[MaskL], %[Temp]		\n"	// Temp = count of 1s in MaskL;
	"	addb		%b[Temp], %b[Size]"			// Size += Temp;
	: [Size] "=&r" (Size_u64), [Temp] "=&r" (Temp_u64)								// Output operands.
	: [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64)								// Input operands.
	: "cc"														// Clobbered elements.
    );
    return((IPv6_Size_t) Size_u64);
}

// Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.)

static inline IPv6_Size_t IPv6_Mask_to_Size_BSF_2(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) {
    uint64_t Size_u64 = 64, Temp_u64 = 64;

    __asm__(   "bsfq		%[MaskH], %[Size]		\n"	// Size = first 1 bit of MaskH, low -> high;
	"	bsfq		%[MaskL], %[Temp]		\n"	// Temp = first 1 bit of MaskL, low -> high;
	"	subb		$128, %b[Size]			\n"	// Size -= 128;
	"	addb		%b[Temp], %b[Size]		\n"	// Size += Temp;
	"	negb		%b[Size]"				// Size -= Size;
	: [Size] "+r" (Size_u64), [Temp] "+r" (Temp_u64)								// Output operands.
	: [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64)								// Input operands.
	: "cc"														// Clobbered elements.
    );
    return((IPv6_Size_t) Size_u64);
}

// Size is undefined on failure.

static inline int IPv6_Mask_to_Size_Validate(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64, IPv6_Size_t &r_Size_u8) {
    int R_i32;

    __asm__(   "popcntq		%[MaskH], %q[Size]		\n"	// Size = count of 1s in MaskH;
	"	xorl		%[R], %[R]			\n"	// R = 0;					// Set ZF for shift.
	"	shlq		%[Size], %[MaskH]		\n"	// MaskH <<= Size;				// Result discarded; Flags unmodified when count == 0.
	"	jnz		2f				\n"	// if(ZF != 1) goto Exit;
	"	testq		%[MaskL], %[MaskL]		\n"
	"	jz		1f				\n"	// if(MaskL == 0) goto Success;
	"	cmpb		$64, %[Size]			\n"
	"	jne		2f				\n"	// if(Size != 64) goto Exit;
	"	popcntq		%[MaskL], %q[Size]		\n"	// Size = count of 1s in MaskL;
	"	xorl		%[R], %[R]			\n"	// R = 0;					// Set ZF for shift.
	"	shlq		%[Size], %[MaskL]		\n"	// MaskL <<= Size;				// Result discarded; Flags unmodified when count == 0.
	"	jnz		2f				\n"	// if(ZF != 1) goto Exit;
	"	addb		$64, %[Size]			\n"	// Size += 64;
	"1:							\n"	// Success:
	"	movb		$1, %b[R]			\n"	// R = 1;
	"2:"								// Exit:
	: [MaskL] "+r" (v_MaskL_u64), [MaskH] "+r" (v_MaskH_u64), [Size] "=c" (r_Size_u8), [R] "=q" (R_i32)		// Output operands.
	: 
	: "cc"														// Clobbered elements.
    );
    return(R_i32);
}

I used this to create an array of arrays of the 129 valid masks:


// Mask is undefined when Size > 128.

static inline void IPv6_Size_to_Mask(IPv6_Size_t v_Size_u8, uint64_t &r_MaskL_u64, uint64_t &r_MaskH_u64) {

    __asm__(   "xorl		%k[MaskL], %k[MaskL]		\n"	// MaskL = 0;					// Clear upper bits for SET.
	"	cmpb		$64, %[Size]			\n"	
	"	setbe		%b[MaskL]			\n"	// MaskL = (Size <= 64);
	"	sbbq		%[MaskH], %[MaskH]		\n"	// MaskH -= MaskH + (Size < 64);
	"	subb		$128, %[Size]			\n"	// Size -= 128;
	"	decq		%[MaskL]			\n"	// MaskL -= 1;
	"	shrq		%[Size], %[MaskH]		\n"	// MaskH >>= Size;
	"	negb		%[Size]				\n"	// Size = -Size;
	"	notq		%[MaskH]			\n"	// MaskH != MaskH;
	"	shlq		%[Size], %[MaskL]"			// MaskL <<= Size;
	: [Size] "+c" (v_Size_u8), [MaskL] "=r" (r_MaskL_u64), [MaskH] "=r" (r_MaskH_u64)				// Output operands.
	: 
	: "cc"														// Clobbered elements.
    );
}

...and this to convert between little- and big-endian data (my routines are all little-endian; I made the arrays big-endian to match the expectation of mask2prefixlen6):


static inline uint64_t BSwap64(uint64_t v_R_u64) {

    __asm__(   "bswapq		%[R]"					// Endian-swap R;
	: [R] "+r" (v_R_u64)												// Output operands.
    );
    return(v_R_u64);
}

Yes, I notate my code this way, and I use Hungarian-type variable identifiers (a legacy from learning BASIC 40 years ago). I did not bother to attempt to isolate the routines, so execution times (from RDTSC/RDTSCP) varied a bit.

Average of 10 runs, 200 * 129 big-endian masks. (Probably a bad choice of run size, but hey - relative performance only.)

Sandy Bridge: Core i5-2500K at 4GHz, Windows 10
Haswell: Core i3-4130T at 2.9GHz, Linux

Sandy, GCC:

Initialize mask array:	1.2126
IPv6_Mask_to_Size:	1.0000
IPv6_Mask_to_Size_BSF_2:	1.0044
IPv6_Mask_to_Size_Validate:	1.2211
mask2prefixlen6:	10.1033

Sandy, Clang (slightly fudged):

Initialize mask array:	1.2359
IPv6_Mask_to_Size:	1.0051
IPv6_Mask_to_Size_BSF_2:	1.0000
IPv6_Mask_to_Size_Validate:	1.2763
mask2prefixlen6:	11.5341

Haswell, GCC:

Initialize mask array:	1.3346
IPv6_Mask_to_Size:	1.0000
IPv6_Mask_to_Size_BSF_2:	1.1177
IPv6_Mask_to_Size_Validate:	1.4175
mask2prefixlen6:	7.3471

The Clang results are "fudged" because execution times were very inconsistent - I cherry-picked a reasonable set. Looking at the assembler output, the only substantive difference between the two compilers is in label alignment (a familiar issue, so I won't bother poking at it at this time). One note: Clang generates a gigantic (35kB of assembler text) go-go-gadget jump table for the mask2prefixlen6 case statement, which evaluated consistently slower than the simpler GCC output (on the Sandy Bridge).
At any rate, mask2prefixlen6 is slower than I expected. Is it an appropriate target for a bit of embedded assembler? You be the judge.

Last modified on 07/26/2018
Comments and corrections to webmaster