GCC has come a long way in the past twenty years. It has a good enough optimizer to make assembler programming unprofitable for most general tasks, but there is always a subset that may exploit the machine more effectively than the compiler semantics will allow. GCC has a rather nice inline assembler, which offers a few advantages over a standalone assembler:
- ABI independence
- Code inlining
- Flexible register substitution
These allow for efficient integration of assembler code. So long as you have a fairly narrow range of target platforms, you can't beat it with a stick. In addition, LLVM's Clang supports most GCC extensions - specifically "asm" - so you have a choice of at least two compilers.
For a sample target, I grabbed this routine that I had noticed in a slide deck on OpenBGP:
uint8_t
mask2prefixlen6(struct sockaddr_in6 *sa_in6)
{
uint8_t l = 0, *ap, *ep;
/*
* sin6_len is the size of the sockaddr so substract the offset of
* the possibly truncated sin6_addr struct.
*/
ap = (uint8_t *)&sa_in6->sin6_addr;
ep = (uint8_t *)sa_in6 + sockaddr_len((struct sockaddr *)sa_in6);
for (; ap < ep; ap++) {
/* this "beauty" is adopted from sbin/route/show.c ... */
switch (*ap) {
case 0xff:
l += 8;
break;
case 0xfe:
l += 7;
return (l);
case 0xfc:
l += 6;
return (l);
case 0xf8:
l += 5;
return (l);
case 0xf0:
l += 4;
return (l);
case 0xe0:
l += 3;
return (l);
case 0xc0:
l += 2;
return (l);
case 0x80:
l += 1;
return (l);
case 0x00:
return (l);
default:
fatalx("non contiguous inet6 netmask");
}
}
return (l);
}
Yak! It seemed an ideal example to illustrate potential benefits of embedding assembler routines in high-level code:
- It's ugly... but it could be a quick-and-dirty solution that was never intended to survive
- It's generically targeted: (practically) no ALU width requirements or advantages
- It's not easy to improve using standard high-level constructs
- It's likely faster than it looks
- It's easily and efficiently solved using specialized processor instructions (BSF and/or POPCNT on x86, CLS on ARM, etc.)
I'm only looking to open a particular can of worms, but I feel obligated to point out a few others:
- The routine takes a fixed action on invalid data, but...
- The routine only tests a small subset of values for validity due to the short-circuit returns
...and two issues (apparently) not specific to this routine:
- The mask element is stored as a big-endian byte array
- The 16B address structure may be truncated?
All sorts of interesting design choices there.
I include the former two as they relate to code design:
- I would not have a utility routine dictate program flow or exception handling
- Incomplete and/or incorrect processing should be avoided, or, if justifiable, identified and documented
I include the latter two as they directly affect any solution:
- Endian swapping is annoying but not difficult or particularly costly; simple operations (e.g. mask and compare-for-equality) can be performed without normalization, but for any more detailed processing I would (ideally) establish a boundary and convert data that crosses it
- The "possibly truncated sin6_addr struct" appears to be a red herring stemming from (mostly) early implementation errors - a truncated field would have an undetermined value
So how bad is it? To find out, I simplified the routine a bit:
static inline uint8_t mask2prefixlen6(uint8_t *ap) {
uint8_t l = 0, *ep;
ep = (uint8_t *) ap + 16;
for (; ap < ep; ap++) {
switch (*ap) {
case 0xff:
l += 8;
break;
case 0xfe:
l += 7;
return (l);
case 0xfc:
l += 6;
return (l);
case 0xf8:
l += 5;
return (l);
case 0xf0:
l += 4;
return (l);
case 0xe0:
l += 3;
return (l);
case 0xc0:
l += 2;
return (l);
case 0x80:
l += 1;
return (l);
case 0x00:
return (l);
default:
return (0);
}
}
return (l);
}
I use "static inline" to eliminate call overhead; I use it by default on my own routines, as they're quite short (but wide):
typedef uint8_t IPv6_Size_t;
// Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.)
static inline IPv6_Size_t IPv6_Mask_to_Size(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) {
uint64_t Size_u64, Temp_u64;
__asm__( "popcntq %[MaskH], %[Size] \n" // Size = count of 1s in MaskH;
" popcntq %[MaskL], %[Temp] \n" // Temp = count of 1s in MaskL;
" addb %b[Temp], %b[Size]" // Size += Temp;
: [Size] "=&r" (Size_u64), [Temp] "=&r" (Temp_u64) // Output operands.
: [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64) // Input operands.
: "cc" // Clobbered elements.
);
return((IPv6_Size_t) Size_u64);
}
// Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.)
static inline IPv6_Size_t IPv6_Mask_to_Size_BSF_2(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) {
uint64_t Size_u64 = 64, Temp_u64 = 64;
__asm__( "bsfq %[MaskH], %[Size] \n" // Size = first 1 bit of MaskH, low -> high;
" bsfq %[MaskL], %[Temp] \n" // Temp = first 1 bit of MaskL, low -> high;
" subb $128, %b[Size] \n" // Size -= 128;
" addb %b[Temp], %b[Size] \n" // Size += Temp;
" negb %b[Size]" // Size -= Size;
: [Size] "+r" (Size_u64), [Temp] "+r" (Temp_u64) // Output operands.
: [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64) // Input operands.
: "cc" // Clobbered elements.
);
return((IPv6_Size_t) Size_u64);
}
// Size is undefined on failure.
static inline int IPv6_Mask_to_Size_Validate(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64, IPv6_Size_t &r_Size_u8) {
int R_i32;
__asm__( "popcntq %[MaskH], %q[Size] \n" // Size = count of 1s in MaskH;
" xorl %[R], %[R] \n" // R = 0; // Set ZF for shift.
" shlq %[Size], %[MaskH] \n" // MaskH <<= Size; // Result discarded; Flags unmodified when count == 0.
" jnz 2f \n" // if(ZF != 1) goto Exit;
" testq %[MaskL], %[MaskL] \n"
" jz 1f \n" // if(MaskL == 0) goto Success;
" cmpb $64, %[Size] \n"
" jne 2f \n" // if(Size != 64) goto Exit;
" popcntq %[MaskL], %q[Size] \n" // Size = count of 1s in MaskL;
" xorl %[R], %[R] \n" // R = 0; // Set ZF for shift.
" shlq %[Size], %[MaskL] \n" // MaskL <<= Size; // Result discarded; Flags unmodified when count == 0.
" jnz 2f \n" // if(ZF != 1) goto Exit;
" addb $64, %[Size] \n" // Size += 64;
"1: \n" // Success:
" movb $1, %b[R] \n" // R = 1;
"2:" // Exit:
: [MaskL] "+r" (v_MaskL_u64), [MaskH] "+r" (v_MaskH_u64), [Size] "=c" (r_Size_u8), [R] "=q" (R_i32) // Output operands.
:
: "cc" // Clobbered elements.
);
return(R_i32);
}
I used this to create an array of arrays of the 129 valid masks:
// Mask is undefined when Size > 128.
static inline void IPv6_Size_to_Mask(IPv6_Size_t v_Size_u8, uint64_t &r_MaskL_u64, uint64_t &r_MaskH_u64) {
__asm__( "xorl %k[MaskL], %k[MaskL] \n" // MaskL = 0; // Clear upper bits for SET.
" cmpb $64, %[Size] \n"
" setbe %b[MaskL] \n" // MaskL = (Size <= 64);
" sbbq %[MaskH], %[MaskH] \n" // MaskH -= MaskH + (Size < 64);
" subb $128, %[Size] \n" // Size -= 128;
" decq %[MaskL] \n" // MaskL -= 1;
" shrq %[Size], %[MaskH] \n" // MaskH >>= Size;
" negb %[Size] \n" // Size = -Size;
" notq %[MaskH] \n" // MaskH != MaskH;
" shlq %[Size], %[MaskL]" // MaskL <<= Size;
: [Size] "+c" (v_Size_u8), [MaskL] "=r" (r_MaskL_u64), [MaskH] "=r" (r_MaskH_u64) // Output operands.
:
: "cc" // Clobbered elements.
);
}
...and this to convert between little- and big-endian data (my routines are all little-endian; I made the arrays big-endian to match the expectation of mask2prefixlen6):
static inline uint64_t BSwap64(uint64_t v_R_u64) {
__asm__( "bswapq %[R]" // Endian-swap R;
: [R] "+r" (v_R_u64) // Output operands.
);
return(v_R_u64);
}
Yes, I notate my code this way, and I use Hungarian-type variable identifiers (a legacy from learning BASIC 40 years ago). I did not bother to attempt to isolate the routines, so execution times (from RDTSC/RDTSCP) varied a bit.
Average of 10 runs, 200 * 129 big-endian masks. (Probably a bad choice of run size, but hey - relative performance only.)
Sandy Bridge: Core i5-2500K at 4GHz, Windows 10
Haswell: Core i3-4130T at 2.9GHz, Linux
Sandy, GCC:
Initialize mask array: | 1.2126 |
IPv6_Mask_to_Size: | 1.0000 |
IPv6_Mask_to_Size_BSF_2: | 1.0044 |
IPv6_Mask_to_Size_Validate: | 1.2211 |
mask2prefixlen6: | 10.1033 |
Sandy, Clang (slightly fudged):
Initialize mask array: | 1.2359 |
IPv6_Mask_to_Size: | 1.0051 |
IPv6_Mask_to_Size_BSF_2: | 1.0000 |
IPv6_Mask_to_Size_Validate: | 1.2763 |
mask2prefixlen6: | 11.5341 |
Haswell, GCC:
Initialize mask array: | 1.3346 |
IPv6_Mask_to_Size: | 1.0000 |
IPv6_Mask_to_Size_BSF_2: | 1.1177 |
IPv6_Mask_to_Size_Validate: | 1.4175 |
mask2prefixlen6: | 7.3471 |
The Clang results are "fudged" because execution times were very inconsistent - I cherry-picked a reasonable set. Looking at the assembler output, the only substantive difference between the two compilers is in label alignment (a familiar issue, so I won't bother poking at it at this time). One note: Clang generates a gigantic (35kB of assembler text) go-go-gadget jump table for the mask2prefixlen6 case statement, which evaluated consistently slower than the simpler GCC output (on the Sandy Bridge).
At any rate, mask2prefixlen6 is slower than I expected. Is it an appropriate target for a bit of embedded assembler? You be the judge.
Last modified on 07/26/2018
Comments and corrections to webmaster