GCC has come a long way in the past twenty years. It has a good enough optimizer to make assembler programming unprofitable for most general tasks, but there is always a subset that may exploit the machine more effectively than the compiler semantics will allow. GCC has a rather nice inline assembler, which offers a few advantages over a standalone assembler:

- ABI independence
- Code inlining
- Flexible register substitution

For a sample target, I grabbed this routine that I had noticed in a slide deck on OpenBGP:

uint8_t mask2prefixlen6(struct sockaddr_in6 *sa_in6) { uint8_t l = 0, *ap, *ep; /* * sin6_len is the size of the sockaddr so substract the offset of * the possibly truncated sin6_addr struct. */ ap = (uint8_t *)&sa_in6->sin6_addr; ep = (uint8_t *)sa_in6 + sockaddr_len((struct sockaddr *)sa_in6); for (; ap < ep; ap++) { /* this "beauty" is adopted from sbin/route/show.c ... */ switch (*ap) { case 0xff: l += 8; break; case 0xfe: l += 7; return (l); case 0xfc: l += 6; return (l); case 0xf8: l += 5; return (l); case 0xf0: l += 4; return (l); case 0xe0: l += 3; return (l); case 0xc0: l += 2; return (l); case 0x80: l += 1; return (l); case 0x00: return (l); default: fatalx("non contiguous inet6 netmask"); } } return (l); }

Yak! It seemed an ideal example to illustrate potential benefits of embedding assembler routines in high-level code:

- It's ugly... but it could be a quick-and-dirty solution that was never intended to survive
- It's generically targeted: (practically) no ALU width requirements or advantages
- It's not easy to improve using standard high-level constructs
- It's likely faster than it looks
- It's easily and efficiently solved using specialized processor instructions (BSF and/or POPCNT on x86, CLS on ARM, etc.)

- The routine takes a fixed action on invalid data, but...
- The routine only tests a small subset of values for validity due to the short-circuit returns

- The mask element is stored as a big-endian byte array
- The 16B address structure may be truncated?

I include the former two as they relate to code design:

- I would not have a utility routine dictate program flow or exception handling
- Incomplete and/or incorrect processing should be avoided, or, if justifiable, identified and documented

- Endian swapping is annoying but not difficult or particularly costly; simple operations (e.g. mask and compare-for-equality) can be performed without normalization, but for any more detailed processing I would (ideally) establish a boundary and convert data that crosses it
- The "possibly truncated sin6_addr struct" appears to be a red herring stemming from (mostly) early implementation errors - a truncated field would have an undetermined value

So how bad is it? To find out, I simplified the routine a bit:

static inline uint8_t mask2prefixlen6(uint8_t *ap) { uint8_t l = 0, *ep; ep = (uint8_t *) ap + 16; for (; ap < ep; ap++) { switch (*ap) { case 0xff: l += 8; break; case 0xfe: l += 7; return (l); case 0xfc: l += 6; return (l); case 0xf8: l += 5; return (l); case 0xf0: l += 4; return (l); case 0xe0: l += 3; return (l); case 0xc0: l += 2; return (l); case 0x80: l += 1; return (l); case 0x00: return (l); default: return (0); } } return (l); }

I use "static inline" to eliminate call overhead; I use it by default on my own routines, as they're quite short (but wide):

typedef uint8_t IPv6_Size_t; // Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.) static inline IPv6_Size_t IPv6_Mask_to_Size(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) { uint64_t Size_u64, Temp_u64; __asm__( "popcntq %[MaskH], %[Size] \n" // Size = count of 1s in MaskH; " popcntq %[MaskL], %[Temp] \n" // Temp = count of 1s in MaskL; " addb %b[Temp], %b[Size]" // Size += Temp; : [Size] "=&r" (Size_u64), [Temp] "=&r" (Temp_u64) // Output operands. : [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64) // Input operands. : "cc" // Clobbered elements. ); return((IPv6_Size_t) Size_u64); } // Returns undefined size when Mask is invalid. (For the purpose of conversion to a size an invalid IP mask is defined as having a 1 bit following a 0.) static inline IPv6_Size_t IPv6_Mask_to_Size_BSF_2(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64) { uint64_t Size_u64 = 64, Temp_u64 = 64; __asm__( "bsfq %[MaskH], %[Size] \n" // Size = first 1 bit of MaskH, low -> high; " bsfq %[MaskL], %[Temp] \n" // Temp = first 1 bit of MaskL, low -> high; " subb $128, %b[Size] \n" // Size -= 128; " addb %b[Temp], %b[Size] \n" // Size += Temp; " negb %b[Size]" // Size -= Size; : [Size] "+r" (Size_u64), [Temp] "+r" (Temp_u64) // Output operands. : [MaskL] "r" (v_MaskL_u64), [MaskH] "r" (v_MaskH_u64) // Input operands. : "cc" // Clobbered elements. ); return((IPv6_Size_t) Size_u64); } // Size is undefined on failure. static inline int IPv6_Mask_to_Size_Validate(uint64_t v_MaskL_u64, uint64_t v_MaskH_u64, IPv6_Size_t &r_Size_u8) { int R_i32; __asm__( "popcntq %[MaskH], %q[Size] \n" // Size = count of 1s in MaskH; " xorl %[R], %[R] \n" // R = 0; // Set ZF for shift. " shlq %[Size], %[MaskH] \n" // MaskH <<= Size; // Result discarded; Flags unmodified when count == 0. " jnz 2f \n" // if(ZF != 1) goto Exit; " testq %[MaskL], %[MaskL] \n" " jz 1f \n" // if(MaskL == 0) goto Success; " cmpb $64, %[Size] \n" " jne 2f \n" // if(Size != 64) goto Exit; " popcntq %[MaskL], %q[Size] \n" // Size = count of 1s in MaskL; " xorl %[R], %[R] \n" // R = 0; // Set ZF for shift. " shlq %[Size], %[MaskL] \n" // MaskL <<= Size; // Result discarded; Flags unmodified when count == 0. " jnz 2f \n" // if(ZF != 1) goto Exit; " addb $64, %[Size] \n" // Size += 64; "1: \n" // Success: " movb $1, %b[R] \n" // R = 1; "2:" // Exit: : [MaskL] "+r" (v_MaskL_u64), [MaskH] "+r" (v_MaskH_u64), [Size] "=c" (r_Size_u8), [R] "=q" (R_i32) // Output operands. : : "cc" // Clobbered elements. ); return(R_i32); }

I used this to create an array of arrays of the 129 valid masks:

// Mask is undefined when Size > 128. static inline void IPv6_Size_to_Mask(IPv6_Size_t v_Size_u8, uint64_t &r_MaskL_u64, uint64_t &r_MaskH_u64) { __asm__( "xorl %k[MaskL], %k[MaskL] \n" // MaskL = 0; // Clear upper bits for SET. " cmpb $64, %[Size] \n" " setbe %b[MaskL] \n" // MaskL = (Size <= 64); " sbbq %[MaskH], %[MaskH] \n" // MaskH -= MaskH + (Size < 64); " subb $128, %[Size] \n" // Size -= 128; " decq %[MaskL] \n" // MaskL -= 1; " shrq %[Size], %[MaskH] \n" // MaskH >>= Size; " negb %[Size] \n" // Size = -Size; " notq %[MaskH] \n" // MaskH != MaskH; " shlq %[Size], %[MaskL]" // MaskL <<= Size; : [Size] "+c" (v_Size_u8), [MaskL] "=r" (r_MaskL_u64), [MaskH] "=r" (r_MaskH_u64) // Output operands. : : "cc" // Clobbered elements. ); }

...and this to convert between little- and big-endian data (my routines are all little-endian; I made the arrays big-endian to match the expectation of mask2prefixlen6):

static inline uint64_t BSwap64(uint64_t v_R_u64) { __asm__( "bswapq %[R]" // Endian-swap R; : [R] "+r" (v_R_u64) // Output operands. ); return(v_R_u64); }

Yes, I notate my code this way, and I use Hungarian-type variable identifiers (a legacy from learning BASIC 40 years ago). I did not bother to attempt to isolate the routines, so execution times (from RDTSC/RDTSCP) varied a bit.

Average of 10 runs, 200 * 129 big-endian masks. (Probably a bad choice of run size, but hey - relative performance only.)

Sandy Bridge: Core i5-2500K at 4GHz, Windows 10

Haswell: Core i3-4130T at 2.9GHz, Linux

Initialize mask array: | 1.2126 |

IPv6_Mask_to_Size: | 1.0000 |

IPv6_Mask_to_Size_BSF_2: | 1.0044 |

IPv6_Mask_to_Size_Validate: | 1.2211 |

mask2prefixlen6: | 10.1033 |

Initialize mask array: | 1.2359 |

IPv6_Mask_to_Size: | 1.0051 |

IPv6_Mask_to_Size_BSF_2: | 1.0000 |

IPv6_Mask_to_Size_Validate: | 1.2763 |

mask2prefixlen6: | 11.5341 |

Initialize mask array: | 1.3346 |

IPv6_Mask_to_Size: | 1.0000 |

IPv6_Mask_to_Size_BSF_2: | 1.1177 |

IPv6_Mask_to_Size_Validate: | 1.4175 |

mask2prefixlen6: | 7.3471 |

The Clang results are "fudged" because execution times were very inconsistent - I cherry-picked a reasonable set. Looking at the assembler output, the only substantive difference between the two compilers is in label alignment (a familiar issue, so I won't bother poking at it at this time). One note: Clang generates a gigantic (35kB of assembler text) go-go-gadget jump table for the mask2prefixlen6 case statement, which evaluated consistently slower than the simpler GCC output (on the Sandy Bridge).

At any rate, mask2prefixlen6 is slower than I expected. Is it an appropriate target for a bit of embedded assembler? You be the judge.

Last modified on 07/26/2018

Comments and corrections to webmaster