User: Password:
|
|
Subscribe / Log in / New account

The word-at-a-time interface

The word-at-a-time interface

Posted Jun 14, 2012 18:04 UTC (Thu) by jzbiciak (subscriber, #5246)
In reply to: The word-at-a-time interface by mina86
Parent article: The word-at-a-time interface

Perhaps it's a codesize thing, at least for 64-bit arches? A 64-bit manifest constant is fairly big and may require many more than 8 bytes to generate with code. Since these are inlines, multiply the cost by the number of call sites.

A data-page relative access might be comparatively smaller and possibly cheaper when you account for smaller I-cache footprint. You're still costing some D-cache, but it gets reused at every call site.


(Log in to post comments)

The word-at-a-time interface

Posted Jun 14, 2012 18:15 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

To throw some data into all this, I coded up a short test:

struct csts
{
    long csta, cstb;
};

long foo1(long x, const struct csts *c)
{
    return (x + c->csta) & c->cstb;
}

long foo2(long x)
{
    return (x + 0x0101010101010101ul) & 0x8080808080808080ul;
}

The objdump output illustrates the codesize argument:

0000000000000000 <foo1>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 03 06             	add    (%rsi),%rax
   6:	48 23 46 08          	and    0x8(%rsi),%rax
   a:	c3                   	retq   
   b:	eb 03                	jmp    10 <foo2>
   d:	90                   	nop
   e:	90                   	nop
   f:	90                   	nop

0000000000000010 <foo2>:
  10:	48 b8 01 01 01 01 01 	movabs $0x101010101010101,%rax
  17:	01 01 01 
  1a:	48 ba 80 80 80 80 80 	movabs $0x8080808080808080,%rdx
  21:	80 80 80 
  24:	48 8d 04 07          	lea    (%rdi,%rax,1),%rax
  28:	48 21 d0             	and    %rdx,%rax
  2b:	c3                   	retq   

If you ignore the end padding (which, in practice, would disappear with inlining), the first function is less than half the size of the second (11 vs. 27). (Not exactly sure what that jmp foo2 is doing there, but I'm ignoring it.) With inlining, the situation naturally gets muddier if you have multiple uses of the same constant, but still, you can see that the constant-in-struct approach has some natural size advantages.

And yes, my data-page relative comment above is maybe a little bit off, unless after inlining the generation of the "struct word_at_a_time *" gets resolved there as data-page relative. It may well not.

The word-at-a-time interface

Posted Jun 14, 2012 19:17 UTC (Thu) by mina86 (subscriber, #68442) [Link]

Then again, if you call it in a loop, those constants are loaded into register before the loop. But then again, that is probably true for the code with structure as well. In fact, when compiling hash_name(), with -O2, both versions gave the same result, so the whole thing may be premature optimisation which compiler is capable of handling itself.

The word-at-a-time interface

Posted Jun 14, 2012 19:36 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

I'm not referring to the cycle cost of generating the constants, but rather the codesize cost. Even if the constant generation gets hoisted out of the loop, there's still the codesize outside the loop.

If hash_name() is the only call site for these inlines, then yeah, it's tilting at windmills. If the inlines start to find use other places, then the savings, while small, may start to add up.

The word-at-a-time interface

Posted Jun 14, 2012 19:38 UTC (Thu) by mina86 (subscriber, #68442) [Link]

But that's not how the constants work, or at least not how I understand it. The idea is that each call site will have its own const structure on stack. After all, there's no shared structure anywhere. If there was, then the functions would not need to have it passed as argument.

The word-at-a-time interface

Posted Jun 15, 2012 1:54 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

Indeed, I had visually glossed over this bit:

As will be described in more detail below, this structure contains some useful constant values. The structure is small and the contents are architecture-dependent, so it was evidently deemed unnecessary to create a single, globally-accessible copy.

Right there in black and white. Ah well, mea culpa.

Still, not every call site will have these constants, but rather the enclosing function or compilation unit. If the same functions get called multiple times from one outer function or compilation unit, then you have an opportunity for reuse across all those instances.

As I said upthread, though, it does feel a bit like tilting at windmills.

The word-at-a-time interface

Posted Jun 15, 2012 17:37 UTC (Fri) by hamjudo (guest, #363) [Link]

Locally generating a copy, means that the copy is in a cache line with related stuff. It actually takes more time to pull in a stale cache line, than to make a fresh copy of something small. Things could really go slowly, if that global copy lived in a cache line with data that was written by another CPUs.

Optimizations get rejected if they make the worst case times even worse, even if they improve the average case.

The word-at-a-time interface

Posted Jun 15, 2012 18:15 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

I guess it depends on how the local copy initializer gets implemented. If the compiler effectively does memcpy(local, shared_constant_initializer, size), then you've gained nothing.

As for a global copy -- it would be in .rodata or equivalent. It would only result in a read miss, and never in "cache line ping" because it would be with other read-only data. The cost of bringing it in wouldn't be dramatically different than bringing the shared instructions into the instruction cache. In MESI terms, it could be in E, S or I (just like program code), but never M. That said, a global copy is unlikely to get optimized away; see below.

All that said, it appears that GCC just turns the struct back into manifest constants. I threw together what I imagined a string length function might look like (ignoring pointer alignment issues), and the compiler just generated the raw constants without ever storing them to the stack. So, on x86_64 at least, there's no real difference between passing the struct pointer around and just having manifest constants in the code, it would appear. The compiler generated identical code for both of these versions:

#define CONSTANTS {0x0101010101010101ul, 0x8080808080808080ul}

struct word_at_a_time
{
    unsigned long one_bits, high_bits;
};

static inline unsigned long has_zero(unsigned long a, unsigned long *bits,
                                     const struct word_at_a_time *c)
{
    unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
    *bits = mask;
    return mask;
}

static inline unsigned long prep_zero_mask(unsigned long value,
                                           unsigned long bits,
                                           const struct word_at_a_time *c)
{
    return bits;
}

static inline unsigned long create_zero_mask(unsigned long bits)
{
    bits = (bits - 1) & ~bits;
    return bits >> 7;
}

static inline unsigned int find_zero(unsigned long mask)
{
    return mask*0x0001020304050608ul >> 56;
}

int string_len(char *str)
{
    unsigned long *l_str = (unsigned long *)str;
    unsigned int bytes = 0, zero_byte;
    unsigned long bits;
    const struct word_at_a_time csts = CONSTANTS;

    while (1)
    {
        bits = 0;
        if (has_zero(*l_str, &bits, &csts))
        {
            bits = prep_zero_mask(0, bits, &csts);
            bits = create_zero_mask(bits);
            zero_byte = find_zero(bits);

            return zero_byte + bytes;
        }

        bytes += sizeof(unsigned long);
        l_str++;
    }
}

...and...

static inline unsigned long has_zero(unsigned long a, unsigned long *bits)
{
    unsigned long mask = ((a - 0x0101010101010101ul) & ~a) & 0x8080808080808080ul;
    *bits = mask;
    return mask;
}

static inline unsigned long prep_zero_mask(unsigned long value,
                                           unsigned long bits)
{
    return bits;
}

static inline unsigned long create_zero_mask(unsigned long bits)
{
    bits = (bits - 1) & ~bits;
    return bits >> 7;
}

static inline unsigned int find_zero(unsigned long mask)
{
    return mask*0x0001020304050608ul >> 56;
}

int string_len(char *str)
{
    unsigned long *l_str = (unsigned long *)str;
    unsigned int bytes = 0, zero_byte;
    unsigned long bits;

    while (1)
    {
        bits = 0;
        if (has_zero(*l_str, &bits))
        {
            bits = prep_zero_mask(0, bits);
            bits = create_zero_mask(bits);
            zero_byte = find_zero(bits);

            return zero_byte + bytes;
        }

        bytes += sizeof(unsigned long);
        l_str++;
    }
}

So, whatever benefits the structure abstraction provides aren't in evidence on x86-64, at least with GCC 4.4.5 or 4.6.0. At least it doesn't slow the code down any. I guess I'd like to hear more about the intended benefits of this structure (and what platform(s) benefit), since I clearly didn't understand it completely on first examination.

Anyone have a pointer to a thread somewhere that discusses the struct and its intended benefits more directly?


Compiler output from 4.6.0 with -O3 -fomit-frame-pointer for the curious:

string_len:
.LFB4:
    .cfi_startproc
    movq    (%rdi), %rax
    movabsq $-72340172838076673, %r8     ; this is 0xFEFEFEFEFEFEFEFFul
    movabsq $-9187201950435737472, %rsi  ; this is 0x8080808080808080ul
    leaq    (%rax,%r8), %rdx
    notq    %rax
    andq    %rax, %rdx
    xorl    %eax, %eax
    andq    %rsi, %rdx
    jne .L3
    .p2align 4,,10
    .p2align 3
.L5:
    movq    8(%rdi), %rcx
    addl    $8, %eax
    addq    $8, %rdi
    leaq    (%rcx,%r8), %rdx
    notq    %rcx
    andq    %rcx, %rdx
    andq    %rsi, %rdx
    je  .L5
.L3:
    movq    %rdx, %rcx
    subq    $1, %rdx
    notq    %rcx
    andq    %rdx, %rcx
    movabsq $283686952306184, %rdx  ; this is 0x01020304050608ul
    shrq    $7, %rcx
    imulq   %rdx, %rcx
    shrq    $56, %rcx
    addl    %ecx, %eax
    ret

Interestingly, -O2 -Os -fomit-frame-pointer generated somewhat different code, but still generated identical code for both versions of the functions:

string_len:
.LFB4:
    .cfi_startproc
    xorl    %edx, %edx
    movabsq $-72340172838076673, %rax    ; this is 0xFEFEFEFEFEFEFEFFul
    movabsq $-9187201950435737472, %rsi  ; this is 0x8080808080808080ul
.L2:
    movq    (%rdi,%rdx), %r8
    movl    %edx, %r9d
    addq    $8, %rdx
    leaq    (%r8,%rax), %rcx
    notq    %r8
    andq    %r8, %rcx
    andq    %rsi, %rcx
    je  .L2
    movq    %rcx, %rax
    decq    %rcx
    movabsq $283686952306184, %rdx  ; this is 0x01020304050608ul
    notq    %rax
    andq    %rcx, %rax
    shrq    $7, %rax
    imulq   %rdx, %rax
    shrq    $56, %rax
    addl    %r9d, %eax
    ret


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds