Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

assembly - Why does GCC chose dword movl to copy a long shift count to CL?

In the third chapter of Computer System: A Programmer's Prespective, an example program is given when talking about shift operations:

long shift_left4_rightn(long x, long n)
{
    x <<= 4;
    x >>= n;
    return x;
}

And its assembly code is as follows (reproducible with GCC10.2 -O1 for x86-64 on the Godbolt compiler explorer. -O2 schedules the instructions in a different order but still uses movl to ECX):

shift_left4_rightn:
  endbr64
  movq   %rdi, %rax    Get x
  salq   $4, %rax      x <<= 4
  movl   %esi, %ecx    Get n
  sarq   %cl, %rax     x >>= n
  ret

I wonder why the assembly code of getting n is movl %esi, %ecx instead of movq %rsi, %rcx since n is a quad-word.

On the other hand, movb %sil, %cl might be more suitable if the optimation is considered, since the shift amount only use the single-byte register element %cl and those higher bits are all ignored.

As a result, I really fail to figure out the reason for using "movl %esi, %ecx" when dealing with long integer.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, GCC realizes that upper bits are ignored by sar.
Then movl is the natural consequence of applying two simple optimization rules:

  • Avoid writing partial registers (i.e. 8 or 16-bit, where writing merges into the old value instead of zero-extending). Why doesn't GCC use partial registers? - For various reasons across different microarchitectures, including in this case a false dependency on the old value of RCX.
  • Prefer 32-bit operand size because it's the default in x86-64 machine code, not needing any prefixes. And it's at least as fast as any other operand-size for any instruction.

Fun fact: even if the arg had been uint8_t, compiles would still hopefully use movl %esi, %ecx. You'd think reading a wider register when the arg value is only in SIL could create a partial-register stall, but an unofficial extension to the x86-64 System V calling convention is that callers should zero or sign extend narrow args to at least 32-bit. So we can assume it was written with at least a 32-bit operation.

The specific downsides of some other choices:

  • movq %rsi, %rcx - waste of a REX prefix (code-size downside).
  • movb %sil, %cl - writes a partial register, and still needs a REX prefix to access SIL.
  • movzbl %sil, %ecx - code size: 2-byte opcode, and needs a REX to read SIL. Also, AMD CPUs only do mov-elimination (zero latency) for movl / movq, not movzx.
  • movw %si, %cx - zero advantages, needs an operand-size prefix and writes a partial register.
  • movzwl %si, %ecx - Tied with movq for code-size, but defeats mov-elimination even on Intel CPUs.

Fun fact: if we pad with a dummy arg so n arrives in RDX, GCC still chooses movl %edx, %ecx, even though movb %dl, %cl is the same code-size (no REX needed to access DL). So yes, GCC is definitely avoiding byte operand-size.

Fun fact 2: Clang unfortunately does waste a REX on movq, missing this optimization. https://godbolt.org/z/6GWhMd

But if we make the count arg unsigned char, clang and GCC do both use movl instead of movb, fortunately. https://godbolt.org/z/e95WP8


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...