Saturday, April 12, 2014

Inline assembler stupidity

I keep getting caught by this, because this is a perfect example of the compiler doing something contrary to what you're writing.
  asm volatile (
                "ldr %0, [%1]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%1]\n\t"
                : "=r" (tmp)
                : "r" (p)

Guess what this gets compiled to?
  30: f9400000  ldr x0, [x0]
  34: 91000400  add x0, x0, #0x1
  38: f9000000  str x0, [x0]

...equivalent to, of course,
  asm volatile (
                "ldr %0, [%0]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%0]\n\t"
                : "+r" (p)
The sort of aggressive and non-obvious optimization is crazy because if I really wanted the generated code, I'd have written the inline asm the second way with a read and write modifier. Maybe for architectures with specialized and very few registers this is a reasonable approach, but for RISCish instruction sets with large instruction files this is nuts. There should be a warning option for this nonsense.

This "correct way" is to use an earlyclobber modifier.
  asm volatile (
                "ldr %0, [%1]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%1]\n\t"
                : "=&r" (tmp)
                : "r" (p)
IMO anything that needs a separate paragraph in third-party documents as "a caveat" needs to be fixed.

Speaking of which... Given that C really is a high-level assembly, why not finally standardize on inline asm?

Wednesday, April 2, 2014

Exotic QEMU bugs and fixes

I found that the linux-user portion of QEMU has a few bugs around signals. Really, around handling "self-modifying" code and having the code generator step on unmapped memory.

The test is pretty simple.  Have a page of memory containing one instruction which will cause SIGILL to be delivered, followed by a 'ret'. On a SIGILL, unmap the page. On a SIGSEGV, map the page back in. I've two of these tests - one with actual mmap/munmap, and another with mprotect. The tests verify corner conditions in the binary translation logic, with back-to-back signals and an attempt to execute unmapped code.

"self-modifying" code sounds grand, but it's just the signal return path. While newer Linux kernels use VDSO symbols for the restorer (that's the part that does the sigreturn syscall), QEMU still creates an on-the-stack trampoline. When QEMU creates a translation block for the trampoline, it marks the page internally as read-only so that it can detect when the TB should be invalidated. It is this later logic which was short-circuiting and exiting earlier than needed.
That's fixed in

The second problem is that QEMU doesn't deal very well with being forced to run code that's unmapped. The TCG generator walks over the unmapped memory, gets a SIGSEGV, which attempts delivery of the signal to the translated program (which again, means getting and/or creating more TBs). The problem, though, is that we attempt to reacquire the tcg_ctx.tb_ctx.tb_lock, which we never dropped due to the signal. i.e. after a SIGSEGV here:
#0  disas_a64_insn (s=0x7fffffffdc40, env=<optimized out>) at /target-arm/translate-a64.c:8972
#1  gen_intermediate_code_internal_a64 (cpu=cpu@entry=0x62532200, tb=tb@entry=0x7ffff440b120, search_pc=search_pc@entry=false) at /target-arm/translate-a64.c:9097
#2  0x00000000600d76e5 in gen_intermediate_code_internal (search_pc=false, tb=0x7ffff440b120, cpu=0x62532200) at /target-arm/translate.c:10629
#3  gen_intermediate_code (env=env@entry=0x6253a468, tb=tb@entry=0x7ffff440b120) at /target-arm/translate.c:10904
#4  0x00000000600e4851 in cpu_arm_gen_code (env=env@entry=0x6253a468, tb=tb@entry=0x7ffff440b120, gen_code_size_ptr=gen_code_size_ptr@entry=0x7fffffffdd64) at /translate-all.c:159
#5  0x00000000600e5152 in tb_gen_code (cpu=cpu@entry=0x62532200, pc=pc@entry=4820992, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=cflags@entry=0) at /translate-all.c:973
#6  0x0000000060040e7a in tb_find_slow (flags=<optimized out>, pc=4820992, env=0x6253a468, cs_base=<optimized out>) at /cpu-exec.c:162
#7  tb_find_fast (env=0x6253a468) at /cpu-exec.c:193
#8  cpu_arm_exec (env=env@entry=0x6253a468) at /cpu-exec.c:611
#9  0x000000006005ad2c in cpu_loop (env=env@entry=0x6253a468) at /linux-user/main.c:1015
#10 0x0000000060004dd1 in main (argc=1, argv=<optimized out>, envp=<optimized out>) at /linux-user/main.c:4392

We longjmp back to the CPU loop and deadlock here:
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
#1  0x000000006012991d in _L_lock_858 ()
#2  0x000000006012978a in __pthread_mutex_lock (mutex=0x604ffa98 <tcg_ctx+350904>) at pthread_mutex_lock.c:61
#3  0x0000000060040bfd in cpu_arm_exec (env=env@entry=0x6253a228) at /cpu-exec.c:610
#4  0x000000006005ad2c in cpu_loop (env=env@entry=0x6253a228) at /linux-user/main.c:1015
#5  0x0000000060004dd1 in main (argc=1, argv=<optimized out>, envp=<optimized out>) at /linux-user/main.c:4392
The solution is to allow tb_gen_code to back out if it knows it can't read the memory. A new exception type is added, EXCP_TB_EFAULT, which then needs to be handled just like an address fault inside cpu_loop.

This makes the above tests pass on AArch64 and x86 (32-bit only, since there is no signal handling support for the x86_64-linux-user target at the moment).

Update: Fixes look like they're going in. The TCG deadlock is getting fixed in a simpler way. It is a better and more self-contained fix.