Showing posts with label AArch64. Show all posts
Showing posts with label AArch64. Show all posts

Sunday, May 8, 2016

64-bit ARM OS/Kernel/Systems Development Demo on an nVidia Shield TV (Tegra X1)

64-bit ARM OS/Kernel/Systems Development on an nVidia Shield TV (Tegra X1)

The Shield TV is based on the 64-bit nVidia X1 chip. Unlike the K1, this is actually a Cortex-A57 based design, instead of being based on the nVidia "Denver" design. That by itself is kind of interesting already. The Shield TV was available much much earlier than the X1-based nVidia development board (Jetson TX1, you can even buy it on Amazon), and costs about a third of the TX1. The Shield TV allows performing an unlock via "fastboot oem unlock", allowing custom OS images to be booted. Unlike the TX1, you don't get a UART (and I haven't found the UART pads yet, either).

What this is

https://github.com/andreiw/shieldTV_demo

This is a small demo, demonstrating how to build and boot arbitrary code on your Tegra Shield TV. Unlike the previous Tegra K1 demo, you get EL2 (hypervisor mode!).

  • A Shield TV, unlocked. Search Youtube for walkthroughs.
  • Shield OS version >= 1.3.
  • GNU Make.
  • An AArch64 GNU toolchain.
  • ADB/Fastboot tools.
  • Bootimg tools (https://github.com/pbatard/bootimg-tools), built and somewhere in your path.
  • An HDMI-capable screen. Note, HDMI, not DVI-HDMI adapter. You want the firmware to configure the screen into 1920x1080 mode, otherwise you'll be in 640x480 and we don't want that...

How to build

$ CROSS_COMPILE=aarch64-linux-gnu- make
...should yield 'shieldTV_demo'.

How to boot

  1. Connect the Shield TV a USB cable to your dev workstation.
  2. Reboot device via:
    $ adb reboot-bootloader
    ...you should now see the nVidia splash screen, followed by the boot menu.
  3. If OS is 1.3, you can simply:
    $ fastboot boot shieldTV_demo
  4. If OS is 1.4 or 2.1, you will need to:
    $ fastboot flash recovery shieldTV_demo
    ...and then boot the "recovery kernel" by following instructions on screen.
The code will now start. You will see text and some drawn diagonal lines black background. The text should say we're at EL2 and the lines should be green. The drawing will be slow - the MMU is off and the caches are thus disabled.

Let me know if it's interesting to see the MMU setup code.

Final thoughts

The Shield TV is a better deal than the TX1 for the average hobbyist, even with the missing UART. For the price being sold the TX1 should come with a decent amount of RAM, not 1GB more than the Shield TV. nVidia...are you listening? Uncripple your firmware so booting custom images is not a song-and-dance (you broke it in 1.4!) and at least TELL us where the UART pads are on the motherboard. If you're really cool put together an "official" Ubuntu image that runs on the TX1 and the Shield (and fix SCR_EL3.HCE, too).

Friday, November 28, 2014

Using the Nexus 9 secure agent for debug logging

#!/usr/bin/python

import fileinput, re, sys

#
# It turns out the "Trusty Secure OS" Crippleware on the Nexus 9 is
# good for least something. It is thankfully pretty chatty, meaning
# you can use it for logging from code where it's inconvenient
# or impossible to write to the UART directly, like MMU bringup code ;-).
#
# A sequence like:
#   mov x0, #'V'
#   smc #0xFFFF
#
# ...will result in the following getting emitted. I am guessing x1...x3
# get printed here as param0..2 but I am too lazy to check.
#
# smc_undefined:67: Undefined monitor call!
# smc_undefined:67: SMC: 0x56 (Stdcall entity 0 function 0x56)
# smc_undefined:67: param0: 0xf77c2e69
# smc_undefined:67: param1: 0xf77c2e68
# smc_undefined:67: param2: 0x0
#
# Now you can do basic logging to debug early bring-up. The following
# Python will turn your giant Minicom capture into something more
# sensible.
#

def process(line):
    m = re.match('\s*smc_undefined:67: SMC: (0x[0-9a-f]+)', line)
    if m:
        sys.stdout.write(chr(int(m.groups()[0], 16)))

for line in fileinput.input():
    process(line)

print("\n");

Sunday, November 23, 2014

64-bit ARM OS/Kernel/Systems Development Demo on a Nexus 9

64-bit ARM OS/Kernel/Systems Development on a Nexus 9

The Nexus 9 is based on a 64-bit nVidia K1 chip. At the moment it is the most affordable (price wise) and accessible (unit-wise) platform for exploring OS work on an AArch64 platform. The Nexus 9 allows performing an unlock via "fastboot oem unlock", allowing custom Android images to be booted.
https://github.com/andreiw/nexus9_demo

What this is

This is a small demo, demonstrating how to build and boot arbitrary code on your Nexus 9 and do some basic I/O. The demo demonstrates serial I/O and draws two black diagonal lines on the framebuffer.

What you need - required

What you need - optional

How it works

HBOOT, the Nexus bootloader, expects images to be in a certain format. The booted kernel/code must:
  • Be 64-bit
  • Be binary (not ELF)
  • Be linked at 0x80080000
  • Be compressed using "gzip"
  • Be followed by the binary FDT
  • Be contained in an "ANDROID!" boot image.

Some notes:

  • The link address appears to be hardcoded in HBOOT. The Android boot image bases and the AArch64 kernel header fields appear to be ignored.
  • The boot image can contain an additional ramdisk/initrd/payload.
  • The FDT is patched by HBOOT to contain correct linux,initrd-start and linux,initrd-end addresses.

How to build

$ CROSS_COMPILE=aarch64-linux-gnu- make

How to boot

Connect your Android tablet via a USB cable. Optionally connect the UART headphone jack adapter to your computer. The settings are 115200 8-n-1.
$ adb reboot-bootloader
$ fastboot boot nexus9_demo

Actual output of the demo

Hello!
CurrentEL = 0000000000000001
SCTLR_EL1 = 0000000010C5083A
Bye!

Where to go from here

"nexus9_dts" is the decompiled "nexus9_dtb". "nexus9_dtb" was extracted from the Android boot.img.

Final thoughts

From studying the Tegra K1 TRM, the K1 should have virtualization support (i.e. EL2). However, the HTC firmware does not allow booting an EL2-enabled OS. All kernels are booted in EL1. This is rather unfortunate and prevents playing around with KVM and Xen on this platform. Perhaps there are some problems with EL2 support. Or perhaps HTC/nVidia/Google were too myopic to allow EL2 access. It's unclear if the "oem unlock" allows reflashing custom unsigned firmware. "nvtboot" seems to enforce signed "Trusted OS" payloads, at least from dumping the strings. The boot flow looks something like this:
  • "nvtboot" (32-bit) runs on the AVP/COP.
  • "nvtboot" loads "tos" (64-bit) (Trusty aka Secure OS) on the AArch64 chip.
  • "tos" loads HBOOT (32-bit).
  • HBOOT loads Android and implements the fastboot protocol.
It's unclear how to enter NVFlash/APX mode, or how helpful that would be.

Saturday, April 12, 2014

Inline assembler stupidity

I keep getting caught by this, because this is a perfect example of the compiler doing something contrary to what you're writing.
  asm volatile (
                "ldr %0, [%1]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%1]\n\t"
                : "=r" (tmp)
                : "r" (p)
                :
                );

Guess what this gets compiled to?
  30: f9400000  ldr x0, [x0]
  34: 91000400  add x0, x0, #0x1
  38: f9000000  str x0, [x0]

...equivalent to, of course,
  asm volatile (
                "ldr %0, [%0]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%0]\n\t"
                : "+r" (p)
                :
                :
                );
The sort of aggressive and non-obvious optimization is crazy because if I really wanted the generated code, I'd have written the inline asm the second way with a read and write modifier. Maybe for architectures with specialized and very few registers this is a reasonable approach, but for RISCish instruction sets with large instruction files this is nuts. There should be a warning option for this nonsense.

This "correct way" is to use an earlyclobber modifier.
  asm volatile (
                "ldr %0, [%1]\n\t"
                "add %0, %0, #1\n\t"
                "str %0, [%1]\n\t"
                : "=&r" (tmp)
                : "r" (p)
                :
                );
IMO anything that needs a separate paragraph in third-party documents as "a caveat" needs to be fixed.

Speaking of which... Given that C really is a high-level assembly, why not finally standardize on inline asm?

Wednesday, April 2, 2014

Exotic QEMU bugs and fixes

I found that the linux-user portion of QEMU has a few bugs around signals. Really, around handling "self-modifying" code and having the code generator step on unmapped memory.

The test is pretty simple.  Have a page of memory containing one instruction which will cause SIGILL to be delivered, followed by a 'ret'. On a SIGILL, unmap the page. On a SIGSEGV, map the page back in. I've two of these tests - one with actual mmap/munmap, and another with mprotect. The tests verify corner conditions in the binary translation logic, with back-to-back signals and an attempt to execute unmapped code.
https://github.com/andreiw/andreiw-wip/blob/master/qemu/tests/sigtest.c
https://github.com/andreiw/andreiw-wip/blob/master/qemu/tests/sigtest_mprotect.c

"self-modifying" code sounds grand, but it's just the signal return path. While newer Linux kernels use VDSO symbols for the restorer (that's the part that does the sigreturn syscall), QEMU still creates an on-the-stack trampoline. When QEMU creates a translation block for the trampoline, it marks the page internally as read-only so that it can detect when the TB should be invalidated. It is this later logic which was short-circuiting and exiting earlier than needed.
That's fixed in https://github.com/andreiw/andreiw-wip/blob/master/qemu/0001-qemu-fix-page_check_range.patch

The second problem is that QEMU doesn't deal very well with being forced to run code that's unmapped. The TCG generator walks over the unmapped memory, gets a SIGSEGV, which attempts delivery of the signal to the translated program (which again, means getting and/or creating more TBs). The problem, though, is that we attempt to reacquire the tcg_ctx.tb_ctx.tb_lock, which we never dropped due to the signal. i.e. after a SIGSEGV here:
#0  disas_a64_insn (s=0x7fffffffdc40, env=<optimized out>) at /target-arm/translate-a64.c:8972
#1  gen_intermediate_code_internal_a64 (cpu=cpu@entry=0x62532200, tb=tb@entry=0x7ffff440b120, search_pc=search_pc@entry=false) at /target-arm/translate-a64.c:9097
#2  0x00000000600d76e5 in gen_intermediate_code_internal (search_pc=false, tb=0x7ffff440b120, cpu=0x62532200) at /target-arm/translate.c:10629
#3  gen_intermediate_code (env=env@entry=0x6253a468, tb=tb@entry=0x7ffff440b120) at /target-arm/translate.c:10904
#4  0x00000000600e4851 in cpu_arm_gen_code (env=env@entry=0x6253a468, tb=tb@entry=0x7ffff440b120, gen_code_size_ptr=gen_code_size_ptr@entry=0x7fffffffdd64) at /translate-all.c:159
#5  0x00000000600e5152 in tb_gen_code (cpu=cpu@entry=0x62532200, pc=pc@entry=4820992, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=cflags@entry=0) at /translate-all.c:973
#6  0x0000000060040e7a in tb_find_slow (flags=<optimized out>, pc=4820992, env=0x6253a468, cs_base=<optimized out>) at /cpu-exec.c:162
#7  tb_find_fast (env=0x6253a468) at /cpu-exec.c:193
#8  cpu_arm_exec (env=env@entry=0x6253a468) at /cpu-exec.c:611
#9  0x000000006005ad2c in cpu_loop (env=env@entry=0x6253a468) at /linux-user/main.c:1015
#10 0x0000000060004dd1 in main (argc=1, argv=<optimized out>, envp=<optimized out>) at /linux-user/main.c:4392

We longjmp back to the CPU loop and deadlock here:
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
#1  0x000000006012991d in _L_lock_858 ()
#2  0x000000006012978a in __pthread_mutex_lock (mutex=0x604ffa98 <tcg_ctx+350904>) at pthread_mutex_lock.c:61
#3  0x0000000060040bfd in cpu_arm_exec (env=env@entry=0x6253a228) at /cpu-exec.c:610
#4  0x000000006005ad2c in cpu_loop (env=env@entry=0x6253a228) at /linux-user/main.c:1015
#5  0x0000000060004dd1 in main (argc=1, argv=<optimized out>, envp=<optimized out>) at /linux-user/main.c:4392
The solution is to allow tb_gen_code to back out if it knows it can't read the memory. A new exception type is added, EXCP_TB_EFAULT, which then needs to be handled just like an address fault inside cpu_loop.

https://github.com/andreiw/andreiw-wip/blob/master/qemu/0002-qemu-handle-tb_gen_code-getting-called-for-unmapped-.patch
https://github.com/andreiw/andreiw-wip/blob/master/qemu/0003-x86-implement-EXCP_TB_EFAULT.patch

This makes the above tests pass on AArch64 and x86 (32-bit only, since there is no signal handling support for the x86_64-linux-user target at the moment).

Update: Fixes look like they're going in. The TCG deadlock is getting fixed in a simpler way. It is a better and more self-contained fix. http://www.mail-archive.com/qemu-devel@nongnu.org/msg225421.html