Wednesday, June 8, 2016

Disassembling NT system files

Most NT files are stripped. This means that trying to disassemble them is a bit annoying because there are no symbols available. Checked builds of NT came with the symbol files (e.g. support/debug/ppc/symbols/exe/ntoskrnl.dbg for ntoskrnl.exe), but tools like Microsoft's dumpbin or OpenWatcom's wdis don't use them.

Now there's https://github.com/andreiw/dbgsplice to add the COFF symbol table back!


Sadly, the OpenWatcom analogue is quite buggy, so it's hard to suggest. It was a capable disassembler around setupldr and veneer.exe, but it gets horribly confused with complicated section layouts.

Of course the DBG files contain quite a bit more info (and we can do a lot more with the aux COFF syms too for annotating code than dumpbin suggests).

Sunday, May 8, 2016

Easy creation of proxy DLL pragmas

Converting dumpbin DLL exports information to MSVC linker pragmas

Yes, a bit of a weird request. But imagine you want to create a dummy DLL that forwards all the existing symbols of another DLL. Of course you're not going to do it by hand.

You have dumpbin output that looks like:

And you want:

I'm not an awk expert, but this works, except the dumpbin I ran on WIndows and awk I ran on OS X, hahah. But you get the gist...
dumpbin /exports C:\winnt\system32\ntdll.dll |
awk 'NR > 19 && $3 != "" { printf "#pragma comment(linker, \"/export:%s=ntdll.%s\")\n", $3, $3 }'
Might have to tweak number of lines to skip, depending on your tools. I'm on MSVC 4.0 (hello '90s!).

64-bit ARM OS/Kernel/Systems Development Demo on an nVidia Shield TV (Tegra X1)

64-bit ARM OS/Kernel/Systems Development on an nVidia Shield TV (Tegra X1)

The Shield TV is based on the 64-bit nVidia X1 chip. Unlike the K1, this is actually a Cortex-A57 based design, instead of being based on the nVidia "Denver" design. That by itself is kind of interesting already. The Shield TV was available much much earlier than the X1-based nVidia development board (Jetson TX1, you can even buy it on Amazon), and costs about a third of the TX1. The Shield TV allows performing an unlock via "fastboot oem unlock", allowing custom OS images to be booted. Unlike the TX1, you don't get a UART (and I haven't found the UART pads yet, either).

What this is

https://github.com/andreiw/shieldTV_demo

This is a small demo, demonstrating how to build and boot arbitrary code on your Tegra Shield TV. Unlike the previous Tegra K1 demo, you get EL2 (hypervisor mode!).

  • A Shield TV, unlocked. Search Youtube for walkthroughs.
  • Shield OS version >= 1.3.
  • GNU Make.
  • An AArch64 GNU toolchain.
  • ADB/Fastboot tools.
  • Bootimg tools (https://github.com/pbatard/bootimg-tools), built and somewhere in your path.
  • An HDMI-capable screen. Note, HDMI, not DVI-HDMI adapter. You want the firmware to configure the screen into 1920x1080 mode, otherwise you'll be in 640x480 and we don't want that...

How to build

$ CROSS_COMPILE=aarch64-linux-gnu- make
...should yield 'shieldTV_demo'.

How to boot

  1. Connect the Shield TV a USB cable to your dev workstation.
  2. Reboot device via:
    $ adb reboot-bootloader
    ...you should now see the nVidia splash screen, followed by the boot menu.
  3. If OS is 1.3, you can simply:
    $ fastboot boot shieldTV_demo
  4. If OS is 1.4 or 2.1, you will need to:
    $ fastboot flash recovery shieldTV_demo
    ...and then boot the "recovery kernel" by following instructions on screen.
The code will now start. You will see text and some drawn diagonal lines black background. The text should say we're at EL2 and the lines should be green. The drawing will be slow - the MMU is off and the caches are thus disabled.

Let me know if it's interesting to see the MMU setup code.

Final thoughts

The Shield TV is a better deal than the TX1 for the average hobbyist, even with the missing UART. For the price being sold the TX1 should come with a decent amount of RAM, not 1GB more than the Shield TV. nVidia...are you listening? Uncripple your firmware so booting custom images is not a song-and-dance (you broke it in 1.4!) and at least TELL us where the UART pads are on the motherboard. If you're really cool put together an "official" Ubuntu image that runs on the TX1 and the Shield (and fix SCR_EL3.HCE, too).

Saturday, May 7, 2016

Porting TianoCore to a new architecture

"UEFI" on...?

This article is the first in a series of posts touching on the general process of bringing up TianoCore EDK2 on an otherwise unsupported architecture. Maybe you want to support UEFI for your CPU architecture, or simply have a reasonable firmware environment.  In either case, because UEFI is not actually defined for your architecture, you're going to have to do a bit more work than your typical platform bring-up. By the time you're done, you could become a perfect addition to the UEFI forum and its specification committees...yeah!

This blog post and the ones following it continually refer to the ongoing PPC64LE Tiano port I am working on, available at https://github.com/andreiw/ppc64le-edk2/. Since everyone can read the fine code, this document mostly highlights the various steps performed throughout the commits. The git repo isn't perfect, though. Some changes ended up evolving over a few commits while I ironed things out and brought in more code. Hopefully I don't miss mentioning anything important.


TianoCore

Tiano is Intel's open-source (BSD-licensed) implementation of the UEFI specification. EDK2 is the second and current iteration of the implementation.

UEFI officially is supported on IA32, X64, IPF, ARM and AARCH64 architectures, the EDK2 has CPU support code for the x86 and ARM variants. There's a MIPS EDK1 port floating about, and now two PPC64LE OpenPower EDK2 ports, one of which ended up fueling this article...

I'm not going to focus on the architecture of either UEFI or Tiano. Good books have been written on the subject. Here's some tl;dr material, though, for the hopelessly impatient:
At least read the User Documentation and glance at the boot flow diagram. You should be now able to fetch, build and boot Tiano Core using the emulation package and have a rough understanding of what it takes to get a build going via Conf/target.txt and Conf/tools_def.txt.

Your Target

Your target is a 32-bit or 64-bit little-endian chip. I suppose big-endian is doable, but none of the Tiano code is endian safe and UEFI is strictly little-endian.

Development Environment

This assumes that you are doing development on Linux and are using and ELF toolchain and GCC compilers. You are going to need:

Basic Project Setup

Pick a short identifier for your architecture. Pick a name that's unused - for the Power8 port I picked PPC64. This tag will be used with the Tiano build scripts. The next step is to create a couple of Pkg directories that will contain our stuff. In my PPC64 port I initially went with a single-package solution, but I should be working on splitting it up into the more conventional layout, where platform-independent portions are in PPC64Pkg and platform-dependent parts (including build scripts for building for actual boards) are in PPC64PlatformPkg. You can refer to this commit as an example for the minimum required to build a dummy EFI application that does nothing.

If our architecture was supported, then running a similar build command for your package would succeed.
build -p YourArchPlatformPkg/YourArchPlatformPkg.dsc

Build Infrastructure

The next step is to enable the build infrastructure and scripts to understand your architecture identifier. Here's a list of files I had to modify - this is all stuff under BaseTools/Source/Python and the changes are incredibly mechanical.
  • Common/DataType.py
  • Common/EdkIIWorkspaceBuild.py
  • Common/FdfParserLite.py
  • Common/MigrationUtilities.py
  • CommonDataClass/CommonClass.py
  • CommonDataClass/ModuleClass.py
  • CommonDataClass/PlatformClass.py
  • GenFds/FdfParser.py
  • GenFds/FfsInfStatement.py
  • GenFds/GenFds.py
  • TargetTool/TargetTool.py
  • build/build.py

Build tools

UEFI executables are PE/COFF files. Since we are building on Linux, EDK2 uses a workflow where ELF artifacts produced by the cross-compiler are converted into PE32/PE32+ files with the GenFw tool. The PE/COFF artifacts are then wrapped into an FFS object and assembled into what is known as an FV ("firmware volume"). Multiple FVs are put into an FD.  The FV is really a flat file system that uses GUIDs for everything and can store other types of objects as well. You could also generate what is known as a TE (terse executable), but it's basically a cut down version of COFF FWIW.

Tiano deals with several kinds of executables. The UEFI runtime (DXE core, UEFI drivers, and so on) are relocated as they are loaded, while the code that runs prior to the UEFI runtime itself is XIP (execute-in-place) and is thus pre-relocated to fixed addresses by the tool constructing the FV. The point behind this being that such pre-UEFI code (which is the SEC and PEI phases for Tiano), is run in an environment before the DRAM is available.

Thus we need to enable GenFw to create PE/COFF executables from ELF for our architecture. This ties into the compiler options we're going to use, which is not something we've addressed yet. It helps to understand that a PE/COFF file is basically a position-independent executable. Although there is "preferred linking address" that all symbols are relocated against, the PE/COFF image contains a sufficient amount of relocation information, known as "base relocations", to allow loading at any address. So it would appear, that the easiest approach is to generate a position-independent ET_DYN ELF executable (with the -pie flag to the linker) and then focus on converting the output to COFF. This is the approach I highly suggest adopting. You will only have to deal with a single relocation type (R_PPC64_RELATIVE in my case) that will map naturally to either the 64-bit or the 32-bit COFF base relocation type, depending on the bit width of your architecture.

Other approaches are possible, such as embedding all relocations with the --emit-relocs linker flag and dealing with the entire soup of crazy relocs later, but the success of this approach is highly dependent on the architecture and ABI. It may be impossible to convert to PE/COFF due to a mismatch between ELF and COFF base relocs and ABI issues. When first working on the PPC64 port, I first followed the AArch64 approach that did just this, and ended up being forced to use the older (and not really meant for LE) ELFv1 ABI. Don't do it.

Note that depending on the ABI, you may have to do a bit of tool work. I am guessing this was the reason why the AArch64 port never adopted using PIE ELF binaries. Fortunately, you should be able to follow along my changes to Elf64Convert.c. You may also have to make changes to the base linker script used with the GNU toolchains.

Don't forget that you will need to manually rebuild the BaseTools if you make any changes!
make -C BaseTools

At this point we can go back and figure out the compile options. This is the BaseTools/Conf/tools_def.template file, that is then copied to Conf/ by edksetup.sh on freshly checked-out trees. The compiler options heavily depend on your architecture, of course, but generally speaking:
  • build on top of definitions made for new architectures like AArch64, because there's simply less of them in this file to wrap your mind around
  • consider the PPC64 definitions in my tree
  • -pie, unless position-independent executables don't work for you for some reason
  • large model
  • soft float (you can always move to hard float later if that rocks your boat, but it's just more CPU state to wrap your head around)
  • PECOFF_HEADER_SIZE=0x228 for 64-bit chips, 0x220 for 32-bit ones
This is the point where trying to build again should start giving you compile errors, because we still haven't modified any of the Tiano include and library files to be aware of the new architecture.

To be continued.

Thursday, October 1, 2015

Toying around with LE PowerPC64 via the PowerNV QEMU

I've validated that my ppc64le_hello example runs on top of BenH's PowerNV QEMU tree. Runs really snappy!

The only thing that doesn't work is mixed page-size segment support (MPSS, like 16MB in a 4K segment). QEMU does not support MPSS at the moment. Also, QEMU does not implement any of the IBM simulator's crazy Mambo calls.

Monday, July 13, 2015

Toying around with LE PowerPC64 via the Power8 simulator

ppc64le_hello is simple example of what it takes to write stand-alone (that is, system or OS) code that runs in Little-Endian and Hypervisor modes on the latest OpenPOWER/Power8 chips. Of course, I don't have a spare $3k to get one of these nice Tyan reference systems, but IBM does have a free, albeit glacially slow and non-OSS, POWER8 Functional Simulator.

What you get is a simple payload you can boot via skiboot, or another OPAL-compatible firmware. Features, in no particular order:
  • 64-bit real-mode HV LE operation.
  • logging via sim inteface (mambo_write).
  • logging via OPAL firmware (opal_write).
  • calling C code, stack/BSS/linkage setup/TOC.
  • calling BE code from LE.
  • FDT parsing, dumping FDT.
  • Taking and returning from exceptions, handling unrecoverable/nested exceptions.
  • Timebase (i.e. the "timestamp counter"), decrementer and hypervisor decrementer manipulation with some basic timer support (done for periodic callbacks into OPAL).
  • Running at HV alias addresses (loaded at 0x00000000200XXXXX, linked at 0x80000000200XXXXX). The idea being that the code will access physical RAM and its own data structures solely using the HV addresses.
  • SLB setup: demonstrates 1T segments with 4K base page and 16M base page size. One segment (slot = 0) is used  to back the HV alias addresses with 16M pages. Another  segment maps EA to VA 1:1 using 4K pages.
  • Very basic HTAB setup. Mapping and unmapping for pages in the 4K and 16M segments, supporting MPSS (16M pages in the 4K segment). No secondary PTEG. No eviction support. Not SMP safe. Any access within the HV alias addresses get mapped in. Any faults to other  unmapped locations are crashes, as addresses below 0x8000000000000000 should only be explicit maps.
  • Taking exception vectors with MMU on at the alternate vector location (AIL) 0xc000000000004000.
  • Running unpriviledged code.
See README for more information, including how to build and run. At some point it ran on a real Power8 machine - and may run still ;-).

Monday, July 6, 2015

DOES> in Jonesforth

Jonesforth 47 quoth:
NOTES ----------------------------------------------------------------------

DOES> isn't possible to implement with this FORTH because we don't have a separate
data pointer.

Thankfully, that's not true. The following is a tad AArch32-specific, given that I am playing with pijFORTHos (https://github.com/organix/pijFORTHos), but the principle remains the same. Let's first look at how DOES> gets used.
: MKCON WORD CREATE 0 , , DOES> @ ;
This creates a word MKCON, that when invoked like:
1337 MKCON PUSH1337
...creates a new word PUSH1337 that will behave, as if it were defined as:
: PUSH1337 1337 ;
Recall the CREATE...;CODE example. DOES> is very similar to ;CODE, except you want Forth words, not native machine words invoked. In ;CODE, the native machine words are embedded in the word using CREATE...;CODE, and in CREATE...DOES> it will be Forth words instead. So if we had no DOES> word, we could write something like:
: MKCON WORD CREATE 0 , , ;CODE $DODOES @ ;
...where $DODOES is the machine code generator word that creates the magic we've yet to figure out. $DODOES needs to behave like a mix between DOCOL and NEXT, that is adjusting FIP (the indirect threaded code instruction pointer, pointing to the next word to execute) to point past $DODOES to the @ word. The DFA of the CREATEd word (i.e. PUSH1337) is put on the stack, so @ can read the constant (1337) out. This means the simplest CREATE...DOES> example is:
: DUMMY WORD CREATE 0 , DOES> DROP ;
DUMMY ADUMMY
...because we need to clean up the DFA for ADUMMY that is pushed on its invocation. Anyway, we could thus define DOES> like:
: DOES> IMMEDIATE ' (;CODE) , [COMPILE] $DODOES ;
Let's look at two ways of implementing $DODOES. Way 1 - fully inline. The address of the Forth words (the new FIP) is calculated by skipping past the bits emitted by $DODOES.
        .macro COMPILE_INSN, insn:vararg
        .int LIT
        \insn
        .int COMMA
        .endm

        .macro NEXT_BODY, wrap_insn:vararg=
        \wrap_insn ldr r0, [FIP], #4
        \wrap_insn ldr r1, [r0]
        \wrap_insn bx  r1
        .endm
@
@ A CREATE...DOES> word is basically a special CREATE...;CODE
@ word, where the forth words follow $DODOES. $DODOES thus
@ adjusts FIP to point right past $DODOES and does NEXT.
@
@ You can think of this as a special DOCOL that sets FIP to a
@ certain offset into the CREATE...DOES> word's DFA. This
@ version is embedded into the DFA so finding FIP is
@ as easy as moving FIP past itself.
@
@ - Just like DOCOL, we enter with CFA in r0.
@ - Just like DOCOL, we need to push (old) FIP for EXIT to pop.
@ - The forth words expect DFA on stack.
@
        .macro DODOES_BODY, magic=, wrap_insn:vararg=
0:      \wrap_insn PUSHRSP FIP
1:      \wrap_insn ldr FIP, [r0]
        \wrap_insn add FIP, FIP, #((2f-0b)/((1b-0b)/(4)))
        \wrap_insn add r0, r0, #4
        \wrap_insn PUSHDSP r0
        NEXT_BODY \wrap_insn
2:
        .endm
@
@ $DODOES ( -- ) emits the machine words used by DOES>.
@
defword "$DODOES",F_IMM,ASMDODOES
        DODOES_BODY ASMDODOES, COMPILE_INSN
        .int EXIT

Way 2 - partly inline, where the emitted code does an absolute branch and link. This reduces the amount of memory used per definition at the cost of a branch. Ultimately this is the solution adopted. _DODOES calculates the new FIP adjusting the return address from the branch-and-link done by the inlined bits.
_DODOES:
        PUSHRSP FIP        @ just like DOCOL, for EXIT to work
        mov FIP, lr        @ FIP now points to label 3 below
        add FIP, FIP, #4   @ add 4 to skip past ldr storage
        add r0, r0, #4     @ r0 was CFA
        PUSHDSP r0         @ need to push DFA onto stack
        NEXT

        .macro DODOES_BODY, wrap_insn:vararg=
1:      \wrap_insn ldr r12, . + ((3f-1b)/((2f-1b)/(4)))
2:      \wrap_insn blx r12
3:      \wrap_insn .long _DODOES
        .endm

@
@ $DODOES ( -- ) emits the machine words used by DOES>.
@
defword "$DODOES",F_IMM,ASMDODOES
        DODOES_BODY COMPILE_INSN
        .int EXIT
In either case, just like DOCOL, we need to push the old FIP pointer before calculating the new one. The old FIP pointer corresponds to the address within the word that called the DOES>-created word. In both cases we need to push the DFA of the executing word onto the stack (this is in r0 on the AArch32 Jonesforth).

Finally, in both cases the CREATE...DOES> word is indistinguishable from a CREATE...;CODE word, and the created word is indistinguishable from a word created by a CREATE...;CODE word.
\ This is the CREATE...;CODE $DOCON END-CODE example before.
: MKCON WORD CREATE 0 , , ;CODE ( MKCON+7 ) E590C004 E52DC004 E49A0004 E5901000 E12FFF11 (END-CODE)
CODE CON ( CODEWORD MKCON+7 ) 5 (END-CODE)

\ Fully inlined CREATE...DOES>.
: MKCON_WAY1 WORD CREATE 0 , , ;CODE ( MKCON_WAY1+7) E52BA004 E590A000 E28AA020 E2800004 E52D0004 E49A0004 E5901000 E12FFF11 9714 938C (END-CODE)
CODE CON_BY_WAY1 ( CODEWORD MKCON_WAY1+7 ) 5 (END-CODE)

\ Partly-inlined CREATE...DOES>. 
: MKCON_WAY2 WORD CREATE 0 , , ;CODE ( MKCON_WAY2+7 ) E59FC000 E12FFF3C 9F64 9714 938C (END-CODE)
CODE CON_BY_WAY2 ( CODEWORD MKCON_WAY2+7 ) 5 (END-CODE)
This makes decompiling (i.e. SEE) a bit tricky, but not impossible. As you can see here, I haven't written a good disassembler yet, which would detect these sequences as $DOCON. IMHO this is still a lesser evil than introducing new fields or flags into the word definition header.

P.S. Defining constants is a classical example of using DOES>, but a bit silly when applied to Jonesforth, where it's an intrinsic. It's an intrinsic so that certain compile-time constants, known only at assembler time, can be exposed to the Forth prelude and beyond. The other classical example of DOES> is struct-like definitions.

P.P.S. You might be wondering how I'm SEEing into code words, as neither Jonesforth nor pijFORTHos support it. I guess I'll blog about that next real soon whenever... The ( CODEWORD XXX ) business here shows the "code word" pointed to by the CFA, which is necessarily not DOCOL (otherwise it would be a regular colon definition, not CODE). The ( CODEWORD word+offset ) notation tells you that the machine words pointed to by the CFA are part of a different word. Native (jonesforth.s-defined) intrinsics would decompile as something like:
CODE 2SWAP ( CODEWORD 85BC ) (END-CODE)