Monday, March 28, 2011

Bringing up UEFI on Linux-ARM devices...

Introduction.

Given that TianoCore is open source and now supporting the ARM architecture, it may be a given that someone may wish to play around with it on their device of choice. Most likely, it's an Android phone, or perhaps some other Linux-using device. That also might imply that you don't have access to boot-loader sources, or the bootloader is "secure", or you simply don't wish to have the possibility of bricking the device totally... It almost likely implies that you might know which SOC chipset is inside, but have no idea for all the power regulators, display controllers, and other logic in the device, so nuking the firmware is definitively not the right option. Better to keep it around so it initializes all the hardware enough that you can interact with it purely by manipulating SoC devices...

So what do you do?

TianoCore UEFI firmware consists of three phases of execution. The SEC is the little bit of bootstrap code that runs at cold (or warm) boot, and does just enough to jump to the second stage, the PEI, which configures the DRAM and jumps to the DXE phase. DXE is UEFI proper and loads and dispatches UEFI drivers and applications... We want to chain load UEFI from our existing bootloader. The existing bootloader can load Linux kernels. Our UEFI bootstrap can skip directly to DXE, since the hardware is sufficiently initialized that we can access our RAM.

That was a really quick description of UEFI boot, too. Please see here, here and here for more info.

Basically the goal is to make our DXE IPL look like a Linux kernel. The flash image containing our boot firmware volume will for all purposes be treated like an initrd. I'll omit certain things that you get at by looking at the various EDK2 platform packages, such as the BeagleBoardPkg and everything under ArmPlatformPkg.

Hello, World!

The EDK2 build tools generally create PE32+ images as the output, since that's what UEFI uses for drivers and applications. If you build with GCC, the intermediate files are ELF, and the GenFw tool does all the dirty work of converting the file formats and re-base the images if necessary. The Linux kernel on ARM is a binary blob, composed of a decompressor and a compressed image. Luckily, GenFw can be asked to create a pure binary.

Add something like this to your .inf file:
[BuildOptions]
*_*_*_GENFW_FLAGS = -b
Of course, to be booted by a Linux kernel bootloader, we need to look like the Linux kernel. Thankfully, it's as easy as a few magic words at the start of the binary. Something like this -
.align
_ModuleEntryPoint:
        .text
        .type   _ModuleEntryPoint, #function
        .rept   8
        mov     r0, r0
        .endr

        b       1f
        .word   0x016f2818              @ Magic numbers to help the loader
        .word   _ModuleEntryPoint
        .word   _end
1:
        mov     ip, r2                  @ Save ATAG pointer.
        ...
        ...
        ...

You can read more about Linux-ARM booting here.

With that you should be able to write a little assembly program that can build using the EDK2 tool chain and write "Hello, World!" to, say, an already configured UART.

Location.

The next problem is a bit more interesting - we don't know where we are running. Well, that's not true. We know where we are running, yet, our code has been linked to a particular address during build time, so if we execute, all of our data accesses will be garbage. We have two options:
  1. Relocate ourselves from wherever we are to the linked address. This involves thinking about the memory map (and knowing it at least somewhat).
  2. Build position independent code, and patch ourselves up.
Both methods rely (in different ways) on GCC and GNU LD link scripts.  For option (1) we need to know how large our executable is. I suppose this is something that could be done with an external tool and patching of the binary...but this was simpler, even if dependent on the tool chain.

To use a link script, add something like this to your module INF file:
[BuildOptions]
*_*_*_DLINK_FLAGS == --oformat=elf32-littlearm -nostdlib -u $(IMAGE_ENTRY_POINT) -e \
$(IMAGE_ENTRY_POINT) -Map $(DEST_DIR_DEBUG)/$(BASE_NAME).map -X -T $(MODULE_DIR)/Arm/LinIpl.lds

For option (1) the following is sufficient for a linker script:
SECTIONS
{
  . = 0x1400000;
  _start = .;
  _text = .;

  .text : {
    *(.start)
    *(.text)
    *(.text.*)
    *(.fixup)
    *(.data)
    *(.rodata)
    *(.rodata.*)
    *(.glue_7)
    *(.glue_7t)
    . = ALIGN(4);
  }

  . = ALIGN(4);
  .bss   : { *(.bss) }
  _end = .;
}
... the ". = 0x1400000" specifies the link address. That number I picked more-or-less knowingly, and if you don't know what to pick it's not the best way to start out.

The actual code to perform the relocation looks something like this -
@ Load stack and relocate everything to link address.
 adr r0, LC0
 ldmia  r0, {r1, r2, r3, r4, sp}        @ r2 - linked _start
 subs  r0, r1, r0                       @ r0 = linked - actual
 sub r0, r2, r0                         @ r0 = actual _start
1: ldr  r1, [r0], #4
 str r1, [r2], #4
 cmp r2, r3
 blo  1b
 isb
 dsb
 mov  pc, r4
reloc_done:
 ...
 ...
 ...
 .ltorg
 
 .align 2
 .type LC0, #object
LC0:
 .word   LC0        @ r1
 .word _start       @ r2
 .word _end         @ r3
 .word   reloc_done @ r4
 .word _stack_end   @ sp
 .align
 
 .section ".bss"
 .align  2
_stack: .space 4096
_stack_end:
For option (2) we'll use position-independent code. Note that it's not sufficient to compile just the IPL code with -fpic, but all linked modules need to be compiled in the same way too (and since we use plenty of EDK libraries, there is no going around that). I used a separate DSC file that I used specifically to build the IPL bits separately from the rest of my platform files. Add something like this to your DSC file -
[BuildOptions]

  # Important! Must build as PIC code.
  GCC:*_*_ARM_ARCHCC_FLAGS     == -march=armv7-a -fpic -mthumb
The linker script looks like -
SECTIONS
{
  . = 0;
  _text = .;

  .text : {
    *(.start)
    *(.text)
    *(.text.*)
    *(.fixup)
    *(.data)
    *(.rodata)
    *(.rodata.*)
    *(.glue_7)
    *(.glue_7t)
    . = ALIGN(4);
  }

  _got_start = .;
  .got                  : { *(.got) }
  _got_end = .;

  . = ALIGN(4);
  .bss                  : { *(.bss) }
  _end = .;
}
...this way we get access to our Global Offset Table so we can patch it up. The GOT fix-up code looks like this -
@ Load stack and fix-up GOT.
        adr     r0, LC0
        ldmia   r0, {r1, r2, r3, sp}
        subs    r0, r0, r1              @ r0 = actual - linked
        add     r2, r2, r0
        add     r3, r3, r0
        add     sp, sp, r0

1:      ldr     r6, [r2], #0            
        add     r6, r6, r0              @ actual = linked + r0
        str     r6, [r2], #4            
        cmp     r2, r3
        blo     1b
        ...
        ...
        ...
        .ltorg
        
        .align  2
        .type   LC0, #object
LC0:            
        .word   LC0                        @ r1
        .word   _got_start                 @ r2
        .word   _got_end                   @ r3
        .word   _stack_end                 @ sp
        .align
The disadvantage of this method (2) is that you have to be careful. The GOT fix-ups will not fix up your own pointers to objects, so if you have a structure/table with function pointers (like an EFI protocol), you will need to patch it manually.

Finding the FD and RAM.

As you've seen in the link above about Linux-ARM boot process, we get passed a  pointer to a list of ATAG, which contain our info about RAM and initrd location, among other things. Additionally the RAM size could be passed through the kernel command line.

Parsing ATAGs is easy -
#define ATAG_TYPE_CORE    (0x54410001)
#define ATAG_TYPE_MEM     (0x54410002)
#define ATAG_TYPE_INITRD2 (0x54420005)
#define ATAG_TYPE_CMDLINE (0x54410009)
#define ATAG_TYPE_NONE    (0x0)

typedef struct _ATAG_HEADER
{
  UINT32 Size;
  UINT32 Tag;
} ATAG_HEADER, *PATAG_HEADER;

typedef struct _ATAG_INITRD2
{
  ATAG_HEADER Header;
  UINT32 Start;
  UINT32 Size;
} ATAG_INITRD2, *PATAG_INITRD2;

typedef struct _ATAG_MEM
{
  ATAG_HEADER Header;
  UINT32 Start;
  UINT32 Size;
} ATAG_MEM, *PATAG_MEM;

typedef struct _ATAG_CMDLINE
{
  ATAG_HEADER Header;
  CHAR8 CmdLine[1];
} ATAG_CMDLINE, *PATAG_CMDLINE;

EFI_STATUS
ATagsValid(
  IN  PATAG_HEADER ATagsList
  )
{
  if (ATagsList->Tag == ATAG_TYPE_CORE) {
    return EFI_SUCCESS;
  }
  return EFI_NOT_FOUND;
}

PATAG_HEADER ATagsGet(
  IN  PATAG_HEADER ATags,
  IN  UINT32 TagType
  )
{

  //
  // Start at the next tag...
  //

  PATAG_HEADER Tag = ATags;
  Tag = (PATAG_HEADER) ((UINTN) Tag + Tag->Size * sizeof(UINT32));

  while (Tag->Tag != TagType &&Tag->Tag != ATAG_TYPE_NONE) {
    if (Tag->Tag == TagType) {
      break;
    }
    Tag = (PATAG_HEADER) ((UINTN) Tag + Tag->;Size * sizeof(UINT32));
  }
  return Tag->Tag == ATAG_TYPE_NONE ? NULL : Tag;
}
...
...
...
{
  FdAtag = (PATAG_INITRD2) ATagsGet (ATags, ATAG_TYPE_INITRD2);
  if (!FdAtag) {
    DEBUG ((EFI_D_ERROR, "No EFI FD image passed to IPL :(\n"));
    goto done;
  }
  DEBUG ((EFI_D_INFO, "FD @ 0x%x-0x%x\n", FdAtag->Start,
         FdAtag->Start + FdAtag->Size));
  MemAtag = (PATAG_MEM) ATagsGet (ATags, ATAG_TYPE_MEM);
  if (MemAtag) {
    MemBase = MemAtag->Start;
    MemSize = MemAtag->Size;
  } else {
    MemBase = PcdGet32(PcdMemoryBase);
    MemSize = PcdGet32(PcdMemorySize);
    DEBUG ((EFI_D_ERROR, "Where'd ATAG_MEM go? Using baked-in default...\n"));
  }
  DEBUG ((EFI_D_INFO, "RAM @ 0x%x-0x%x\n",
         MemBase,
         MemBase + MemSize));
}
...
...
...
You could make the above as complicated as necessary to handle non-contiguous memory ranges, etc. The bootloader also passes a machine type to the IPL, so you could even create an image that runs on multiple devices :-).

You're also responsible for setting up a page table, and mapping (linearly, of course, this is UEFI) the necessary memory and MMIO ranges with the correct attributes.

For me this is something like -
ARM_MEMORY_REGION_DESCRIPTOR *
EFIAPI
IplPlatformMemoryRegions (
  IN  UINT32 MemoryBase,
  IN  UINT32 MemoryLength,
  IN  ARM_MEMORY_REGION_ATTRIBUTES CacheAttributes
  )
{
  mMemoryTable[0].PhysicalBase = MemoryBase;
  mMemoryTable[0].VirtualBase = MemoryBase;
  mMemoryTable[0].Length = MemoryLength;
  mMemoryTable[0].Attributes = CacheAttributes;

  mMemoryTable[1].PhysicalBase = TEGRA_IO_SPACE_START;
  mMemoryTable[1].VirtualBase = TEGRA_IO_SPACE_START;
  mMemoryTable[1].Length = TEGRA_IO_SPACE_END - TEGRA_IO_SPACE_START;
  mMemoryTable[1].Attributes = ARM_MEMORY_REGION_ATTRIBUTE_DEVICE;

  mMemoryTable[2].PhysicalBase = 0;
  mMemoryTable[2].VirtualBase = 0;
  mMemoryTable[2].Length = 0;
  mMemoryTable[2].Attributes = 0;

  return mMemoryTable;
}

VOID
InitCache (
  IN  UINT32 MemoryBase,
  IN  UINT32 MemoryLength
  )
{
  ARM_MEMORY_REGION_ATTRIBUTES CacheAttributes;
  ARM_MEMORY_REGION_DESCRIPTOR *MemoryTable;
  VOID *TranslationTableBase;
  UINTN TranslationTableSize;

  if (FeaturePcdGet(PcdCacheEnable) == TRUE) {
    CacheAttributes = ARM_MEMORY_REGION_ATTRIBUTE_SECURE_WRITE_BACK;
  } else {
    CacheAttributes = ARM_MEMORY_REGION_ATTRIBUTE_SECURE_UNCACHED_UNBUFFERED;
  }

  MemoryTable = IplPlatformMemoryRegions (MemoryBase,
                                          MemoryLength,
                                          CacheAttributes);
  ArmConfigureMmu (MemoryTable, &TranslationTableBase, &TranslationTableSize);
  BuildMemoryAllocationHob((EFI_PHYSICAL_ADDRESS)(UINTN)TranslationTableBase,
                           TranslationTableSize, EfiBootServicesData);
}
Note that I've marked the regions as secure accesses, whereas by default the AXI transactions will otherwise be non-secure. It may appear unimportant, but with non-secure accesses you won't be able to properly access many important system devices.

Handing Off.

Now we just need to figure out where the DXE stack will be, create a HOB (Hand-Off Block) list that will describe the CPU and memory layout, and pass control to DXE. Back in EDK1 days it ended up being a bunch of ugly code lifted verbatim from the PEI Core, but it's a lot simpler now, as PrePiLib contains pretty much everything you need.

Something like this -
//
  // DXE stack is below end of first chunk.
  //

  StackBase = MemBase + MemSize - PcdGet32(PcdPrePiStackSize);
  DEBUG ((EFI_D_INFO, "DXE Stack @ 0x%x-0x%x\n",
         StackBase,
         StackBase + PcdGet32(PcdPrePiStackSize)));

  //
  // We'll make the HOB lie after the FD.
  //

  HobBase = (VOID *) (FdAtag->tart + FdAtag->Size);
  CreateHobList ((VOID*) MemBase, MemSize, HobBase, (VOID *) StackBase);

  //
  // Enable CPU features, cache/MMU.
  //

  ArmEnableBranchPrediction ();
  InitCache (MemBase, MemSize);
  BuildMemoryAllocationHob(FdAtag->Start,
                           FdAtag->Size,
                           EfiBootServicesData);
  BuildFvHob(FdAtag->Start + PcdGet32(PcdFlashFvMainBase), PcdGet32(PcdFlashFvMainSize));

  DEBUG ((EFI_D_INFO, "Handing-off to DXE...\n"));
  InitVectors(PcdGet32(PcdCpuVectorBaseAddress));
  LoadDxeCoreFromFv (NULL, 0);
  DEBUG ((EFI_D_ERROR, "DXE core returned :(\n"));
Note that above InitVectors doesn't actually set up exception vectors, it "cleans them up" replacing each entry with a "jump back to self" loop. Otherwise the CpuDxe will think the exception entry is already in use and refuse to replace it. D'Oh!
VOID
InitVectors (
  IN  UINT32 VectorsBase
  )
{
  UINT32 *Start = (UINT32 *) VectorsBase;
  UINT32 *End = (UINT32 *) VectorsBase + VECTORS_COUNT;
  for (;Start < End; Start++) {
    *Start = VECTORS_B_LOOP;;
  }
  ArmDataSyncronizationBarrier ();
}
Odds & Ends.

You might be wondering what your FDF file should look like. To re-iterate, the IPL gets built but it doesn't get stored inside the FV/FD. After you finish building, you might wish to combine the two into an Android boot image for flashing with fastboot. The IPL is your kernel, while the FD is your Initrd :-). My FDF looks like this. There is a single FV. Remember, since don't know where the FD will be loaded by the bootloader, the base address doesn't really matter. We have no XIP code (like embedded SEC, PEI Core, PEIMs) in our FV, so we have nothing to worry about -
[FD.Tegra_EFI]
#
# Warning! BaseAddress is "logical". That means PcdFlashFvMainBase is an offset
# from where it really is.
#
BaseAddress = 0x0|gEmbeddedTokenSpaceGuid.PcdEmbeddedFdBaseAddress
Size = 0x00100000|gEmbeddedTokenSpaceGuid.PcdEmbeddedFdSize
ErasePolarity = 1
BlockSize     = 0x1
NumBlocks     = 0x100000
0x00000000|0x00100000
gEmbeddedTokenSpaceGuid.PcdFlashFvMainBase|gEmbeddedTokenSpaceGuid.PcdFlashFvMainSize
FV = Tegra

[FV.Tegra]
BlockSize          = 0x1
NumBlocks          = 0         
FvAlignment        = 8         
ERASE_POLARITY     = 1
MEMORY_MAPPED      = TRUE
STICKY_WRITE       = TRUE
LOCK_CAP           = TRUE
LOCK_STATUS        = TRUE
WRITE_DISABLED_CAP = TRUE
WRITE_ENABLED_CAP  = TRUE
WRITE_STATUS       = TRUE
WRITE_LOCK_CAP     = TRUE
WRITE_LOCK_STATUS  = TRUE
READ_DISABLED_CAP  = TRUE
READ_ENABLED_CAP   = TRUE
READ_STATUS        = TRUE
READ_LOCK_CAP      = TRUE
READ_LOCK_STATUS   = TRUE

  INF MdeModulePkg/Core/Dxe/DxeMain.inf
...
...
...
The End.

This should get you as far as an ASSERT inside DxeMain about missing UEFI architectural protocols - some of these, like the CpuDxe, are already present in the source distribution, others you will have to write if they don't exist yet for your SoC - like the Timer and Interrupt Controller code.

1 comment:

  1. Nice. Seems you have sufficient knowledge about UEFI on ARM.
    I was trying to use UEFI on our ARM platform. I'm going through the theory, understanding spec part etc now. Once i dirt my hands with board and code, this blog will be pretty much useful. Thanks.

    ReplyDelete