Monday, May 24, 2010

Linux: synchronizing kernel mappings manually

I want to write a hypervisor for ARM. I decided that my launch vehicle is the Linux kernel, in fact, I am effectively turning the Linux kernel into the hypervisor. I want to keep it as a loadable kernel module. Mainly, because I don't want to modify the Linux kernel. This means I should be theoretically able to load it (and unload it) anywhere where there is hardware support, root access, and a kernel source tree to build the module against.

The particular ISA extensions I am targeting add another CPU mode, in which you can intercept hypercalls, IRQs, FIQs, and certain types of aborts. The first time I tried to intercept an IRQ the kernel died with in the prefetch abort handler with a "Bad mode in prefetch abort handler detected" message.

Huh?

The problem stems from the way memory is mapped within the kernel. Anytime vmalloc() is called (and loaded modules go into vmalloc()ed memory), the kernel simply updates the master kernel page tables (init_mm->pgd). When a page fault occurs, the kernel page fault handler checks if the address falls within VMALLOC_START and VMALLOC_END, and propagates the page table entries from init_mm->pgd to the current active translation tables.  On ARM, page faults come in as either data aborts or prefetch (instruction access) aborts. When the module loaded, this happened in the context of the insmod process, whose page tables then became fixed up in the page fault handler when the module was initialized. When the IRQ arrived, the CPU went into the hypervisor mode, and the hypervisor IRQ vector tried to pass control to a piece of code located in the module. Because a different process was executing at the time of the IRQ, a page fault occurred, that should have resulted in patching up the page table, but didn't, because the abort handler didn't expect the previous CPU state to be this new unknown-to-Linux state.
Fixing this means ensuring that no matter what context we're executing in, a page fault never happens while accessing my module's data or code. We can't access init_mm (and the list lock), because these symbols are not exported by the kernel. However, we /can/ propagate changes in the VMALLOC_START-VMALLOC_END range from the current processes address space to all the other process address spaces, as the last part of initializing the module. But we do need to be sure that the current address space contains all the mappings for the module's code and data.

Sunday, May 23, 2010

A long time ago in a galaxy far away...


This is a 32-bit UEFI firmware based on UEFI EDK. As per UEFI specification - unpaged protected mode. This is was around 3.5 years ago, before the time of OVMF, and involved filling in all the missing bits to make the Nt32 simulator real firmware - low-level code, chipset support, patched build tools to properly relocate execute-in-place PE32 binaries for SEC/PEI phases. Sure, I'd done it all before, but this time I did everything carefully and non-hacky. And this time it took me no more than 5 all-nighters, ignored lectures and "work" at my then job...but that's a stark comparison to the month/month and a half it took me to realize the same for x64 before =).

I had followed the same pattern of Tiano porting - maintaining separate SEC, PEI and DXE phases, even though in the context of a virtual machine, the PEI phase had nothing to do but load the DXE core...

The platform changes have been lost forever, while the tool patches to generate proper ROMs (fitting to my own specification as to how it should work, given lack of functioning code provided by Intel) have carried on with me into my more official EFI endeavors, but as actual code are also either gone or bit rotting on some harddrive in one of my old boxes... I've never gone past booting to the EFI shell - I think my plans were to test ELILO, but given my senior year at UIC and other concerns, I never came around to it.

P.S.: Ignore the silly SVN commit comment. You definitely don't want to use the TSC for the timer calibration, given that the TSC may fluctuate depending on CPU power savings....

Maybe I should do a port for ARMv5 Integrator/CP in QEMU...

Saturday, May 22, 2010

Day 1

In the interest of keeping my notes pertaining to my errrr...professional activities in one place and open for others, I've decided to start a blog that I will solely dedicate to low-level and system programming, as well as any and all pet projects of mine...