Wednesday, December 1, 2010

LED triggers redux

As I found out today, profile_event_unregister and friends aren't supposed to be used in generic drivers, and are meant only for profiler usage.

Right now I am looking at reimplementing the trigger to instead create an additional kobject under the LED one, containing an additional brightness attribute. Writing to the attribute will decrease the refcount, and I would guess the refcount to remain above zero for as long as I have the sysfs fd associated with the attribute open. Then I can leverage the kobject release to handle the cleanup and turn the LED off. Looks good in my mind at 4 AM :-).

Edit: wrong again. I completely misunderstood the relationship between kobjects and sysfs. Sysfs manipulations do not have any effect on kobject lifetime. So the refcount doesn't change while store/show are executing. There is no need for that. The sysfs attributes are created in response to kobject creation, so it has no need to manipulate refcounts - you're guaranteed the kobject exists. Sysfs cleanup is performed when the kobject is destroyed, and that will wait until all outstanding sysfs operations are completed. As far as the LED trigger goes, since my original problem ties LED status to a separate hardware device, the seemingly right solution is to have a trigger that lets arbitrary kernel clients manipulate the LEDs they are interested in. I'll have code up soon :-).

Anyway, it's always exciting to see a better way of implementing something than you could before. It means you've grown a bit. For the past couple of months I've been trying to use any free time on various pet projects to explore different Linux kernel subsystems...and it's been a ride so far. It's the one thing I regret not having done enough as an NT and Hyper-V dev (exploring these systems, respectively, not Linux ;)), but that was mostly because I lacked the time, not the initiative.

I am starting to pine for EFI again. I should look for that hard drive with my 32-bit Bochs port. For old-times sake. Or maybe port EDKII to my ARM environment...

Saturday, November 20, 2010

Turning off LEDs on process crashes...

Edit: As I found out today, profile_event_unregister and friends aren't supposed to be used in generic drivers, and are meant only for profiler usage. So if you use this code, there will be kittens getting hurt someplace, and people will laugh at your kernel patches. Or something like that. YMMV.

Sometimes you want to manipulate LEDs from a program. This is easy. LEDs live in /sys/class/leds, and all you need to do is set the brightness sysfs property. Unfortunately, if the task that manipulated the LED died before turning it off, you have no automatic way of cleaning up after yourself. This is why people like manipulating drivers via an file descriptor - if anything goes wrong, the close() will happen automatically.

But the LED interface happened and no one is going to change it. The solution, of course, is to implement a LED trigger. But where LED triggers usually turn LEDs on, this one will turn it off. And it will turn it off when the task that set the trigger completes its execution. I generalized this a bit so any arbitrary task can be watched, and so that any brightness can be set on exit (because I'm nice nice like, and it didn't cost me anything).

Usage is something like this (from within your program).

# echo "owner" > /sys/class/leds/XXX/trigger
# echo  "1" > /sys/class/leds/XXX/brightness
# stuff...
# echo "0" > /sys/class/leds/XXX/brightness
# echo "none" >  /sys/class/leds/XXX/trigger

Implementation-wise, the driver registers a PROFILE_TASK_EXIT notifier. The notifier is global, i.e. it's not tied to any specific process, so it will be invoked for every process exiting (but only as long as the trigger is in actual use), thus the need to compare PIDs. It would be nice to get a targeted PROFILE_TASK_EXIT...


That repo patch I wrote about, that let's you continue syncing git projects even if some of them fail, is now merged in by Google. Enjoy :-)!

ARMv7 kernel with L1 cache disabled.

I was (well, still am) hunting down some memory corruptions inside our kernel, and figured removing as many of possible culprits would be a good idea. Given the different PL310 cache controller errata I figured I might as well disable this guy and see if that helps stability somewhat. Doing that is as easy as disabling CONFIG_CACHE_L2X0. Even though I obviously wasn't going to play with disabling L1 (after all, if that's where your problems are, you have bigger issues...), once I saw the CONFIG_CPU_ICACHE_DISABLE/CONFIG_CPU_DCACHE_DISABLE options I knew I had to try them out. Even if just to see our ARM target crawl.

Of course, after building with that I booted to a hard hang. I tried with just CONFIG_CPU_ICACHE_DISABLE, which worked (glacially), so it was disabling the d-cache that was hosing me. I wasn't going to let a measly kernel config option defeat me, so there went my Friday night :-)... It took me a while to figure out it was actually hanging inside printk(). On a spin_lock. Locking and atomic operations are implemented on Linux with the LDREX/STREX instructions on ARMv6 and above. If you look at the description of these instructions, they involve an exclusive monitor, which is part of the Data Cache Unit (DCU) for L1, and my L2 is off (not that it would do me any good - PL310 doesn't contain an exclusive monitor). So the STREX always fails, and the lock appears taken.  Of course, spinlocks are only used with SMP, and SMP is only supported in Linux on ARMv6 and above (which added support for STREX and LDREX), so since I didn't feel like implementing raw_spin_lock with the SWP instruction (deprecated on >= ARMv6),  disabling SMP was pretty much the obvious choice at 1 AM. After that I needed to enable pre-ARMv6 variants for mutexes, locks, atomic operations, bit operations and __xchg/cmpxchg. And now it boots.... Of course my user space, being compiled for ARMv7, expects functional LDREX/STREX, and so it hangs the init process.

Wednesday, October 13, 2010

Email notifications.

Frequently, you may start some long task, like a build or a SCM sync, that might take some time to finish, and you might not want to hang around to see it finish or fail. Additionally, if it does fail, you might wish to contact someone via email.

So I just had to put together a tool, that would let me run an arbitrary command with arbitrary parameters, and pipe stdout/stderr/return status into an email. The tool handles locale/encoding correctly, tails a configurable amount of interspersed stdout/stderr output in the body of the email in a fixed size font, and lets you attach both stdout and stderr in full as separate attachments. It also provides run times for the run task. Uses SMTP via MTA/MSA, and supports TLS+auth. Unless requested, still prints to console all stdout/stderr.

Not the prettiest or best code ever, as I didn't have time, and I'm not really a Python person. It only depends on, because I was lazy and didn't want to deal with generating HTML.

< 2.6.36 and ioctl

I didn't realize all ioctl()s were handled with the BKL, unless of course a driver used the new unlocked_ioctl way of doing things. Of course with 2.6.36, you don't have a choice any longer =).

Friday, July 30, 2010

Forcing repo to continue syncing projects even if an individual project sync fails...

In case anyone got frustrated at repo bailing at git errors for a particular you go! Hopefully this gets merged in.

Ethos KGDB

So one of the many things I am working in my spare time is adding Ethos kernel debugging support via the gdb debugger. The gdb remote debugging protocol is quite simple making initial bringup very simple and error prone. The protocol is all ASCII, with values encoded in hexadecimal and commands being simple characters. The commands are a pretty simple lot - reading/modifying memory and registers. Compare this to the KD protocol, which is binary and expects NT kernel structures to be available for reading by the debugger core, making the implementation pretty verbose to use for debugging targets other than ntoskrnl through WinDbg/KD. I suppose the difference is the result of a debugger growing around a specific OS and need, rather than just around generic target debugging support. Of course, once the basic CPU state manipulation works in Ethos, kernel-specific commands will need to be added, to handle address space switches, for example, or physical address space reads. In either case, all Ethos-specific support will be versioned and unnecessary for the basic kernel panic debugging, which is the primary motivator for remote debugging in the first place...

So Ethos is a kernel that runs in a Xen paravirtualized virtual machine. That means that whereas on real hardware you could have a serial UART, 1394 adapter, or Ethernet, within a paravirtualized machine you only have virtual devices exposed by Dom0. Obviously, the debugging support code path should be pretty small and should not tie into too many components in the debugged kernel, otherwise you suddenly lose the ability to step through a fair amount of code. The only virtual device that can be accessed almost immediately after bootup with minimal pain is the console device. All other virtual device initialization relies on being able to access XenBus, a bus abstraction for virtual devices to communicate between virtual machines, used for configuration negotiation. Xenbus/XenStore is implemented at a high-level as a namespace with file/directory access semantics, and lets Xen domains exchange small bits of information.

Disregarding the obvious question of why Xen doesn't expose a couple of hypercalls as a debugger channel, like the Hyper-V hypervisor does, we're left trying to see which is the minimum impact communication channel. Using the console is the simplest and lets you debug when nothing else is even working yet (like when you're trying to bring up on x64 ;-)). On the other hand, if you're interested in debugging a live non-crashed kernel, then you might not wish to clobber the kernel console, which is used for kernel messages. The approach I am taking is abstracting the packet transport from the actual debugging engine, and implementing a console transport and a XenBus transport, where a particular key is used to exchange packets between debugger and kernel. The console transport will be used for early boot debugging and kernel panics.

Thursday, July 29, 2010


Two years after I last touched the Ethos kernel, I am back at hacking away at what is now referred to as the cardboard Ethos, while we are looking ahead to developing the next incarnation in a safer programming language than C. Right now, that language might be Go.

What is Ethos?

For the moment, I am the architecture and memory management code owner. That effectively means cleaning up the implementation and architecture of code dealing with physical and virtual address space manipulation, scheduling and process management, and other low-lying bits. It took a bit of time to bring myself up to speed with code that I had written just a few years ago (which is kinda sad...), but I place the blame squarely on not having it done cleanly enough the first time (not that that was or is a priority at this stage in the project ;-)). Aside from cleaning up the tree and fixing bugs, I am adding kernel gdb debugging over Ethernet, and hopefully porting Ethos to x64 if work and personal life allow ;-).

Today I've added a stack unwinder / backtrace that can handle stepping over the interrupt context, so that panic logs are more useful than seeing a backtrace up to the exception handler frame.

Now on a BUG() say occuring within the timer interrupt path, you would get something like the following -

Terminating Ethos Kernel
[0xc00236f0] timer_handler + 21
[0xc000726f] xen_event_handle + ec
[0xc0007b0f] do_hypervisor_callback + a7
[0xc00030a6] hypervisor_callback + 35
-----> Next frame returning from interupt context is kernel space <-----
[0xc0005cc1] memset + 24
[0xc0010e4c] elfLoad + 390
[0xc001221c] scheduleInit + 214
[0xc00156a2] start_kernel + 1dd
[0xc000000e] stack_start + 0

Wednesday, July 28, 2010

ARM11 Technical Documentation

It's excruciatingly frustrating to navigate the ARM website. There are a bajillion similar-sounding documents each containing minor or chip revision-specific minutiae, when all you are really trying to find is a generic overview of architecture for system programming.

In case anyone needs it - ARM11 MPCore™ Processor Technical Reference Manual

Let's just hope the ARM website isn't like MSDN, which recycles links for no apparently good reason more often than you would think is reasonable...

Tuesday, July 27, 2010

Virtualizing ARM with TrustZone

Newer ARM processors come with security extensions called TrustZone. TrustZone is designed to enable a secure environment for software. Effectively, TrustZone extensions "splits" an ARM processor in two domains of operation - secure and a non-secure. Each domain from the point of view of TrustZone non-aware code is generally identical. Each domain has 7 modes of operation (usr, sys, svc, irq, fiq, abt, und), and the secure domain also has an 8th mode - mon, which is meant for secure monitor code. Which domain the CPU is executing in is controlled by the NS bit in CP15 register c1. Most of the system control registers are banked, thus code in secure and non-secure domain more-or-less unaware of the other. The 32-bit physical address space is further extended by a bit, the secure bit, creating two separate physical address spaces, enabling memory-mapped secure devices as well as accessing secure-only memory. With appropriate hardware support, one can route specific interrupts to the secure domain, or carve out a a "secure" area of RAM. The usage scenario for TrustZone is for creating a secure nucleus which can be used as a boot-time root of trust and as foundation for security in a system. The code running in secure mode effectively owns the hardware in the system, and has access to secure and non-secure domains. The non-secure domain is largely unaware of the secure domain and in a well-designed system cannot access any secure resources.

An idea which immediately comes to mind is that TrustZone extensions could be used to simplify virtualization - after all, it enables running code in a "virtual" ARM processor. The secure mode would be used for the hypervisor, while the non-secure domain would be used for virtual machines. In practice, however, TrustZone in its current implementation was never really meant for generic virtualization.

The first problem which immediately comes to mind, is physical address space protection. Most SoCs containing TrustZone support that I have seen so far generally have a facility for "carving out" a region of RAM, making it visible in secure physical address space, which hiding it from the non-secure address space. This allows the secure code and data to be inaccessible from non-secure mode. However, if you have several VMs running, you will not be able to protect the physical address space of each from the other VMs. Worse, for existing designs all hardware is accessible via the non-secure domain, so the is no way to isolate a VM from messing with the physical state. Additionally, if you're basing your hypervisor on top of an existing kernel like Linux, defining the secure region in terms of base and length is fairly difficult, unless you hide half (or whatever) of RAM using something like the mem= boot parameter. All of these physical protection issues are not CPU issues and are solvable with a custom memory controller, which effectively will be an additional MMU, with physical->machine address tables describing the secure physical address space and the non-secure physical address space. However, unless you are designing a new SoC and device around it, you have to live with no physical address space protection between OSes. Effectively, that means that the non-secure domain has to run your own code, that will ensure that physical memory belonging to the secure domain or other OSes is not trampled - i.e. paravirtualization.

The other 90% of the iceberg does end up being an ARM issue. See, the secure monitor mode is meant for a secure monitor, which facilitates switching between secure and non-secure domains on a "secure monitor call" or interrupt. Code operating in the mon mode can access the banked R13/R14/SPSR registers in any secure mode. This is done by allowing secure code to transition to mon from any other mode. Now, code operating in the mon mode can toggle the NS bit to access non-secure versions of the control registers, but you can't access the non-secure banked R13 (stack), banked R14 (link) and banked SPSR registers. Not without resorting to large-overhead hacks (more on this later). So if you wanted to use TrustZone non-secure mode for VMs, you wouldn't be (easily) able to switch those while scheduling. Of course when coupled with the physical address space protection issue, this issue might just nudge you towards porting, say, the Xen Hypervisor to run as the non-secure OS and taking it from there :-).

Using TrustZone to run Xen side-by-side with, say, a Linux, Symbian or NT has definite advantages - no need to modify the "host" OS other than loading a driver implementing a TrustZone monitor, and pretty transparent and fast switches to the Xen hypervisor and thus other OSes. This provides a solution for devices where replacing the bundled OS or booting a third-party kernel is not an option... which as far as ARM devices go nowadays, is almost all of them.

Monday, May 24, 2010

Linux: synchronizing kernel mappings manually

I want to write a hypervisor for ARM. I decided that my launch vehicle is the Linux kernel, in fact, I am effectively turning the Linux kernel into the hypervisor. I want to keep it as a loadable kernel module. Mainly, because I don't want to modify the Linux kernel. This means I should be theoretically able to load it (and unload it) anywhere where there is hardware support, root access, and a kernel source tree to build the module against.

The particular ISA extensions I am targeting add another CPU mode, in which you can intercept hypercalls, IRQs, FIQs, and certain types of aborts. The first time I tried to intercept an IRQ the kernel died with in the prefetch abort handler with a "Bad mode in prefetch abort handler detected" message.


The problem stems from the way memory is mapped within the kernel. Anytime vmalloc() is called (and loaded modules go into vmalloc()ed memory), the kernel simply updates the master kernel page tables (init_mm->pgd). When a page fault occurs, the kernel page fault handler checks if the address falls within VMALLOC_START and VMALLOC_END, and propagates the page table entries from init_mm->pgd to the current active translation tables.  On ARM, page faults come in as either data aborts or prefetch (instruction access) aborts. When the module loaded, this happened in the context of the insmod process, whose page tables then became fixed up in the page fault handler when the module was initialized. When the IRQ arrived, the CPU went into the hypervisor mode, and the hypervisor IRQ vector tried to pass control to a piece of code located in the module. Because a different process was executing at the time of the IRQ, a page fault occurred, that should have resulted in patching up the page table, but didn't, because the abort handler didn't expect the previous CPU state to be this new unknown-to-Linux state.
Fixing this means ensuring that no matter what context we're executing in, a page fault never happens while accessing my module's data or code. We can't access init_mm (and the list lock), because these symbols are not exported by the kernel. However, we /can/ propagate changes in the VMALLOC_START-VMALLOC_END range from the current processes address space to all the other process address spaces, as the last part of initializing the module. But we do need to be sure that the current address space contains all the mappings for the module's code and data.

Sunday, May 23, 2010

A long time ago in a galaxy far away...

This is a 32-bit UEFI firmware based on UEFI EDK. As per UEFI specification - unpaged protected mode. This is was around 3.5 years ago, before the time of OVMF, and involved filling in all the missing bits to make the Nt32 simulator real firmware - low-level code, chipset support, patched build tools to properly relocate execute-in-place PE32 binaries for SEC/PEI phases. Sure, I'd done it all before, but this time I did everything carefully and non-hacky. And this time it took me no more than 5 all-nighters, ignored lectures and "work" at my then job...but that's a stark comparison to the month/month and a half it took me to realize the same for x64 before =).

I had followed the same pattern of Tiano porting - maintaining separate SEC, PEI and DXE phases, even though in the context of a virtual machine, the PEI phase had nothing to do but load the DXE core...

The platform changes have been lost forever, while the tool patches to generate proper ROMs (fitting to my own specification as to how it should work, given lack of functioning code provided by Intel) have carried on with me into my more official EFI endeavors, but as actual code are also either gone or bit rotting on some harddrive in one of my old boxes... I've never gone past booting to the EFI shell - I think my plans were to test ELILO, but given my senior year at UIC and other concerns, I never came around to it.

P.S.: Ignore the silly SVN commit comment. You definitely don't want to use the TSC for the timer calibration, given that the TSC may fluctuate depending on CPU power savings....

Maybe I should do a port for ARMv5 Integrator/CP in QEMU...

Saturday, May 22, 2010

Day 1

In the interest of keeping my notes pertaining to my errrr...professional activities in one place and open for others, I've decided to start a blog that I will solely dedicate to low-level and system programming, as well as any and all pet projects of mine...