blog

DRM leasing part three (vblank)

The last couple of weeks have been consumed by getting frame sequence numbers and events handled within the leasing environment (and Vulkan) correctly.

Vulkan EXT_display_control extension

This little extension provides the bits necessary for applications to track the display of frames to the user.

VkResult
vkGetSwapchainCounterEXT(VkDevice           device,
             VkSwapchainKHR         swapchain,
             VkSurfaceCounterFlagBitsEXT    counter,
             uint64_t           *pCounterValue);

This function just retrieves the current frame count from the display associated with swapchain.

VkResult
vkRegisterDisplayEventEXT(VkDevice          device,
              VkDisplayKHR          display,
              const VkDisplayEventInfoEXT   *pDisplayEventInfo,
              const VkAllocationCallbacks   *pAllocator,
              VkFence           *pFence);

This function creates a fence that will be signaled when the specified event happens. Right now, the only event supported is when the first pixel of the next display refresh cycle leaves the display engine for the display. If you want something fancier (like two frames from now), you get to do that on your own using this basic function.

drmWaitVBlank

drmWaitVBlank is the existing interface for all things sequence related and has three modes (always nice to have one function do three things, I think). It can:

Query the current vblank number
Block until a specified vblank number
Queue an event to be delivered at a specific vblank number

This interface has a few issues:

It has been kludged into supporting multiple CRTCs by taking bits from the 'type' parameter to hold a 'pipe' number, which is the index in the kernel into the array of CRTCs.
It has a random selection of 'int' and 'long' datatypes in the interface, making it need special helpers for 32-bit apps running on a 64-bit kernel.
Times are in microseconds, frame counts are 32 bits. Vulkan does everything in nanoseconds and wants 64-bits of frame counts.

For leases, figuring out the index into the kernel list of crtcs is pretty tricky -- our lease has a subset of those crtcs, so we can't actually compute the global crtc index.

drmCrtcGetSequence

int drmCrtcGetSequence(int fd, uint32_t crtcId,
               uint64_t *sequence, uint64_t *ns);

Here's a simple new function — hand it a crtc ID and it provides the current frame sequence number and the time when that frame started (in nanoseconds).

drmCrtcQueueSequence

int drmCrtcQueueSequence(int fd, uint32_t crtcId,
                 uint32_t flags, uint64_t sequence,
             uint64_t user_data);

struct drm_event_crtc_sequence {
    struct drm_event    base;
    __u64           user_data;
    __u64           time_ns;
    __u64           sequence;
};

This will cause a CRTC_SEQUENCE event to be delivered at the start of the specified frame sequence. That event will include the frame when the event was actually generated (in case it's late), along with the time (in nanoseconds) when that frame was started. The event also includes a 64-bit user_data value, which can be used to hold a pointer to whatever data the application wants to see in the event handler.

The 'flags' argument contains a combination of:

#define DRM_CRTC_SEQUENCE_RELATIVE      0x00000001  /* sequence is relative to current */
#define DRM_CRTC_SEQUENCE_NEXT_ON_MISS      0x00000002  /* Use next sequence if we've missed */
#define DRM_CRTC_SEQUENCE_FIRST_PIXEL_OUT   0x00000004  /* Signal when first pixel is displayed */

These are similar to the values provided for the drmWaitVBlank function, except I've added a selector for when the event should be delivered to align with potential future additions to Vulkan. Right now, the only time you can ask for is first-pixel-out, which says that the event should correspond to the display of the first pixel on the screen.

DRM events → Vulkan fences

With the kernel able to deliver a suitable event at the next frame, all the Vulkan code needed was a to create a fence and hook it up to such an event. The existing fence code only deals with rendering fences, so I added window system interface (WSI) fencing infrastructure and extended the radv driver to be able to handle both kinds of fences within that code.

Multiple waiting threads

I've now got three places which can be waiting for a DRM event to appear:

Frame sequence fences.
Wait for an idle image. Necessary when you want an image to draw the next frame to.
Wait for the previous flip to complete. The kernel can only queue one flip at a time, so we have to make sure the previous flip is complete before queuing another one.

Vulkan allows these to be run from separate threads, so I needed to deal with multiple threads waiting for a specific DRM event at the same time.

XCB has the same problem and goes to great lengths to manage this with a set of locking and signaling primitives so that only one thread is ever doing poll or read from the socket at time. If another thread wants to read at the same time, it will block on a condition variable which is then signaled by the original reader thread at the appropriate time. It's all very complicated, and it didn't work reliably for a number of years.

I decided to punt and just create a separate thread for processing all DRM events. It blocks using poll(2) until some events are readable, processes those and then broadcasts to a condition variable to notify any waiting threads that 'something' has happened. Each waiting thread simply checks for the desired condition and if not satisfied, blocks on that condition variable. It's all very simple looking, and seems to work just fine.

Code Complete, Validation Remains

At this point, all of the necessary pieces are in place for the VR application to take advantage of an HMD using only existing Vulkan extensions. Those will be automatically mapped into DRM leases and DRM events as appropriate.

The VR compositor application is working pretty well; tests with Dota 2 show occasional jerky behavior in complex scenes, so there's clearly more work to be done somewhere. I need to go write a pile of tests to independently verify that my code is working. I wonder if I'll need to wire up some kind of light sensor so I can actually tell when frames get displayed as it's pretty easy to get consistent-but-wrong answers in this environment.

Source Code

Linux. This is based off of a reasonably current drm-next branch from Dave Airlie. 965 commits past 4.12 RC3.

git://people.freedesktop.org/~keithp/linux drm-lease-v3
X server (which includes xf86-video-modesetting). This is pretty close to master.

git://people.freedesktop.org/~keithp/xserver drm-lease
RandR protocol changes

git://people.freedesktop.org/~keithp/randrproto drm-lease
xcb proto (no changes to libxcb sources, but it will need to be rebuilt)

git://people.freedesktop.org/~keithp/xcb/proto drm-lease
DRM library. About a dozen patches behind master.

git://people.freedesktop.org/~keithp/drm drm-lease
Mesa. Branched early this month (4 June), this is pretty far from master.

git://people.freedesktop.org/~keithp/mesa drm-lease

Posted Sat Jul 1 01:45:31 2017 Tags: tags/fdo

DRM leasing part three (Vulkan)

With the kernel APIs off for review, and the X RandR bits looking like they're in reasonable shape, I finally found some time to sit down and figure out how I wanted to integrate this into Vulkan.

Avoiding two DRM file descriptors

Given that a DRM lease is represented by a DRM master file descriptor, we want to use that for all of the operations in the driver, including rendering and mode setting. Using the vulkan driver render node and the lease master node together would require passing buffer objects between the kernel contexts using even more file descriptors.

The Mesa Vulkan drivers open the device nodes while enumerating devices, not when they are created. This seems a bit early to me, but it makes sure that the devices being enumerated are actually available for use, and not just present in the system. To replace the render node fd with the lease master fd means hooking into the system early enough that the enumeration code can see the lease fd. And that means creating an instance extension as the instance gets created before devices are enumerated.

The VK_KEITHP_kms_display instance extension

This simple instance extension provides the necessary hooks to get the lease information from the application down into the driver before the DRM node is opened. In the first implementation, I added a function that could be called before the devices were enumerated to save the information in the Vulkan loader. That worked, but required quite a bit of study of the Vulkan loader and its XML description of the full Vulkan API.

Mark Young suggested that a simpler plan would be to chain the information into the VkInstanceCreateInfo pNext field; with no new APIs added to Vulkan, there shouldn't be any need to change the Vulkan loader -- the device driver would advertise the new instance extension and the application could find it.

That would have worked great, except the Vulkan loader 'helpfully' elides all instance extensions it doesn't know about before returning the list to the application. I'd say this was a bug and should be fixed, but for now, I've gone ahead and added the few necessary definitions to the loader to make it work.

In the application, it's a simple matter of searching for this extension, constructing the VkKmsDisplayInfoKEITHP structure, chaining that into the VkInstanceCreateInfo pNext list and passing that in to the vkCreateInstance call.

typedef struct VkKmsDisplayInfoKEITHP {
    VkStructureType         sType;  /* VK_STRUCTURE_TYPE_KMS_DISPLAY_INFO_KEITHP */
    const void*             pNext;
    int                     fd;
    uint32_t                crtc_id;
    uint32_t                *connector_ids;
    int                     connector_count;
    drmModeModeInfoPtr      mode;
} VkKmsDisplayInfoKEITHP;

As you can see, this includes the master file descriptor along with all of the information necessary to set the desired video mode using the specified resources.

The driver just walks the pNext list from the VkInstanceCreateInfo structure looking for any provided VkKmsDisplayInfoKEITHP structure and pulls the data out.

To avoid questions about file descriptor lifetimes, the driver dup's the provided fd. The application is expected to close their copy at a suitable time.

The VK_KHR_display extension

Vulkan already has an API for directly accessing the raw device, including code for exposing video modes and everything. As tempting as it may be to just go do something simpler, there's a lot to be said for using existing APIs.

This extension doesn't provide any direct APIs for acquiring display resources, relying on the VK_EXT_acquire_xlib_display extension for that part. And that takes a VkPhysicalDisplay parameter, which is only available after the device is opened, which is why I created the VK_KEITHP_kms_display extension instead of just using the VK_EXT_acquire_xlib_display extension -- we can't increase the capabilities of the render node opened by the driver, and we don't want to keep two file descriptors around.

With the information provided by the VK_KEITHP_kms_display extension, we can implement all of the VK_KHR_display extension APIs, including enumerating planes and modes and creating the necessary display surface. Of course, there's only one plane and one mode, so some of the implementation is pretty simplistic.

The big piece of work was to create the swap chain structure and associated frame buffers.

A working example

I've taken the 'cube' example from the Vulkan loader and hacked it up to use XCB to construct a DRM lease, the VK_KEITHP_kms_display extension to pass that lease into the Vulkan driver. The existing support for the VK_KHR_display extension "just worked", which was pretty satisfying.

It's a bit of a mess

I'm not satisfied with the mesa code at this point; there's a bunch of code in the radeon driver which should be in the vulkan WSI bits, and the vulkan WSI bits should probably not have the KMS interfaces wired in. I'll ask around and see what other Mesa developers think I should do before restructuring it further; I'll probably have to rewrite it all at least one more time before it's ready to upstream.

Seeing the code

I'll be cleaning the code up a bit before sending it out for review, but it's already visible in my own repositories:

Posted Tue May 30 01:41:22 2017 Tags: tags/fdo

TeleMini V3.0 Dual-deploy altimeter with telemetry now available

TeleMini v3.0 is an update to our original TeleMini v1.0 flight computer. It is a miniature (1/2 inch by 1.7 inch) dual-deploy flight computer with data logging and radio telemetry. Small enough to fit comfortably in an 18mm tube, this powerful package does everything you need on a single board:

512kB on-board data logging memory, compared with 5kB in v1.
40mW, 70cm ham-band digital transceiver for in-flight telemetry and on-the-ground configuration, compared to 10mW in v1.
Transmitted telemetry includes altitude, speed, acceleration, flight state, igniter continutity, temperature and battery voltage. Monitor the state of the rocket before, during and after flight.
Radio direction finding beacon transmitted during and after flight. This beacon can be received with a regular 70cm Amateur radio receiver.
Barometer accurate to 100k' MSL. Reliable apogee detection, independent of flight path. Barometric data recorded on-board during flight. The v1 boards could only fly to 45k'.
Dual-deploy with adjustable apogee delay and main altitude. Fires standard e-matches and Q2G2 igniters.
0.5” x 1.7”. Fits easily in an 18mm tube. This is slightly longer than the v1 boards to provide room for two extra mounting holes past the pyro screw terminals.
Uses rechargeable Lithium Polymer battery technology. All-day power in a small and light-weight package.
Learn more at http://www.altusmetrum.org/TeleMini/
Purchase these at http://shop.gag.com/home-page/telemini-v3.html

I don't have anything in these images to show just how tiny this board is—but the spacing between the screw terminals is 2.54mm (0.1in), and the whole board is only 13mm wide (1/2in).

This was a fun board to design. As you might guess from the version number, we made a couple prototypes of a version 2 using the same CC1111 SoC/radio part as version 1 but in the EasyMini form factor (0.8 by 1.5 inches). Feed back from existing users indicated that bigger wasn't better in this case, so we shelved that design.

With the availability of the STM32F042 ARM Cortex-M0 part in a 4mm square package, I was able to pack that, the higher power CC1200 radio part, a 512kB memory part and a beeper into the same space as the original TeleMini version 1 board. There is USB on the board, but it's only on some tiny holes, along with the cortex SWD debugging connection. I may make some kind of jig to gain access to that for configuration, data download and reprogramming.

For those interested in an even smaller option, you could remove the screw terminals and battery connector and directly wire to the board, and replace the beeper with a shorter version. You could even cut the rear mounting holes off to make the board shorter; there are no components in that part of the board.

Posted Tue Apr 25 09:01:19 2017 Tags: tags/rockets

DRM leasing part deux (kernel side)

I've stabilized the kernel lease implementation so that leases now work reliably, and don't require a reboot before running the demo app a second time. Here's whats changed:

Reference counting is hard. I'm creating a new drm_master, which means creating a new file. There were a bunch of reference counting errors in my first pass; I've cleaned those up while also reducing the changes needed to the rest of the DRM code.
Added a 'mask_lease' value to the leases -- this controls whether resources in the lease are hidden from the lessor, allowing the lessor to continue to do operations on the leased resources if it likes.
Hacked the mutex locking assertions to not crash in normal circumstances. I'm now doing:

BUG_ON(__mutex_owner(&master->dev->mode_config.idr_mutex) != current);

to make sure the mutex is held by the current thread, instead of just making sure some thread holds the mutex. I have this memory of a better way to do this, but now I can't dig it up. Suggestions welcome, of course.

I'm reasonably pleased with the current state of the code, although I want to squash the patches together so that only the final state of the design is represented, rather than the series of hacks present now.

Comments on #dri-devel

I spent a bit of time on the #dri-devel IRC channel answering questions about the DRM-lease design and the changes above reflect that.

One concern was about mode setting from two masters at the same time. Mode setting depends on a number of shared 'hidden' resources; things like memory fifos and the like. If either lessee or lessor wants to change the displayed modes, they may fail due to conflicts over these resources and not be able to recover easily.

A related concern was that the current TEST/render/commit mechanism used by compositors may no longer be reliable as another master could change hidden resource usage between the TEST and commit operations. Daniel Vetter suggested allowing the lessor to 'exclude' lessee mode sets during this operation.

One solution would be to have the lessor set the mode on the leased resources before ceding control of the objects, and once set, the lessee shouldn't perform additional mode setting operations. This would limit the flexibility in the lessee quite a bit as it wouldn't be able to pop up overlay planes or change resolution.

Questions

Let's review the questions from my last post, DRM-lease:

What should happen when a Lessor is closed? Should all access to controlled resources be revoked from all descendant Lessees?

Answer: The lessor is referenced by the lessee and isn't destroyed until it has been closed and all lessees are destroyed.
How about when a Lessee is closed? Should the Lessor be notified in some way?

Answer: I think so? Need to figure out a mechanism here.
CRTCs and Encoders have properties. Should these properties be automatically included in the lease?

Answer -- no, userspace is responsible for constructing the entire lease.

Remaining Kernel Work

The code is running, and appears stable. However, it's not quite done yet. Here's a list of remaining items that I know about:

Changing leases should update sub-leases. When you reduce the resources in one lease, the kernel should walk any sub-leases and clear out resources which the lessor no longer has access to.
Sending events when leases are created/destroyed. When a lease is created, if the mask_lease value is set, then the lessor should get regular events describing the effective change. Similarly, both lessor and lessee should get events when a lease is changed.
Refactoring the patch series to squash intermediate versions of the new code.

Remaining Other Work

Outside of the kernel, I'll be adding X support for this operation. Here's my current thinking:

Extend RandR to add a lease-creation request that returns a file descriptor. This will take a mode and the server will set that mode before returning.
Provide some EDID-based matching for HMD displays in the X server. The goal is to 'hide' these from the regular desktop so that the HMD screen doesn't flicker with desktop content before the lease is created. I think what I want is to let an X client provide some matching criteria and for it to get an event when a output is connected with matching EDID information.

With that, I think this phase of the project will be wrapped up and it will be time to move on to actually hooking up a real HMD and hacking the VR code to use this new stuff.

Seeing the code

For those interested in seeing the state of the code so far, there's kernel, drm and kmscube repositories here:

Posted Fri Mar 31 17:25:22 2017 Tags: tags/fdo

DRM display resource leasing (kernel side)

So, you've got a fine head-mounted display and want to explore the delights of virtual reality. Right now, on Linux, that means getting the window system to cooperate because the window system is the DRM master and holds sole access to all display resources. So, you plug in your device, play with RandR to get it displaying bits from the window system and then carefully configure your VR application to use the whole monitor area and hope that the desktop will actually grant you the boon of page flipping so that you will get reasonable performance and maybe not even experience tearing. Results so far have been mixed, and depend on a lot of pieces working in ways that aren't exactly how they were designed to work.

We could just hack up the window system(s) and try to let applications reserve the HMD monitors and somehow removing them from the normal display area so that other applications don't randomly pop up in the middle of the screen. That would probably work, and would take advantage of much of the existing window system infrastructure for setting video modes and performing page flips. However, we've got a pretty spiffy standard API in the kernel for both of those, and getting the window system entirely out of the way seems like something worth trying.

I spent a few hours in Hobart chatting with Dave Airlie during LCA and discussed how this might actually work.

Goals

Use KMS interfaces directly from the VR application to drive presentation to the HMD.
Make sure the window system clients never see the HMD as a connected monitor.
Maybe let logind (or other service) manage the KMS resources and hand them out to the window system and VR applications.

Limitations

Don't make KMS resources appear and disappear. It turns out applications get confused when the set of available CRTCs, connectors and encoders changes at runtime.

An Outline for Multiple DRM masters

By the end of our meeting in Hobart, Dave had sketched out a fairly simple set of ideas with me. We'd add support in the kernel to create additional DRM masters. Then, we'd make it possible to 'hide' enough state about the various DRM resources so that each DRM master would automagically use disjoint subsets of resources. In particular, we would.

Pretend that connectors were always disconnected
Mask off crtc and encoder bits so that some of them just didn't seem very useful.
Block access to resources controlled by other DRM masters, just in case someone tried to do the wrong thing.

Refinement with Eric over Swedish Pancakes

A couple of weeks ago, Eric Anholt and I had breakfast at the original pancake house and chatted a bit about this stuff. He suggested that the right interface for controlling these new DRM masters was through the existing DRM master interface, and that we could add new ioctls that the current DRM master could invoke to create and manage them.

Leasing as a Model

I spent some time just thinking about how this might work and came up with a pretty simple metaphor for these new DRM masters. The original DRM master on each VT "owns" the output resources and has final say over their use. However, a DRM master can create another DRM master and "lease" resources it has control over to the new DRM master. Once leased, resources cannot be controlled by the owner unless the owner cancels the lease, or the new DRM master is closed. Here's some terminology:

DRM Master: Any DRM file which can perform mode setting.
Owner: The original DRM Master, created by opening /dev/dri/card*
Lessor: A DRM master which has leased out resources to one or more other DRM masters.
Lessee: A DRM master which controls resources leased from another DRM master. Each Lessee leases resources from a single Lessor.
Lessee ID: An integer which uniquely identifies a lessee within the tree of DRM masters descending from a single Owner.
Lease: The contract between the Lessor and Lessee which identifies which resources which may be controlled by the Lessee. All of the resources must be owned by or leased to the Lessor.

With Eric's input, the interface to create a lease was pretty simple to write down:

int drmModeCreateLease(int fd,
               const uint32_t *objects,
               int num_objects,
               int flags,
               uint32_t *lessee_id);

Given an FD to a DRM master, and a list of objects to lease, a new DRM master FD is returned that holds a lease to those objects. 'flags' can be any combination of O_CLOEXEC and O_NONBLOCK for the newly minted file descriptor.

Of course, the owner might want to take some resources back, or even grant new resources to the lessee. So, I added an interface that rewrites the terms of the lease with a new set of objects:

int drmModeChangeLease(int fd,
               uint32_t lessee_id,
               const uint32_t *objects,
               int num_objects);

Note that nothing here makes any promises about the state of the objects across changes in the lease status; the lessor and lessee are expected to perform whatever modesetting is required for the objects to be useful to them.

Window System Integration

There are two ways to integrate DRM leases into the window system environment:

Have logind "lease" most resources to the window system. When a HMD is connected, it would lease out suitable resources to the VR environment.
Have the window system "own" all of the resources and then add window system interfaces to create new DRM masters leased from its DRM master.

I'll probably go ahead and do 2. in X and see what that looks like.

One trick with any of this will be to hide HMDs from any RandR clients listening in on the window system. You probably don't want the window system to tell the desktop that a new monitor has been connected, have it start reconfiguring things, and then have your VR application create a new DRM master, making the HMD appear to have disconnected to the window system and have that go reconfigure things all over again.

I'm not sure how this might work, but perhaps having the VR application register something like a passive grab on hot plug events might make sense? Essentially, you want it to hear about monitor connect events, go look to see if the new monitor is one it wants, and if not, release that to other X clients for their use. This can be done in stages, with the ability to create a new DRM master over X done first, and then cleaning up the hotplug stuff later on.

Current Status

I hacked up the kernel to support the drmModeCreateLease API, and then hacked up kmscube to run two threads with different sets of KMS resources. That ran for nearly a minute before crashing and requiring a reboot. I think there may be some locking issues with page flips from two threads to the same device.

I think I also made the wrong decision about how to handle lessors closing down. I tried to let the lessors get deleted and then 'orphan' the lessees. I've rewritten that so that lessees hold a reference on their lessor, keeping the lessor in place until the lessee shuts down. I've also written the kernel parts of the drmModeChangeLease support.

Questions

What should happen when a Lessor is closed? Should all access to controlled resources be revoked from all descendant Lessees?

Proposed answer -- lessees hold a reference to their lessor so that the entire tree remains in place. A Lessor can clean up before exiting by revoking lessee access if it chooses.
How about when a Lessee is closed? Should the Lessor be notified in some way?
CRTCs and Encoders have properties. Should these properties be automatically included in the lease?

Proposed answer -- no, userspace is responsible for constructing the entire lease.

Posted Tue Mar 28 15:22:22 2017 Tags: tags/fdo

Consulting for Valve in my spare time

Valve Software has asked me to help work on a couple of Linux graphics issues, so I'll be doing a bit of consulting for them in my spare time. It should be an interesting diversion from my day job working for Hewlett Packard Enterprise on Memory Driven Computing and other fun things.

First thing on my plate is helping support head-mounted displays better by getting the window system out of the way. I spent some time talking with Dave Airlie and Eric Anholt about how this might work and have started on the kernel side of that. A brief synopsis is that we'll split off some of the output resources from the window system and hand them to the HMD compositor to perform mode setting and page flips.

After that, I'll be working out how to improve frame timing reporting back to games from a composited desktop under X. Right now, a game running on X with a compositing manager can't tell when each frame was shown, nor accurately predict when a new frame will be shown. This makes smooth animation rather difficult.

Posted Tue Mar 14 12:10:17 2017 Tags: tags/fdo

FOSDEM 2017 -- The Machine and ChaosKeys

Yay! I get to go to FOSDEM this year. I'll be speaking about Free Software for The Machine on Sunday afternoon at 14:00 in K.1.105 (La Fontaine).

I'll also be bringing along a number of ChaosKeys and will have them available for either $40 US or €40. You can either bring cash or pre-pay at the Altus Metrum shop.

Hope to see you there.

Posted Mon Jan 30 16:40:46 2017 Tags: tags/debian

Finding a Libc for tiny embedded ARM systems

You'd think this problem would have been solved a long time ago. All I wanted was a C library to use in small embedded systems -- those with a few kB of flash and even fewer kB of RAM.

Small system requirements

A small embedded system has a different balance of needs:

Stack space is limited. Each thread needs a separate stack, and it's pretty hard to move them around. I'd like to be able to reliably run with less than 512 bytes of stack.
Dynamic memory allocation should be optional. I don't like using malloc on a small device because failure is likely and usually hard to recover from. Just make the linker tell me if the program is going to fit or not.
Stdio doesn't have to be awesomely fast. Most of our devices communicate over full-speed USB, which maxes out at about 1MB/sec. A stdio setup designed to write to the page cache at memory speeds is over-designed, and likely involves lots of buffering and fancy code.
Everything else should be fast. A small CPU may run at only 20-100MHz, so it's reasonable to ask for optimized code. They also have very fast RAM, so cycle counts through the library matter.

Available small C libraries

I've looked at:

μClibc. This targets embedded Linux systems, and also appears dead at this time.
musl libc. A more lively project; still, definitely targets systems with a real Linux kernel.
dietlibc. Hasn't seen any activity for the last three years, and it isn't really targeting tiny machines.
newlib. This seems like the 'normal' embedded C library, but it expects a fairly complete "kernel" API and the stdio bits use malloc.
avr-libc. This has lots of Atmel assembly language, but is otherwise ideal for tiny systems.
pdclib. This one focuses on small source size and portability.

Current AltOS C library

We've been using pdclib for a couple of years. It was easy to get running, but it really doesn't match what we need. In particular, it uses a lot of stack space in the stdio implementation as there's an additional layer of abstraction that isn't necessary. In addition, pdclib doesn't include a math library, so I've had to 'borrow' code from other places where necessary. I've wanted to switch for a while, but there didn't seem to be a great alternative.

What's wrong with newlib?

The "obvious" embedded C library is newlib. Designed for embedded systems with a nice way to avoid needing a 'real' kernel underneath, newlib has a lot going for it. Most of the functions have a good balance between speed and size, and many of them even offer two implementations depending on what trade-off you need. Plus, the build system 'just works' on multi-lib targets like the family of cortex-m parts.

The big problem with newlib is the stdio code. It absolutely requires dynamic memory allocation and the amount of code necessary for 'printf' is larger than the flash space on many of our devices. I was able to get a cortex-m3 application compiled in 41kB of code, and that used a smattering of string/memory functions and printf.

How about avr libc?

The Atmel world has it pretty good -- avr-libc is small and highly optimized for atmel's 8-bit avr processors. I've used this library with success in a number of projects, although nothing we've ever sold through Altus Metrum.

In particular, the stdio implementation is quite nice -- a 'FILE' is effectively a struct containing pointers to putc/getc functions. The library does no buffering at all. And it's tiny -- the printf code lacks a lot of the fancy new stuff, which saves a pile of space.

However, much of the places where performance is critical are written in assembly language, making it pretty darn hard to port to another processor.

Mixing code together for fun and profit!

Today, I decided to try an experiment to see what would happen if I used the avr-libc stdio bits within the newlib environment. There were only three functions written in assembly language, two of them were just stubs while the third was a simple ultoa function with a weird interface. With those coded up in C, I managed to get them wedged into newlib.

Figuring out the newlib build system was the only real challenge; it's pretty awful having generated files in the repository and a mix of autoconf 2.64 and 2.68 version dependencies.

The result is pretty usable though; my STM 32L discovery board demo application is only 14kB of flash while the original newlib stdio bits needed 42kB and that was still missing all of the 'syscalls', like read, write and sbrk.

Here's gitweb pointing at the top of the tiny-stdio tree:

gitweb

And, of course you can check out the whole thing

git clone git://keithp.com/git/newlib

'master' remains a plain upstream tree, although I do have a fix on that branch. The new code is all on the tiny-stdio branch.

I'll post a note on the newlib mailing list once I've managed to subscribe and see if there is interest in making this option available in the upstream newlib releases. If so, I'll see what might make sense for the Debian libnewlib-arm-none-eabi packages.

Posted Sat Jan 7 23:32:10 2017 Tags: tags/debian tags/rockets

Updates to Altos Lisp

I wrote a few days ago about a tiny lisp interpreter I wrote for AltOS

Really, it's almost "done" now, I just wanted to make a few improvements

Incremental Collection

I was on a walk on Wednesday when I figured out that I didn't need to do a full collection every time; a partial collection that only scanned the upper portion of memory would often find plenty of free space to keep working for a while.

To recap, the heap is in two pieces; the ROM piece and the RAM piece. The ROM piece is generated during the build process and never changes afterwards (hence the name), so the only piece which is collected is the RAM piece. Collection works like:

chunk_low = heap base
new_top = heap base

For all of the heap
    Find the first 64 live objects above chunk_low
    Compact them all to new_top
    Rewrite references in the whole heap for them
    Set new_top above the new locations
    Set chunk_low above the old locations

top = new_top

The trick is to realize that there's really no need to start at the bottom of the heap; you can start anywhere you like and compact stuff, possibly leaving holes below that location in the heap. As the heap tends to have long-lived objects slowly sift down to the beginning, it's useful to compact objects higher than that, skipping the compaction process for the more stable area in memory.

Each time the whole heap is scanned, the top location is recorded. After that, incremental collects happen starting at that location, and when that doesn't produce enough free space, a full collect is done.

The collector now runs a bunch faster on average now.

Binary Searches

I stuck some linear searches in a few places in the code, the first was in the collector when looking to see where an object had moved to. As there are 64 entries, the search is reduced from 32 to 6 compares on average. The second place was in the frame objects, which hold the list of atom/value bindings for each lexical scope (including the global scope). These aren't terribly large, but a binary search is still a fine plan. I wanted to write down here the basic pattern I'm using for binary searches these days, which avoids some of the boundary conditions I've managed to generate in the past:

int find (needle) {
    int l = 0;
    int r = count - 1;
    while (l <= r) {
        int m = (l + r) >> 1;
        if (haystack[m] < needle)
            l = m + 1;
        else
            r = m - 1;
    }
    return l;
}

With this version, the caller can then check to see if there's an exact match, and if not, then the returned value is the location in the array where the value should be inserted. If the needle is known to not be in the haystack, and if the haystack is large enough to accept the new value:

void insert(needle) {
    int l = find(needle);

    memmove(&haystack[l+1],
        &haystack[l],
        (num - l) * sizeof (haystack[0]));

    haystack[l] = needle;
}

Similarly, if the caller just wants to know if the value is in the array:

bool exists(needle) {
    int l = find(needle);

    return (l < count && haystack[l] == needle);
}

Call with Current Continuation

Because the execution stack is managed on the heap, it's completely trivial to provide the scheme-like call with current continuation, which constructs an object which can be 'called' to transfer control to a saved location:

> (+ "hello " (call/cc (lambda (return) (setq boo return) (return "foo "))) "world")
"hello foo world"
> (boo "bar ")
"hello bar world"
> (boo "yikes ")
"hello yikes world"

One thing I'd done previously is dump the entire state of the interpreter on any error, and that included a full stack trace. I adopted that code for printing of these continuation objects:

boo
[
    expr:   (call/cc (lambda (return) (set (quote boo) return) (return "foo ")))
    state:  val
    values: (call/cc
             [recurse...]
             )
    sexprs: ()
    frame:  {}
]
[
    expr:   (+ "hello " (call/cc (lambda (return) (set (quote boo) return) (return "foo "))) "world")
    state:  formal
    values: (+
             "hello "
             )
    sexprs: ((call/cc (lambda (return) (set (quote boo) return) (return "foo ")))
             "world"
             )
    frame:  {}
]

The top stack frame is about to return from the call/cc spot with a value; supply a value to 'boo' and that's where you start. The next frame is in the middle of computing formals for the + s-expression. It's found the + function, and the "hello " string and has yet to get the value from call/cc or the value of the "world" string. Once the call/cc "returns", that value will get moved to the values list and the sexpr list will move forward one spot to compute the "world" value.

Implementing this whole mechanism took only a few dozen lines of code as the existing stack contexts were already a continuation in effect. The hardest piece was figuring out that I needed to copy the entire stack each time the continuation was created or executed as it is effectively destroyed in the process of evaluation.

I haven't implemented dynamic-wind yet; when I did that for nickle, it was a bit of a pain threading execution through the unwind paths.

Re-using Frames

I decided to try and re-use frames (those objects which hold atom/value bindings for each lexical scope). It wasn't that hard; the only trick was to mark frames which have been referenced from elsewhere as not-for-reuse and then avoid sticking those in the re-use queue. This reduced allocations even further so that for simple looping or tail-calling code, the allocator may never end up being called.

How Big Is It?

I've managed to squeeze the interpreter and all of the rest of the AltOS system into 25kB of Cortex-M0 code. That leaves space for the 4kB boot loader and 3kB of flash to save/restore the 3kB heap across resets.

Adding builtins to control timers and GPIOs would make this a reasonable software load for an Arduino; offering a rather different programming model for those with a taste for adventure. Modern ARM-based Arduino boards have plenty of flash and ram for this. It might be interesting to get this running on the Arduino Zero; there's no real reason to replace the OS either; porting the lisp interpreter into the bare Arduino environment wouldn't take long.

Posted Sat Nov 19 01:20:28 2016

A Tiny Lisp for AltOS

I took a bit of a diversion over the last week or so when I wondered how small a lisp interpreter I could write, and whether I could fit that into one of the processors that AltOS runs on. It turns out, you can write a tiny lisp interpreter that fits in about 25kB of ram with a 3kB heap for dynamic data.

I decided to target our ChaosKey boards; they're tiny, and I've got a lot of them. That processor offers 28kB of usable flash space (after the 4kB boot loader) and 6kB of ram with the processor running at a steaming 48MHz.

I'm not at all sure this is useful, but I always enjoy doing language implementations, and this one presented some 'interesting' challenges:

Limited RAM. I don't have space to do a classic stop/copy collector.
Limited stack. A simple lisp implementation uses the C stack for all recursion in execution and memory collection. I don't have enough ram for that.

Iterative Compacting Allocator

I'm betting someone has built one of these before, but I couldn't find one, so I wrote my own.

The basic strategy is to walk the heap to find a subset of the active objects which are allocated sequentially in memory with only unused storage between them. These objects are then compacted in-place, and then the heap is walked again to update all references to the moved objects. Then, the process is restarted to find another subset and move them.

By looking for these subsets starting at the bottom of the heap, and working upwards towards the top, the whole heap can be compacted into a contiguous chunk at the bottom of memory.

Allocation involves moving a pointer along at the top of active memory; when it gets to the top of the heap, collect and see if there's space now.

As always, the hardest part was to make sure all active memory was tied down. The second hardest part was to make sure that all active pointers were updated after any allocation, in case a collect moved the underlying object. That was just bookkeeping, but did consume much of the development time.

One additional trick was to terminate the recursion during heap walking by flagging active cons cell locations in a global bitmap and then walking that separately, iterating until that bitmap is empty. Nested lambdas form another recursion which should probably get the same approach, but I haven't done that yet.

An unexpected "benefit" of the tiny heap is that the collector gets called a lot, so any referencing bugs will have a good chance of being uncovered in even a short program execution.

ROM-able Lisp

Instead of implementing all of the language in C, I wanted to be able to implement various pieces in Lisp itself. Because of the complex nature of the evaluation process, adding things like 'let' or even 'defun' turn out to be dramatically simpler in Lisp. However, I didn't want to consume bunches of precious RAM to hold these basic functions.

What I did was to create two heaps, one in ROM and the other in RAM. References are be tagged as to which heap they're in.

16-bit Values

Lisp programs use a pile of references. Using a full 32 bits for each one would mean having a lot less effective storage. So, instead, I use an offset from the base of the heap. The top bit of the offset is used to distinguish between the ROM heap and the RAM heap.

I needed a place to store type information, so I settled on using the bottom two bits of the references. This allows for four direct type values. One of these values is used to indicate an indirect type, where the type is stored in the first byte of the object. The direct types are:

Value	Type
0	Cons cell
1	14-bit int
2	String
3	Other

With 2 tag bits, the allocator needs to work in 32-bit units as the references couldn't point to individual bytes. Finally, I wanted 0 to be nil, so I add four to the offsets within the heaps.

The result is that the ROM and RAM heaps can each cover up to 32k - 4 bytes.

Note that ints are not stored in the heap; instead they are immediate values stored in 14 bits, providing a range of -8192 to 8191. One can imagine wanting more range in ints at some point.

Heap-based Evaluator

A simple lisp implementation uses the fact that eval is re-entrant and do the operation on the C stack:

val eval(val exprs) {
    val vals;

    while (exprs) {
        vals = append(vals, eval(car(exprs)));
        exprs = exprs->cdr;
    }
    return execute (car(vals), cdr(vals));
}

This makes things really simple and provides for a clean framework for implementing various bits of lisp, including control flow and macros. However, it rapidly consumes all of the available memory for a stack, while also requiring separate bookkeeping for the in-use memory in each frame.

I replaced this design with one which keeps the lisp stack on the heap, and then performs eval with a state machine with all state stored in global variables so that the memory manager can reference them directly.

Each eval operation is performed in a separate 'stack' context, which holds the entire eval state except for the current value, which lives in a separate global variable and is used to pass values out of one stack frame and into another. When the last stack context is finished, the evaluation terminates and the value is returned to the caller.

There are nine states in the state machine, each of which is implemented in a separate function, making the state machine a simple matter of pulling the current state from the top of the stack and invoking the associated function:

while (ao_lisp_stack) {
    if (!(*evals[ao_lisp_stack->state])() || ao_lisp_exception) {
        ao_lisp_stack_clear();
        return AO_LISP_NIL;
    }
}
return ao_lisp_v;

Because there's no C recursion involved, catching exceptions is a simple matter of one test at this level.

Primitives like progn, while, cond and eval all take special magic in the state machine to handle; getting all of that working took several attempts before I found the simple loop shown above.

Lexical Scoping

The last time I did a lisp interpreter, I implemented dynamic scoping. Atoms were all global and had values associated directly with them. Evaluating a lambda started by saving all of the existing global values for the parameter atoms and then binding the new values. When finished, the previous values would be restored. This is almost correct, but provides surprising results for things like:

> (setq baz 1)
> (def foo (lambda (bar) (+ baz bar)))
> (def bletch (lambda (baz) (foo baz)))
> (bletch 2)
4

The value that foo gets for 'baz' is 2 instead of 1 under dynamic scoping, which most people find surprising. This time, I was determined to use lexical scoping, and it turned out to be surprisingly easy.

The first trick was to separate the atoms from their 'value'; each atom can have a different value in different lexical scopes. So, each lexical scope gets a 'frame' object, those contain the value for each atom defined in that scope. There's a global scope which holds all of the globally defined values (like baz, foo and bletch above). Each frame points to its enclosing scope, so you can search upwards to find the right value.

The second trick was to realize that the lexical scope of a lambda is the scope in which the lambda itself is evaluated, and that the evaluation of a lambda expression results in a 'function' object, which contains the lambda and its enclosing scope:

> (def foo (lambda (bar bletch)
       ((lambda (baz)
          (+ baz bar))
        bletch)))
> (foo 2 3)
5

In this case, the inner lambda in foo can 'see' the value of bar from the enclosing lambda. More subtly, even if the inner lambda were executed multiple times, it would see the same baz, and could even change it. This can be used to implement all kinds of craziness, including generators:

> (defun make-inc (add)
  ((lambda (base)
     (lambda ()
       (progn
     (setq base (+ base add))
     base)))
   0)
  )

> (setq plus2 (make-inc 2))
> (plus2)
2
> (plus2)
4

The current implementation of each frame is a simple array of atom/value pairs, with a reference to the parent frame to form the full scope. There are dramatically faster implementations of this same concept, but the goal here was small and simple.

A Tiny Allocator Optimization

With eval consuming heap space for stacks, frames and argument lists, the interpreter was spending a lot of time in the collector. As a simple optimization, I added some free lists for stack frames and cons cells.

Stack frames are never referenced when they're finished, so they can always go on the free list. Cons cells used to construct argument lists for functions are usually free.

Builtin functions have a bit which indicates whether they might hold on to a reference to the argument list. Interpreted lambdas can't get the list while nlambdas, lexprs and macros do.

Each lambda execution creates a new frame, and while it would be possible to discover if that frame 'escapes' the lambda, I decided to not attempt to cache free ones yet.

Save and Restore

To make the lisp interpreter more useful in tiny computers, I added the ability to save and restore the entire heap to flash. This requires leaving enough space in the flash to preserve the heap, further constraining the amount of flash available for the application.

Code

All of this code is in the 'lisp' branch of my AltOS repository:

AltOS

The lisp interpreter is independent from the rest of AltOS and could be re-purposed for another embedded operating system. It runs fine on ChaosKey hardware, and also on the STM32F042 Nucleo-32 board

There's also a test framework which runs on Linux, and is how I developed almost all of the code. That's in the src/test directory in the above repository, and is called 'ao_lisp_test'.

Towers of Hanoi

Here's an implementation of the classic recursive Towers of Hanoi game; it shows most of the current features of the language.

;
; Towers of Hanoi
;
; Copyright © 2016 Keith Packard <keithp@keithp.com>
;
; This program is free software; you can redistribute it and/or modify
; it under the terms of the GNU General Public License as published by
; the Free Software Foundation, either version 2 of the License, or
; (at your option) any later version.
;
; This program is distributed in the hope that it will be useful, but
; WITHOUT ANY WARRANTY; without even the implied warranty of
; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
; General Public License for more details.
;

; ANSI control sequences

(defun move-to (col row)
  (patom "\033[" row ";" col "H" nil)
  )

(defun clear ()
  (patom "\033[2J" nil)
  )

(defun display-string (x y str)
  (move-to x y)
  (patom str)
  )

; Here's the pieces to display

(setq stack '("*" "**" "***" "****" "*****" "******" "*******"))

(setq top (+ (length stack) 3))

;
; Here's all of the stacks of pieces
; This is generated when the program is run
;
(setq stacks nil)

; Display one stack, clearing any
; space above it

(defun display-stack (x y clear stack)
  (cond ((= 0 clear)
     (cond (stack (progn
            (display-string x y (car stack))
            (display-stack x (1+ y) 0 (cdr stack))
            )
              )
           )
     )
    (t (progn
         (display-string x y "          ")
         (display-stack x (1+ y) (1- clear) stack)
         )
       )
    )
  )

; This should probably be included in the rom image...

(defun length (list)
  (cond (list (1+ (length (cdr list))))
    (0)
    )
  )

; Position of the top of the stack on the screen
; Shorter stacks start further down the screen

(defun stack-pos (y stack)
  (- y (length stack))
  )

; Display all of the stacks, spaced 20 columns apart

(defun display-stacks (x y stacks)
  (cond (stacks (progn
          (display-stack x 0 (stack-pos y (car stacks)) (car stacks))
          (display-stacks (+ x 20) y (cdr stacks)))
        )
    )
  )

; Display all of the stacks, then move the cursor
; out of the way and flush the output

(defun display ()
  (display-stacks 0 top stacks)
  (move-to 1 21)
  (flush)
  )

; Reset stacks to the starting state, with
; all of the pieces in the first stack and the
; other two empty

(defun reset-stacks ()
  (setq stacks (list stack nil nil))
  (length stack)
  )

; more functions which could usefully
; be in the rom image

(defun min (a b)
  (cond ((< a b) a)
    (b)
    )
  )

(defun nth (list n)
  (cond ((= n 0) (car list))
    ((nth (cdr list) (1- n)))
    )
  )

; Replace a stack in the list of stacks
; with a new value

(defun replace (list pos member)
  (cond ((= pos 0) (cons member (cdr list)))
    ((cons (car list) (replace (cdr list) (1- pos) member)))
    )
  )

; Move a piece from the top of one stack
; to the top of another

(defun move-piece (from to)
  (let ((from-stack (nth stacks from))
    (to-stack (nth stacks to))
    (piece (car from-stack)))
    (setq from-stack (cdr from-stack))
    (setq to-stack (cons piece to-stack))
    (setq stacks (replace stacks from from-stack))
    (setq stacks (replace stacks to to-stack))
    (display)
    (delay 100)
    )
  )

; The implementation of the game

(defun _hanoi (n from to use)
  (cond ((= 1 n)
     (progn
      (move-piece from to)
      nil)
     )
    (t
     (progn
      (_hanoi (1- n) from use to)
      (_hanoi 1 from to use)
      (_hanoi (1- n) use to from)
      )
     )
    )
  )

; A pretty interface which
; resets the state of the game,
; clears the screen and runs
; the program

(defun hanoi ()
  (setq len (reset-stacks))
  (clear)
  (_hanoi len 0 1 2)
  )

Posted Mon Nov 14 23:11:23 2016

All Entries