Halloween is over!

Not just the festival, but my project to write a short story or part thereof every day of October and publish them on the website. See my blog post from a couple of weeks ago for more information

All of the stories are now written and published at http://netsplit.com/halloween/ - up to and including today’s. It’s a good feeling to be finally finished. I’ll leave them up for at least a little while longer before I decide what to next with them, a round of editing and tidying up or two is certainly in order; and my current plan is to publish them collected together in printed or eBook form if people want it.

But until then I have a new project…

NaNoWriMo Participant 2011

…I’m going to be taking part in NaNoWriMo. For those that have not heard of it, which included myself until a month ago, it’s a competition of sorts to write a 50,000 word (minimum) first draft of a novel in the month of November.

I’ve had the idea for the story for a couple of years now, and what with wetting my apetite for writing again with the October project. I’m eager to get started on it. Now I just need to wait until midnight…

Halloween

Readers of this blog will mostly know me for the software I’ve written, most likely Upstart or my work on Ubuntu, but there’s another kind of writing I enjoy doing and it’s something I haven’t taken much time to do in the last few years.

Like everyone else I had to do some creative writing at school, but I really enjoyed it and took it quite a bit further writing many short stories over the years.

A friend of mine is taking part in NaNoWriMo this year, and the two of us discussed ways of practicing and, most of all, warming up for it. After all, writing over 1,650 words of a novel a day is no mean feat to go into stone cold. She’s been using 750words.com for a while, and I suggested she use that to write short stories along the lines of her planned novel, but not to be used for it, to practice.

Discussing it really fired up my desire to do some writing again myself, so I decided to join her. But obviously rather than copying, I decided to do something completely different in tone.

Each day in October, leading up to Halloween, I’m writing a short story myself. Since that’s a spooky event, I’m vaguely sticking to a horror theme for the stories. I say vaguely because it’s quite easy to slip from horror to other genres, such as science-fiction or thrillers, but the intent is certainly there that these all have a darker theme than usual.

I’ve also been using 750words.com for the most part, with an aim that each story be a minimum of that in length. One of the most interesting outcomes is that the earliest stories were hard work to reach 750 words from a simple idea, whereas the latest ones easily reach 1,000. In fact my latest story is almost twice the minimum length.

But at the end of the day, they are roughly 3-5 pages each and since they’re posted and published each day, they’re more akin to first drafts of ideas than polished works. I remember reading once about an author who would pin up the pages of a story he was writing to a shop window as each one came off the typewriter (Google tells me this was almost certainly Harlan Ellison), I like to think I’m doing the Internet-era equivalent.

Perhaps they will give some people joy and delight, or perhaps they will give some people nightmares. Even if not, I’m enjoying writing them!

You can read those so far at http://netsplit.com/halloween/ in all the usual formats, and check back every day or so for new ones if you like what you read.

A new release process for Ubuntu?

With the nomination period beginning for the Ubuntu Technical Board, big changes like Unity having arrived in Ubuntu recently, and the upcoming UDS for being what will likely be a new LTS release of Ubuntu, it’s as good as time as any to ask big questions about the development process, challenge assumptions, and make suggestions for big changes.

Cadence

The Ubuntu release process is well known, and its developers talk regularly about the cadence of it. A new release of Ubuntu comes out every six months, and each release follows a predictable pattern. I’ve stolen the following image from OMG! Ubuntu’s recent series about Ubuntu Development.

Each developer working on Ubuntu follows this cycle. When Ubuntu 11.10 is released on October 13th, they’ll begin again. After they recover, of course.

First there’ll be a bit of a wait for the archive to be open, this gets quicker and quicker each release but since it depends on a toolchain being built and other similarly fundamental things, it tends to be a period where most people figure out what they’re going to discuss at UDS.

UDS is a bit late for the 12.04 cycle, so the merge period will probably occupy developer time both before and after UDS. This isn’t represented on Daniel’s chart above, but this is the time when massive amounts of updates arrive from Debian; it’s a time of great instability for Ubuntu. At some point there will be an Alpha 1, but you won’t want to try and install that.

Planning for UDS is going to take up some time, and writing up the results of the plans afterwards and turning that into work items. There’s also a UDS Hangover which nobody (except Robbie Williamson, when drafting the 10.10 Release Cycle) seems to like to talk about. Nothing gets done in the week or two following UDS, everybody is too wiped out.

So realistically speaking, development of features for 12.04 is going to start around mid-November at the earliest. And by features I mean the big headline things in Ubuntu; like Unity, like the Software Center, like the Installer. These things are important to get right.

Pretending for a moment that features are developed over the winter holidays like Thanksgiving, Christmas and New Year, you’ve got clear development time until Feature Freeze. The 12.04 Release Schedule isn’t published yet, but I figure that’s going to be somewhere around February 16th after which everyone switches to bug fixing and release testing.

That’s just 13 weeks of development time!

Chaos

So you’re an Ubuntu developer working on features for the upcoming release, you don’t have anywhere near as much time as you’d expect to actually do the development work. What happens if you’re replacing something that works with something completely new? Can’t you just target a later release, and work continually until the feature freeze of that release?

It turns out that you can’t. There is an incredible emphasis on the Ubuntu planning process of targeting features for particular releases. This is the exact thing you’re not supposed to do with a time-based release schedule.

Unfortunately Canonical’s own performance-review and management is also based around this schedule. The Ubuntu developers so employed (the vast majority) have such fundamentals as their pay, bonuses, etc. dictated by how many of their assigned features and work items are into the release by feature freeze. It’s not the only requirement, but it’s the biggest one.

Your new feature is going to take twelve months of development time to fully develop before it’s truly a replacement for the existing feature in Ubuntu. What you don’t do is spend twelve months developing and land it when it’s a perfect replacement.

What you do do is develop it in 12-13 week bursts, which means it’s going to take you roughly four release cycles before it’s ready rather than two. And you land the quarter-complete feature in the first release, replacing the older stable feature.

Consequence

If this were true, you would expect to see new features repeatedly arriving in Ubuntu before they were ready. Removing the old, deprecated feature and breaking things temporarily with the promise that everything will be better in the next release, certainly the one after that, definitely by the LTS.

Maybe you don’t believe that characterizes Ubuntu, in which case you should probably just stop reading now because we’re not going to agree with my fundamental complaint.

But I will say this: I know I’m responsible for doing this on more than one occasion because I had to; and I saw the exact same pattern in others’ work, when I was a manager my reports complained that they had to follow this pattern and I still see the same pattern today with features such as Unity and the Software Center.

Follow this pattern and developers are going to complain that they need a release where they don’t have any features to work on, and can just spend the time stabilizing and bug fixing.

Worse, follow this pattern and you’re going to create a user expectation that releases are going to be largely unstable and contain sweeping changes that are going to be surprising to administrators of Enterprise desktop deployments, and discourage them from using your distribution at all.

A kludge to this would be to overlay a second release schedule onto your first one, with more of an emphasis on stability and support. It’s a target for your developers to complete their features, or at least stabilize them in those 12 weeks; and it’s a target for your users to consider deployment. So three out of four of your releases are really just unstable previews of that final fourth release.

Complacency

This second LTS release cycle solves the unstable release issue, so why is this a problem?

Because developer time is wasted; because user time is wasted; because user confidence is lost.

Because features can take longer than two years to develop; or if even if a feature takes just two years, if it’s not begun immediately after the previous LTS release, it’s not going to be ready for the next one so you might postpone and lose the lead.

Because you might expect a knock-on degeneracy effect in the LTS releases as well; with 12.04 LTS being less stable than 10.04 LTS, which was less stable than 8.04 LTS which was less stable than 6.06 LTS. And it’s far too late now to have considered the 10.10/11.04/11.10/12.04 cycle to have been a Super-Long-Term-Support release and kept back the complete replacement of the desktop environment.

Because the original reason for the six-month cycle has already been forgotten: features are targeted towards releases, rather than released when ready; because the original base for the release schedule (GNOME) is no longer a key component of the distribution; because no other key component has adopted this schedule.

Because these might be a better way.

Cataclysm

What I’m going to suggest here is a completely new development process for Ubuntu, complete with details about how it would be implemented.

I’m going to suggest a monthly release process, beginning with the 11.10 release. It so happens that this fits perfectly with Ubuntu’s version numbering system, the next release would be 11.11, followed by 11.12, followed by 12.01 and so on.

This monthly release would be simply known as release in your sources.list, updates would be published to it on the first week of the month. There would be no codenames, and due to the rapid releases, changes would be largely unsurprising and iterative on the previous releases.

In order to provide user testing, a second release known as beta would exist. It’s from this release that release would be copied from on that first week of the month. beta would be updated every two weeks, on the first week of the month after it became the new release, and then on the half-way point of the month. Users who like a little bleeding on their edge can change their sources.list to use this more exciting release, or download appropriate disk images.

Developers wouldn’t run either of these, they would run the third release branch alpha. It’s from here that beta is updated; and from here that daily disk images would be generated.

Publishing from alpha to beta, and then from beta to release is handled semi-automatically. The release manager will track Release Critical bugs, and will hold up packages from copying from one to the other if they have outstanding problems. If this sounds familiar, it’s because this is exactly how the Debian testing distribution works and I recommend using the same software (which Ubuntu already uses to check for archive issues).

So where do developers upload? It’s tempting to just say to alpha, but if we say that, alpha will end up looking very different from release because it will be filled with unstable software that’s not ready for users yet. This will make it harder for problems in the release branch to be fixed, because none of the components are left in alpha because they’ve been replaced by something that’s not ready yet.

Developers will upload to an unpublished trunk branch. Packages will be copied to alpha provided:

  • there is a signed-off code review for the upload
  • the upload meets policy (lintian clean)
  • the upload builds on all released architectures
  • unit tests pass on all released architectures
  • functional and verification tests pass on all architectures for the archive as a whole

I just introduced a bunch of new checks to the developer process there; I just introduced code review, mandatory unit tests and then piled functional tests and verification tests on top.

The first four are relatively self-explanatory; fail any of these tests and your upload has marked the tree red. In which case not only will your package fail to copy to alpha, but you’re about to have a conversation with the Release Manager.

For functional and verification tests, this means doing more automated QA. A failing test could be an automated installer run, or an automated boot-and-test run, etc. They’ll run sometime after the fact and will make the entire tree red. The Release Manager or their team will have to examine the logs to figure out the culprit.

So things aren’t copying to alpha, now one of two things is going to happen.

  • the Release Manager reverts your upload. Because trunk is unpublished, this is simply overwriting with the older package from alpha; nobody except the original developer is going to have known about it
  • after talking with the developer, it’s decided that further uploads of other packages are required (e.g. due to dep-wait, or the bug being elsewhere) in which case the tree remains red while the developer (or another in rare cases) prepares that fix upload.

While the tree is red, nobody else is allowed to upload unless it’s a fix for the problem. All effort should go to fixing the tree.

If the archive has to always remain stable, how do you develop large features such as Upstart, Unity, Ubiquity, Software Center, etc.? You use a PPA to do development, on your own timeline.

If your feature takes twelve months to develop, you take twelve months to develop it in that PPA. You’re going to be posting regularly to mailing lists or blogging about your feature to encourage users to add your PPA to their sources.list to gain testing. Obviously you’ll be doing various uploads to the main series over time to get all your dependencies in early where they don’t conflict with what’s already there.

Conclusion

My proposal is a radical change to the Ubuntu Release Process, but surprisingly it would take very little technical effort to implement because all the pieces are already there including the work on performing automated functional and verification tests.

I believe it solves the problem of landing unstable features before they’re ready, because it almost entirely removes releases as a thing. As a developer you simply work in a PPA until you’ll pass review, and land a stable feature that can replace what was there before.

It solves the need for occasional stabilization and bug-fixing releases because the main series is always stable and can receive bug-fixes easily separate to any development work going on. A developer can chose to focus on looking after the main series for some of their time in addition to their feature development work, or devote all of their time to it.

Another problem I’ve not talked about is that of building software on an unstable foundation, also solved by this change. Since developers will run alpha, and vendor developers can just run a relatively up-to-date, yet stable, release branch, software can be built on a solid foundation. Only the new feature or software itself is unstable until ready.

Canonical can keep its review schedule, and use developer uploads and work items; except rather than landing in a release, they can now land in a PPA.

Merges from Debian unstable can be handled pretty much continually as long as they keep the tree green, alternatively one can decide that users ultimately don’t care about an updated version of cat and until a case can be made (e.g. an open bug) for a package’s update, it need not be merged.

Users can now be confident of always receiving a stable operating system, because of the multiple testing and QA passes each change continually receives. Updates come in monthly, two-weekly or dailyish batches depending where in the main series they chose to run.

Enterprise administrators can run this stable release, because it only changes gradually with well-tested updates. The big changes and features have a long gestation period in PPAs, with many advance notices and blog posts about them. They’re not a surprise and can be planned for well in advance of their landing.

Downsides will, doubtless, be found in the comments below.

For your consideration.

Tracing on Linux

The Linux tracing APIs are a relatively new addition to the kernel and one of the most powerful new features its gained in a long time. Unfortunately the plethora of terms and names for the system can be confusing, so in this follow-up to my previous post on the proc connector and socket filter, I’ll take a look at achieving the same result using tracing and hopefully unravel a little of the mystery along the way.

Rather than write a program along the way, I’ll be referring to sample code found in the kernel tree itself so you’ll want a checkout. If you’re doing any work that touches the kernel further than standard POSIX APIs, I highly recommend this anyway; it’s quite readable and once you find your way around, is the quickest way to answer questions.

Grab your checkout with git:

# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
# cd linux-2.6

Tracepoints

One of the reasons there are so many terms and names is that, like most kernel systems, there are many layers and each of those layers is exposed as different developers have different requirements. An important lower layer is that of tracepoints, also known as static tracepoints. For these we’ll be looking at the code in the samples/tracepoints directory of the kernel source; kernelese documentation of the API can be found in Documentation/trace/tracepoints.txt

A tracepoint is a placeholder function call in kernel code that the developer of that subsystem has deemed a useful point for debugging code to be able to hook into. Static refers to the fact they are fixed in point by the original developer. You can think of them as the kind of code you’d tend to guard with #if DEBUG in traditional C development, and like those statements they’re nearly free when they’re not in use except that you can turn these on and off at runtime.

The samples/tracepoints/tracepoint-sample.c file is a kernel module that creates a /proc/tracepoint-sample file, and has a couple of tracepoints coded into it by the developer. First it includes the samples/tracepoints/tp-samples-trace.h which actually declares the tracepoints.

DECLARE_TRACE(subsys_event,
        TP_PROTO(struct inode *inode, struct file *file),
        TP_ARGS(inode, file));
DECLARE_TRACE_NOARGS(subsys_eventb);

You can think of these as declaring the function prototypes, one trace function has two arguments: an inode and a file; the other has no arguments. And if they’re function prototypes, we need to define a function; this is done back in the main tracepoint-sample.c file.

DEFINE_TRACE(subsys_event);
DEFINE_TRACE(subsys_eventb);

These tracepoints can now be called from the kernel code, passing the arguments that may need to be traced; remember that these have no side-effects unless enabled. The code that calls out to the tracepoints is in the my_open() function.

trace_subsys_event(inode, file);
for (i = 0; i < 10; i++)
        trace_subsys_eventb();

Simple, huh? Don’t worry about the rest, this primer is simply so you can recognise tracepoints in the kernel source when you see them. I don’t expect you to go leaping around the kernel adding tracepoints and rebuilding it, unless you want to, of course.

So how do you hook into tracepoints? The answer is from other kernel code, usually in the form of a loadable module such as that defined by samples/tracepoints/tracepoint-probe-sample.c; this includes the same header file as before to get the prototypes.

#include "tp-samples-trace.h"

In the module __init function it registers two functions of its own as hooks into the tracepoint, this activates the tracepoint and turns the code in the previous module from a near no-op to a function call that will call these functions.

ret = register_trace_subsys_event(probe_subsys_event, NULL);
WARN_ON(ret);
ret = register_trace_subsys_eventb(probe_subsys_eventb, NULL);
WARN_ON(ret);

And obviously in the module __exit function we have to unregister these, otherwise we leave dangling things.

unregister_trace_subsys_eventb(probe_subsys_eventb, NULL);
unregister_trace_subsys_event(probe_subsys_event, NULL);
tracepoint_synchronize_unregister();

As to those functions, they take an argument which is a pointer to the same data as the second argument to the register call, and then otherwise take the arguments defined in DECLARE_TRACE. You can do pretty much what you want here, the example simply extracts the filename and outputs it with a a printk()

static void probe_subsys_event(void *ignore,
                               struct inode *inode, struct file *file){
        path_get(&file->f_path);
        dget(file->f_path.dentry);
        printk(KERN_INFO "Event is encountered with filename %s\n",
                file->f_path.dentry->d_name.name);
        dput(file->f_path.dentry);
        path_put(&file->f_path);
}

So that’s tracepoints; they’re a low-level method for a kernel developer to pick places in their code that may be useful for debugging and a method for loadable kernel code such as modules to hook into those places.

Trace Events (Kernel API)

So you know about tracepoints, and you’ve almost certainly heard about Trace Events, but what’s the difference? Well firstly trace events are actually built on tracepoints, you can think of them as a higher level API – and that’s why I covered tracepoints first. Secondly trace events are usable from userspace! we don’t need to write kernel modules to be able to hook into them, but obviously we can only read data this way.

In fact, since they’re tracepoints with extra benefits, you wouldn’t think anyone would use the basic tracepoints at all, and you’d be right! A git grep DECLARE_TRACE in a current kernel tree will show you that the only user of the raw tracepoint macros is actually the trace events system.

Since everyone just defines trace events, a primer on the kernel-side will be useful, so we’ll be looking at the code in samples/trace_events and if you want to read the userspace API documentation, it’s in Documentation/trace/events.txt

Just one source file and header file this time, first we’ll look at the header samples/trace_events/trace-events-sample.h; this seems pretty complicated at first, but almost all of this is boiler-plate code that gets copied into every trace events header. The important bit is the TRACE_EVENT macro:

TRACE_EVENT(foo_bar,
        TP_PROTO(char *foo, int bar),
        TP_ARGS(foo, bar),
        TP_STRUCT__entry(
                __array(        char,   foo,    10              )
                __field(        int,    bar                     )
        ),
        TP_fast_assign(
                strncpy(__entry->foo, foo, 10);
                __entry->bar    = bar;
        ),
        TP_printk("foo %s %d", __entry->foo, __entry->bar)
);

The first part of this looks just like DECLARE_TRACE, and that’s no accident, we’re still declaring a tracepoint too so this will give us a function with the prototype declared in TP_PROTO and argument names in TP_ARGS.

The TP_STRUCT__entry and TP_fast_assign bits are new though. As well as declaring a tracepoint, trace events come with the equivalent “loadable module” code that copies data from the arguments of the function into a struct that can be examined from userspace. TP_STRUCT__entry defines that structure, and TP_fast_assign is C code that should quickly copy data into that structure.

So we’ve declared a tracepoint, we’ve defined a structure containing an array of 10 char and an int, and we’ve written C code to copy from the tracepoint arguments into that structure. The last bit of the trace event is TP_printk, which does exactly what you’d expect. Since the most common (at least, first) use of a trace event is going to be to output something, this macro defines a format string for that printk() call.

Back in the samples/trace_events/trace-events-sample.c file, we include this header but first set a special define. This is only set once in the entire kernel source, and this results in all of the functions being defined; i.e. TRACE_EVENT becomes DEFINE_TRACE rather than DECLARE_TRACE.

#define CREATE_TRACE_POINTS
#include "trace-events-sample.h"

All other users of this header simply include the header.

From here on in the source, the trace event is just a tracepoint and is called in the same way: as a function call.

trace_foo_bar("hello", cnt);

That’s a kernel-side primer, you should be able to git grep through the source and find trace events. But now it’s time to get into the fun bit and look at the userspace API for dealing with them; remember if you want anything more complicated, they’re just tracepoints so you can write kernel modules and hook into them as before.

Trace Events (Userspace API)

We’re in userspace now, so you can leave the kernel source directory, but you do need to be root and you may need to mount a filesystem. This is because some distributions (like Ubuntu) have an allergy to debugging (seriously, they even disable things like gdb -p).

Try and change into the /sys/kernel/debug/tracing directory.

# cd /sys/kernel/debug/tracing

If this fails, you’ll need to mount the debugfs filesystem and try again.

# mount -t debugfs none /sys/kernel/debug
# cd /sys/kernel/debug/tracing

With that done, we should make sure tracing is enabled.

# cat tracing_enabled
1

If that’s 0, enable it:

# echo 1 > tracing_enabled

So we’ve enabled tracing, but what can we trace? Trace events are exposed in the events sub-directory in two levels, the first is the subsystem and the second are the trace events themselves. Since in my last blog post we were looking at tracing forks, it would be great if there were trace events for doing just that. This is where it helps to be able to git grep around the kernel source and recognise trace events, so you at least know the right subsystem name; and it turns out that the sched subsystem has exactly the events we wanted.

deathspank tracing# ls events/sched
enable                   sched_process_exit/  sched_stat_sleep/
filter                   sched_process_fork/  sched_stat_wait/
sched_kthread_stop/      sched_process_free/  sched_switch/
sched_kthread_stop_ret/  sched_process_wait/  sched_wait_task/
sched_migrate_task/      sched_stat_iowait/   sched_wakeup/
sched_pi_setprio/        sched_stat_runtime/  sched_wakeup_new/

sched_process_fork sounds exactly right, if you look at it, it’s a directory that contains four files: enable, filter, format and id. I bet you can guess how to enable fork tracing, but if not:

# cat events/sched/sched_process_fork/enable
0
# echo 1 > events/sched/sched_process_fork/enable

Pretty painless, so go ahead and run a few things, and turn the tracing off again when you’re done.

# echo 0 > events/sched/sched_process_fork/enable

Now let’s look at the result of our trace; recall that every trace event comes with a free printk() of formatted output? We can find the output from those in the top-level trace file.

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-2667  [001]  6658.716936: sched_process_fork: comm=zsh pid=2667 child_comm=zsh child_pid=2748

So for each process fork, we get the parent and child process ids along with the process name. Pretty much exactly what we want!

There’s plenty to play around with using this API, as you’ve probably noticed you can enable entire subsystems or all events using the enable files at the subsystem and events-levels; there’s also a set_event file at the tracing level which can be used to make batch changes to tracing, see the kernel documentation for more details.

You’re probably wondering though what happened to the rest of the struct, especially if there fields that aren’t included in the default printk(). You can examine the struct format by reading the format file of a trace event, and you can use this with the filter file to exclude events you’re not interested in. Again anything I write here would be just duplicating the kernel documentation, so go read Documentation/trace/events.txt

Perf

After a little bit of playing you’ll realise that not only is tracing not limited to your current process or shell, you’ll get events for processes you’re not intersted in, but also events for subsystems you’re not interested in if other processes are doing traces of their own. There’s also only one global filter for the entire trace events system, so other users or processes doing tracing, could override yours.

There’s an even higher-level that we can use to work around those problems, the perf tool. Originally designed as a userspace component to the performance counters system, it’s grown a wide variety of extra features one of which is the ability to work with kernel tracepoints as an input source.

Since trace events are tracepoints, these count!

So let’s say we want to record the forks made by a process we run, without fear of contamination from other processes on the system or other users performing tracing. Using perf we can simply run

 # perf record -e sched:sched_process_fork record bash

And run as many commands as we like in that shell. When the shell exits, perf will write the results of the tracing to a perf.data file for analysis.

# exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (~735 samples) ]

We can analyse this later using various perf sub-commands, the simplest of which is an argument-less perf script which outputs the equivalent of reading the trace file.

# perf script
            bash-3141  [003] 10201.049939: sched_process_fork: comm=bash pid=3141 child_comm=bash child_pid=3142
           :3142-3142  [001] 10201.050391: sched_process_fork: comm=bash pid=3142 child_comm=bash child_pid=3143

Conclusion

As an administrator debugging their system, or a developer trying to understand the performance or events timeline of their work, perf is perfect. It’s a very well documented tool with all of the bells and whistles you need for tracing a wide variety of events.

Unfortuantely the API between perf and the kernel is a private one; the perf tool source is shipped as part of the kernel source, and they are version-mated with each other.

Recall that the topic of the previous blog post was to write a program to follow forks, rather than doing it as a system administrator.

If we want to write software to do it, the lower (but still high) level trace events API seems a better bet. There are a wide range of applications of this API, for example the ureadahead program in an Ubuntu system uses it to trace the open() and exec() syscalls the system performs during boot so it knows which files to cache for faster boot times. But it’s easy for another process, or a user, to interfere with the results of this tracing so it’s not ideal for our purpose either.

Finally the tracepoints API is too low-level, writing a kernel module and building and maintaining it for each kernel version is just not on the cards.

So it would appear we’re at a dead-end for using tracing to do what we want. That’s not the end of the story though; there are other tracing tools such as kprobes and ftrace that I haven’t covered yet. Unfortunately this blog post has gotten a little too long, and the coverage of tracepoints, trace events and perf was worthwhile in of itself, so we’ll have to pick those up next time!

The Proc Connector and Socket Filters

The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they’re both really quite useful interfaces that might serve a variety of purposes if they were better documented.

The proc connector allows you to receive notification of process events such fork and exec calls, as well as changes to a process’s uid, gid or sid (session id). These are provided through a socket-based interface by reading instances of struct proc_event defined in the kernel header.

#include <linux/cn_proc.h>

The interface is built on the more generic connector API, which itself is built on the generic netlink API. These interfaces add some complexity as they are intended to provide bi-directional communication between the kernel and userspace; the connector API appears to have been largely forgotten as newer such socket interfaces simply declare their own first-class socket classes. So we need the headers for those too.

#include <linux/netlink.h>
#include <linux/connector.h>

(For brevity, I’ll omit any standard boilerplate such as the headers you need for syscalls and library functions that you should be used to as well as function definitions, error checking, and so-forth.)

Ok, now we’re ready to create the connector socket. This is straight-forward enough, since we’re dealing with atomic messages rather than a stream, datagram is appropriate.

int sock;
sock = socket (PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
               NETLINK_CONNECTOR);

To select the proc connector we bind the socket using a struct sockaddr_nl object.

struct sockaddr_nl addr;
addr.nl_family = AF_NETLINK;
addr.nl_pid = getpid ();
addr.nl_groups = CN_IDX_PROC;

bind (sock, (struct sockaddr *)&addr, sizeof addr);

Unfortunately that’s not quite enough yet; the proc connector socket is a bit of a firehose, so it doesn’t in fact send any messages until a process has subscribed to it. So we have to send a subscription message.

As I mentioned before, the proc connector is built on top of the generic connector and that itself is on top of netlink so sending that subscription message also involves embedded a message, inside a message inside a message.  If you understood Christopher Nolan’s Inception, you should do just fine.

Since we’re nesting a proc connector operation message inside a connector message inside a netlink message, it’s easiest to use an iovec for this kind of thing.

struct iovec iov[3];
char nlmsghdrbuf[NLMSG_LENGTH (0)];
struct nlmsghdr *nlmsghdr = nlmsghdrbuf;
struct cn_msg cn_msg;
enum proc_cn_mcast_op op;

nlmsghdr->nlmsg_len = NLMSG_LENGTH (sizeof cn_msg + sizeof op);
nlmsghdr->nlmsg_type = NLMSG_DONE;
nlmsghdr->nlmsg_flags = 0;
nlmsghdr->nlmsg_seq = 0;
nlmsghdr->nlmsg_pid = getpid ();

iov[0].iov_base = nlmsghdrbuf;
iov[0].iov_len = NLMSG_LENGTH (0);

cn_msg.id.idx = CN_IDX_PROC;
cn_msg.id.val = CN_VAL_PROC;
cn_msg.seq = 0;
cn_msg.ack = 0;
cn_msg.len = sizeof op;

iov[1].iov_base = &cn_msg;
iov[1].iov_len = sizeof cn_msg;

op = PROC_CN_MCAST_LISTEN;

iov[2].iov_base = &op;
iov[2].iov_len = sizeof op;

writev (sock, iov, 3);

The netlink message length is the combined length of the following connector and proc connector operation messages, and is otherwise simply a message from our process id with no following messages.  However all of the interfaces to netlink take a lot of care to make sure the following structure in the message is aligned as wide as possible using the NLMSG_LENGTH macro, to avoid issues with platforms that have fixed alignment for data types, so we have to be careful of that too.

So we actually have a bit of padding between the struct nlmsghdr and the struct cn_msg, this is accomplished by actually using a character buffer of the right size for the first iovec element and accessing it through a struct nlmsghdr pointer.

The connector message indicates that it is relevant to the proc connector through the idx and val fields, and the length is the legnth of the proc connector operation message.

Finally the proc connector operation message (just an enum) says we want to subscribe. Why isn’t there padding between the connector and proc connector operation messages? Because the last element in struct cn_msg is a zero-width type which results in the right padding, this interface is rather newer than netlink.

iovec stitches it all together so it’s sent as a single message, visualized this message looks like this:

There’s a matching PROC_CN_MCAST_IGNORE message if you want to turn off the firehose without closing the socket.

Ok, the firehose is on now we need to read the stream of messages.  Just like the message we sent, the stream of messages we receive are actually netlink messages, and inside those netlink messages are connector messages, and inside those are proc connector messages.

Netlink allows for all sorts of things like multi-part messages, but in reality we can ignore most of that since connector doesn’t use the, but it’s worth future-protecting ourselves and being liberal in what we accept.

struct msghdr msghdr;
struct sockaddr_nl addr;
struct iovec iov[1];
char buf[PAGE_SIZE];
ssize_t len;

msghdr.msg_name = &addr;
msghdr.msg_namelen = sizeof addr;
msghdr.msg_iov = iov;
msghdr.msg_iovlen = 1;
msghdr.msg_control = NULL;
msghdr.msg_controllen = 0;
msghdr.msg_flags = 0;

iov[0].iov_base = buf;
iov[0].iov_len = sizeof buf;

len = recvmsg (sock, &msghdr, 0);

Why do we use recvmsg rather than just read? Because netlink allows arbitrary processes to send messages to each other, so we need to make sure the message actually comes from the kernel; otherwise you have a potential security vulnerability. recvfrom lets us receive the sender address as well as the data.

if (addr.nl_pid != 0)
        continue;

(I’m assuming you’re reading in a loop there.)

So now we have a netlink message package from the kernel, this may contain multiple individual netlink messages (it doesn’t, but it may). So we iterate over those.

for (struct nlmsghdr *nlmsghdr = (struct nlmsghdr *)buf;
     NLMSG_OK (nlmsghdr, len);
     nlmsghdr = NLMSG_NEXT (nlmsghdr, len))

And we should ignore error or no-op messages from netlink.

if ((nlmsghdr->nlmsg_type == NLMSG_ERROR)
    || (nlmsghdr->nlmsg_type == NLMSG_NOOP))
        continue;

Inside each individual netlink message is a connector message, we extract that and make sure it comes from the proc connector system.

struct cn_msg *cn_msg = NLMSG_DATA (nlmsghdr);

if ((cn_msg->id.idx != CN_IDX_PROC)
    || (cn_msg->id.val != CN_VAL_PROC))
        continue;

Now we can safely extract the proc connector message; this is a struct proc_event that we haven’t seen before. It’s quite a large structure definition so I won’t paste it here, since it contains a union for each of the different possible message types. Instead here’s code to actually print the relevant contents for an example message.

struct proc_event *ev = (struct proc_event *)cn_msg->data;

switch (ev->what) {
case PROC_EVENT_FORK:
        printf ("FORK %d/%d -> %d/%d\n",
                ev->event_data.fork.parent_pid,
                ev->event_data.fork.parent_tgid,
                ev->event_data.fork.child_pid,
                ev->event_data.fork.child_tgid);
        break;
/* more message types here */
}

As you can see, each message type has an associated member of the event_data union containing the information fields for it. And as you can see, this gives you information about each individual kernel task, not just the top-level processes you’re normally used to seeing. In other words, you see threads as well as processes.

Like I keep saying, it’s a firehose. It would be great if there was some way to filter the socket in the kernel so that our process doesn’t even get woken up for messages. Wake-ups are bad, especially in the embedded space.

Fortunately there is a way to filter sockets on the kernel-side, the kernel socket filter interface. Unfortunately this isn’t too well documented either; but let’s use this opportunity to document an example.

We’ll filter the socket so that we only receive fork notifications, discarding the other types of proc connector event type and most importantly discarding the messages that indicate new threads being created (those where the pid and tgid fields differ). One important part of filtering is that you should be careful so that only expected messages are filtered, and that unexpected messages are still passed through.

The filter machine consists of a set of machine language instructions added to the socket through a special socket option. Fortunately this machine language is copied from the Berkeley Packet Filter from BSD, so we can find documentation for it in the bpf(4) manual page there. Just ignore the structure definitions, because they are different on Linux.

So let’s get started with our example; first we need to add the right header.

#include <linux/filter.h>

And now we need to insert the filter into the socket creation, before the subscription message is sent is usually a good place. On Linux the instructions are given as an array of struct sock_filter members which we can construct using the BPF_STMT and BPF_JUMP macros.

Just to make sure everything is working, we’ll create a simple “no-op” filter.

struct sock_filter filter[] = {
        BPF_STMT (BPF_RET|BPF_K, 0xffffffff),
};

struct sock_fprog fprog;
fprog.filter = filter;
fprog.len = sizeof filter / sizeof filter[0];

setsockopt (sock, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof fprog);

Not very useful, but it means we can now concentrate on writing the filter code itself. This filter consists of a single statement, BPF_RET that tells the kernel to deliver an amount of bytes of the packet to the receiving process and to return from the filter. The BPF_K option means that we give the amount of bytes as the argument to the statement, and in this case we give the largest possible value. In other words, this statement declares to deliver the whole packet and return from the filter.

To not wake up the process at all, and filter everything we deliver no bytes and return from the filter.

BPF_STMT (BPF_RET|BPF_K, 0);

You may want to test that too.

Ok, now let’s actually do some examination of the packets to filter out the noise. Recall that we’re dealing with nested messages here, messages inside messages, inside messages. Visualizing this is really important to understanding what you’re dealing with.

The most basic filter code consists of three operations: load a value from the packet into the machine’s accumulator, compare that against a value and jump to a different instruction if equal (or not equal), and then possibly return or perform another operation.

All of the following filter code replaces whatever you had in the filter[] array before.

So first we should examine the nlmsghdr on the start of the packet, we want to make sure that there is just one netlink message in this packet. If there are multiple, we just pass the whole packet to userspace for dealing with. We check the nlmsg_type field to make sure it contains the value NLMSG_DONE.

BPF_STMT (BPF_LD|BPF_H|BPF_ABS,
          offsetof (struct nlmsghdr, nlmsg_type));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htons (NLMSG_DONE),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

The first statement says to load (BPF_LD) a “halfword” (16-bit) value (BPF_H) from the absolute offset (BPF_ABS) equivalent to the position of the nlmsg_type member in struct nlmsghdr. Since we expect that structure to be the start of the message, this means the accumulator should now have that value.

The next statement is a jump (BPF_JMP), it says to compare the accumulator for equality (BPF_JEQ) against the constant argument (BPF_K). We only want to continue if this is the sole message, so the value we compare against is NLMSG_DONE – first remembering to deal with host and network ordering.

If true, the jump will jump one statement; if false the jump will not jump any statements. These are the third and fourth arguments to the BPF_JUMP macro.

Note that the error case is always to return the whole packet to the process, waking it up. And the success case is future processing of the packet. This makes sure that we don’t filter unexpected packets that userspace may really need to deal with. Don’t use the socket filter for security filtering, it’s for reducing wake-ups.

So let’s filter the next set of values, we want to make sure that this netlink message is from the connector interface. Again we load the right “word” (32-bit) values (BPF_W) from the appropriate offsets and check them against constants.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_IDX_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_VAL_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

So after this filter code has executed, we know the packet contains a single netlink message from the proc connector. Now we want to make sure it’s a fork message; this is a bit different from before, because now we explicitly do filter out the other message types so the return case for non-equality is to return zero bytes.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, what);
BPF_JUMP (BPF_JMP|BPF_JEQ|BF_K,
          htonl (PROC_EVENT_FORK),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);

And now we can compare the pid and tgid values for the parent process and the child process fields. This is again slightly interesting because we can’t compare against an absolute offset with the jump instruction so we use the second index register instead (BPF_X in the jump instruction). Of course it would be too easy if we could load directly into that, so we have to do it via the scratch memory store instead; this requires loading into the accumulator (BPF_LD), storing into scratch memory (BPF_ST) and loading the index register (BPF_LDX) from scratch memory (BPF_MEM).

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);

Then we load the tgid value into the accumulator and we can compare and jump as before; if they are equal we want to continue, if they are inequal we want to filter the packet.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);

Then we do the same for the child field.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));

BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);

BPF_STMT (BPF_RET|BPF_K, 0);

After all that filter hurdling, we have a packet that we want to pass through to the process, so the final instruction is a return of the largest packet size.

BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

That’s it. Of course, what you do with this is up to you. One example could be a daemon that watches for excessive forks and kills fork bombs before they kill the machine. Since you get notification of changes of uid or gid, another example could be a security audit daemon, etc.

Upstart uses this interface for its own nefarious process tracking purposes.

Leaving Canonical

This will be my last week working for Canonical Ltd.

I joined the company almost seven years ago, right at its inception.  I was contracting at the time and a member of the Debian project maintaining the dpkg package manager, when I received an e-mail out of the blue that led to a phone call with a South African I’d never heard of who wanted to offer me a dream job working on a Debian-based Linux distribution.  Sadly I never kept that original e-mail, but I tried to replicate it from memory for Canonical’s 5th birthday:

Dear Friend,

How are you and your family hope fine?

I am Mark SHUTTLEWORTH, from the great country of SOUTH AFRICA.

Due to good fortune mine in business, I have come into money of the sum $575,000,000 (US).

I would like to with you discuss BUSINESS OPPORTUNITY, and solicit your confidentiality in this transaction.

Pleased to discuss by phone at your earliest convenience.

Ok, Mark wasn’t really a Nigerian 419 scammer, but some people did discard his e-mail as spam!  The job sounded interesting, and I was largely waiting for him to stop talking on the phone so I could say yes.  Even better, he was going to pay me up front for the first couple of months because the company hadn’t been formed yet let alone contracts signed and such.  No, I didn’t have to send him any money first to make the transaction happen ;-)

So I joined the super-secret IRC channel (#weirdos, on the FreeNode IRC network, just fire up Pidgin in Ubuntu and…) and discovered Jeff Waugh, Robert Collins and Thom May already onboard.  This was going to be big.  After a month of being in awe at each new person being brought on, we had our first meeting in London over Easter.  For many this was their decision time about whether to join, or not.  Plans were drawn up, mostly on napkins at Pizza Express in Sloane Square:

Photo by Lamont Jones

Funnily enough, there’s a Pizza Express in Millbank Tower, the current location of Canonical’s Offices.

We weren’t very good at coming up with names, the original domain name of the company was no-name-yet.com and the Debian folk called us the Super-Secret-Debian-Startup.  The company started out as MRS Virtual Development (Mark’s middle name is Richard).  And the nickname for the distribution before Ubuntu was settled as the final name was The Warty Warthog.

Everything was announced at Debconf in Porto Alegre, Brazil.  The first of many long economy class flights taken on behalf of the company.  This meant that by the time Jeff and I attended GUADEC in Kristiansand, word had got around.  There was much joking about our insane plans:

Mrs VD’s Warty Ubuntu?  Sounds like an STI cream!
Yes, it cures Red Hat.

Many were of the opinion that users just didn’t want a six-monthly release of Debian, with a hard emphasis on the Desktop, hotplug and making things just work.  Fortunately they were wrong, but we didn’t have time to be smug because things got a bit out of hand.  I remember Mark saying that his goal for the first two years was that Ubuntu be in the top three Linux distributions.  Ah.

Next up was our first ever big company meeting.  Lots of variations happened of these over the years before we finally settled into the Ubuntu Developer Summit (UDS) format.  Initially they were all-hands events, and started off a bit more like sprints/rallies than anything like the current schedule-frenzy that is UDS.  Fortunately one of the changes is that they’ve gotten a bit shorter.  After a two week coding sprint at Mark’s apartment in London, there was a two week all-hands in Oxford, UK.

Robert and the Hoverbook

I wanted to find a photo with laptops in, for some this event was painful.  We learned that hotel cleaners are not always to be trusted.  Fortunately Robert’s laptop was far too heavy to be stolen.

Then Ubuntu 4.10 was out!  And the world changed.  Well, maybe a bit.

The next conference was LCA in Canberra, followed by our own third developer meeting in Sydney.  This developer meeting was pretty recognizable as a UDS in fact, except two weeks long and all-hands again.  I was granted the very rare privilege of flying to Australia on Mark’s personal private jet.

"Canonical One"

I actually got to fly on this a few more times over the years, and after an amazing night-time landing flying across San Francisco into San Jose airport, got the bug and learned to fly myself!  But I’m digressing.

On the plane we’d bought Ubuntu 5.04 CDs, our second release.  We’d got a few boxes of them, and it was my responsibility to look after them and try and persuade the conference staff to let us put some out to pick up.  I took a small handful and wandered to the reception desk, with a sheepish look on my face.  I was accompanied back to my dorm with the reception staff who wanted the rest!  I think that’s when I finally realized how popular Ubuntu had become, seeing almost everyone at the conference running Ubuntu machines only solidified that.

We had a big printed-out version of the 5.04 CD cover that we got people at the conference to sign.  It’s still on the wall of the Canonical Offices to this day.

Matthew signing the Ubuntu poster

I’ve probably made all this sounds a bit glamorous, jet set life style, celebrity, probably even danger.  But at the end of the day, it was a job.  For example, at no point did we find ourselves white-water rafting in Brazil with an instructor who didn’t speak English.

White-water rafting

To this day I don’t know whether “Frenchie!” means “Faster!” or “STOP! We’re going to DIE!”.

There was a lot of hard work too.  When we were preparing to release our first Late To Ship, err, sorry, Long Term Support release a few of us decided to use the space at Canonical’s new offices at Mossop Street to get together and test the hell out of it.  The idea being that any serious issues could be fixed there and then.  We still do these “Release Sprints” to this day, though the next one breaks the tradition of being in London due to some Prince getting married that week.

Everything will be OK

We kept track of the release status using a sign helpfully provided for us by the then-COO Jane Silber, it has two sides.  This is the happy side.

More releases followed, more conferences, more meetings.  We got better at the releases, and even started getting better at the conferences after enough goes at it.  The meetings were generally ok, except sometimes there was a bit of a problem getting to them!

You see, myself and a colleague Colin King are cursed.  Seriously, if you ever find yourself getting on a plane and see both of us on that same plane, get off the plane.  No, better yet, get the hell out of the city!

You remember that great big snow storm in the UK back in the winter of 2009?  That was our fault!  Colin and I were booked to fly on the same flight.

Cancelled

Things went well until we were sat in the Gatwick business lounge, and it started snowing outside.  Our plane never arrived so our flight was cancelled.  Since the queue for the Easyjet desk went around the airport three times, our travel agent got us booked on a flight out of City Airport in the morning, and sorted us a hotel by that airport.  Easy.  Gatwick Express into London, District Line tube, then transfer onto a bus for City Airport.  Only problem is by the time we’d left the tube, there were several feet of snow on the ground and more falling all the time, the buses were not running and we were a couple of hours walk from the airport.  Oh well, needs must!

The next day we rebooked repeatedly onto later flights until the afternoon, when we finally managed to get Eurostar tickets to Brussels.  Another night in a hotel, another 6am start, ICE to Köln and another to Berlin.  Finally arriving Tuesday afternoon.  Our average pace from Gatwick to Berlin turned out to be roughly walking speed.

Now this might have been an interesting story for the dinner table, except it happened again! That volcano in Iceland?  Our fault!  Just over a year since the previous time, we were at a conference in San Francisco together, and we ground all air traffic in the skies of Northern Europe.  We really are sorry about that, and since the disasters seem to be escalating, that’s why I have to leave Canonical.

Running in San Francisco

While an amusing thought, there’s actually a small amount of truth to it.  You see, due to airlines, flight priorities and so-forth I was actually stuck in San Francisco for three weeks as a result of the volcano.  Instead of attending the release sprint, I worked from the offices of Ubuntu-friendly companies in the bay area and fixed problems flagged the previous day by the release manager.  At night I explored the city.

I’d been to SF before quite a few times, including a long holiday with my then-partner, and I’ve always loved the place.  I was for all intents and purposes living there for three weeks, and a previous dream to move there got stronger.

I also bought an iPad which made me realize that perhaps the desktop distribution was approaching a decline.

I also got a chance to do pure development again, having bugs triaged for me and I fell in love with programming again – rather than the oddball effort that is distribution engineering.

And I was working in offices, and while I’ve enjoyed working from home for the past seven years, I was far more productive in the office environment.

There are lots of other reasons of course, but ultimately they all come down to it being time for a change.  So where next?

One Infinite Loop

No.

While I do really admire what Apple have done, they’ve already got their ideas set in stone and I want to beat them.

Google

So I’m going to be joining Google.  After months of waiting, and worrying, my US Visa was approved last week and I’m half way through procrastinating about packing my house and life up for the big move!

Don’t worry though, I won’t be disappearing into a black hole!  I’m retaining my Ubuntu membership, Core Developer upload privileges and my seat on the Ubuntu Technical Board (which means there will be a non-Canonical person on the board once again!).  I’ve even re-activated my Debian membership.

I’m also going to continue developing Upstart, I’ve been working hard on the new version for what seems like an age now, and I’m not giving up now; not in the least because Google use Upstart themselves on many projects, including Chrome OS.

The only real worry is whether I end up spending more time at the Google Gym or the wide variety of Google cafés.

The Importance of Being Tested

In addition to the regular posts documenting features of 0.6 and giving hints and tips about it’s usage, release announcements and so-forth; I’ll also be posting insights and anecdotes about Upstart’s ongoing development.  A particular story cropped up again this month, and I thought I’d share it with you.

When I began work on Upstart, one of the earliest decisions I made was to make sure the code was very-well covered by a comprehensive test suite.  I’d been working with Robert Collins a lot in the previous couple of years and he is very much an advocate of practices such as Extreme Programming (XP) and Agile Development; especially the discipline of Test Driven Development.

I’d also recently seen a keynote by Andrew Tridgell in which he talked about some of the development of Samba 4, in particular the high use of both test cases and code generation in that code-base.  Something he said in the keynote stuck with me: “untested code is broken code”.

Statistics obviously depend on exactly how you count lines of code, but using a simple semi-colon count the combined source code of libnih and Upstart is slightly over 20,000 lines of code.  The combined source code of the test suite for both is slightly over 120,000 lines of code.

The init daemon is an extremely important part of a Linux system, if it crashes then you’re left with a kernel panic; if it simply misbehaves, you’re left with just severe problems.  Not only was I changing it, but I was replacing a very simple dumb system (Sys V init) with something comparatively complex with rules and behaviours that needed rigorous testing.

It would have been very scary to have developed it without the careful testing, and I would have been very worried if anyone had agreed to replace such a core component of the system without this test suite to back up its behaviour.

That being said, maintaining the test suite can be a huge burden.  Don’t believe what anybody tells you, if you’re writing test cases as well as code, then your pace of development slows as well.  They’re right that you spend a lot less time debugging of course, but unlike in the commercial software business free software developers tend to release first and debug later.   If you use a similarly high test to code ratio in your own project, then you’ll find that the time until your first release will be pretty long and the time between releases longer as well.

Another decision is whether to do Test Driven Development or not; that discipline requires that you always write the tests first, to fail, and only write code in order to make the tests pass.  I’m not a fan of TDD, and I’ve no problem admitting that I mostly did not use it for Upstart.  My gut feel is that TDD produces code that hangs, swings and loops just to deal with testing.  It also just doesn’t suit my coding style: I like to write code from the middle outwards, the function API is the last thing I tend to fix, where TDD forces it to be the first.

I’m also not convinced TDD is really suitable for a language like C; it’s pretty hard to get a test case to compile, run and fail without writing any supporting code such as a header file, etc.

I have found TDD useful when I have code that really does break down into a single unit with a well-defined and obvious API, and that while the inputs and outputs have been obvious, the algorithm for getting between them wasn’t at the time.

What I’ve tended to do instead is write code naturally how I would, and write test cases alongside to run the code and make sure it’s working.  As the code grows more complex, more test cases appear for it.  One big advantage to this is then I don’t need to reboot or fire up a VM as much, I can test a large proportion of Upstart’s operation through testing.

Now, onto the stories.  There are two similar ones.

One of the side-effects of testing Upstart so strongly is that the tests are not only driving the code I’ve written but also code in libraries and even in the Kernel.  One particular set of tests was covering the code in libnih and Upstart that handles watching the configuration directory for changes, it’s this code that means Upstart automatically reloads jobs when you edit them without needing an explicitly signal.

One day these test cases started failing without warning.  Investigation showed that they passed fine under older kernels, but with the newest kernel update to Ubuntu, they failed.

The inotify subsystem in the kernel had undergone a radical overhaul and rewrite.  Rather than being its own code, it was completely rebased onto the new fsnotify system.  Fortunately I was aware of this, and after careful checking that it was indeed the kernel behaviour that was now incorrect (and that it wasn’t incorrect before), I got in touch with the Eric Paris, the author of the new code, and was able to give him minimal example code to replicate the problem.

inotify: check filename before dropping repeat events

This was a while ago, but pretty much the same story happened again recently, just this time not with the kernel.

Again, the story started with Upstart’s test suite failing.  The engineer who first noticed it assumed it was an issue with the new build daemon and disabled the test for the time being.  The test was in the part of the code testing Upstart’s interaction with D-Bus.

Now, sometimes I tend to write tests to deal with corner-cases and “what if” scenarios that I dream up.  This isn’t always about testing my code, often it’s a case of finding out whether something is really possible or whether that thing misbehaves.  These tests still stay in the suite of course.

A particular set of tests were intended to find out what happened if the D-Bus daemon crashed during initial connection, I considered this fairly important because at times the libdbus library has called exit() or abort() when things happened that it didn’t like.  If you call that from the init daemon, the kernel panics.

These tests had worked fine for a couple of years (actually at the time I had to fix bugs in libdbus to make them pass) but now one of these tests was breaking.  The disconnection was causing SIGPIPE to be delivered to the test.

Again, this turned out to be due to a change to D-Bus.  Lennart Poettering had been working on some changes to avoid libdbus’s awkward SIGPIPE handling and replace it with the use of the MSG_NOSIGNAL flag.  Unfortunately he’d missed a case in the authentication code.  The side-effect was that if the D-Bus daemon had crashed, been killed, OOM’d, etc. during initial connection – the connecting application would have gone too.  Especially bad for an init daemon.

Fortunately Upstart’s test suite caught it, and the fix was simple.

sysdeps-unix: use MSG_NOSIGNAL when sending creds

(reposted from http://upstart.at/2010/12/20/the-importance-of-being-tested/ – post comments there)

Events are like Methods

In last week’s post I talked about how Events can be treated like Signals, this week we’ll be looking at how Events can be treated like Methods.  That might seem a little surprising, since normally one considers signals and methods as very different things, but to Upstart they are both just events.

What do I mean by Methods?  You’ve almost certainly done some kind of programming, even if just a little scripting, so you should know about methods or functions.

In contrast to signals, which are just a notification that something happened on the system, a method is a request for the system to do something on your behalf.  Usually to make some kind of change to the system state.

Likewise in contrast to the signals where you don’t care about the result, for a method you want to wait for the changes to be completed and perhaps even be notified if the method failed.

It’s just as easy to implement a method in Upstart as it is to implement something that considers an event a signal.  Here’s an example of how you might implement a suspend method:

start on suspend

task
exec pm suspend

Doesn’t look that much difference from a signal, the only new stanza in this is task (and that’s not necessary for a method either).  So what happens if we want to trigger a suspend?  We use the command:

root@worldofwarcraft:~# initctl emit suspend

The difference here from emitting a signal we demonstrated in the previous post is that we aren’t using the –no-wait flag.

So we emit the suspend event, and Upstart will start our job as a result; but initctl emit will not return immediately, it waits for the results of the event to complete before it returns.

Because we used the task stanza in the configuration, we’ve told Upstart that the process we execute is expected to take a limited amount of time and then finish by itself.  This means that Upstart will not believe the job is complete until the process has exited, and will continue to block the event while it is still running.

Finally if the command exited with an error, that error is propagated back to the event that started it, and the initctl emit command will exit with an error code.

So now we can use Upstart events and jobs for two different purposes; we can announce changes to the system, and we can use them as methods to make changes to the system.

The most typical event that is used as a methods on your system is the runlevel event used to change the runlevel for System-V compatibility and generally emitted by the telinit and shutdown tools.  The /etc/init/rc.conf script that handles it can be pretty simple and looks not unlike the suspend example above:

start on runlevel [0123456]

task
exec /etc/init.d/rc $RUNLEVEL

What happens if you don’t include task?  Well, that means Upstart will consider the job as ready when the process executed is running, and the event will be unblocked and initctl emit will return.  If the service fails to start, then initctl will return with an error.  This is great for methods that start (or stop) services.

Side-note: the start and stop commands act very much like method events, they block until the service is running or the task has finished and they return errors as well.  However they’re not actually implemented as events right now, an oversight I intend to correct in Upstart 2.

(reposted from http://upstart.at/2010/12/16/events-are-like-methods/ – post comments there)

Event matching in Upstart

A little while ago I was asked to solve a problem that somebody was having with Upstart, and I realised that people weren’t understanding how things were actually working and were just muddling along when doing event matching in jobs.  This is unfortunate, because it hides some of Upstart’s true power, so I thought it high time I actually explained this.

Let’s start with a simple example.  Fire up any Linux distribution with Upstart 0.6, Ubuntu or Fedora current releases will do, and create a file named /etc/init/example1.conf with the following content:

start on surprise

This is pretty simple, it’s a job that does nothing except declare that it’s started when the surprise event happens.  We can demonstrate that works by emitting the event ourselves and checking the status of the job before and afterwards:

root@angrybirds:/etc/init# status example1
example1 stop/waiting
root@angrybirds:/etc/init# initctl emit surprise
root@angrybirds:/etc/init# status example1
example1 start/running

Nothing too surprising after all, I hope.  The job did indeed start on the surprise event, and would now be running if we’d actually told Upstart to run something.

Incidentally I’m often asked why there isn’t a single list of events anywhere, that’s because you can match any event you like as long as you know something emits it.  Events are supposed to come from all manner of sources.  I do try and document them though, try running man 7 startup on your system to see an example of an event’s man page.

If events were just names, they’d be pretty boring.  Events can also have attached environment variables, and these get put into the environment of any job’s process started by the event.  Here’s /etc/init/example2.conf:

start on weather

script
    echo $KIND > /tmp/weather
end script

This will now run a small shell script that outputs the $KIND environment variable to a file.  This isn’t set anywhere, but we can pass it in the event.

root@angrybirds:/etc/init# cat /tmp/weather
cat: /tmp/weather: No such file or directory
root@angrybirds:/etc/init# initctl emit weather KIND=RAIN
root@angrybirds:/etc/init# cat /tmp/weather
RAIN

Ok, these are just examples but there are plenty of useful events on your system right now which carry environment variables such as which network interface just came up, and so on.

If you wanted to only run on a certain type of weather, you might think to check the value of $KIND within the script; you could do that, but it’s inefficient, ideally you don’t want your script run at all.  Fortunately we can match the environment of an event in the job easily enough, here’s /etc/init/example3.conf:

start on weather KIND=snow

Hopefully you’ll figure that this one will only start if it’s snowing, and you’d be right:

root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=hail
root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=snow
root@angrybirds:/etc/init# status example3
example3 start/running

Events can have more than one environment variable, and you can have more than one match:

start on weather KIND=rain INTENSITY=heavy

The matches are actually globs, so you can use * and ? in there and as well as =, there’s obviously !=.

One useful use for the latter is in the stop on stanza, as well as being available for the job’s processes you can also use these in other stanzas within the job.  Here’s a cute example for /etc/init/example4.conf:

start on weather KIND=rain or weather KIND=snow
stop on weather KIND!=$KIND

This one takes a bit of explaining.  First of all to start the job we match the weather event with $KIND set to either rain or snow.  Now we supply a condition to stop the job, and we also match the weather event with a given value of $KIND except this time we match what looks like itself.

In fact this expansion of $KIND is the value that variable had when the job was started, not the value in the new event.  It says to stop the job if it stops raining, or stops snowing depending on which of the two started it.  Most importantly, if an event simply repeats the same kind of weather, but maybe with a different intensity, the job carries on running (but it doesn’t have its environment updated – UNIX can’t do that).

root@angrybirds:/etc/init# status example4
example4 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=heavy
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=light
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=sun
root@angrybirds:/etc/init# status example4
example4 stop/waiting

Ok, last fake example before we get onto the fun bits.  Remember the example from above:

start on weather KIND=rain INTENSITY=heavy

Upstart lets us shortcut this a little, the environment variables are specified in an order on the initctl command-line and if we know what that order is, we can just assume what variable is in that position.  So as long as we know a weather event always has a KIND followed by an INTENSITY, we could shortcut that to:

start on weather rain heavy

If you’ve used Upstart at all, you’ve seen that shortcut before.  A lot.  You may not have even realised it was a shortcut at all, and that’s what I hope to fix here.

Here’s an example of where you’ve used that:

start on started dbus

You should hopefully now recognise that started is the name of the event there, an dbus is simply the value of its first argument, whatever that might be.  Remember I mentioned that events have man pages?  Take a look at man 7 started, which is the man page for this event.

It documents which environment variables are attached to the started event, and most importantly what order they come in.

started JOB=JOB INSTANCE=INSTANCE [ENV]...

So really when we wrote the previous, we were just using a shortcut to specify:

start on started JOB=dbus

You might wonder what difference this makes.  A good example of how to exploit this is the stopped event.  If you look at it’s man page (man 7 stopped) you’ll see it has a large number of environment variables specifying not only which job stopped but the reason for it stopping.  One of those is the exit signal, for example.

Now you know that you’re just matching the $JOB environment variable, it’s obvious that you don’t have to!  You can match any other environment variable or variables in the event, or none at all.

Here’s how to run a script if any other job on the system exits with a segmentation fault:

start on stopped EXIT_SIGNAL=SEGV

I said you didn’t have to match any variables, just like in the first examples we didn’t, there’s a neat use for that with the job events.  The starting event blocks the named job from actually starting until anything run by it is started; or, in the case of jobs marked task, finished.

Here’s a little job that runs every time another job is started, and blocks that job from actually starting until the script finishes.

start on starting
task

script
    ....
end script

Useful both for debugging and performance analysis.

Now for the really neat bit.  So far we’ve concentrated on the environment variables that come from events, and those that Upstart puts into the job events.  But we can influence these in rather useful ways.

Firstly we can declare a default value for an environment variable in a job, if no alternate value is given in the start event or command, then this default value wins:

start on mounted

env MOUNTPOINT=/tmp
script
    ....
end script

This script will run for each occurrence of the mounted event, and will hopefully get the value for $MOUNTPOINT from that event.  But should the value be missing from the event, or the script be started manually by a system administrator, a default value is provided.

This isn’t a false example, that’s from the job on your system that cleans up the /tmp directory on boot.  The default value wasn’t there in earlier versions of Ubuntu, and this had a rather disastrous side-effect when run by hand.

Ok, we can set the values of environment variables from a job, and we don’t have to match the job name in the usual job events.  We can combine these two facts in a very interesting way when we can export the value of a job’s environment variable into its job events.

Here’s the first job:

env AM_A_DISPLAY_MANAGER=1
export AM_A_DISPLAY_MANAGER

This sets the default value of $AM_A_DISPLAY_MANAGER, but this isn’t a variable we ever expect to be supplied by an event so it just gets passed into the environment of its processes.  It’s not that useful either on its own.

The export line is the useful one, it adds the value of the named environment variable to the job’s events.  That is the starting, started, stopping and stopped events.

Now, in another job, we can do:

start on started AM_A_DISPLAY_MANAGER=1

This is run when any job is started that has that environment variable in its events.  In other words, we can tag classes of services so we don’t have to list every single one.

And because everything in Upstart is the same fundamental type of thing, this can work in the opposite direction.  For example we can put in our job:

env NEED_PORTMAP=1
export NEED_PORTMAP

This means our events will have NEED_PORTMAP=1 in them, now remembering that the job waits for the side-effects of the starting event to complete, we can now write in /etc/init/portmap.conf:

start on starting NEED_PORTMAP=1

So we can implement a dependency-based init system with Upstart, an event-based init system.

I look forwards to finding out what else you can do with it.

(reposted from http://upstart.at/2010/12/03/event-matching-in-upstart/ – post comments there)