AudioFeeder updates for ogv.js

I’ve taken a break from the blog for too long! Time to update some on current work. We’re doing a final push on the video.js-based frontend media player for MediaWiki’s TimedMediaHandler, with some new user interface bits, better mobile support, and laying the foundation for smoother streaming in the future.

Among other things, I’m doing some cleanup on the AudioFeeder component in the ogv.js codec shim, which is still used in Safari on iOS devices and older Macs.

This abstracts a digital sound output channel with an append-only buffer, which can be stopped/started, the volume changed, and the current playback position queried.

When I was starting on this work in 2014 or so, Internet Explorer 11 was supported so I needed a Flash backend for IE, and a Web Audio backend for Safari… at the time, the only way to create the endless virtual buffer in Web Audio was using a ScriptProcessorNode, which ran its data-manipulation callback on the main thread. This required a fairly large buffer size for each callback to ensure the browser’s audio thread had data available if the main thread was hung up on drawing a video frame or something.

Fast forward to 2022: IE 11 and Flash are EOL and I’ve been able to drop them from our support matrix. Safari and other browsers still support ScriptProcessorNode, but it’s been officially deprecated for a while in favor of AudioWorklets.

I’ve been meaning to look into upgrading with an AudioWorklet backend but hadn’t had need; however I’m seeing some performance problems with the current code on Safari Technology Preview, especially on my slower 2015 MacBook Pro which doesn’t grok VP9 in hardware so needs the shim. :) Figured it’s worth taking a day or two to see if I can avoid a perf regression on old Macs when the next Safari update comes out.

So first — what’s a worklet? This is an interface that’s being adopted by a few web bits (I think some CSS animation bits are using these too) to have a fairly structured way of loading little scripts into a dedicated worker thread (the worklet) to do specific things off-main-thread that are performance critical (audio, layout, animation).

An AudioWorkletNode hooks into the Web Audio graph, giving something similar to a ScriptProcessorNode but where the processing callback runs in the worklet, on an AudioWorkletProcessor subclass. The processor object has audio-specific stuff like the media time at the start of the buffer, and is given input/output channels to read/write.

For an ever-growing output, we use 0 inputs and 1 output; ideally I can support multichannel audio as well, which I never bothered to do in the old code (for simplicity it downmixed everything to stereo). Because the worklet processors run on a dedicated thread, the data comes in small chunks — by default something like 128 samples — whereas I’d been using like 8192-sample buffers on the main thread! This allows you to have low latency, if you prefer it over a comfy buffer.

Communicating with worker threads traditionally sucks in JavaScript — unless you opt into the new secure stuff for shared memory buffers you have to send asynchronous messages; however those messages can include structured data so it’s easy to send Float32Arrays full of samples around.

The AudioWorkletNode on the main thread gets its own MessagePort, which connects to a fellow MessagePort on the AudioWorkletProcessor in the audio thread, and you can post JS objects back and forth, using the standard “structured clone” algorithm for stripping out local state.

I haven’t quite got it running yet but I think I’m close. ;) On node creation, an initial set of queued buffers are sent in with the setup parameters. When audio starts playing, after the first callback copies out its data it posts some state info back to the main thread, with the audio chunk’s timestamp and the number of samples output so far.

The main thread’s AudioFeeder abstraction can then pair those up to report what timestamp within the data feed is being played now, with compensation for any surprises (like the audio thread missing a callback itself, or a buffer underrun from the main thread causing a delay).

When stopping, instead of just removing the node from the audio graph, I’ve got the main thread sending down a message that notifies the worklet code that it can safely stop, and asking for any remaining data back. This is important if we’ve maintained a healthy buffer of decoded audio; in case we continue playback from the same position, we can pass the buffers back into the new worklet node.

I kinda like the interface now that I’m digging in it. Should work… Either tonight or tomorrow hope to sort that out and get ogv.js updated in-tree again in TMH.

HDR to SDR tone-mapping

I’ve been playing a lot of Flight Simulator lately, and when I acquired a monitor with basic high dynamic range (HDR) capability, thought it might be fun to try out. Little did I know it would launch me into a world of image processing and color spaces…

First, what is HDR? And what is SDR? Standard dynamic range images are optimized for a fairly small dynamic range of minimum to maximum brightness. The sRGB color space, standard for most computer stuff these days, is specified in optimal conditions for a maximum screen brightness of 80 nits (1 nit == 1 candela per square meter) in a darkened room, though most people’s desktops are much brighter for daylight conditions.

High dynamic range images can have much higher brightnesses, while (hopefully) still maintaining good detail in darker regions of the image. The common HDR10 pixel format used for HDR video allows for a maximum luminance of 10,000 nits — 125 times the brightness of a standard-calibrated SDR signal! Common displays may be much more limited though — my monitor is rated as DisplayHDR 400, which provides a maximum brightness of just 400 nits (5 times the SDR standard). This is still plenty to show brighter whites and colors, and is actually really nice for Flight Simulator where bright daylight and dark shadows and interiors coexist all the time.

However now that I’m flying in HDR and taking screenshots of my simulated adventures, how do I share those photos with everyone with a normal monitor, in file formats that social media platforms support?

Naturally, I decided that converting files one-off with a viewer app I found wasn’t good enough, and wrote my own utility I can use for batch-processing. ;) Once cleaned up, this can also become useful for Wikipedia to render SDR thumbnails of HDR images (once we confirm which formats we can support without problems).

To illustrate how tone-mapping and mapping of out of gamut colors affects the rendering, I’ve taken a particularly dramatic screenshot from an early morning flight, at sunrise. See also original file in JPEG XR format.

If we just clip the brighter colors into SDR range, the entire sky is completely blown out:

Or if we drop the exposure a few stops to optimize for the brightest colors, we can’t see anything but the sunrise:

To map the wide range of input into the [0, 1] range, we need some non-linear operator that preserves most of the detail in the low end untouched, then squishes brighter stuff into the top end with some loss of contrast.

A common HDR to SDR tone-mapping operator is the Reinhard algorithm; where C is the input value and C_white is the maximum value to be preserved:

TMO(C) = C(1 + C/C_white²)/(1 + C)

Reinhard et al, 2009

If you apply this separately to the input Red, Green, and Blue channels, you end up with a result that isn’t displeasing, but causes a lot of color shifts as the color elements don’t scale at the same rates… in this case, the orange areas of the sky become much more yellow than they should be. There’s also a lot of desaturation of brighter areas, much more than I like personally:

If instead we apply the operator in the luminance domain, we can preserve colors more exactly. However there’s a big problem, which is that a pixel’s luminance (brightness) may be much lower than the maximum of its components! For instance a deep orange will have a very high red, a more modest green, and a much more modest blue. When we map the resulting colors into the output, the red clips at maximum before the green does, causing bright sky oranges to shift towards yellow and lose contrast:

One possibility is to map those too-bright colors back into gamut by progressively desaturating them. For both luminance and saturation changes I’m using the Oklab color space, which is similar to CIELUV and is designed to make it easy to scale and transition colors maintaining perceptual qualities. If I apply just enough desaturation to keep every pixel’s Red, Green, and Blue elements in gamut, I lose some color in the brightest parts of the image but it packs the full punch of the brightness of the sunrise:

Which one’s right? There’s no one right answer. But when you’re batch processing you gotta pick a default, and I kinda like this last one. ;) It maintains the luminance data, which is most important to the human visual system, and though it loses the pure color of the sun and immediate area of the sunrise, it keeps the surrounding area much better than my other versions so far.

So what would we need to support these sorts of images on Wikipedia? A few things to consider:

First, actual file formats are important!

  • My screenshots are saved by the NVIDIA game capture tool in JPEG XR (a Microsoft flavored standard, which may or may not have patent issues but should be covered by their open source patent license covenant because they released a codec library for it). If patents aren’t a problem, it’s easy enough to use that library directly or indirectly.
  • I assume HDR can be done in HEIC/HEIF which is based on HEVC, the codec my NVIDIA tool captures videos in.
  • AVIF is the open media / Google-flavored variant of HEIF based on AV1 codec instead of HEVC. Should be no problems for patents from our perspective. I hear there may be browser support in Chrome at least, but have not tested this either yet.
  • OpenEXR is a more classic HDR file format for photography and cinema production usage. I don’t know the patent state, but it’s implemented by widely used open source tools.
  • For video, VP9 should be fine and AV1 will work later, but we’ll need more complications in the pipeline to deal with transcoding SDR and HDR variants!

Second, rendering regular SDR thumbnails for browsers that don’t grok them natively or don’t know how to tone-map well: we could probably adapt the utility I wrote to do this as a filter that we can plug into Thumbor. The code’s written in Rust as a CLI utility, and runs cross-platform. Could be adapted to take raw data on stdin/stdout or call as a library.

Third, interactive browser display. Whether on an SDR or HDR monitor it would often be nice to be able to adjust exposure in the viewer, which necessitates being able to do the tone-mapping in real-time; this would be best done in WebGL with a shader, rather than something silly like compiling my Rust code to WebAssembly. :)

And then that would have to get integrated into MediaViewer, with suitable mobile and desktop interfaces if necessary.

If we actually want to display HDR thumbnails inline — well that’s another fun thing! AVIF would be the main target, I think, but I don’t know what the status of support is in browsers yet (both for the format, and for HDR specifically).

We might also want the thumbnail & initial display on zoom to be able to set an exposure multiplier, or even specify whether to use tone-mapping or clip the range, as image parameters in the wiki page.

All fun possibilities that need to be decided on and taken into account some time. :)

Civilization V cross-play is dead

PSA for Civilization V aficionados: the Windows and Mac versions are no longer compatible for online multiplayer.

It seems the game’s online state management is probably based on passing raw game structures, and some types differ between 32-bit and 64-bit versions: the Windows version of the game is 32-bit, but the Mac version was updated to 64-bit last year to allow it to run on recent versions of macOS that dropped 32-bit support.

It’s unclear whether updating the Windows version to 64 bit would resolve the incompatibility, as macOS and Windows have some different types at 64-bit as well.

Sigh.

This was an avoidable problem, by either using device independent serializations or device independent core representations. And it was exacerbated by Apple dropping 32-bit compatibility and forcing developers to make rushed decisions about supporting or abandoning legacy products.

Modularity and cross-language projects: a nostalgic look

Before I was paid to work on code for a living, it was my hobby. My favorite project from when I was young, before The Internet came to the masses, was a support library for my other programs (little widgets, games, and utilities for myself): a loadable graphics driver system for VGA and SVGA cards, *just* before Windows became popular and provided all this infrastructure for you. ;)

The basics

I used a combination of Pascal, C, and assembly language to create host programs (mainly in Pascal) and loadable modules (linked together from C and asm code). I used C for higher-level parts of the drivers like drawing lines and circles, because I could express the code more easily than in asm yet I could still create a tiny linkable bit of code that was self-sufficient and didn’t need a runtime library.

High performance loop optimizations and BIOS calls were done in assembly language, directly invoking processor features like interrupt calls and manually unrolling and optimizing tight loops for blits, fills, and horizontal lines.

Driver model

A driver would be compiled with C’s “tiny” memory model and the C and asm code linked together into a DOS “.com” executable, which was the simplest executable format devisable — it’s simply loaded into memory at the start of a 64-KiB “segment”, with a little space at the top for command line args. Your code could safely assume the pointer value of the start of the executable within that segment, so you could use absolute pointers for branches and local memory storage.

I kept the same model, but loaded it within the host program’s memory and added one more convention: an address table at the start of the driver, pointing to the start of the various standard functions, which was a list roughly like this:

  • set mode
  • clear screen
  • set palette
  • set pixel
  • get pixel
  • draw horizontal line
  • draw vertical line
  • draw arbitrary line
  • draw circle
  • blit/copy

Optimizations

IIRC, a driver could choose to implement only a few base functions like set mode & set/get pixel and the rest would be emulated in generic C or Pascal code that might be slower than an optimized version.

The main custom optimizations (rather than generic “make code go fast”) were around horizontal lines & fills, where you could sometimes make use of a feature of the graphics card — for instance in the “Mode X” variants of VGA’s 256-color mode used by many games of the era, the “planar” memory mode of the VGA could be invoked to write four same-color pixels simultaneously in a horiz line or solid box. You only had to go pixel-by-pixel at the left and right edges if they didn’t end on a 4-pixel boundary!

SVGA stuff sometimes also had special abilities you could invoke, though I’m not sure how far I ever got on that. (Mostly I remember using the VESA mode-setting and doing some generic fiddling at 640×480, 800×600, and maybe even the exotic promise of 1024×768!)

High-level GUI

I built a minimal high-level Pascal GUI on top of this driver which could do some very simple window & widget drawing & respond to mouse and keyboard events, using the low-level graphics driver to pick a suitable 256-color mode and draw stuff. If it’s the same project I’m thinking of, my dad actually paid me a token amount as a “subcontractor” to use my GUI library in a small program for a side consulting gig.

So that’s the story of my first paying job as a programmer! :)

Even more nostalgia

None of this was new or groundbreaking when I did it; most of it would’ve been old hat to anyone working in the graphics & GUI program library industries I’m sure! But it was really exciting to me to work out how the pieces went together with the tools available to me at the time, with only a room full of Byte! Magazine and Dr Dobb’s Journal to connect me to the outside world of programming.

I’m really glad that kids (and adults!) learning programming today have access to more people and more resources, but I worry they’re also flooded with a world of “everything’s already been done, so why do anything from scratch?” Well it’s fun to bake a cake from scratch too, even if you don’t *have* to because you can just buy a whole cake or a cake mix!

The loadable drivers and asymmetric use of programming languages to target specific areas of work are *useful*. The old Portland Pattern Repository Wiki called it “alternating hard and soft layers”. 8-bit programmers called it “doing awesome stuff with machine language embedded in my BASIC programs”. Embedded machine code in BASIC programs you typed in from magazines? That was how I lived in the late 1980s / early 1990s my folks!

Future

I *might* still have the source code for some of this on an old backup CD-ROM. If I find it I’ll stick this stuff up on GitHub for the amusement of my fellow programmers. :)

Windows ARM64 Visual Studio Code Insiders now available

Just want to give a shout-out to the wonderful folks at Microsoft and elsewhere who have gotten a Visual Studio Code Insiders build created for Windows on ARM64, which runs natively on the Surface Pro X and other ARM64 machines.

It’s still not listed in the regular downloads but it works for me when installed directly, and should auto-update with further Insiders releases. :)

The x86 build ran acceptably for some light development on the Surface Pro X in emulation, but the native build feels a *lot* faster. Starts up instantly, no longer so sluggish to scroll or wait for linter updates.

Now all I need is Docker for Win10/ARM64 and for WSL2 to fix the ARM64 performance problems with Hyper-V. :)

Surface Pro X thoughts after a few weeks

After a few weeks using the fancy new Windows 10 ARM64 tablet, the Surface Pro X, I’ve got a few thoughts. Mostly good so far, but it remains an early adopter device with a few rough edges (virtually — the physical edges are smooth and beautiful!) Note that my use cases are not everyone’s use cases, so some people will have even more luck, or even less luck, getting things working. :) Your mileage can and will vary.

Hardware

It’s just gorgeous. Too gorgeous. It’s all black-on-black labeled with black type. Mostly this is fine, but I find it hard to find the USB-C ports on the left side when I’ve got it propped up on its stand. :)

Seriously though, my biggest hardware complaint is that the bezels are too small for its size when holding as a tablet in the hands — I keep hitting the corners with my fat hands and opening the start menu or closing an app. I’m still not 100% sold on the idea of tablets-with-keyboards at this size (13″ diagonal or so).

But for watching stuff, the screen is *fantastic*. The 3:2 aspect ratio is also much better for anything that’s not video, while still not feeling like I’ve wasted much space on a 16:9 letterbox.

The keyboard attachment is pretty good. Get it. GET IT. I got the one that also has the cradle for the pen, which I never use but felt like I had to try out. If I did more art I would probably use it.

Performance and emulation

The CPU is really good. It’s got a huge speed boost over the Snapdragon 835 and 850 in older ARM64 Windows machines, and feels very snappy in native apps like Firefox or the new Edge. With 4 high-power CPU cores and 4 low-power cores, it handles multithreaded workloads fairly well unless they get confused by the scheduler… I’ve sometimes seen things have background threads get pushed to the low-power cores where they take a long time to run.

(In Task Manager, you can see the first 4 cores are the low-power cores, the next 4 are high-power.)

x86 Windows software is supported via emulation, both for store apps and regular win32 apps you find anywhere. But not everything works. I’ve generally had good luck with tools and applications – Visual Studio, VS Code, Chrome, Git for Windows, Krita, Inkscape all run. But about 1/2 of the Steam games I tried failed to run, maybe more. And software that’s x64-only won’t run at all, as there’s no emulator support for 64-bit code.

Emulated code in my unscientific testing runs 2-3 times slower than native code on sustained loops, but you can expect loading-time stuff to be slower because things have to get traced/compiled on the first run through or when code is modified in memory.

Nonetheless, 2-3 times slower than really-fast is still not-bad, and for UI-heavy or i/o-heavy applications it’s not too significant. I’ve had no real complaints using the x86 VS Code front-end, but more complaints with, say, compiling things in Visual Studio. :)

Web use case

Most of what I use a computer for these days is in a web browser environment, so “using the web” is big. Firefox has an optmized, native ARM64 build. Works great. ’nuff said.

Oh also Edge preview builds in the Dev and Canary channel are ARM64 native and run great, if you like that sort of thing.

Chrome, however, has not released a native build and will run in x86 emulation. If you need Chrome specifically it *will install and run* but it will be slow. Do not grab custom Chromium builds unless you’re using them only for testing, as they will not be secure or get updated!

Developer use case

I’m a software developer, so in addition to “everything that goes in a web browser” I need to use tools to work on a combination of stuff, mostly:

  • PHP and client-side JavaScript code (MediaWiki, a few other bits)
  • weird science C / JavaScript / emscripten / WebAssembly stuff (ogv.js, which plugs into MediaWiki’s video player extension)
  • research work in Rust (mtpng threaded PNG compressor)

LAMP stuff

I’m used to working in either a macOS or Linux environment, with Unix-like command line tools and usually a separate GUI text editor like Visual Studio Code, and never had good experiences trying to run the MediaWiki LAMP-stack tools on a Windows environment in years past. Even with Vagrant managing a VM, it had proved more fragile on Windows for me than on Mac or Linux.

WSL (Windows Subsystem for Linux) has changed that. I can run a Debian or Ubuntu system with less overhead and better integration to the host system than running in a traditional VM like VirtualBox or Hyper-V. On the Surface Pro X, you get the aarch64 distribution of Ubuntu or Debian (or whatever other supporting distro you choose to install) so it runs full speed, with no emulation overhead.

I’ve been using a MediaWiki git checkout in an Ubuntu setup, using the standard PHP/Apache/MySQL/whatevers and manually running git & composer updates. The main downside to using WSL here is that services don’t get started automatically because it doesn’t run the traditional init process, but “service mysql start” etc works as expected and gets you working.

For editing, I use Visual Studio Code. This is not yet available as an ARM64 optimized build (the x86 frontend runs in emulation), but does in 1.41 now include ARM64 support for WSL integration — which means you can run the PHP linter on your code running inside the Linux environment while your editor frontend is a native Windows GUI app. No wacky X11 hacks required.

emscripten stuff

The emscripten compiler for WebAssembly stuff works great, but doesn’t ship ARM or ARM64 builds for any platform yet in the emsdk tool.

You can build manually from source for now, and hopefully I can get builds working from the emsdk installer too (though you still would have to run the build yourself).

The main annoyance I had was that Ubuntu LTS currently ships an old node.js, which I had to supplement with a newer build to get my environment the way I wanted it for my scripts. :) This was pretty straightforward.

Rust stuff

Rust includes support for building code for Windows ARM64 — it has to to support things like Firefox! — but the compiler & tools distribution comes as x86. I’m sure this will eventually get worked out, but for now if you install Rust on Windows you’ll get the x86 build and may have to manually add the aarch64 target. But it does work — I can compile and run my mtpng project for Windows 10 ARM64 on the device.

Within a WSL environment, you can install Rust for Linux aarch64 and it “just works” as you’d expect, as well.

Final notes

All in all, pretty happy with it. I might have preferred a Surface Laptop X with similar specs but a built-in keyboard, but at a desk or other …. “surface” … it works fine for typey things like programming.

Certainly I prefer the keyboard to the keyboard on my 2018 MacBook Pro. ;)

Overflowing stacks in WebAssembly, whoops!

Native C-like programming environments tend to divide memory into several regions:

  • static data contains predefined constants loaded from the binary, and space for global variables (“data”)
  • your actual code has to go in memory too! (“text”)
  • space for dynamic memory allocation (“heap”), which may grow “up” as more memory is needed
  • space for temporary data and return addresses for function calls (“stack”), which may grow “down” as more memory is needed

Usually this is laid out with the stack high in the address space and the heap lower in the address space, if I recall correctly? Allocating more heap is done when you need it via malloc, and the stack can either grow or warn of an overflow by using the CPU’s memory manager to detect use of data pages incremented beyond the edge of the stack.

In emscripten’s WebAssembly porting environment, things are similar but a little different:

  • code doesn’t live in linear memory, so functions don’t have memory addresses
  • because code return addresses and small local variables also live separately, only arrays/structs and vars with address taken must be on stack.
  • usable memory is continguous; you can’t have a sparse address space where stack and heap can both grow.

As a result, the stack is fixed size and there’s some fragility, but the stack uses less space usually.

Currently the memory layout is to start with static data, follow with the stack, and then the heap. The stack grows “down”, meaning when you reach the end of the stack you end up in static data territory and can overwrite global variables. This leads to weird, hard to detect errors.

When making debug builds with assertions there are some “cookie” checks to ensure that some specific locations at the edge of the stack have not been overwritten at various times, but this doesn’t always catch things if you only overwrote the beginning of a buffer in static data and not the part that had the cookie. :) It also doesn’t seem to trigger on library workflows where you’re not running through the emscripten HTML5/SDL/WebGL runtime.

There’s currently a PR open to reduce the default stack size from 5 MiB to 0.5 MiB, which reduces the amount of memory needed for small modules significantly, and we’re chatting a bit about detecting errors in the case that codebases have regressions…

One thing that’s come up is the possibility of moving the stack to before static data, so you’d have: stack, static data, heap.

This has two consequences:

  • any memory access beyond the stack end will wrap around 0 into unallocated memory at the top of the address space, causing an immediate trap — this is done by the memory manager for free, with no false positives or negatives
  • literal references to addresses in static data will be larger numbers, thus may take more bytes to encode in the binary (variable-length encoding is used in WebAssembly for constants)

Probably the safety and debugging win would be a bigger benefit than the size savings, though potentially that could be a win for size-conscience optimizations when not debugging.

Rust error handling with Result and Option (WebAssembly ABI)

In our last adventure we looked at C++ exceptions in WebAssembly with the emscripten compiler. Now we’re taking a look at the main error handling system for another language targeting WebAssembly, Rust.

Rust has a “panic”/”unwind” system similar to C++ exceptions, but it’s generally recommended against catching panics. It also currently doesn’t work in WebAssembly output, where Rust doesn’t use emscripten’s tricks of calling out to JavaScript and is waiting for proper exception integration to arrive in Wasm.

Instead, most user-level Rust code expresses fallible operations using the Result or Option enum types. An Option<T> can hold either some value of type T — Some(T) — or no value — None. A Result<T, E> is fancier and can hold either a data payload Ok(T) or a Err(E) holding some information about the error condition. The E type can be a string or an enum carrying a description of the error, or it can just be an empty () unit type if you’re not picky.

Rust’s “?” operator is used as a shortcut for checking and propagating these error values.

For instance a function to store to memory (which might fail if the address is invalid or non-writable) you could write:

    // Looking up a memory page for writing could fail,
    // if the page is unmapped or read-only.
    pub fn page_for_write(&mut self, addr: GuestPtr)
    -> Result<&mut PageData, Exception> {
       // ... returns either Ok(something)
       // or Err(Exception::PageFault)
    }

    pub fn store_8(&mut self, addr: GuestPtr, val: u8)
    -> Result<(), Exception> {

        // First some address calculations which cannot fail.
        let index = addr_to_index(addr);
        let page = addr_to_page(addr);

        let mem = self.page_for_write(page)?;

        // The "?" operator checked for an Err(_) condition
        // and if so, propagated it to the caller. If it was
        // Ok(_) then we continue on with that value.
        mem.store_8(index, val);

        Ok(())
    }

This looks reasonably clean in the source — you have some ?s sprinkled about but they indicate that a branch may occur, which is good to help reason about the performance. This makes things more predictable when you look at it — addition of a destructor in C++ might make a fast path suddenly slow with emscripten’s C++/JavaScript hybrid exceptions, while the worst that happens here is a visible flag check which is not so bad, right?

Well, mostly. The beautiful, yet horrible thing about Rust is that automatic optimization lets you create these wonderful clean conceptual APIs that compile down to fast code. The horrible part is that you sometimes don’t know how or why it’s doing what, and there can still be surprises!

The good news is that there’s never any JavaScript call-outs in your hot paths, unlike emscripten’s C++ exception catching. Yay! But the bad news is that in WebAssembly, returning a structure like a Result<T,E> or an Option<T> that doesn’t resolve into a single word can be complicated. And knowing what resolves is complicated. And sometimes things get packed into integers and I don’t even know what’s doing that.

Crash course on enum layout

Rust “enums” can be much more fancy than C/C++ “enums”, with the ability not only to carry a discriminator of which enumerated value they carry, but to carry structure-like data payloads within them.

For Option<T>, this means you have a 1-bit discriminant between Some(_) or None and then the storage payload of the T type. If T is itself a non-zero or non-nullable type, or an enum with free space (a “niche”) then it can be optimized into a single word — for instance an Option<&u8> takes only a single pointer word of space because null references are not possible, while Option<*const u8> would take two words. When we use Result<(), E> for a return value on fallible operations that don’t need to return data, like the store_8 function above, we get that same benefit.

When you can return a single word, life is simple. At the low-level JIT implementation it’s probably passed in a register, and everything is awesome and fast.

However we have two words on data-bearing Result<T,E>s, and WebAssembly can’t return two words at once from a function. Instead, most of the time, the function is transformed to not use the return value, and instead send the address of a stack location to write the structure to memory, which means two memory stores and at least one memory read to check the discriminant value. Plus you have to manipulate the stack pointer, which you might not have needed otherwise on some functions.

This is not great, but is probably better than calling out to JavaScript and back. I haven’t yet benchmarked tight loops with this.

Packed enums in single words

I’ve also sometimes seen Result enums get packed into a single word, and I don’t yet understand the circumstances that this happens.

On a WebAssembly build, I’ve had it happen with Result<u16, E> where E is a small fieldless enum: it’s packed into a single 32-bit integer. But Result<u8, E> is sent through the stack, and Result<u32, E> is too, though it could have been packed into a single 64-bit integer.

On native x86-64 builds on macOS, I see the word packing for both Result<u16, E> and Result<u32, E>… meanwhile Result<u64, E> uses the stack, but Result<u8, E> uses two words returned in different registers.

I know the internals of Rust layout and calling conventions are officially undocumented and can change, but it’d be nice to know roughly what some of them are. ;)

Further information on Rust layout and ABI

Exception handling in emscripten: how it works and why it’s disabled by default

One of my “fun” projects I’m poking at on the side is an x86-64 emulator for the web using WebAssembly (not yet online, nowhere near working). I’ve done partial spike implementations in C++ and Rust with a handful of interpreted opcodes and emulation of a couple Linux syscalls; this is just enough for a hand-written “Hello, world!” executable. ;)

One of the key questions is how to handle exceptional conditions within the emulator, such as an illegal instruction, page fault, etc… There are multiple potentially fallible operations within the execution of a single opcode, so the method chosen to communicate those exceptions could make a big impact on performance.

For C++, one method which looks nice in the source code and can be good on native platforms is to use C++’s normal exception facility. This turns out to be somewhat problematic in emscripten builds for WebAssembly, which is explained by looking at its implementation.

In fact it’s considered enough of a performance problem that catching exceptions is disabled by default in emscripten — throwing will abort the program with no way to catch it except at the surrounding JavaScript host page level. To get exceptions to actually work, you must pass -s DISABLE_EXCEPTION_CATCHING=0 to the compiler.

Throwing is easy

Let’s write ourselves a potentially-throwing function:

static volatile int should_explode = 0;

void kaboom() {
    if (should_explode) {
        throw "kaboom";
    }
}

Pretty exciting! The volatile int is just to make sure the optimizer doesn’t do anything clever for us, as real world functions don’t _always_ or _never_ throw, they _sometimes_ throw. At -O1 this compiles to WebAssembly something like this:

 (func $kaboom\28\29 (; 26 ;) (type $11)
  (local $0 i32)
  (if
   (i32.load
    (i32.const 13152)
   )
   (block
    (i32.store
     (local.tee $0
      (call $__cxa_allocate_exception
       (i32.const 4)
      )
     )
     (i32.const 1024)
    )
    (call $__cxa_throw
     (local.get $0)
     (i32.const 12432)
     (i32.const 0)
    )
    (unreachable)
   )
  )
 )

The function $__cxa_throw calls into emscripten’s JavaScript-side runtime to throw a native JS exception.

That’s right, WebAssembly doesn’t have a way to throw — or catch — an exception itself. Yet. They’re working on it, but it’s not done yet.

Calling best case

So now we want to call that function. In the best case scenario, there’s a function that doesn’t have any cleanup work to do after a potentially throwing call: no try/catch clauses, no destructors, no stack allocations to clean up.

void dostuff_nocatch() {
    printf("Doing nocatch stuff\n");
    kaboom();
    printf("Possibly unreachable stuff\n");
}

This compiles to nice, simple, fast code with a direct call to kaboom:

 (func $dostuff_nocatch\28\29 (; 32 ;) (type $11)
  (drop
   (call $puts
    (i32.const 1106)
   )
  )
  (call $kaboom\28\29)
  (drop
   (call $puts
    (i32.const 1126)
   )
  )
 )

If kaboom throws, the JavaScript + WebAssembly runtime unwinds the native call stack up through dostuff_nocatch until whatever surrounding code (if any) catches it. The second printf/puts call never gets made, as expected.

Catching gets clever

But if you have variables with destructors, as with the RAII pattern, or a try/catch block, things get weird fast.

void dostuff_catch() {
    try {
        printf("Doing ex stuff\n");
        kaboom();
    } catch (const char *e) {
        printf("Caught stuff\n");
    }
}

compiles to the much longer:

 (func $dostuff_catch\28\29 (; 27 ;) (type $11)
  (local $0 i32)
  (drop
   (call $puts
    (i32.const 1078)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (call $invoke_v
   (i32.const 1)
  )
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    (block
     (local.set $0
      (call $__cxa_find_matching_catch_3
       (i32.const 12432)
      )
     )
     (br_if $label$1
      (i32.ne
       (call $getTempRet0)
       (call $llvm_eh_typeid_for
        (i32.const 12432)
       )
      )
     )
     (drop
      (call $__cxa_begin_catch
       (local.get $0)
      )
     )
     (drop
      (call $puts
       (i32.const 1093)
      )
     )
     (call $__cxa_end_catch)
    )
   )
   (return)
  )
  (call $__resumeException
   (local.get $0)
  )
  (unreachable)
 )

A few things to note.

Here the direct call to kaboom is replaced with a call to the indirected invoke_v function:

  (call $invoke_v
   (i32.const 1)
  )

which is implemented in the JS runtime to wrap a try/catch around the call:

function invoke_v(index) {
  var sp = stackSave();
  try {
    dynCall_v(index);
  } catch(e) {
    stackRestore(sp);
    if (e !== e+0 && e !== 'longjmp') throw e;
    _setThrew(1, 0);
  }
}

Note this saves the stack position (in Wasm linear memory, for stack-allocated data — separate from the native/JS/Wasm call stack) and restores it, which allows functions that allocate stack data to participate in the faster call mode by not having to clean up their own stack pointer modifications.

If an exception is caught, a global flag is set which is checked and re-cleared after the call:

  ;; Load the threw flag into a local variable
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  ;; Clear the threw flag for the next call
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  ;; Branch based on the stored state
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    ....
   )
  )

If the flag wasn’t set after all, then the function continues on its merry way. If it was set, then it does its cleanup and either eats the exception (for catch) or re-throws it (for automatic cleanup like destructors).

Summary and surprises

So there’s several things that’ll slow down your code using C++ exception catching in emscripten, even if you never throw in the hot paths:

  • when catching, potentially throwing calls are indirected through JavaScript
  • destructors create implicit catch blocks, so you may be catching more than you think

There’s also some surprises:

  • in C++, every extern “C” function is potentially throwing!
  • “noexcept” functions that call potentially throwing functions add implicit catch blocks in order to call abort(), so they can guarantee that they do not throw exceptions themselves

So don’t just add “noexcept” to code willy-nilly or it may get worse.

It would be faster to use my own state flag and check it after every call (which could be hidden behind a template or a macro or something, probably, at least in part), since this would avoid calling out through JavaScript and indirecting all the calls.

Next I’ll look at another way in the Rust version, using Result<T, E> return types and the error propagation “?” shortcut in Rust…

Building Chromium for Windows ARM64

I heard a couple months back that someone had confirmed that you can build Chromium (open-source version of Chrome browser) for Windows ARM64… Since there’s still no general release, figured I’d give it a shot to test things in.

You have to build on Windows x64 (it won’t build on-device, it’d be too slow and may require running some x64 binaries from the tooling which wouldn’t work) and then copy the output over to an ARM64 device.

Some resources:

Tricky bits:

  • I built with Visual Studio 2019, but something in the build setup seems to want Visual Studio 2017’s C runtime library? Or something? Anyway I had to change a couple references in build/vs_toolchain.py from ‘Microsoft.VC141.CRT’ to ‘Microsoft.VC142.CRT’ and *.DebugCRT because I could not get it to look in the directory where the VC141 files were.
  • If you don’t turn on use_jumbo_build it will take a vveerryy lloonngg time. If you do, though, it will use more memory. Beware.
  • I had a lot of problems with the provided python instance not picking up imports from the depot_tools package, even after removing my other Python instances. Had to set PYTHONPATH=C:\src\depot_tools

My gn args for the build were:

# Required or else it defaults to x64
target_cpu = "arm64"

# To get a release-quality build for benchmarking
is_debug = false
is_component_build = false

# Coalesce multiple files and skip the Native Client stuff which won't be used in testing
use_jumbo_build = true
enable_nacl = false

I copied the resulting output directory over to the ARM64 machine in the morning; cd into the out dir and just run “chrome” and bam it works!

After enabling SIMD in chrome://flags, I tested ogv.js’s WebAssembly version of the dav1d AV1 decoder, with excellent results. Matching my earlier experiments with the Linux version under WSL (where I couldn’t get playback working due to missing audio and slow X11 drawing, but decoding benchmarks worked) I see decode times about 2x as fast as Firefox (which only uses a baseline compiler on ARM64, so isn’t as well optimized).

Enabling SIMD roughly doubles the throughput… and enabling threading roughly doubles it again.

As a result, I can get 720p24 AV1 video decoding and playing in WebAssembly fairly stably, with a few dropped frames here and there:

That’s not bad!

Very very much hoping that WebAssembly threading and SIMD will come to Safari, where we use ogv.js to play Ogg/WebM/VP9 (and in future AV1) video clips for Wikipedia. The recent model iPad and iPhone ARM64 chips are already amazing, and both desktops and mobiles will benefit from the performance improvements.

Also very much looking forward to Chrome and Edge coming to native ARM64 Windows. And looking forward to Firefox’s next-gen code generation engine, cranelift, improving their situation on these chips…. :)