brionv – reticula, electronica, & oddities

Testing in-browser video transcoding with MediaRecorder

A few months ago I made a quick test transcoding video from MP4 (or whatever else the browser can play) into WebM using the in-browser MediaRecorder API.

I’ve updated it to work in Chrome, using a <canvas> element as an intermediary recording surface as captureStream() isn’t available on <video> elements yet there.

Live demo: https://brionv.com/misc/browser-transcode-test/capture.html

There are a couple advantages of re-encoding a file this way versus trying to do all the encoding in JavaScript, but also some disadvantages…

Pros

actual encoding should use much less CPU than JavaScript cross-compile
less code to maintain!
don’t have to jump through hoops to get at raw video or audio data

Cons

MediaRecorder is realtime-oriented:
- will never decode or encode faster than realtime
- if encoding is slower than realtime, lots of frames are dropped
- on my MacBook Pro, realtime encoding tops out around 720p30, but eg phone camera videos will often be 1080p30 these days.
browser must actually support WebM encoding or it won’t work (eg, won’t work in Edge unless they add it in future, and no support at all in Safari)
Firefox and Chrome both seem to be missing Vorbis audio recording needed for base-level WebM (but do let you mix Opus with VP8, which works…)

So to get frame-rate-accurate transcoding, and to support higher resolutions, it may be necessary to jump through further hoops and try JS encoding.

I know this can be done — there are some projects compiling the entire ffmpeg package in emscripten and wrapping it in a converter tool — but we’d have to avoid shipping an H.264 or AAC decoder for patent reasons.

So we’d have to draw the source <video> to a <canvas>, pull the RGB bits out, convert to YUV, and run through lower-level encoding and muxing… oh did I forget to mention audio? Audio data can be pulled via Web Audio, but only in realtime.

So it may be necessary to do separate audio (realtime) and video (non-realtime) capture/encode passes, then combine into a muxed stream.

Canvas, Web Audio, MediaStream oh my!

I’ve often wished that for ogv.js I could send my raw video and audio output directly to a “real” <video> element for rendering instead of drawing on a <canvas> and playing sound separately to a Web Audio context.

In particular, things I want:

Not having to convert YUV to RGB myself
Not having to replicate the behavior of a <video> element’s sizing!
The warm fuzzy feeling of semantic correctness
Making use of browser extensions like control buttons for an active video element
Being able to use browser extensions like sending output to ChromeCast or AirPlay
Disabling screen dimming/lock during playback

This last is especially important for videos of non-trivial length, especially on phones which often have very aggressive screen dimming timeouts.

Well, in some browsers (Chrome and Firefox) now you can do at least some of this. 🙂

I’ve done a quick experiment using the <canvas> element’s captureStream() method to capture the video output — plus a capture node on the Web Audio graph — combining the two separate streams into a single MediaStream, and then piping that into a <video> for playback. Still have to do YUV to RGB conversion myself, but final output goes into an honest-to-gosh <video> element.

To my great pleasure it works! Though in Firefox I have some flickering that may be a bug, I’ll have to track it down.

Some issues:

Flickering on Firefox. Might just be my GPU, might be something else.
The <video> doesn’t have insight to things like duration, seeking, etc, so can’t rely on native controls or API of the <video> alone acting like a native <video> with a file source.
Pretty sure there are inefficiencies. Have not tested performance or checked if there’s double YUV->RGB->YUV->RGB going on.

Of course, Chrome and Firefox are the browsers I don’t need ogv.js for for Wikipedia’s current usage, since they play WebM and Ogg natively already. But if Safari and Edge adopt the necessary interfaces and WebRTC-related infrastructure for MediaStreams, it might become possible to use Safari’s full screen view, AirPlay mirroring, and picture-in-picture with ogv.js-driven playback of Ogg, WebM, and potentially other custom or legacy or niche formats.

Unfortunately I can’t test whether casting to a ChromeCast works in Chrome as I’m traveling and don’t have one handy just now. Hoping to find out soon! 😀

JavaScript async/await fiddling

I’ve been fiddling with using ECMAScript 2015 (“ES6”) in rewriting some internals for ogv.js, both in order to make use of the Promise pattern for asynchronous code (to reduce “callback hell”) and to get cleaner-looking code with the newer class definitions, arrow functions, etc.

To do that, I’ll need to use babel to convert the code to the older ES5 version to run in older browsers like Internet Explorer and old Safari releases… so why not go one step farther and use new language features like asynchronous functions that are pretty solidly specced but still being implemented natively?

Not yet 100% sure; I like the slightly cleaner code I can get, but we’ll see how it functions once translated…

Here’s an example of an in-progress function from my buffering HTTP streaming abstraction, currently being rewritten to use Promises and support a more flexible internal API that’ll be friendlier to the demuxers and seek operations.

I have three versions of the function: one using provisional ES2017 async/await, one using ES2015 Promises directly, and one written in ES5 assuming a polyfill of ES2015’s Promise class. See the full files or the highlights of ES2017 vs ES2015:

The first big difference is that we don’t have to start with the “new Promise((resolve,reject) => {…})” wrapper. Declaring the function as async is enough.

Then we do some synchronous setup which is the same:

Now things get different, as we perform one or two asynchronous sub-operations:

In my opinion the async/await code is cleaner:

First it doesn’t have as much extra “line noise” from parentheses and arrows.

Second, I can use a try/finally block to do the final state change only once instead of on both .then() and .catch(). Many promise libraries will provide an .always() or something but it’s not standard.

Third, I don’t have to mentally think about what the “intermittent return” means in the .then() handler after the triggerDownload call:

Here, returning a promise means that that function gets executed before moving on to the next .then() handler and resolving the outer promise, whereas not returning anything means immediate resolution of the outer promise. It ain’t clear to me without thinking about it every time I see it…

Whereas the async/await version:

makes it clear with the “await” keyword what’s going on.

Updated: I managed to get babel up and running; here’s a github gist with expanded versions after translation to ES5. The ES5 original is unchanged; the ES2015 promise version is very slightly more verbose, and the ES2017 version becomes a state machine monstrosity. 😉 Not sure if this is ideal, but it should preserve the semantics desired.

Dell P2415Q 24″ UHD monitor review

Last year I got two Dell P2415Q 24″ Ultra-HD monitors, replacing my old and broken 1080p monitor, to use with my MacBook Pro. Since the model’s still available, thought I’d finally post my experience.

tl;dr:

Picture quality: great
Price: good for what you get and they’re cheaper now than they were last year.
Functionality: mixed; some problems that need workarounds for me.

So first the good: the P2415Q is the “right size, right resolution” for me; with an operating system such as Mac OS X, Windows 10, or some Linux environments that handles 200% display scaling correctly, it feels like a 24″ 1080p monitor that shows much, much sharper text and images. When using the external monitors with my 13″ MacBook Pro, the display density is about the same as the internal display and the color reproduction seems consistent enough to my untrained eye that it’s not distracting to move windows between the laptop and external screens.

Two side by side plus the laptop makes for a vveerryy wwiiddee desktop, which can be very nice when developing & testing stuff since I’ve got chat, documentation, terminal, code, browser window, and debugger all visible at once. 🙂

The monitor accepts DisplayPort input via either full-size or mini, and also accepts HDMI (limited to 30 Hz at the full resolution, or full 60Hz at 1080p) which makes it possible to hook up random devices like phones and game consoles.

There is also an included USB hub capability, which works well enough but the ports are awkward to reach.

The bad: there are three major pain points for me, in reducing order of WTF:

Sometimes the display goes black when using DisplayPort; the only way to resolve it seems to be to disconnect the power and hard-reset the monitor. Unplugging and replugging the DisplayPort cable has no effect. Switching cables has no effect. Rebooting computer has no effect. Switching the monitor’s power on and off has no effect. Have to reach back and yank out the power.
There are neither speakers nor audio passthrough connectors, but when connecting over HDMI devices like game consoles and phones will attempt to route audio to the monitor, sending all your audio down a black hole. Workaround is to manually re-route audio back to default or attach a USB audio output path to the connected device.
Even though the monitor can tell if there’s something connected to each input or not, it won’t automatically switch to the only active input. After unplugging my MacBook from the DisplayPort and plugging a tablet in over HDMI, I still have to bring up the on-screen menu and switch inputs.

The first problem is so severe it can make the unit appear dead, but is easily worked around. The second and third may or may not bother you depending on your needs.

So, happy enough to use em but there’s real early adopter pain in this particular model monitor.

Exploring VP9 as a progressive still image codec

At Wikipedia we have long articles containing many images, some of which need a lot of detail and others which will be scrolled past or missed entirely. We’re looking into lazy-loading, alternate formats such as WebP, and other ways to balance display density vs network speed.

I noticed that VP9 supports scaling of reference frames from different resolutions, so a frame that changes video resolution doesn’t have to be a key frame.

This means that a VP9-based still image format (unlike VP8-based WebP) could encode multiple resolutions to be loaded and decoded progressively, at each step encoding only the differences from the previous resolution level.

So to load an image at 2x “Retina” display density, we’d load up a series of smaller, lower density frames, decoding and updating the display until reaching the full size (say, 640×360). If the user scrolls away before we reach 2x, we can simply stop loading — and if they scroll back, we can pick up right where we left off.

I tried hacking up vpxenc to accept a stream of concatenated PNG images as input, and it seems plausible…

Demo page with a few sample images (not yet optimized for network load; requires Firefox or Chrome):
https://media-streaming.wmflabs.org/pic9/

Compared to loading a series of intra-coded JPEG or WebP images, the total data payload to reach resolution X is significantly smaller. Compared against only loading the final resolution in WebP or JPEG, without any attempt at tuning I found my total payloads with VP9 to be about halfway between the two formats, and with tuning I can probably beat WebP.

Currently the demo loads the entire .webm file containing frames up to 4x resolution, seeking to the frame with the target density. Eventually I’ll try repacking the frames into separately loadable files which can be fed into Media Source Extensions or decoded via JavaScript… That should prevent buffering of unused high resolution frames.

Some issues:

Changing resolutions always forces a keyframe unless doing single-pass encoding with frame lag set to 1. This is not super obvious, but is neatly enforced in encoder_set_config in vp9_cx_iface.h! Use –passes=1 –lag-in-frames=1 options to vpxenc.

Keyframes are also forced if width/height go above the “initial” width/height, so I had to start the encode with a stub frame of the largest size (solid color, so still compact). I’m a bit unclear on whether there’s any other way to force the ‘initial’ frame size to be larger, or if I just have to encode one frame at the large size…

There’s also a validity check on resized frames that forces a keyframe if the new frame is twice or more the size of the reference frame. I used smaller than 2x steps to work around this (tested with steps at 1/8, 1/6, 1/4, 1/2, 2/3, 1, 3/2, 2, 3, 4x of the base resolution).

I had to force updates of the golden & altref on every frame to make sure every frame ref’d against the previous, or the decoder would reject the output. –min-gf-interval=1 isn’t enough; I hacked vpxenc to set the flags on the frame encode to VP8_EFLAG_FORCE_GF | VP8_EFLAG_FORCE_ARF.

I’m having trouble loading the VP9 webm files in Chrome on Android; I’m not sure if this is because I’m doing something too “extreme” for the decoder on my Nexus 5x or if something else is wrong…

Scaling video playback on slow and fast CPUs in ogv.js

Video playback has different performance challenges at different scales, and mobile devices are a great place to see that in action. Nowhere is this more evident than in the iPhone/iPad lineup, where the same iOS 9.3 runs across several years worth of models with a huge variance in CPU speeds…

In ogv.js 1.1.2 I’ve got the threading using up to 3 threads at maximum utilization (iOS devices so far have only 2 cores): main thread, video decode thread, and audio decode thread. Handling of the decoded frames or audio packets is serialized through the main thread, where the player logic drives the demuxer, audio output, and frame blitting.

On the latest iPad Pro 9.7″, advertising “desktop-class performance”, I can play back the Blender sci-fi short Tears of Steel comfortably at 1080p24 in Ogg Theora:

The performance graph shows frames consistently on time (blue line is near the red target line) and a fair amount of headroom on the video decode thread (cyan) with a tiny amount of time spent on the audio thread (green) and main thread (black).

At this and higher resolutions, everything is dominated by video decode time — if we can keep up with it we’re golden, but if we get behind everything would ssllooww ddoownn badly.

On an iPad Air, two models behind, we get similar performance on the 720p24 version, at about half the pixels:

We can see the blue bars jumping up once a second, indicating sensitivity to the timing report and graph being updated once a second on the main thread, but overall still good. Audio in green is slightly higher but still ignorable.

On a much older iPad 3, another two models behind, we see a very different graph as we play back a mere 240p24 quarter-SD resolution file:

The iPad 3 has an older generation, 32-bit processor, and is in general pretty sluggish. Even at a low resolution, we have less headroom for the cyan bars of the video decode thread. Blue bars dipping below the red target line show we’re slipping on A/V sync sometimes. The green bars are much higher, indicating the audio decode thread is churning a lot harder to keep our buffers filled. Last but not least the gray bars at the bottom indicate more time spent in demuxing, drawing, etc on the main thread.

On this much slower processor, pushing audio decoding to another core makes a significant impact, saving an average of several milliseconds per frame by letting it overlap with video decoding.

The gray spikes from the main thread are from the demuxer, and after investigation turn out to be inflated by per-packet overhead on the tiny Vorbis audio packets… Such as adding timestamps to many of the packets. Ogg packs multiple small packets together into a single “page”, with only the final packet at the end of the page actually carrying a timestamp. Currently I’m using liboggz to encapsulate the demuxing, using its option to automatically calculate the missing timestamp deltas from header data in the packets… But this means every few frames the demuxer suddenly releases a burst of tiny packets with a 15-50ms delay on the main thread as it walks through them. On the slow end this can push a nearly late frame into late territory.

I may have further optimizations to make in keeping the main thread clear on slower CPUs, such as more efficient handling of download progress events, but overlapping the video and audio decode threads helps a lot.

On other machines like slow Windows boxes with blacklisted graphics drivers, we also benefit from firing off the next video decode before drawing the current frame — if WebGL is unexpectedly slow, or we fall back to CPU drawing, it may take a significant portion of our frame budget just to paint. Sending data down to the decode thread first means it’s more likely that the drawing won’t actually slow us down as much. This works wonders on a slow ARM-based Windows RT 8.1 Surface tablet. 🙂

Thoughts on Ogg adaptive streaming

So I’d like to use adaptive streaming for video playback on Wikipedia and Wikimedia Commons, automatically selecting the appropriate source format and resolution at runtime based on bandwidth and CPU availability.

For Safari, Edge, and IE users, that means figuring out how to rig a Media Source Extensions-like interface into ogv.js to let the streaming handler inject its buffered data into the demuxer and codecs instead of letting the player handle its own buffering.

It also means I have to figure out how to do adaptive stream switching for Ogg streams and Theora video, since WebM VP8 still decodes too slowly in ogv.js to rely on for deployment…

Theory vs Theora

At its base, adaptive streaming relies on the ability to feed the decoders with data from another stream without them freaking out and demanding a pause or reset. We can either read packets from a subset of a monolithic file for each source, or from a bunch of tiny segmented files.

In order to do this, generally you need to switch on video keyframe boundaries: each keyframe represents a point in the data stream where the video decoder can reset its state.

For WebM with VP8 and VP9 codecs, the decoders are pretty good at this. As long as you came in on a keyframe boundary you can just start feeding it packets at a new resolution and it’ll happily output frames at the new resolution.

For Ogg Theora, there are a few major impediments.

Ogg stream serial numbers

At the Ogg stream level: each Ogg logical bitstream gets a random serial number; those serial numbers will not match across separate encodings at different resolutions.

Ogg explicitly allows for “chaining” of complete bitstreams, where one ends and you just tack another on, but we’re not quite doing that here… We want to be able to switch partway through with minimal interruption.

For Vorbis audio, this might require some work if pulling audio+video together from combined .ogv files, but it gets simpler if there’s one .oga audio stream and separate video-only .ogv streams — we’d essentially have separate demuxer contexts for audio and video, and would not need to meddle with the audio.

For the Theora video stream this is probably ok too, since when we reach a switch boundary we also need to feed the decoder with…

Header packets

Every Theora video stream sets up start codes at the beginning of the stream in its three header packets. This means that encodings of the same video at different resolutions will have different header setup.

So, when we switch sources we’ll need to reinitialize the Theora decoder with the header packets from the target stream; then it should be safe to feed new packets into it from our arbitrary start position.

This isn’t a super exotic requirement; I’ve seen some provision for ‘start codes’ for MP4 adaptive streaming too.

Keyframe timing

More worrisome is that keyframe timing is not predictable in a Theora stream. This is actually due to the libtheora encoder internals — it allows you to specify a maximum keyframe interval, but it may decide at any time to insert a keyframe on its own if it thinks it’s more efficient to store a frame that way, at which point the interval starts counting from there instead of the last scheduled keyframe.

Since this heuristic is determined based on actual frame data, the early keyframes will appear in different times and places for renderings at different resolutions… And so will every keyframe following them.

This means you don’t have switch points that are consistent between sources, breaking the whole model!

It looks like a keyframe can be forced by changing the keyframe interval to 1 right before a desire keyframe, then changing it back to the desired value after. This would result in still getting some early keyframes at unpredictable times, but then also getting predictable ones. As long as the switchover points aren’t too often, that’s probably fine — just keep decoding over the extra keyframes, but only switch/segment on the predictable ones.

Streams vs split files

Another note: it’s possible to either store data as one long file per source, or to split it up into small chunk files at each keyframe boundary.

Chunk files are nice because they can be streamed easily without use of the HTTP ‘Range’ header and they’re friendly to cache layers. Long files can be easier to manage on the server, but Wikimedia ops folks have told me that the way large files are stored doesn’t always interact ideally with the caching layer and they’d be much happier with split chunk files!

A downside of chunks is that it’s harder to download a complete copy of a file at a given resolution for offline playback. But, if we split audio and video tracks we’re in a world where that’s hard anyway… Can either just say “download the full resolution source then!” Or provide a remuxer to produce combined files for download on the fly from the chunks… 🙂
The keyframe timing seems the ugliest issue to deal with; may need to patch ffmpeg2theora or ffmpeg to work around it, but at least shouldn’t have to mess with libtheora itself…

WebM seeking coming in ogv.js 1.1.2

Seeking in WebM playback will finally be supported in ogv.js 1.1.2. Yay! Try it out!

This took some fancy footwork, as the demuxer library I’m using (nestegg) only provides a synchronous i/o-using interface for seeking: on the first seek, it needed to be able to first do a seek to the location of the cues in the file (they’re usually at the end in WebM, not at the beginning!), then read in the cue data, and then run a second seek to the start of a cluster.

On examining the innards of the library I found that it’s fairly safe to ‘restart’ the operation directly after the seek attempts, which saved me the trouble of patching the library code; though I may come up with a patch to more cleanly & reliably do this.

For the initial hack, I have my i/o callbacks detect that an attempt was made to seek outside the buffer range, and then when the nestegg library function fails out, the demuxer sees the seek attempt and passes it back to the caller (OGVPlayer) which is able to trigger a seek on the actual, asynchronous i/o layer. Once the new data starts arriving, we call back into the demuxer to read the cues or confirm that we’ve seeked to the right place and can continue decoding. As an additional protection against the library freaking out, I make sure that the entire cues element has been buffered before attempting to read it.

I’ve also fixed a bug that caused WebM playback to occasionally die with an out of memory error. This one was caused by not freeing packet data in the demuxer. *headdesk* At least that bug was easier. 😉

This gets WebM VP8 playback working about as well as Ogg Theora playback, except for the decoding being about 5x slower! Well, I’ve got plans for that. 😀

VP8 parallelization continued

Following up on earlier musings… Found an interesting analysis by someone who did some work in this area a few years ago.

Based on the way data flows from macroblocks adjacent above or to the left, it is safe to parallelize multiple super blocks in diagonal lines running from top-right towards the bottom-left. This means you start with a batch of one macroblock, then a batch of two, then three…. Up to the maximum diagonal dimension (30 blocks at 480p or 68 at 1080p) then back down again to one at the far bottom right corner. Hmm, not exactly diagonal, actually half diagonal?

(Based on my reading the filter stage doesn’t need to access the above-right block which simplifies things, but I could be wrong… Intraframe prediction looks scarier though and may need the different angle cut. Anyway the half diagonal order should work for the entire set of all operations…).

This breakdown should be suitable both for CPU worker-based threading (breaking the batches into smaller sub-batches per core) and for WebGL shader based GPU work (where each large batch would issue several draw calls, each processing up to the full batch size of macroblocks).

For CPU work there could also be pipelining between the data stages, though if doing full batching that may not be necessary.

Data locality and latency

On modern computers, accessing memory is often the slowest part of a naively written calculation. If the memory you’re working with isn’t in cache, fetching it can be verrrry slow. And if you need to transfer data between the CPU and GPU things get even nastier.

Unpacking the entropy-coded data structures for an entire frames worth of macroblocks then processing them in a different order sounds like it might be expensive. But it shouldn’t be insanely so; most processing will be local to each block.

For GPU usage, data also flows mostly one way from the CPU into textures and arrays uploaded into the GPU, where random texel access should be fast. We shouldn’t need to read any of that data back to the CPU until the end of the loop filter, when we read the YUV buffers back to feed into the next frame’s predictions and output to the player.

What we do need for the GPU is to read the results of each batch’s computations back into the next batch. As long as I can render into a texture and use that texture as a source for the next call I think that all works as I need it.

Loop filter madness

The loop filter stage reduces artifacting at the edges of macroblocks and sub-blocks from motion prediction and DCT fun. Because it’s very precisely specced, and the output feeds back into the next frame’s predictions, it’s important to get this right.

per spec, each macroblock may apply up to four filter passes:

left edge

subblock left edges

top edge

subblock top edges

Along each edge, for each pixel along the edge the boundaries are tested for a threshold and a filter may or may not be applied, variously to one, two, or three pixels deep from the edge. For the macroblock edges, when the filter applies it also modifies the adjacent block data!

It’s all pretty funky, but it looks parallelizable within limits.

Ah, now to find time to research this further while still getting other stuff done. 😉

Peeking into VP8 video decoding performance

The ogv.js distribution includes Ogg Theora video decoding, which we use on Wikipedia to play back our media files in Safari, IE, and Edge, but also has an experimental mode for WebM files with VP8 video.

WebM/VP8 is better supported by many tools, and provides higher quality video at lower bandwidth usage than Ogg Theora… There are two major reasons I haven’t taken WebM out of “experimental” status:

The demuxer library (nestegg) was hacked in quickly and I need to add support for seeking, “not crashing”, etc
Decoding VP8 video is several times slower than decoding Theora at the same resolution.

The first is a simple engineering problem; I can either hack up nestegg or replace it with something that fits the i/o model I’m working with better.

The second is an intrinsic complexity problem: although the two formats are broadly similar technologies, VP8 requires more computation to decode a frame than Theora does.

Add to this the fact that we have some environmental limitations on common CPU optimizations for parallelizable code in the C libraries:

JavaScript “Worker” threads are different from the low-level pthreads threading model used by C code, and the interfaces required to more closely emulate it are not yet available in Safari or Edge.
SIMD (“Same Instruction Multiple Data”) processing is not available in Safari, and not yet production-enabled in Edge.

So, if we can’t use SIMD instructions to parallelize tiny bits of processing, and we can’t simply crank up multithreading to use a now-ubiquitous second CPU core, how can we split up the work?

The first step, I think, is to determine some data boundaries where we might hope to be able to export data from the libvpx library before it’s fully done processing.

What’s what

The VP8 decoder & bitstream format is defined in RFC 6386, from which we can roughly divide decoding into four stages:

…decode input stream entropy encoding…
reconstruct base signal by applying motion vectors against a reference frame
reconstruct residual signal by applying inverse DCT
apply loop filter

The entropy encoding isn’t really a separate stage, but happens as other data gets read in and then fed into each other stage. It’s not really parallelizable itself either.

Aiming high

So before I go researching more on trying to optimize some of these steps, what’s actually the biggest, slowest stuff to work on?

I did a quick profiling run playing back some 480p WebM in Chrome (pretty good compiler, and the profiling tools support profiling just one worker thread which isolates the VP8 decoder nicely). Broke down the resulting function self-time list by rough category and found:

Filter: 54.40%
Motion: 22.75%
IDCT: 10.55%
Other: 12.31%

(“Other” included any function I didn’t tag, as well as general overhead.)

Ouch! Filtering looks like a good first application of a separate step.

Possible directions – staying on CPU

If we stick with the CPU, we can create further Worker threads and send blocks of data to be filtered. In many cases, even when processing of one macroblock to another is hard to parallelize because of data dependencies, there is natural parallelism in the color system — the Y (“luma”, or brightness) plane can be processed by one thread while the U and V (“chroma”, or color) planes can be processed independently by another.

However splitting processing between luma and chroma has a limited benefit. There’s twice as much luma data as chroma, so you save at most 33% of the runtime here.

Macroblocks and subblocks

VP8 divides most processing & data into “macroblocks” of 16×16 pixels, composed of 24 “subblocks” of 4×4 pixels (that’s 16 subblocks of luma data, plus 4 each from the two lower-resolution chroma planes).

In many cases we can parallelize at the subblock level, which divides 24 evenly into 2 cores, or even 4 or 8 cores! But the most performance-sensitive devices will usually only have 2 CPU cores available, giving us only a 50% potential speedup.

Going crazy – GPU time?

Graphics processors are much more aggressively multithreaded, with multiple cores and ability to handle many more simultaneous operations.

If we run more operations on the GPU, we might actually have a chance to run 24 subblocks all in parallel!

However there’s a lot more fuzziness involved in getting there:

may have to rewrite code from C into GLSL
have to figure out how to operate as fragment or vertex shaders
have to figure out how to efficiently get data in/out
oh and you can’t use WebGL from a Worker thread, so have to bounce data up to the main thread and back down to the codec
once all that’s done, what’s the actual performance of the GPU vs the CPU per-operation? no idea 😀

So, there would seem to be great scaling potential on the GPU side, but there are a lot of unknowns.

Worth investigating but we’ll see what’s really feasible…