Engineering and Developers Blog
What's happening with engineering and developers at YouTube
Resonance Audio: Multi-platform spatial audio at scale
Monday, November 6, 2017
Cross-posted from the
VR Blog
Posted by Eric Mauskopf, Product Manager
As humans, we rely on sound to guide us through our environment, help us communicate with others and connect us with what's happening around us. Whether walking along a busy city street or attending a packed music concert, we're able to hear hundreds of sounds coming from different directions. So when it comes to AR, VR, games and even 360 video, you need rich sound to create an engaging immersive experience that makes you feel like you're really there. Today, we're releasing a new spatial audio software development kit (SDK) called
Resonance Audio
. It's based on technology from Google's VR Audio SDK, and it works at scale across mobile and desktop platforms.
Experience spatial audio in our Audio Factory VR app for
Daydream
and
SteamVR
Performance that scales on mobile and desktop
Bringing rich, dynamic audio environments into your VR, AR, gaming, or video experiences without affecting performance can be challenging. There are often few CPU resources allocated for audio, especially on mobile, which can limit the number of simultaneous high-fidelity 3D sound sources for complex environments. The SDK uses highly optimized digital signal processing algorithms based on higher order Ambisonics to spatialize hundreds of simultaneous 3D sound sources, without compromising audio quality, even on mobile. We're also introducing a new feature in Unity for precomputing highly realistic reverb effects that accurately match the acoustic properties of the environment, reducing CPU usage significantly during playback.
Using geometry-based reverb by assigning acoustic materials to a cathedral in Unity
Multi-platform support for developers and sound designers
We know how important it is that audio solutions integrate seamlessly with your preferred audio middleware and sound design tools. With Resonance Audio, we've released cross-platform SDKs for the most popular game engines, audio engines, and digital audio workstations (DAW) to streamline workflows, so you can focus on creating more immersive audio. The SDKs run on Android, iOS, Windows, MacOS and Linux platforms and provide integrations for Unity, Unreal Engine, FMOD, Wwise and DAWs. We also provide native APIs for C/C++, Java, Objective-C and the web. This multi-platform support enables developers to implement sound designs once, and easily deploy their project with consistent sounding results across the top mobile and desktop platforms. Sound designers can save time by using our new DAW plugin for accurately monitoring spatial audio that's destined for YouTube videos or apps developed with Resonance Audio SDKs. Web developers get the open source Resonance Audio Web SDK that works in the top web browsers by using the
Web Audio API
.
DAW plugin for sound designers to monitor audio destined for YouTube 360 videos or apps developed with the SDK
Model complex Sound Environments Cutting edge features
By providing powerful tools for accurately modeling complex sound environments, Resonance Audio goes beyond basic 3D spatialization. The SDK enables developers to control the direction acoustic waves propagate from sound sources. For example, when standing behind a guitar player, it can sound quieter than when standing in front. And when facing the direction of the guitar, it can sound louder than when your back is turned.
Controlling sound wave directivity for an acoustic guitar using the SDK
Another SDK feature is automatically rendering near-field effects when sound sources get close to a listener's head, providing an accurate perception of distance, even when sources are close to the ear. The SDK also enables sound source spread, by specifying the width of the source, allowing sound to be simulated from a tiny point in space up to a wall of sound. We've also released an Ambisonic recording tool to spatially capture your sound design directly within Unity, save it to a file, and use it anywhere Ambisonic soundfield playback is supported, from game engines to YouTube videos.
If you're interested in creating rich, immersive soundscapes using cutting-edge spatial audio technology, check out the Resonance Audio documentation on our
developer site
, let us know what you think through
GitHub
, and show us what you build with #ResonanceAudio on social media; we'll be resharing our favorites.
Variable speed playback on mobile
Thursday, September 7, 2017
Variable speed playback was launched on the web several years ago and is one of our most highly requested features on mobile.
Now, it’s here!
You can speed up or slow down videos in the YouTube app on iOS and on Android devices running Android 5.0+. Playback speed can be adjusted from 0.25x (quarter speed) to 2x (double speed) in the overflow menu of the player controls.
The most commonly used speed setting on the web is 1.25x, closely followed by 1.5x.
Speed watching
is the new
speed listening
which was the new
speed reading
, especially when consuming long lectures or interviews. But variable speed isn’t just useful for skimming through content to save time, it can also be an important tool for investigating finer details. For example, you might want to slow down a tutorial to learn some new
choreography
or figure out
a guitar strumming pattern
.
To speed up or slow down audio while retaining its comprehensibility, our main challenge was to efficiently change the duration of the audio signal without affecting the pitch or introducing distortion. This process is called
time stretching
. Without time stretching, an audio signal that was originally at 100 Hz becomes 200 Hz at double speed causing that
chipmunk effect
. Similarly, slowing down the speed will lower the pitch. Time stretching can be achieved using a
phase vocoder
, which transforms the signal into its frequency domain representation to make phase adjustments before producing a lengthened or shortened version. Time stretching can also be done in the time domain by carefully selecting windows from the original signal to be assembled into the new one. On Android, we used the
Sonic
library for our audio manipulation in ExoPlayer. Sonic uses PICOLA, a time domain based algorithm. On iOS, AVplayer has a
built in playback rate
feature with
configurable time stretching
. Here, we have chosen to use the spectral (frequency domain) algorithm.
To speed up or slow down video, we render the video frames in alignment with the modified audio timestamps. Video frames are not necessarily encoded chronologically, so for the video to stay in sync with the audio playback, the video decoder needs to work faster than the rate at which the video frames need to be rendered. This is especially pertinent at higher playback speeds. On mobile, there are also often more network and hardware constraints than on desktop that limit our ability to decode video as fast as necessary. For example, less reliable wireless links will affect how quickly and accurately we can download video data, and then battery, CPU speed, and memory size will limit the processing power we can spend on decoding it. To address these issues, we adapt the video quality to be only as high as we can download dependably. The video decoder can also skip forward to the next
key frame
if it has fallen behind the renderer, or the renderer can drop already decoded frames to catch up to the audio track.
If you want to check out the feature, try this: turn up your volume and play the classic dramatic chipmunk at 0.5x to see an EVEN MORE dramatic chipmunk. Enjoy!
Posted by Pallavi Powale, Software Engineer, recently watched “
Dramatic Chipmunk
” at 0.5x speed.
Blur select faces with the updated Blur Faces tool
Monday, August 21, 2017
In 2012 we launched
face blurring
as a visual anonymity feature, allowing creators to obscure all faces in their video. Last February we followed up with
custom blurring
to let creators blur any objects in their video, even as they move. Since then we’ve been hard at work improving our face blurring tool.
Today we’re launching a new and improved version of Blur Faces, allowing creators to easily and accurately blur specific faces in their videos. The tool now displays images of the faces in the video, and creators simply click an image to blur that individual throughout their video.
To introduce this feature, we had to improve the accuracy of our face detection tools, allowing for recognition of the same person across an entire video. The tool is designed for a wide array of situations that we see in YouTube videos, including users wearing glasses, occlusion (the face being blocked, for example, by a hand), and people leaving the video and coming back later.
Instead of having to use video editing software to manually create feathered masks and motion tracks, our
Blur Faces tool
automatically handles motion and presents creators with a thumbnail that encapsulates all instances of that individual recognized by our technology. Creators can apply these blurring edits to already uploaded videos without losing views, likes, and comments by choosing to “Save” the edits in-place. Applying the effect using “Save As New” and deleting the original video will remove the original unblurred video from YouTube for an extra level of privacy. The blur applied to the published video cannot be practically reversed, but keep in mind that blurring does not guarantee absolute anonymity.
To get to Blur Faces, go to the Enhance tool for a video you own. This can be done from the Video Manager or watch page. The Blur Faces tool can be found under the “Blurring Effects” tab of Enhancements. The following image shows how to get there.
When you open the Blur Faces tool on your video for the first time, we start processing your video for faces. During processing, we break your video up into chunks of frames, and start detecting faces on each frame individually. We use a high-quality face detection model to increase our accuracy, and at the same time, we look for scene changes and compute motion vectors throughout the video which we will use later.
Once we’ve detected the faces in each frame of your video, we start matching face detections within a single scene of the video, relying on both the visual characteristics of the face as well as the face’s motion. To compute motion, we use the same technology that powers our
Custom Blurring feature
. Face detections aren’t perfect, so we use a few techniques to help us hone in on edge cases such as tracking motion through occlusions (see the water bottle in the above GIF) and near the edge of the video frame. Finally, we compute visual similarity across what we found in each scene, pick the best face to show as a thumbnail, and present it to you.
Before publishing your changes, we encourage you to preview the video. As we cannot guarantee 100 percent accuracy in every video, you can use our Custom Blurring tool to further enhance the automated face blurring edits in the same interface.
Ryan Stevens, Software Engineer, recently watched
158,962,555,217,826,360,000 (Enigma Machine)
, and Ian Pudney, Software Engineer, recently watched
Wood burning With Lightning. Lichtenberg Figures!
Visualizing Sound Effects
Thursday, March 23, 2017
At YouTube, we understand the power of video to tell stories, move people, and leave a lasting impression. One part of storytelling that many people take for granted is sound, yet sound adds color to the world around us. Just imagine not being able to hear music, the joy of a baby laughing, or the roar of a crowd. But this is often a reality for the
360 million people
around the world who are deaf and hard of hearing. Over the last decade, we have been working to change that.
The first step came over ten years ago with the launch of
captions
. And in an effort to scale this technology,
automated captions
came a few years later. The success of that effort has been astounding, and a few weeks ago we
announced
that the number of videos with automatic captions now exceeds 1 billion. Moreover, people watch videos with automatic captions more than 15 million times per day. And we have made meaningful improvements to quality, resulting in a 50 percent leap in accuracy for automatic captions in English, which is getting us closer and closer to human transcription error rates.
But there is more to sound and the enjoyment of a video than words. In a joint effort between YouTube, Sound Understanding, and Accessibility teams, we embarked on the task of developing the first ever automatic sound effect captioning system for YouTube. This means finding a way to identify and label all those other sounds in the video without manual input.
We started this project by taking on a wide variety of challenges, such as how to best design the sound effect recognition system and what sounds to prioritize. At the heart of the work was utilizing thousands of hours of videos to train a deep neural network model to achieve high quality recognition results. There are more details in a companion post
here
.
As a result, we can now automatically detect the existence of these sound effects in a video and transcribe it to appropriate classes or sound labels. With so many sounds to choose from, we started with [APPLAUSE], [MUSIC] and [LAUGHTER], since these were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing.
So what does this actually look like when you are watching a YouTube video? The sound effect is merged with the automatic speech recognition track and shown as part of standard automatic captions.
Click the CC button to see the sound effect captioning system in action
We are still in the early stages of this work, and we are aware that these captions are fairly simplistic. However, the infrastructural backend to this system will allow us to expand and easily apply this framework to other sound classes. Future challenges might include adding other common sound classes like ringing, barking and knocking, which present particular problems -- for example, with ringing we need to be able to decipher if this is an alarm clock, a door or a phone as described
here
.
Since the addition of sound effect captions presented a number of unique challenges on both the machine learning end as well as the user experience, we continue to work to better understand the effect of the captioning system on the viewing experience, how viewers use sound effect information, and how useful it is to them. From our initial user studies, two-thirds of participants said these sound effect captions really enhance the overall experience, especially when they added crucial “invisible” sound information that people cannot tell from the visual cues. Overall, users reported that their experience wouldn't be impacted by the system making occasional mistakes as long as it was able to provide good information more often than not.
We are excited to support automatic sound effect captioning on YouTube, and we hope this system helps us make information useful and accessible for everyone.
Noah Wang, software engineer, recently watched "
The Expert (Short Comedy Sketch)
."
Improving VR videos
Tuesday, March 14, 2017
At YouTube, we are focused on enabling the kind of immersive and interactive experiences that only VR can provide, making digital video as immersive as it can be. In March 2015, we launched support for
360-degree videos
shortly followed by
VR (3D 360) videos
. In 2016 we brought
360 live streaming and spatial audio
and
a dedicated YouTube VR app
to our users.
Now,
in a joint effort between YouTube and Daydream
, we're adding new ways to make 360 and VR videos look even more realistic.
360 videos need a large numbers of pixels per video frame to achieve a compelling immersive experience. In the ideal scenario, we would match human
visual acuity
which is 60 pixels per degree of immersive content. We are however limited by user internet connection speed and device capabilities. One way to bridge the gap between these limitations and the human visual acuity is to use better projection methods.
Better Projections
A Projection is the mapping used to fit a 360-degree world view onto a rectangular video surface. The world map is a good example of a spherical earth projected on a rectangular piece of paper. A commonly used projection is called
equirectangular projection
. Initially, we chose this projection when we launched 360 videos because it is easy to produce by camera software and easy to edit.
However, equirectangular projection has some drawbacks:
It has high quality at the poles (top and bottom of image) where people don’t look as much – typically, sky overhead and ground below are not that interesting to look at.
It has lower quality at the equator or horizon where there is typically more interesting content.
It has fewer vertical pixels for 3D content.
A straight line motion in the real world does not result in a straight line motion in equirectangular projection, making videos hard to compress.
Drawbacks of equirectangular (EQ) projection
These drawbacks made us look for better projection types for 360-degree videos. To compare different projection types we used saturation maps. A saturation map shows the ratio of video pixel density to display pixel density. The color coding goes from red (low) to orange, yellow, green and finally blue (high). Green indicates optimal pixel density of near 1:1. Yellow and orange indicate insufficient density (too few video pixels for the available display pixels) and blue indicates wasted resources (too many video pixels for the available display pixels). The ideal projection would lead to a saturation map that is uniform in color. At sufficient video resolution it would be uniformly green.
We investigated cubemaps as a potential candidate. Cubemaps have been used by computer games for a long time to display the
skybox
and other special effects.
Equirectangular projection saturation map
Cubemap projection saturation map
In the equirectangular saturation map the poles are blue, indicating wasted pixels. The equator (horizon) is orange, indicating an insufficient number of pixels. In contrast, the cubemap has green (good) regions nearer to the equator, and the wasteful blue regions at the poles are gone entirely. However, the cubemap results in large orange regions (not good) at the equator because a cubemap samples more pixels at the corners than at the center of the faces.
We achieved a substantial improvement using an approach we call
Equi-angular Cubemap
or
EAC
. The EAC projection’s saturation is significantly more uniform than the previous two, while further improving quality at the equator:
Equi-angular Cubemap - EAC
As opposed to traditional cubemap, which distributes equal pixels for equal distances on the cube surface, equi-angular cubemap distributes equal pixels for equal angular change.
The saturation maps seemed promising, but we wanted to see if people could tell the difference. So we asked people to rate the quality of each without telling them which projection they were viewing. People generally rated EAC as higher quality compared to other projections. Here is an example comparison:
EAC vs EQ
Creating Industry Standards
We’re just beginning to see innovative new projections for 360 video. We’ve worked with Equirectangular and Cube Map, and now EAC. We think a standardized way to represent arbitrary projections will help everyone innovate, so we’ve developed a Projection Independent Mesh.
A Projection Independent Mesh describes the projection by including a 3D mesh along with its texture mapping in the video container. The video rendering software simply renders this mesh as per the texture mapping specified and does not need to understand the details of the projection used. This gives us infinite possibilities. We published our
mesh format draft standard
on github inviting industry experts to comment and are hoping to turn this into a widely agreed upon industry standard.
Some 360-degree cameras do not capture the entire field of view. For example, they may not have a lens to capture the top and bottom or may only capture a 180-degree scene. Our proposal supports these cameras and allows replacing the uncaptured portions of the field of view by a static geometry and image. Our proposal allows compressing the mesh using deflate or other compression. We designed the mesh format with compression efficiency in mind and were able to fit EAC projection within a 4 KB payload.
The projection independent mesh allows us to continue improving on projections and deploy them with ease since our renderer is now projection independent.
Spherical video playback on Android now benefits from EAC projection streamed using a projection independent mesh. We automatically convert uploaded videos to EAC mesh. This will soon be available on IOS and desktop too. Our ingestion format continues to be based on equirect projection as mentioned in our
upload recommendations
.
Anjali Wheeler, Software Engineer, recently watched "
Disturbed - The Sound Of Silence
."
Supercharge your YouTube live tools with the new Super Chat API
Thursday, January 12, 2017
In December 2015, we
launched an array of API services
that let developers access a wealth of data about live streams, chat, and fan funding. Since then, we’ve seen thousands of creators use the tools listed on our
Tools for Gaming Streamers page
to enhance their streams by adding chatbots, overlays, polls and more.
Today, we
announced
a new live feature for fans and creators, Super Chat, which lets anybody watching a live stream stand out from the crowd and get a creator’s attention by purchasing highlighted chat messages. We’re also announcing a new API service for this feature: the
Super Chat API
, designed to allow developers to access real-time information about Super Chat purchases.
The launch of this new API service will be followed by the shutdown of our Fan Funding API. To that end, developers using the Fan Funding API need to move to the new Super Chat API as soon as possible.
On January 31, 2017
, we’ll begin offering replacements for the two ways developers currently get information about Fan Funding:
LiveChatMessages.list
will gain a new message type,
superChatMessage
, which will contain details about Super Chats purchased during an active live stream
A new endpoint,
SuperChats.list
, will be made available to list a channel’s Super Chat purchases
On February 28, 2017
, we’ll be turning down the two existing Fan Funding methods:
LiveChatMessages.list will no longer return messages of type
fanFundingEvent
FanFundingEvents.list will no longer return data
During the transition period between Super Chats and Fan Funding,
SuperChats.list
will provide information about
both
Super Chat events
and
Fan Funding events, so we encourage all developers to switch to the new API as soon as it becomes available. Keep your eye on the
YouTube Data API v3 Revision History
to get the documentation for this service as soon as we post it.
If you’ve got questions on this, please feel free to ask the community on our
Stack Overflow tag
or send us a tweet at @YouTubeDev and we’ll do our best to answer.
Marc Chambers, Developer Relations, recently watched "
Show of the Week: New Games for 2017
."
Labels
.net
360
acceleration
access control
accessibility
actionscript
activities
activity
android
announcements
apis
app engine
appengine
apps script
as2
as3
atom
authentication
authorization
authsub
best practices
blackops
blur faces
bootcamp
captions
categories
channels
charts
chrome
chromeless
client library
clientlibraries
clientlogin
code
color
comments
compositing
create
curation
custom player
decommission
default
deprecation
devs
direct
discovery
docs
Documentation RSS
dotnet
education
embed
embedding
events
extension
feeds
flash
format
friendactivity
friends
fun
gears
google developers live
google group
googlegamedev
googleio
html5
https
iframe
insight
io12
io2011
ios
iphone
irc
issue tracker
java
javascript
json
json-c
jsonc
knight
legacy
Live Streaming API
LiveBroadcasts API
logo
machine learning
mashups
media:keywords keywords tags metadata
metadata
mobile
mozilla
NAB 2016
news
oauth
oauth2
office hours
open source
partial
partial response
partial update
partners
patch
php
player
playlists
policy
previews
pubsubhubbub
push
python
quota
rails
releases
rendering
reports
responses
resumable
ruby
samples
sandbox
shortform
ssl https certificate staging stage
stack overflow
stage video
staging
standard feeds
storify
storyful
subscription
sup
Super Chat API
survey
tdd
theme
tos
tutorials
updates
uploads
v2
v3
video
video files
video transcoding
virtual reality
voting
VR
watch history
watchlater
webvtt
youtube
youtube api
YouTube Data API
youtube developers live
youtube direct
YouTube Engineering & Developers Blog
YouTube IFrame Player API
YouTube live
YouTube Reporting API
ytd
Archive
2020
Nov
2019
Aug
Apr
2018
Dec
Aug
Apr
2017
Nov
Sep
Aug
Mar
Jan
2016
Nov
Oct
Aug
May
Apr
2015
Dec
Nov
Oct
May
Apr
Mar
Jan
2014
Oct
Sep
Aug
May
Mar
2013
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
2012
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
2007
Dec
Nov
Aug
Jun
May
Feed
Follow @youtubedev