Engineering and Developers Blog
What's happening with engineering and developers at YouTube
Visualizing Sound Effects
Thursday, March 23, 2017
At YouTube, we understand the power of video to tell stories, move people, and leave a lasting impression. One part of storytelling that many people take for granted is sound, yet sound adds color to the world around us. Just imagine not being able to hear music, the joy of a baby laughing, or the roar of a crowd. But this is often a reality for the
360 million people
around the world who are deaf and hard of hearing. Over the last decade, we have been working to change that.
The first step came over ten years ago with the launch of
captions
. And in an effort to scale this technology,
automated captions
came a few years later. The success of that effort has been astounding, and a few weeks ago we
announced
that the number of videos with automatic captions now exceeds 1 billion. Moreover, people watch videos with automatic captions more than 15 million times per day. And we have made meaningful improvements to quality, resulting in a 50 percent leap in accuracy for automatic captions in English, which is getting us closer and closer to human transcription error rates.
But there is more to sound and the enjoyment of a video than words. In a joint effort between YouTube, Sound Understanding, and Accessibility teams, we embarked on the task of developing the first ever automatic sound effect captioning system for YouTube. This means finding a way to identify and label all those other sounds in the video without manual input.
We started this project by taking on a wide variety of challenges, such as how to best design the sound effect recognition system and what sounds to prioritize. At the heart of the work was utilizing thousands of hours of videos to train a deep neural network model to achieve high quality recognition results. There are more details in a companion post
here
.
As a result, we can now automatically detect the existence of these sound effects in a video and transcribe it to appropriate classes or sound labels. With so many sounds to choose from, we started with [APPLAUSE], [MUSIC] and [LAUGHTER], since these were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing.
So what does this actually look like when you are watching a YouTube video? The sound effect is merged with the automatic speech recognition track and shown as part of standard automatic captions.
Click the CC button to see the sound effect captioning system in action
We are still in the early stages of this work, and we are aware that these captions are fairly simplistic. However, the infrastructural backend to this system will allow us to expand and easily apply this framework to other sound classes. Future challenges might include adding other common sound classes like ringing, barking and knocking, which present particular problems -- for example, with ringing we need to be able to decipher if this is an alarm clock, a door or a phone as described
here
.
Since the addition of sound effect captions presented a number of unique challenges on both the machine learning end as well as the user experience, we continue to work to better understand the effect of the captioning system on the viewing experience, how viewers use sound effect information, and how useful it is to them. From our initial user studies, two-thirds of participants said these sound effect captions really enhance the overall experience, especially when they added crucial “invisible” sound information that people cannot tell from the visual cues. Overall, users reported that their experience wouldn't be impacted by the system making occasional mistakes as long as it was able to provide good information more often than not.
We are excited to support automatic sound effect captioning on YouTube, and we hope this system helps us make information useful and accessible for everyone.
Noah Wang, software engineer, recently watched "
The Expert (Short Comedy Sketch)
."
Labels
.net
360
acceleration
access control
accessibility
actionscript
activities
activity
android
announcements
apis
app engine
appengine
apps script
as2
as3
atom
authentication
authorization
authsub
best practices
blackops
blur faces
bootcamp
captions
categories
channels
charts
chrome
chromeless
client library
clientlibraries
clientlogin
code
color
comments
compositing
create
curation
custom player
decommission
default
deprecation
devs
direct
discovery
docs
Documentation RSS
dotnet
education
embed
embedding
events
extension
feeds
flash
format
friendactivity
friends
fun
gears
google developers live
google group
googlegamedev
googleio
html5
https
iframe
insight
io12
io2011
ios
iphone
irc
issue tracker
java
javascript
json
json-c
jsonc
knight
legacy
Live Streaming API
LiveBroadcasts API
logo
machine learning
mashups
media:keywords keywords tags metadata
metadata
mobile
mozilla
NAB 2016
news
oauth
oauth2
office hours
open source
partial
partial response
partial update
partners
patch
php
player
playlists
policy
previews
pubsubhubbub
push
python
quota
rails
releases
rendering
reports
responses
resumable
ruby
samples
sandbox
shortform
ssl https certificate staging stage
stack overflow
stage video
staging
standard feeds
storify
storyful
subscription
sup
Super Chat API
survey
tdd
theme
tos
tutorials
updates
uploads
v2
v3
video
video files
video transcoding
virtual reality
voting
VR
watch history
watchlater
webvtt
youtube
youtube api
YouTube Data API
youtube developers live
youtube direct
YouTube live
YouTube Reporting API
ytd
Archive
2018
Apr
2017
Nov
Sep
Aug
Mar
Jan
2016
Nov
Oct
Aug
May
Apr
2015
Dec
Nov
Oct
May
Apr
Mar
Jan
2014
Oct
Sep
Aug
May
Mar
2013
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
2012
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
2007
Dec
Nov
Aug
Jun
May
Feed
YouTube
on
Follow @youtubedev