Research Blog: YouTube

Mobile Real-time Video Segmentation

Thursday, March 01, 2018

Valentin Bazarevsky and Andrei Tkachenka, Software Engineers, Google Researchrotoscopingchroma keyingstories

Neural network video segmentation in YouTube stories.

convolutional neural networks

A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real time inference, such a model needs to provide results at 30 frames per second.

A video model should leverage temporal redundancy (neighboring frames look similar) and exhibit temporal consistency (neighboring results should be similar)

High quality segmentation results require high quality annotations.

The DatasetIntersection-Over-Union

An example image from our dataset carefully annotated with nine labels - foreground elements are overlaid over the image.

Network InputRGBLSTMsGRUs

The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).

Training Procedure

Empty previous mask - Trains the network to work correctly for the first frame and new objects in scene. This emulates the case of someone appearing in the camera's frame.

Affine transformed ground truth mask - Minor transformations train the network to propagate and adjust to the previous frame mask. Major transformations train the network to understand inadequate masks and discard them.

Transformed image - We implement thin plate spline smoothing of the original image to emulate fast camera movements and rotations.

Our real-time video segmentation in action.

Network Architecturehourglass segmentation network architecture

We use big convolution kernels with large strides of four and above to detect object features on the high-resolution RGB input frame. Convolutions for layers with a small number of channels (as it is the case for the RGB input) are comparably cheap, so using big kernels here has almost no effect on the computational costs.

For speed gains, we aggressively downsample using large strides combined with skip connections like U-Net to restore low-level features during upsampling. For our segmentation model this technique results in a significant improvement of 5% IOU compared to using no skip connections.

Hourglass segmentation network w/ skip connections.

For even further speed gains, we optimized default ResNet bottlenecks. In the literature authors tend to squeeze channels in the middle of the network by a factor of four (e.g. reducing 256 channels to 64 by using 64 different convolution kernels). However, we noticed that one can squeeze much more aggressively by a factor of 16 or 32 without significant quality degradation.

ResNet bottleneck with large squeeze factor.

To refine and improve the accuracy of edges, we add several DenseNet layers on top of our network in full resolution similar to neural matting. This technique improves overall model quality by a slight 0.5% IOU, however perceptual quality of segmentation improves significantly.

AcknowledgementsA thank you to our team members who worked on the tech and this launch with us: Andrey Vakunov, Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Matsvei Zhdanovich, Andrei Kulik, Camillo Lugaresi, John Kim, Ryan Bolyard, Wendy Huang, Michael Chang, Aaron La Lau, Willi Geiger, Tomer Margolin, John Nack and Matthias Grundmann.

Announcing AudioSet: A Dataset for Audio Event Research

Thursday, March 30, 2017

Posted by Dan Ellis, Research Scientist, Sound Understanding Teamadding sound effect information to automatic video captions,AudioSetour paperIEEE International Conference on Acoustics, Speech, and Signal Processing

Developing an Ontologyhow humans organize sound eventsChild speechUkuleleBoingUrban Sound Taxonomyhere

The top two levels of the AudioSet ontology.

From Ontology to Labeled Dataonomatopoeia wordsour paperupdated YouTube 8M Dataset

The total number of videos for selected classes in AudioSet.

AudioSet website

A few of the segments representing the class “Violin, fiddle”.

Quality AssessmentA Community DatasetDCASE challenge seriesGitHubAcknowledgments:

Adding Sound Effect Information to YouTube Captions

Thursday, March 23, 2017

Posted by Sourish Chaudhuri, Software Engineer, Sound UnderstandingYouTube has provided automatic caption trackswe announced

Click the CC button to see the sound effect captioning system in action.

Deep Neural NetworkA DNN Model for Ambient SoundDense Detections to CaptionsViterbi algorithm

(Left) The dense sequence of probabilities from our DNN for the occurrence over time of single sound category in a video. (Center) Binarized segments based on the modified Viterbi algorithm. (Right) The duration-based filter removes segments that are shorter in duration than desired for the class.

Adding Sound Effect Information into the Automatic Captions Track

Using separate parts of the screen for speech and sound effect captions.

Interleaving the speech and sound effect captions as they occur.

Only showing sound effect captions at the end of sentences or when there is a pause in speech (even if they occurred in the middle of speech).

How hearing users perceive captions when watching with the sound off.

Participants who could hear the audio were able to ignore the inaccuracies.

Participants who could not hear the audio interpreted the error as the presence of a sound event, and that they had not missed out on critical speech information.

Looking Forward

An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!

Wednesday, February 15, 2017

Posted by Paul Natsev, Software EngineerYouTube-8M datasetmillions of videos labeled with thousands of classesOpen ImagesYouTube-BoundingBoxesYouTube-8M datasetGoogle Cloud Machine Learningkaggle.comvideo understanding competitionCVPR’17 WorkshopAn Updated YouTube-8Maudio modeling architecturehere

A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.

Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.

The Google Cloud & YouTube-8M Video Understanding Challenge Google Cloud & YouTube-8M Video Understanding ChallengeGoogle Cloudkaggle.comKaggle competitionhereGoogle Cloud Machine LearningGithubREADMEgetting started guide on KaggleThe CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding 1st YouTube-8M WorkshopCVPR 2017inviteAcknowledgements

Advancing Research on Video Understanding with the YouTube-BoundingBoxes Dataset

Monday, February 06, 2017

Posted by Esteban Real, Vincent Vanhoucke, Jonathon Shlens, Google Brain team and
Stefano Mazzocchi, Google Researchwhat objects wherelocations over timeYouTube-8MYouTube-BoundingBoxes

Summary of dataset statistics. Bar Chart: Relative number of detections in existing image (red) and video (blue) data sets. The YouTube BoundingBoxes dataset (YT-BB) is at the bottom, is at the bottom. Table: The three columns are counts for: classification annotations, bounding boxes, and unique videos with bounding boxes. Full details on the dataset can be found in the preprint.

identifylocalizetrack objectscan

Three video segments, sampled at 1 frame per second. The final frame of each example shows how it is visually challenging to recognize the bounded object, due to blur or occlusion (train example, blue arrow). However, temporally-related frames, where the object has been more clearly identified, can allow object classes to be inferred. Note how only visible parts are included in the box: the orange arrow in the bear example (middle row) points to the hidden head. The dog example illustrates tight bounding boxes that track the tail (orange arrows) and foot (blue arrows). The airplane example illustrates how partial objects are annotated (first frame) tracked across changes in perspective, occlusions and camera cuts.

associated preprintAcknowledgements

Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research

Wednesday, September 28, 2016

Posted by Sudheendra Vijayanarasimhan and Paul Natsev, Software EngineersImageNetdetecting and classifying objects in static imagesVideo analysishelped re-imagine the photos experienceYouTube-8MKnowledge GraphSports-1Morder of magnitude increaseand

technical reportdataset browser

A dataset explorer allows browsing and searching the full vocabulary of Knowledge Graph entities, grouped in 24 top-level verticals, along with corresponding videos. This screenshot depicts a subset of dataset videos annotated with the entity “Guitar”.

The distribution of videos in the top-level verticals illustrates the scope and diversity of the dataset and reflects the natural distribution of popular YouTube videos.

featuresInception-V3 image annotation modelTensorFlowActivityNet,Sports-1Mtechnical report

Improving YouTube video thumbnails with deep neural nets

Thursday, October 08, 2015

Posted by Weilong Yang and Min-hsuan Tsai, Video Content Analysis team and the YouTube Creator teamVideo thumbnailsdeep neural networksimagevideoThe Thumbnailer Pipelinequality modelquality scorequality model

The main processing pipeline of the thumbnailer.

(Training) The Quality Model

Example training images.

GoogLeNetResults

Example frames with low and high quality score from the DNN quality model, from video “Grand Canyon Rock Squirrel”.

Thumbnails generated by old vs. new thumbnailer algorithm.

thumbs up

Released Data Set: Features Extracted From YouTube Videos for Multiview Learning

Tuesday, November 26, 2013

Posted by Omid Madani, Senior Software Engineer“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.” - The “duck test”multiple viewsUCI machine learning repository (multiview video dataset)here

New Challenges in Computer Science Research

Friday, July 27, 2012

Posted by Jeff Walz, Head of University Relations2012 Computer Science Faculty SummitVivek KwatraEd ChiJohn WilkesAlon HalevyFusion TablesKnowledge Graphlaunched

Video Stabilization on YouTube

Friday, May 04, 2012

Posted by Matthias Grundmann, Vivek Kwatra, and Irfan Essa, Research at Googleearlier blog postAuto-Directed Video Stabilization with Robust L1 Optimal Camera PathsIEEE CVPR 2011YouTube editor video manager Original video with rolling shutter distortionsCalibration-Free Rolling Shutter Removalbest paperIEEE ICCP 2012 narrated video descriptionYouTube stabilizer After stabilization and rolling shutter removal

Gamification for Improved Search Ranking for YouTube Topics

Monday, March 19, 2012

Posted by Charles DuHadway and Sanketh Shetty, Google ResearchIn earlier postsYouTube Slamvideoemoatali99Whitney Houston covercute slammusic slamGamificationLatte Art SlamSpeed Painting Slamyoutube.com/topics

Preference judgement over a pair, not absolute judgement on a single video, and,

Active solicitation of user opinion as opposed to passive observation. Due to what is called a “cold start” problem in data modeling, conventional (passive observation) techniques don’t work well on new items with little prior information. For any given topic, Slam’s improvement over the baseline in ranking of the “recent 20” set of videos was in fact better than the improvement in ranking of the “top 20” set.

Romantic Proposals SlamthisthisthisthisChocolate BrowniePaper PlaneBush FlyingStealth TechnologyStencil GraffitiYosemite National ParkStealth Technology

Quantifying comedy on YouTube: why the number of o’s in your LOL matter

Thursday, February 09, 2012

Posted by Sanketh Shetty, YouTube Slam Team, Google Research postYouTube Music SlamAI-CompleteCharlie bit my fingerDavid after dentistGive it a try

“Opinion Mining and Sentiment Analysis,” by Bo Pang and Lillian Lee.

“A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews,” by Oren Tsur, Dmitry Davidov, and Ari Rappoport.

“That’s What She Said: Double Entendre Identiﬁcation,” by Chloe Kiddon and Yuriy Brun.

Discovering Talented Musicians with Acoustic Analysis

Wednesday, November 02, 2011

Posted by Charles DuHadway, YouTube Slam Team, Google Research earlier postMusic Beta by Googlelife-changing success

YouTube Slamgive it a tryleaderboards

cutebizarrecomedydanceVideo2Text: Learning to Annotate Video ContentHrishikesh AradhyeGeorge TodericiJay Yagnik

Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths

Monday, June 20, 2011

Posted by Matthias Grundmann, Vivek Kwatra, and Irfan Essa, Research TeamannouncedYouTube Video Editorthis paperCVPR 2011Google’s exhibition boothpaper

Google Research Blog