Google Research Blog
The latest news from Research at Google
Mobile Real-time Video Segmentation
Thursday, March 01, 2018
Valentin Bazarevsky and Andrei Tkachenka, Software Engineers, Google Research
Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background, and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process (e.g. an artist
rotoscoping
every frame) or requires a studio environment with a green screen for real-time background removal (a technique referred to as
chroma keying
). In order to enable users to create this effect live in the viewfinder, we designed a new technique that is suitable for mobile phones.
Today, we are excited to bring precise, real-time, on-device mobile video segmentation to the YouTube app by integrating this technology into
stories
. Currently in limited beta, stories is YouTube’s new lightweight video format, designed specifically for YouTube creators. Our new segmentation technology allows creators to replace and modify the background, effortlessly increasing videos’ production value without specialized equipment.
Neural network video segmentation in YouTube stories.
To achieve this, we leverage machine learning to solve a semantic segmentation task using
convolutional neural networks
. In particular, we designed a network architecture and training procedure suitable for mobile phones focusing on the following requirements and constraints:
A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real time inference, such a model needs to provide results at 30 frames per second.
A video model should leverage temporal redundancy (neighboring frames look similar) and exhibit temporal consistency (neighboring results should be similar)
High quality segmentation results require high quality annotations.
The Dataset
To provide high quality data for our machine learning pipeline, we annotated tens of thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel-accurate locations of foreground elements such as hair, glasses, neck, skin, lips, etc. and a general background label achieving a cross-validation result of 98%
Intersection-Over-Union
(IOU) of human annotator quality.
An example image from our dataset carefully annotated with nine labels - foreground elements are overlaid over the image.
Network Input
Our specific segmentation task is to compute a binary mask separating foreground from background for every input frame (three channels,
RGB
) of the video. Achieving temporal consistency of the computed masks across frames is key. Current methods that utilize
LSTMs
or
GRUs
to realize this are too computationally expensive for real-time applications on mobile phones. Instead we first pass the computed mask from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, as shown below:
The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).
Training Procedure
In video segmentation we need to achieve frame-to-frame temporal continuity, while also accounting for temporal discontinuities such as people suddenly appearing in the field of view of the camera. To train our model to robustly handle those use cases, we transform the annotated ground truth of each photo in several ways and use it as a previous frame mask:
Empty previous mask
- Trains the network to work correctly for the first frame and new objects in scene. This emulates the case of someone appearing in the camera's frame.
Affine transformed ground truth mask
- Minor transformations train the network to propagate and adjust to the previous frame mask. Major transformations train the network to understand inadequate masks and discard them.
Transformed image
- We implement thin plate spline smoothing of the original image to emulate fast camera movements and rotations.
Our real-time video segmentation in action.
Network Architecture
With that modified input/output, we build on a standard
hourglass segmentation network architecture
by adding the following improvements:
We use big convolution kernels with
large strides
of four and above to detect object features on the high-resolution RGB input frame. Convolutions for layers with a small number of channels (as it is the case for the RGB input) are comparably cheap, so using big kernels here has almost no effect on the computational costs.
For speed gains, we aggressively downsample using large strides combined with skip connections like
U-Net
to restore low-level features during upsampling. For our segmentation model this technique results in a significant improvement of 5% IOU compared to using no skip connections.
Hourglass segmentation network w/ skip connections.
For even further speed gains, we optimized default
ResNet
bottlenecks. In the literature authors tend to squeeze channels in the middle of the network by a factor of four (e.g. reducing 256 channels to 64 by using 64 different convolution kernels). However, we noticed that one can squeeze much more aggressively by a factor of 16 or 32 without significant quality degradation.
ResNet bottleneck with large squeeze factor.
To refine and improve the accuracy of
edges
, we add several
DenseNet
layers on top of our network in full resolution similar to
neural matting
. This technique improves overall model quality by a slight 0.5% IOU, however perceptual quality of segmentation improves significantly.
The end result of these modifications is that our network runs remarkably fast on mobile devices, achieving 100+ FPS on iPhone 7 and 40+ FPS on Pixel 2 with high accuracy (realizing 94.8% IOU on our validation dataset), delivering a variety of smooth running and responsive effects in YouTube stories.
Our immediate goal is to use the limited rollout in YouTube stories to test our technology on this first set of effects. As we improve and expand our segmentation technology to more labels, we plan to integrate it into Google's broader Augmented Reality services.
Acknowledgements
A thank you to our team members who worked on the tech and this launch with us: Andrey Vakunov, Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Matsvei Zhdanovich, Andrei Kulik, Camillo Lugaresi, John Kim, Ryan Bolyard, Wendy Huang, Michael Chang, Aaron La Lau, Willi Geiger, Tomer Margolin, John Nack and Matthias Grundmann.
Announcing AudioSet: A Dataset for Audio Event Research
Thursday, March 30, 2017
Posted by Dan Ellis, Research Scientist, Sound Understanding Team
Systems able to recognize sounds familiar to human listeners have a wide range of applications, from
adding sound effect information to automatic video captions,
to potentially allowing you to search videos for specific audio events. Building Deep Learning systems to do this relies heavily on both a large quantity of computing (often from highly parallel GPUs), and also – and perhaps more importantly – on significant amounts of accurately-labeled training data. However, research in environmental sound recognition is limited by currently available public datasets.
In order to address this, we recently released
AudioSet
, a collection of over 2 million ten-second YouTube excerpts labeled with a vocabulary of 527 sound event categories, with at least 100 examples for each category. Announced in
our paper
at the
IEEE International Conference on Acoustics, Speech, and Signal Processing
, AudioSet provides a common, realistic-scale evaluation task for audio event detection and a starting point for a comprehensive vocabulary of sound events, designed to advance research into audio event detection and recognition.
Developing an Ontology
When we started on this work last year, our first task was to define a vocabulary of sound classes that provided a consistent level of detail over the spectrum of sound events we planned to label. Defining this ontology was necessary to avoid problems of ambiguity and synonyms; without this, we might end up trying to differentiate “Timpani” from “Kettle drum”, or “Water tap” from “Faucet”. Although a number of scientists have looked at
how humans organize sound events
, the few existing ontologies proposed have been small and partial. To build our own, we searched the web for phrases like “Sounds, such as X and Y”, or “X, Y, and other sounds”. This gave us a list of sound-related words which we manually sorted into a hierarchy of over 600 sound event classes ranging from “
Child speech
” to “
Ukulele
” to “
Boing
”. To make our taxonomy as comprehensive as possible, we then looked at comparable lists of sound events (for instance, the
Urban Sound Taxonomy
) to add significant classes we may have missed and to merge classes that weren't well defined or well distinguished. You can explore our ontology
here
.
The top two levels of the AudioSet
ontology
.
From Ontology to Labeled Data
With our new ontology in hand, we were able to begin collecting human judgments of where the sound events occur. This, too, raises subtle problems: unlike the billions of well-composed photographs available online, people don’t typically produce “well-framed” sound recordings, much less provide them with captions. We decided to use 10 second sound snippets as our unit; anything shorter becomes very difficult to identify in isolation. We collected candidate snippets for each of our classes by taking random excerpts from YouTube videos whose metadata indicated they might contain the sound in question (“Dogs Barking for 10 Hours”). Each snippet was presented to a human labeler with a small set of category names to be confirmed (“Do you hear a Bark?”). Subsequently, we proposed snippets whose content was similar to examples that had already been manually verified to contain the class, thereby finding examples that were not discoverable from the metadata. Because some classes were much harder to find than others – particularly the
onomatopoeia words
like “Squish” and “Clink” – we adapted our segment proposal process to increase the sampling for those categories. For more details, see
our paper
on the matching technique.
AudioSet provides the URLs of each video excerpt along with the sound classes that the raters confirmed as present, as well as precalculated audio features from the same classifier used to generate audio features for the
updated YouTube 8M Dataset
. Below is a histogram of the number of examples per class:
The total number of videos for selected classes in AudioSet.
You can browse this data at the
AudioSet website
which lets you view all the 2 million excerpts for all the classes.
A few of the segments representing the class “
Violin, fiddle
”.
Quality Assessment
Once we had a substantial set of human ratings, we conducted an internal Quality Assessment task where, for most of the classes, we checked 10 examples of excerpts that the annotators had labeled with that class. This revealed a significant number of classes with inaccurate labeling: some, like “Dribble” (uneven water flow) and “Roll” (a hard object moving by rotating) had been systematically confused (as basketball dribbling and drum rolls, respectively); some such as “Patter” (footsteps of small animals) and “Sidetone” (background sound on a telephony channel) were too difficult to label and/or find examples for, even with our content-based matching. We also looked at the behavior of a classifier trained on the entire dataset and found a number of frequent confusions indicating separate classes that were not really distinct, such as “Jingle” and “Tinkle”.
To address these “problem” classes, we developed a re-rating process by labeling groups of classes that span (and thus highlight) common confusions, and created instructions for the labeler to be used with these sounds. This re-rating has led to multiple improvements – merged classes, expanded coverage, better descriptions – that we were able to incorporate in this release. This iterative process of labeling and assessment has been particularly effective in shaking out weaknesses in the ontology.
A Community Dataset
By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events. We would like to see a vibrant sound event research community develop, including through external efforts such as the
DCASE challenge series
. We will continue to improve the size, coverage, and accuracy of this data and plan to make a second release in the coming months when our rerating process is complete. We additionally encourage the research community to continue to refine our ontology, which we have open sourced on
GitHub
. We believe that with this common focus, sound event recognition will continue to advance and will allow machines to understand sounds similar to the way we do, enabling new and exciting applications.
Acknowledgments:
AudioSet is the work of Jort F. Gemmeke, Dan Ellis, Dylan Freedman, Shawn Hershey, Aren Jansen, Wade Lawrence, Channing Moore, Manoj Plakal, and Marvin Ritter, with contributions from Sami Abu-El-Hajia, Sourish Chaudhuri, Victor Gomes, Nisarg Kothari, Dick Lyon, Sobhan Naderi Parizi, Paul Natsev, Brian Patton, Rif A. Saurous, Malcolm Slaney, Ron Weiss, and Kevin Wilson.
Adding Sound Effect Information to YouTube Captions
Thursday, March 23, 2017
Posted by Sourish Chaudhuri, Software Engineer, Sound Understanding
The effect of audio on our perception of the world can hardly be overstated. Its importance as a communication medium via speech is obviously the most familiar, but there is also significant information conveyed by ambient sounds. These ambient sounds create context that we instinctively respond to, like getting startled by sudden commotion, the use of music as a narrative element, or how laughter is used as an audience cue in sitcoms.
Since 2009,
YouTube has provided automatic caption tracks
for videos, focusing heavily on speech transcription in order to make the content hosted more accessible. However, without similar descriptions of the ambient sounds in videos, much of the information and impact of a video is not captured by speech transcription alone. To address this,
we announced
the addition of sound effect information to the automatic caption track in YouTube videos, enabling greater access to the richness of all the audio content.
In this post, we discuss the backend system developed for this effort, a collaboration among the Accessibility, Sound Understanding and YouTube teams that used machine learning (ML) to enable the first ever automatic sound effect captioning system for YouTube.
Click the CC button to see the sound effect captioning system in action.
The application of ML – in this case, a
Deep Neural Network
(DNN) model – to the captioning task presented unique challenges. While the process of analyzing the time-domain audio signal of a video to detect various ambient sounds is similar to other well known classification problems (such as object detection in images), in a product setting the solution faces additional difficulties. In particular, given an arbitrary segment of audio, we need our models to be able to 1) detect the desired sounds, 2) temporally localize the sound in the segment and 3) effectively integrate it in the caption track, which may have parallel and independent speech recognition results.
A DNN Model for Ambient Sound
The first challenge we faced in developing the model was the task of obtaining enough labeled data suitable for training our neural network. While labeled ambient sound information is difficult to come by, we were able to generate a large enough dataset for training using weakly labeled data. But of all the ambient sounds in a given video, which ones should we train our DNN to detect?
For the initial launch of this feature, we chose [APPLAUSE], [MUSIC] and [LAUGHTER], prioritized based upon our analysis of human-created caption tracks that indicates that they are among the most frequent sounds that are manually captioned. While the sound space is obviously far richer and provides even more contextually relevant information than these three classes, the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of “what was it that rang – a bell, an alarm, a phone?”
Much of our initial work on detecting these ambient sounds also included developing the infrastructure and analysis frameworks to enable scaling for future work, including both the detection of sound events and their integration into the automatic caption track. Investing in the development of this infrastructure has the added benefit of allowing us to easily incorporate more sound types in the future, as we expand our algorithms to understand a wider vocabulary of sounds (e.g. [RING], [KNOCK], [BARK]). In doing so, we will be able to incorporate the detected sounds into the narrative to provide more relevant information (e.g. [PIANO MUSIC], [RAUCOUS APPLAUSE]) to viewers.
Dense Detections to Captions
When a video is uploaded to YouTube, the sound effect recognition pipeline runs on the audio stream in the video. The DNN looks at short segments of audio and predicts whether that segment contains any one of the sound events of interest – since multiple sound effects can co-occur, our model makes a prediction at each time step for each of the sound effects. The segment window is then slid to the right (i.e. a slightly later point in time) and the model is used to make a prediction again, and so on till it reaches the end. This results in a dense stream the (likelihood of) presence of each of the sound events in our vocabulary at 100 frames per second.
The dense prediction stream is not directly exposed to the user, of course, since that would result in captions flickering on and off, and because we know that a number of sound effects have some degree of temporal continuity when they occur; e.g. “music” and “applause” will usually be present for a few seconds at least. To incorporate this intuition, we smooth over the dense prediction stream using a modified
Viterbi algorithm
containing two states: ON and OFF, with the predicted segments for each sound effect corresponding to the ON state. The figure below provides an illustration of the process in going from the dense detections to the final segments determined to contain sound effects of interest.
(Left) The dense sequence of probabilities from our DNN for the occurrence over time of single sound category in a video. (Center) Binarized segments based on the modified Viterbi algorithm. (Right) The duration-based filter removes segments that are shorter in duration than desired for the class.
A classification-based system such as this one will certainly have some errors, and needs to be able to trade off false positives against missed detections as per the product goals. For example, due to the weak labels in the training dataset, the model was often confused between events that tended to co-occur. For example, a segment labeled “laugh” would usually contain both speech and laughter and the model for “laugh” would have a hard time distinguishing them in test data. In our system, we allow further restrictions based on time spent in the ON state (i.e. do not determine sound X to be detected unless it was determined to be present for at least Y seconds) to push performance toward a desired point in the precision-recall curve.
Once we were satisfied with the performance of our system in temporally localizing sound effect captions based on our offline evaluation metrics, we were faced with the following: how do we combine the sound effect and speech captions to create a single automatic caption track, and how (or when) do we present sound effect information to the user to make it most useful to them?
Adding Sound Effect Information into the Automatic Captions Track
Once we had a system capable of accurately detecting and classifying the ambient sounds in video, we investigated how to convey that information to the viewer in an effective way. In collaboration with our User Experience (UX) research teams, we explored various design options and tested them in a qualitative pilot usability study. The participants of the study had different hearing levels and varying needs for captions. We asked participants a number of questions including whether it improved their overall experience, their ability to follow events in the video and extract relevant information from the caption track, to understand the effect of variables such as:
Using separate parts of the screen for speech and sound effect captions.
Interleaving the speech and sound effect captions as they occur.
Only showing sound effect captions at the end of sentences or when there is a pause in speech (even if they occurred in the middle of speech).
How hearing users perceive captions when watching with the sound off.
While it wasn’t surprising that almost all users appreciated the added sound effect information when it was accurate, we also paid specific attention to the feedback when the sound detection system made an error (a false positive when determining presence of a sound, or failing to detect an occurrence). This presented a surprising result: when sound effect information was incorrect, it did not detract from the participant’s experience in roughly 50% of the cases. Based upon participant feedback, the reasons for this appear to be:
Participants who could hear the audio were able to ignore the inaccuracies.
Participants who could not hear the audio interpreted the error as the presence of a sound event, and that they had not missed out on critical speech information.
Overall, users reported that they would be fine with the system making the occasional mistake as long as it was able to provide good information far more often than not.
Looking Forward
Our work toward enabling automatic sound effect captions for YouTube videos and the initial rollout is a step toward making the richness of content in videos more accessible to our users who experience videos in different ways and in different environments that require captions. We’ve developed a framework to enrich the automatic caption track with sound effects, but there is still much to be done here. We hope that this will spur further work and discussion in the community around improving captions using not only automatic techniques, but also around ways to make creator-generated and community-contributed caption tracks richer (including perhaps, starting with the auto-captions) and better to further improve the viewing experience for our users.
An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!
Wednesday, February 15, 2017
Posted by Paul Natsev, Software Engineer
Last September, we released the
YouTube-8M dataset
, which spans
millions of videos labeled with thousands of classes
, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as
Open Images
and
YouTube-BoundingBoxes
that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the
YouTube-8M dataset
, and in collaboration with
Google Cloud Machine Learning
and
kaggle.com
, we are also organizing a
video understanding competition
and an affiliated
CVPR’17 Workshop
.
An Updated YouTube-8M
The new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art
audio modeling architecture
, in addition to the previously released visual features. The audio and visual features are synchronized in time, at 1-second temporal granularity, which makes YouTube-8M a large-scale multi-modal dataset, and opens up opportunities for exciting new research on joint audio-visual (temporal) modeling. Key statistics on the new version are illustrated below (more details
here
).
A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.
Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.
The Google Cloud & YouTube-8M Video Understanding Challenge
We are also excited to announce the
Google Cloud & YouTube-8M Video Understanding Challenge
, in partnership with
Google Cloud
and
kaggle.com
. The challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and to then label ~700K unseen test videos. It will be hosted as a
Kaggle competition
, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details
here
). In order to enable wider participation in the competition, Google Cloud is also offering credits so participants can optionally do model training and exploration using
Google Cloud Machine Learning
. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at
Github
. For details on getting started with local or cloud-based training, please see our
README
and the
getting started guide on Kaggle
.
The CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
We will announce the results of the challenge and host invited talks by distinguished researchers at the
1st YouTube-8M Workshop
, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (
CVPR 2017
) in Honolulu, Hawaii. The workshop will also feature presentations by top-performing challenge participants and a selected set of paper submissions. We
invite
researchers to submit papers describing novel research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the above challenge.
We designed this dataset with scale and diversity in mind, and hope lessons learned here will generalize to many video domains (YouTube-8M captures over 20 diverse video domains). We believe the challenge can also accelerate research by enabling researchers without access to big data or compute clusters to explore and innovate at previously unprecedented scale. Please join us in advancing video understanding!
Acknowledgements
This post reflects the work of many others within Machine Perception at Google Research, including Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Joonseok Lee, Hanhan Li, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jiang Wang, as well as Philippe Poutonnet and Mike Styer from Google Cloud, and our partners at Kaggle. We are grateful for the support and advice from many others at Google Research, Google Cloud, and YouTube, and especially thank Aren Jansen, Jort Gemmeke, Dan Ellis, and the Google Research Sound Understanding team for providing the audio features in the updated dataset.
Advancing Research on Video Understanding with the YouTube-BoundingBoxes Dataset
Monday, February 06, 2017
Posted by Esteban Real, Vincent Vanhoucke, Jonathon Shlens, Google Brain team and
Stefano Mazzocchi, Google Research
One of the most challenging research areas in machine learning today is enabling computers to understand what a scene is about. For example, while humans know that a ball that disappears behind a wall only to reappear a moment later is very likely the same object, this is not at all obvious to an algorithm. Understanding this requires not only a global picture of
what objects
are contained in each frame of a video, but also
where
those objects are located within the frame and their
locations over time
. Just last year we published
YouTube-8M
, a dataset consisting of automatically labelled YouTube videos. And while this helps further progress in the field, it is only one piece to the puzzle.
Today, in order to facilitate progress in video understanding research, we are introducing
YouTube-BoundingBoxes
, a dataset consisting of 5 million bounding boxes spanning 23 object categories, densely labeling segments from 210,000 YouTube videos. To date, this is the largest manually annotated video dataset containing bounding boxes, which track objects in temporally contiguous frames. The dataset is designed to be large enough to train large-scale models, and be representative of videos captured in natural settings. Importantly, the human-labelled annotations contain objects as they appear in the real world with partial occlusions, motion blur and natural lighting.
Summary of dataset statistics.
Bar Chart:
Relative number of detections in existing image (red) and video (blue) data sets. The YouTube BoundingBoxes dataset (YT-BB) is at the bottom, is at the bottom.
Table:
The three columns are counts for: classification annotations, bounding boxes, and unique videos with bounding boxes. Full details on the dataset can be found in the
preprint
.
A key feature of this dataset is that bounding box annotations are provided for entire video segments. These bounding box annotations may be used to train models that explicitly leverage this temporal information to
identify
,
localize
and
track objects
over time. In a video, individual annotated objects might become entirely occluded and later return in subsequent frames. These annotations of individual objects are sometimes not recognizable from individual frames, but
can
be understood and recognized in the context of the video if the objects are localized and tracked accurately.
Three video segments, sampled at 1 frame per second. The final frame of each example shows how it is visually challenging to recognize the bounded object, due to blur or occlusion (train example, blue arrow). However, temporally-related frames, where the object has been more clearly identified, can allow object classes to be inferred. Note how only visible parts are included in the box: the orange arrow in the bear example (middle row) points to the hidden head. The dog example illustrates tight bounding boxes that track the tail (orange arrows) and foot (blue arrows). The airplane example illustrates how partial objects are annotated (first frame) tracked across changes in perspective, occlusions and camera cuts.
We hope that this dataset might ultimately aid the computer vision and machine learning community and lead to new methods for analyzing and understanding real world vision problems. You can learn more about the dataset in this
associated preprint
.
Acknowledgements
This work was greatly helped along by Xin Pan, Thomas Silva, Mir Shabber Ali Khan, Ashwin Kakarla and many others, as well as support and advice from Manfred Georg, Sami Abu-El-Haija, Susanna Ricco and George Toderici.
Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research
Wednesday, September 28, 2016
Posted by Sudheendra Vijayanarasimhan and Paul Natsev, Software Engineers
Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as
ImageNet
, which has millions of images labeled with thousands of classes. Their availability has significantly accelerated research in image understanding, for example on
detecting and classifying objects in static images
.
Video analysis
provides even more information for detecting and recognizing objects, and understanding human actions and interactions with the world. Improving video understanding can lead to better video search and discovery, similarly to how image understanding
helped re-imagine the photos experience
. However, one of the key bottlenecks for further advancements in this area has been the lack of real-world video datasets with the same scale and diversity as image datasets.
Today, we are excited to announce the release of
YouTube-8M
, a dataset of 8 million YouTube video URLs (representing over 500,000 hours of video), along with video-level labels from a diverse set of 4800
Knowledge Graph
entities. This represents a significant increase in scale and diversity compared to existing video datasets. For example,
Sports-1M
, the largest existing labeled video dataset we are aware of, has around 1 million YouTube videos and 500 sports-specific classes--YouTube-8M represents nearly an
order of magnitude increase
in both number of videos
and
classes.
In order to construct a labeled video dataset of this scale, we needed to address two key challenges: (1) video is much more time-consuming to annotate manually than images, and (2) video is very computationally expensive to process and store. To overcome (1), we turned to YouTube and its video annotation system, which identifies relevant Knowledge Graph topics for all public YouTube videos. While these annotations are machine-generated, they incorporate powerful user engagement signals from millions of users as well as video metadata and content analysis. As a result, the quality of these annotations is sufficiently high to be useful for video understanding research and benchmarking purposes.
To ensure the stability and quality of the labeled video dataset, we used only public videos with more than 1000 views, and we constructed a diverse vocabulary of entities, which are visually observable and sufficiently frequent. The vocabulary construction was a combination of frequency analysis, automated filtering, verification by human raters that the entities are visually observable, and grouping into 24 top-level verticals (more details in our
technical report
). The figures below depict the
dataset browser
and the distribution of videos along the top-level verticals, and illustrate the dataset’s scale and diversity.
A
dataset explorer
allows browsing and searching the full vocabulary of Knowledge Graph entities, grouped in 24 top-level verticals, along with corresponding videos. This screenshot depicts a subset of dataset videos annotated with the entity “Guitar”.
The distribution of videos in the top-level verticals illustrates the scope and diversity of the dataset and reflects the natural distribution of popular YouTube videos.
To address (2), we had to overcome the storage and computational resource bottlenecks that researchers face when working with videos. Pursuing video understanding at YouTube-8M’s scale would normally require a petabyte of video storage and dozens of CPU-years worth of processing. To make the dataset useful to researchers and students with limited computational resources, we pre-processed the videos and extracted frame-level
features
using a state-of-the-art deep learning model--the publicly available
Inception-V3 image annotation model
trained on ImageNet. These features are extracted at 1 frame-per-second temporal resolution, from 1.9 billion video frames, and are further compressed to fit on a single commodity hard disk (less than 1.5 TB). This makes it possible to download this dataset and train a baseline
TensorFlow
model at full scale on a single GPU in less than a day!
We believe this dataset can significantly accelerate research on video understanding as it enables researchers and students without access to big data or big machines to do their research at previously unprecedented scale. We hope this dataset will spur exciting new research on video modeling architectures and representation learning, especially approaches that deal effectively with noisy or incomplete labels, transfer learning and domain adaptation. In fact, we show that pre-training models on this dataset and applying / fine-tuning on other external datasets leads to state of the art performance on them (e.g.
ActivityNet,
Sports-1M
). You can read all about our experiments using this dataset, along with more details on how we constructed it, in our
technical report
.
Improving YouTube video thumbnails with deep neural nets
Thursday, October 08, 2015
Posted by Weilong Yang and Min-hsuan Tsai, Video Content Analysis team and the YouTube Creator team
Video thumbnails
are often the first things viewers see when they look for something interesting to watch. A strong, vibrant, and relevant thumbnail draws attention, giving viewers a quick preview of the content of the video, and helps them to find content more easily. Better thumbnails lead to more clicks and views for video creators.
Inspired by the recent remarkable advances of
deep neural networks
(DNNs) in computer vision, such as
image
and
video
classification, our team has recently launched an improved automatic YouTube "thumbnailer" in order to help creators showcase their video content. Here is how it works.
The Thumbnailer Pipeline
While a video is being uploaded to YouTube, we first sample frames from the video at one frame per second. Each sampled frame is evaluated by a
quality model
and assigned a single
quality score
. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios. Among all the components, the quality model is the most critical and turned out to be the most challenging to develop. In the latest version of the thumbnailer algorithm, we used a DNN for the quality model. So, what is the
quality model
measuring, and how is the score calculated?
The main processing pipeline of the thumbnailer.
(Training) The Quality Model
Unlike the task of identifying if a video contains your favorite animal, judging the visual quality of a video frame can be very subjective - people often have very different opinions and preferences when selecting frames as video thumbnails. One of the main challenges we faced was how to collect a large set of well-annotated training examples to feed into our neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails are typically well framed, in-focus, and center on a specific subject (e.g. the main character in the video). We consider these custom thumbnails from popular videos as positive (high-quality) examples, and randomly selected video frames as negative (low-quality) examples. Some examples of the training images are shown below.
Example training images.
The visual quality model essentially solves a problem we call "binary classification": given a frame, is it of high quality or not? We trained a DNN on this set using a similar architecture to the Inception network in
GoogLeNet
that achieved the top performance in the ImageNet 2014 competition.
Results
Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. In a human evaluation, the thumbnails produced by our new models are preferred to those from the previous thumbnailer in more than 65% of side-by-side ratings. Here are some examples of how the new quality model performs on YouTube videos:
Example frames with low and high quality score from the DNN quality model, from video “
Grand Canyon Rock Squirrel
”.
Thumbnails generated by old vs. new thumbnailer algorithm.
We recently launched this new thumbnailer across YouTube, which means creators can start to choose from higher quality thumbnails generated by our new thumbnailer. Next time you see an awesome YouTube thumbnail, don’t hesitate to give it a
thumbs up
. ;)
Released Data Set: Features Extracted From YouTube Videos for Multiview Learning
Tuesday, November 26, 2013
Posted by Omid Madani, Senior Software Engineer
“If it looks like a duck, swims like a duck, and quacks like a duck, then it
probably
is a duck.”
-
The “duck test”
Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or
multiple views
of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.
We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.
We hope that this dataset will be valuable for research on a variety of multiview related machine learning topics, including multiview clustering, co-training, active learning, classifier fusion and ensembles.
The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
here
.
New Challenges in Computer Science Research
Friday, July 27, 2012
Posted by Jeff Walz, Head of University Relations
Yesterday afternoon at the
2012 Computer Science Faculty Summit
, there was a round of lightning talks addressing some of the research problems faced by Google across several domains. The talks pointed out some of the biggest challenges emerging from increasing digital interaction, which is this year’s Faculty Summit theme.
Research Scientist
Vivek Kwatra
kicked things off with a talk about video stabilization on YouTube. The popularity of mobile devices with cameras has led to an explosion in the amount of video people capture, which can often be shaky. Vivek and his team have found algorithmic approaches to make casual videos look more professional by simulating professional camera moves. Their stabilization technology vastly improves the quality of amateur footage.
Next,
Ed Chi
(Research Scientist) talked about social media focusing on the experimental circle model that characterizes Google+. Ed is particularly interested in how social interaction on the web can be designed to mimic live communication. Circles on Google+ allow a user to manage their audience and share content in a targeted fashion, which reflects face-to-face interaction. Ed discussed how, from an HCI perspective, the challenge going forward is the need to consider the trinity of social media: context, audience, content.
John Wilkes
, Principal Software Engineer, talked about cluster management at Google and the challenges of building a new cluster manager-- that is, an operating system for a fleet of machines. Everything at Google is big and a consequence of operating at such tremendous scale is that machines are bound to fail. John’s team is working to make things easier for internal users enabling our ability to respond to more system requests. There are several hard problems in this domain, such as issues with configuration, making it as easy as possible to run a binary, increasing failure tolerance, and helping internal users understand their own needs as well as the behavior and performance of their system in our complicated distributed environment.
Research Scientist and coffee connoisseur
Alon Halevy
took to the podium to confirm that he did indeed author an empirical book on coffee, and also talked with attendees about structured data on the web. Structured data is comprised of hundreds of millions of (relatively small) tables of data, and Alon’s work is focused on enabling data enthusiasts to discover and visualize those data sets. Great possibilities open up when people start combining data sets in meaningful ways, which inspired the creation of
Fusion Tables
. An example is a map made in the aftermath of the 2011 earthquake and tsunami in Japan, that shows natural disaster data alongside the locations of the world’s nuclear plants. Moving forward, Alon’s team will continue to think about interesting things that can be done with data, and the techniques needed to distinguish good data from bad data.
To wrap up the session, Praveen Paritosh did a brief, but deep dive into the
Knowledge Graph
, an intelligent model that understands real-world entities and their relationships to one another-- things, not strings-- which
launched
earlier this year.
The Google Faculty Summit continued today with more talks, and breakout sessions centered on our theme of digital interaction. Check back for additional blog posts in the coming days.
Video Stabilization on YouTube
Friday, May 04, 2012
Posted by Matthias Grundmann, Vivek Kwatra, and Irfan Essa, Research at Google
One thing we have been working on within Research at Google is developing methods for making casual videos look more professional, thereby providing users with a better viewing experience. Professional videos have several characteristics that differentiate them from casually shot videos. For example, in order to tell a story, cinematographers carefully control lighting and exposure and use specialized equipment to plan camera movement.
We have developed a technique that mimics professional camera moves and applies them to videos recorded by hand-held devices. Cinematographers use specialized equipment such as tripods and dollies to plan their camera paths and hold them steady. In contrast, think of a video you shot using a mobile phone camera. How steady was your hand and were you able to anticipate an interesting moment and smoothly pan the camera to capture that moment? To bridge these differences, we propose an algorithm that automatically determines the best camera path and recasts the video as if it were filmed using stabilization equipment. Specifically, we divide the original, shaky camera path into a set of segments, each approximated by either a constant, linear or parabolic motion of the camera. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm. For details, check out our
earlier blog post
or read our paper,
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
, published in
IEEE CVPR 2011
.
The next time you upload your videos to YouTube, try stabilizing them by going to the
YouTube editor
or directly from the
video manager
by clicking on Edit->Enhancements. For even more convenience, YouTube will automatically detect if your video needs stabilization and offer to do it for you. Many videos on YouTube have already been enhanced using this technology.
More recently, we have been working on a related problem common in videos shot from mobile phones. The camera sensors in these phones contain what is known as an electronic rolling shutter. When taking a picture with a rolling shutter camera, the image is not captured instantaneously. Instead, the camera captures the image one row of pixels at a time, with a small delay when going from one row to the next. Consequently, if the camera moves during capture, it will cause image distortions ranging from shear in the case of low-frequency motions (for instance an image captured from a driving car) to wobbly distortions in the case of high-frequency perturbations (think of a person walking while recording video). These distortions are especially noticeable in videos where the camera shake is independent across frames. For example, take a look at the video below.
Original video with rolling shutter distortions
In our recent paper titled
Calibration-Free Rolling Shutter Removal
, which was awarded the
best paper
at
IEEE ICCP 2012
, we demonstrate a solution to correct these rolling shutter distortions in videos. A significant feature of our approach is that it does not require any knowledge of the camera used to shoot the video. The time delay in capturing two consecutive rows that we mention above is in fact different for every camera and affects the extent of distortions. Having knowledge of this delay parameter can be useful, but difficult to obtain or estimate via calibration. Imagine a video that is already uploaded to YouTube -- it will be challenging to obtain this parameter! Instead, we show that just the visual data in the video has enough information to appropriately describe and compensate for the distortions caused by the camera motion, even in the presence of a rolling shutter. For more information, see the
narrated video description
of our paper.
This technique is already integrated with the
YouTube stabilizer
. Starting today, if you stabilize a video from a mobile phone or other rolling shutter cameras, we will also automatically compensate for rolling shutter distortions. To see our technique in action, check out the video below, obtained after applying rolling shutter compensation and stabilization to the one above.
After stabilization and rolling shutter removal
Gamification for Improved Search Ranking for YouTube Topics
Monday, March 19, 2012
Posted by Charles DuHadway and Sanketh Shetty, Google Research
In earlier posts
we discussed automatic ways to find the most talented emerging singers and the funniest videos using the
YouTube Slam
experiment. We created five “house” slams -- music, dance, comedy, bizarre, and cute -- which produce a weekly leaderboard not just of videos but also of YouTubers who are great at predicting what the masses will like. For example, last week’s cute slam winning
video
claims to be the cutest kitten in the world, beating out four other kittens, two puppies, three toddlers and an amazing duck who feeds the fish. With a whopping 620 slam points, YouTube user
emoatali99
was our best connoisseur of cute this week. On the music side, it is no surprise that many of music slam’s top 10 videos were Adele covers. A
Whitney Houston cover
came out at the top this week, and music slam’s resident expert on talent had more than a thousand slam points. Well done! Check out the rest of the leaderboards for
cute slam
and
music slam
.
Can slam-style game mechanics incentivize our users to help improve the ranking of videos -- not just for these five house slams -- but for millions of other search queries and topics on YouTube?
Gamification
has previously been used to incentivize users to participate in non-game tasks such as image labeling and music tagging. How many votes and voters would we need for slam to do better than the existing ranking algorithm for topic search on YouTube?
As an experiment, we created new slams for a small number of YouTube topics (such as
Latte Art Slam
and
Speed Painting Slam
) using existing top 20 videos for these topics as the candidate pool. As we accumulated user votes, we evaluated the resulting YouTube Slam leaderboard for that topic vs the existing ranking on
youtube.com/topics
(baseline). Note that both the slam leaderboard and the baseline had the same set of videos, just in a different order.
What did we discover? It was no surprise that slam ranking performance had a high variance in the beginning and gradually improved as votes accumulated. We are happy to report that four of five topic slams converged within 1000 votes with a better leaderboard ranking than the existing YouTube topic search. In spite of small number of voters, Slam achieves better ranking partly because of gamification incentives and partly because it is based on machine learning, using:
Preference judgement over a pair, not absolute judgement on a single video, and,
Active solicitation of user opinion as opposed to passive observation. Due to what is called a “
cold start
” problem in data modeling, conventional (passive observation) techniques don’t work well on new items with little prior information. For any given topic, Slam’s improvement over the baseline in ranking of the “recent 20” set of videos was in fact better than the improvement in ranking of the “top 20” set.
Demographics and interests of the voters do affect slam leaderboard ranking, especially when the voter pool is small. An example is a
Romantic Proposals Slam
we featured on Valentine’s day last month. Men thought
this
proposal during a Kansas City Royals game was the most romantic, although
this
one where the man pretends to fall off a building came close. On the other hand, women rated
this
meme proposal in a restaurant as the best, followed by
this
movie theater proposal.
Encouraged by these results, we will soon be exploring slams for a few thousand topics to evaluate the utility of gamification techniques to YouTube topic search. Here are some of them:
Chocolate Brownie
,
Paper Plane
,
Bush Flying
,
Stealth Technology
,
Stencil Graffiti
,
Yosemite National Park
, and
Stealth Technology
.
Have fun slamming!
Quantifying comedy on YouTube: why the number of o’s in your LOL matter
Thursday, February 09, 2012
Posted by Sanketh Shetty, YouTube Slam Team, Google Research
In a previous
post
, we talked about quantification of musical talent using machine learning on acoustic features for
YouTube Music Slam
. We wondered if we could do the same for funny videos, i.e. answer questions such as: is a video funny, how funny do viewers think it is, and why is it funny? We noticed a few audiovisual patterns across comedy videos on YouTube, such as shaky camera motion or audible laughter, which we can automatically detect. While content-based features worked well for music, identifying humor based on just such features is
AI-Complete
. Humor preference is subjective, perhaps even more so than musical taste.
Fortunately, at YouTube, we have more to work with. We focused on videos uploaded in the comedy category. We captured the uploader’s belief in the funniness of their video via features based on title, description and tags. Viewers’ reactions, in the form of comments, further validate a video’s comedic value. To this end we computed more text features based on words associated with amusement in comments. These included (a) sounds associated with laughter such as hahaha, with culture-dependent variants such as hehehe, jajaja, kekeke, (b) web acronyms such as lol, lmao, rofl, (c) funny and synonyms of funny, and (d) emoticons such as :), ;-), xP. We then trained classifiers to identify funny videos and then tell us why they are funny by categorizing them into genres such as “funny pets”, “spoofs or parodies”, “standup”, “pranks”, and “funny commercials”.
Next we needed an algorithm to rank these funny videos by comedic potential, e.g. is “
Charlie bit my finger
” funnier than “
David after dentist
”? Raw viewcount on its own is insufficient as a ranking metric since it is biased by video age and exposure. We noticed that viewers emphasize their reaction to funny videos in several ways: e.g. capitalization (LOL), elongation (loooooool), repetition (lolololol), exclamation (lolllll!!!!!), and combinations thereof. If a user uses an “loooooool” vs an “loool”, does it mean they were more amused? We designed features to quantify the degree of emphasis on words associated with amusement in viewer comments. We then trained a passive-aggressive ranking algorithm using human-annotated pairwise ground truth and a combination of text and audiovisual features. Similar to Music Slam, we used this ranker to populate candidates for human voting for our Comedy Slam.
So far, more than 75,000 people have cast more than 700,000 votes, making comedy our most popular slam category.
Give it a try
!
Further reading:
“
Opinion Mining and Sentiment Analysis
,” by Bo Pang and Lillian Lee.
“
A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews
,” by Oren Tsur, Dmitry Davidov, and Ari Rappoport.
“
That’s What She Said: Double Entendre Identification
,” by Chloe Kiddon and Yuriy Brun.
Discovering Talented Musicians with Acoustic Analysis
Wednesday, November 02, 2011
Posted by Charles DuHadway, YouTube Slam Team, Google Research
In an
earlier post
we talked about the technology behind Instant Mix for
Music Beta by Google
. Instant Mix uses machine hearing to characterize music attributes such as its timbre, mood and tempo. Today we would like to talk about acoustic and visual analysis -- this time on YouTube. A fundamental part of YouTube's mission is to allow anyone anywhere to showcase their talents -- occasionally leading to
life-changing success
-- but many talented performers are never discovered. Part of the problem is the sheer volume of videos: forty eight hours of video are uploaded to YouTube every minute (that’s eight years of content every day). We wondered if we could use acoustic analysis and machine learning to pore over these videos and automatically identify talented musicians.
First we analyzed audio and visual features of videos being uploaded. We wanted to find “singing at home” videos -- often correlated with features such as ambient indoor lighting, head-and-shoulders view of a person singing in front of a fixed camera, few instruments and often a single dominant voice. Here’s a sample set of videos we found.
Then we estimated the quality of singing in each video. Our approach is based on acoustic analysis similar to that used by Instant Mix, coupled with a small set of singing quality annotations from human raters. Given these data we used machine learning to build a ranker that predicts if an average listener would like a performance.
While machines are useful for weeding through thousands of not-so-great videos to find potential stars, we know they alone can't pick the next great star. So we turn to YouTube users to help us identify the real hidden gems by playing a voting game called
YouTube Slam
. We're putting an equal amount of effort into the game itself -- how do people vote? What makes it fun? How do we know when we have a true hit? We're looking forward to your feedback to help us refine this process:
give it a try
*. You can also check out singer and voter
leaderboards
. Toggle “All time” to “Last week” to find emerging talent in fresh videos or all-time favorites.
Our “Music Slam” has only been running for a few weeks and we have already found some very talented musicians. Many of the videos have less than 100 views when we find them.
And while we're excited about what we've done with music, there's as much undiscovered potential in almost any subject you can think of. Try our other slams:
cute
,
bizarre
,
comedy
, and
dance
*. Enjoy!
Related work by Google Researchers:
“
Video2Text: Learning to Annotate Video Content
”,
Hrishikesh Aradhye
,
George Toderici
,
Jay Yagnik
, ICDM Workshop on Internet Multimedia Mining, 2009.
* Music and dance slams are currently available only in the US.
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
Monday, June 20, 2011
Posted by
Matthias Grundmann
,
Vivek Kwatra
, and
Irfan Essa
, Research Team
Earlier this year, we
announced
the launch of new features on the
YouTube Video Editor
, including stabilization for shaky videos, with the ability to preview them in real-time. The core technology behind this feature is detailed in
this paper
, which will be presented at the IEEE International Conference on Computer Vision and Pattern Recognition (
CVPR 2011
).
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our goal was to devise a completely automatic method for converting casual shaky footage into more pleasant and professional looking videos.
Our technique mimics the cinematographic principles outlined above by automatically determining the best camera path using a robust optimization technique. The original, shaky camera path is divided into a set of segments, each approximated by either a constant, linear or parabolic motion. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm.
To achieve real-time performance on the web, we distribute the computation across multiple machines in the cloud. This enables us to provide users with a real-time preview and interactive control of the stabilized result. Above we provide a video demonstration of how to use this feature on the YouTube Editor. We will also demo this live at
Google’s exhibition booth
in CVPR 2011.
For further details, please read our
paper
.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.