Google Research Blog
The latest news from Research at Google
"Aw, so cute!": Allo helps you respond to shared photos
Wednesday, May 18, 2016
by Ariel Fuxman, Research Scientist
Today, Google
announced Allo
— our new mobile messaging app. From day one of the Allo development effort, we set out to build a truly special product that is powered by Google’s strengths in machine intelligence to make messaging easier, more efficient, and more expressive. Photo Reply is a unique feature of Allo that just does that! We use machine learning to understand what a shared photo depicts and to suggest rich natural language replies that the user can tap to send. This makes it easier for users to sustain meaningful conversations while using small mobile keyboards.
Here is an example of the responses that Allo suggests when a friend shares a photo of his child.
Photo Reply — Under the Hood
During the winter, our product managers, Patrick McGregor and Ryan Cassidy, challenged us to develop new approaches to simplify media sharing in messaging while simultaneously delighting users with Google insights. With my colleagues Vivek Ramavajjala, Sergey Nazarov, and Sujith Ravi, we set out to build Photo Reply.
We utilize Google's
image recognition technology
, developed by our
Machine Perception
team, to associate images with
semantic entities
— people, animals, cars, etc. We then apply a machine learned model that maps those recognized entities to actual natural language responses. Our system produces replies for thousands of entity types that are drawn from a taxonomy that is a subset of Google's
Knowledge Graph
and may be at different granularity levels. For example, when you receive a photo of a dog, the system may detect that the dog is actually a labrador and suggest "Love that lab!". Or given a photo of a pasta dish, it may detect the type of pasta ("Yum linguine!") and even the cuisine ("I love Italian food!").
Examples of response suggestions reflecting fine-grained object classes
One aspect of the system that we find very useful is that it can suggest responses not just for physical objects but also for abstract concepts. It can produce suggestions for events (birthday parties, weddings, etc.), nature (sunrises, mountains, etc.), recreational activities (hiking, camping, etc.), and many more categories. Also, the system can generate responses that reflect the emotions that might be associated with an image, such as “happiness”. Here are some examples of responses for abstract concepts:
Response suggestions reflecting abstract concepts
Learning entity-response associations
At runtime, Photo Reply recognizes entities in the shared photo and triggers responses for the entities. The model that maps entities to natural language responses is learned offline using
Expander
, which is a large-scale graph-based
semi-supervised learning
platform at Google. We built a massive a graph where nodes correspond to photos, semantic entities, and textual responses. Edges in the graph indicate when an entity was recognized for a photo, when a specific response was given for a photo, and visual similarities between photos. Some of the nodes are "labeled" and we learn associations for the unlabeled nodes by propagating label information across the graph.
To illustrate this, consider the graph below. There are two labels: the red label corresponds to the response "yummy" and the blue label corresponds to "delicious". The nodes for "spaghetti" and "linguine" are unlabeled, but from the fact that they are close to the red and blue nodes, the algorithm can learn that they should be associated to the "yummy" and "delicious" responses. Notice that in this way, we are associating the entity "linguine" to the response "yummy" even though none of the linguine photos in the graph are directly connected to this answer. Expander can perform this kind of learning at very large scale, for graphs containing billions of nodes and hundred of billions of edges.
Graph of entities, photos, and responses
Photo Reply is an exciting example of
multimodal learning
, where computer vision and natural language processing come together in order to create a compelling user experience. Allo will be available on Android and iOS later this summer. Be sure to check out what Allo sees in your beautiful photos!
Chat Smarter with Allo
Wednesday, May 18, 2016
Posted by Pranav Khaitan, Google Research
At Google, we are continuously building products powered by
Machine Learning
to delight our users and simplify their lives. Today, we are excited to talk about the technology behind
Allo
, a new smart messaging app that uses the power of
neural networks
and Google Search to make your text conversations easier and more productive.
Just like
Smart Reply for Inbox
, Allo understands the conversation history to generate a set of suggestions that the user will likely want to respond with. In addition to understanding the context of your conversation, Allo learns your individual style, so the responses are personalized for you.
How does it work?
About a year ago, we started exploring how we can make communication easier and more fun. The idea of Smart Reply for Allo came up in a brainstorming session with my teammates Sushant Prakash and Ori Gershony who then helped me lead our team to build this technology. We began by experimenting with neural network based model architectures which had proven to be successful for sequence prediction, including the encoder-decoder model used in Smart Reply for Inbox.
One challenge we faced was that response generation in online conversations have very strict latency requirements. To address this, Pavel Sountsov and Sushant came up with an innovative two-stage model that works as follows. First, a
recurrent neural network
looks at the conversation context one word at a time and encodes it in the hidden state of a
long short term memory
(LSTM). Below, we show an example with a context ‘Where are you?’. The context has three tokens, each of which is embedded into a continuous space and input to the LSTM. The LSTM state now encodes the context as a continuous vector. This vector is used to generate the response as a discretized semantic class.
Each semantic class is associated with a set of possible messages that belong to it. We use a second recurrent network to generate a specific message from that set. This network also converts the context into a hidden LSTM state but this time the hidden state is used to generate the full message of the reply one token at a time. For example, now the LSTM after seeing the context “Where are you?” generates the tokens in the response: “I’m at work”.
A
beam search
is used to efficiently select the top-N highest scoring responses from among the very large set of possible messages that a LSTM can generate. A snippet of the search space explored by such a beam-search technique is shown below.
As with any large-scale product, there were several engineering challenges we had to solve in generating a set of high-quality responses efficiently. For example, in spite of the two staged architecture, our first few networks were very slow and required about half a second to generate a response. This was obviously a deal breaker when we are talking about real time communication apps! So we had to evolve our neural network architecture further to reduce the latency to less than 200ms. We moved from using a softmax layer to a hierarchical softmax layer which traverses a tree of words instead of traversing a list of words thus making it more efficient.
Another interesting challenge we had to solve when generating predictions is controlling for message length. Sometimes none of the most probable responses are appropriate - if the model predicts too short a message, it might not be useful to the user, and if we predict something too long, it might not fit on the phone screen. We solved this by biasing the beam search to follow paths that lead to higher utility responses instead of favoring just the responses that are most probable. That way, we can efficiently generate appropriate length response predictions that are useful to our users.
Personalized for you
The best part about these suggestions is that over time they are personalized to you so that your individual style is reflected in your conversations. For example, if you often reply to “How are you?” with “Fine.” instead of “I am good.”, it will learn your preference and your future suggestions will take that into account. This was accomplished by incorporating a user's "style" as one of the features in a Neural Network that is used to predict the next word in a response, resulting in suggestions that are customized for your personality and individual preferences. The user's style is captured in a sequence of numbers that we call the user embedding. These embeddings can be generated as part of the regular model training, but this approach requires waiting for many days for training to be complete and it cannot handle more than a handful of millions of users. To solve this issue, Alon Shafrir implemented a
L-BFGS
based technique to generate user embeddings quickly and at scale. Now, you'll be able to enjoy personalized suggestions after only a short time of using Allo.
More than just English
The neural network model described above is language agnostic so building separate prediction models for each language works quite well. To make sure that responses for each language benefit from our semantic understanding of other languages, Sujith Ravi came up with a graph-based machine learning technique that can connect possible responses across languages. Dana Movshovitz-Attias and Peter Young applied this technique to build a graph that connects responses to incoming messages and to other responses that have similar word embeddings and syntactic relationships. It also connects responses with similar meaning across languages based on the
machine translation
models developed by our
Translate team
.
With this graph, we use
semi-supervised learning
, as described in this
paper
, to learn the semantic meaning of responses and determine which are the most useful clusters of possible responses. As a result, we can allow the LSTM to score many possible variants of each possible response meaning, allowing the personalization routines to select the best response for the user in the context of the conversation. This also helps enforce diversity as we can now pick the final set of responses from different semantic clusters.
Here’s an example of how the graph might look for a set of messages related to greetings:
Beyond Smart Reply
I am also very excited about the Google assistant in Allo with which you can converse and get information about anything that Google Search knows about. It understands your sentences and helps you accomplish tasks directly from the conversation. For example, the Google assistant can help you discover a restaurant and reserve a table from within the Allo app when chatting with your friends. This has been made possible because of the cutting-edge research in natural language understanding that we have been doing at Google. More details to follow soon!
These smart features will be part of the Android and iOS apps for Allo that will be available later this summer. We can’t wait for you to try and enjoy it!
We wish to acknowledge the hard work of the following in building Smart Reply:
Pranav Khaitan, Sushant Prakash, Pavel Sountsov, Alon Shafrir, Max Gubin, Shu Zhang, Sunita Sarawagi, Ori Gershony, Sergey Nazarov, Hung Pham, Harini Krishnamurthy, Ryan Cassidy, Dave Citron, Patrick McGregor, Sujith Ravi, Dana Movshovitz-Attias, Peter Young, Vivek Ramavajjala
Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source
Thursday, May 12, 2016
Posted by Slav Petrov, Senior Staff Research Scientist
At Google, we spend a lot of time thinking about how
computer systems
can
read
and
understand
human language
in order
to process it
in
intelligent ways
. Today, we are excited to share the fruits of our research with the broader community by releasing
SyntaxNet
, an open-source neural network framework implemented in
TensorFlow
that provides a foundation for
Natural Language Understanding
(NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as
Parsey McParseface
, an English parser that we have trained for you and that you can use to analyze English text.
Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the
most accurate such model in the world
, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.
How does SyntaxNet work?
SyntaxNet is a framework for what’s known in academic circles as a
syntactic parser
, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for
Alice saw Bob
:
This structure encodes that
Alice
and
Bob
are nouns and
saw
is a verb. The main verb
saw
is the root of the sentence and
Alice
is the subject (nsubj) of
saw
, while
Bob
is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:
This structure again encodes the fact that
Alice
and
Bob
are the subject and object respectively of
saw
, in addition that
Alice
is modified by a relative clause with the verb
reading
, that
saw
is modified by the temporal modifier
yesterday
, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example
whom did Alice see?
,
who saw Bob?
,
what had Alice been reading about?
or
when did Alice see Bob?
.
Why is Parsing So Hard For Computers to Get Right?
One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence
Alice drove down the street in her car
has at least two possible dependency parses:
The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition
in
can either modify
drove
or
street
; this example is an instance of what is called
prepositional phrase attachment ambiguity
.
Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.
SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use
beam search
in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence
I booked a ticket to Google
.
Furthermore, as described in our
paper
, it is critical to tightly
integrate learning and search
in order to achieve the highest prediction accuracy. Parsey McParseface and other
SyntaxNet
models are some of the most complex networks that we have trained with the
TensorFlow
framework at Google. Given some data from the Google supported
Universal Dependencies
project, you can train a parsing model on your own machine.
So How Accurate is Parsey McParseface?
On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old
Penn Treebank
), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already
better than any previous approach
. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the
Google WebTreebank
(released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.
While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across
all
languages and contexts.
To get started, see the
SyntaxNet
code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.
Research at Google and ICLR 2016
Sunday, May 01, 2016
Posted by Dumitru Erhan, Gentleman Scientist
This week, San Juan, Puerto Rico hosts the
4th International Conference on Learning Representations
(ICLR 2016), a conference focused on how one can learn meaningful and useful representations of data for
Machine Learning
. ICLR includes conference and workshop tracks, with invited talks along with oral and poster presentations of some of the latest research on deep learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.
At the forefront of innovation in cutting-edge technology in
Neural Networks
and
Deep Learning
, Google focuses on both theory and application, developing learning approaches to understand and generalize. As Platinum Sponsor of ICLR 2016, Google will have a strong presence with over 40 researchers attending (many from the
Google Brain team
and
Google DeepMind
), contributing to and learning from the broader academic research community by presenting papers and posters, in addition to participating on organizing committees and in workshops.
If you are attending ICLR 2016, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at ICLR 2016 in the list below (Googlers highlighted in
blue
).
Organizing Committee
Program Chairs
Samy Bengio
, Brian Kingsbury
Area Chairs include:
John Platt
,
Tara Sanaith
Oral Sessions
Neural Programmer-Interpreters
(Best Paper Award Recipient)
Scott Reed,
Nando de Freitas
Net2Net: Accelerating Learning via Knowledge Transfer
Tianqi Chen,
Ian Goodfellow
,
Jon Shlens
Conference Track Posters
Prioritized Experience Replay
Tom Schau
,
John Quan
,
Ioannis Antonoglou
,
David Silver
Reasoning about Entailment with Neural Attention
Tim Rocktäschel,
Edward Grefenstette
,
Karl Moritz Hermann
,
Tomáš Kočiský
,
Phil Blunsom
Neural Programmer: Inducing Latent Programs With Gradient Descent
Arvind Neelakantan,
Quoc Le
,
Ilya Sutskever
MuProp: Unbiased Backpropagation For Stochastic Neural Networks
Shixiang Gu,
Sergey Levine
,
Ilya Sutskever
,
Andriy Mnih
Multi-Task Sequence to Sequence Learning
Minh-Thang Luong,
Quoc Le
,
Ilya Sutskever
,
Oriol Vinyals
,
Lukasz Kaiser
A Test of Relative Similarity for Model Selection in Generative Models
Eugene Belilovsky, Wacha Bounliphone, Matthew Blaschko,
Ioannis Antonoglou
, Arthur Gretton
Continuous control with deep reinforcement learning
Timothy Lillicrap
,
Jonathan Hunt
,
Alexander Pritzel
,
Nicolas Heess
,
Tom Erez
,
Yuval Tassa
,
David Silver
,
Daan Wierstra
Policy Distillation
Andrei Rusu
,
Sergio Gomez
,
Caglar Gulcehre,
Guillaume Desjardins
,
James Kirkpatrick
,
Razvan Pascanu
,
Volodymyr Mnih
,
Koray Kavukcuoglu
,
Raia Hadsell
Neural Random-Access Machines
Karol Kurach
,
Marcin Andrychowicz
,
Ilya Sutskever
Variable Rate Image Compression with Recurrent Neural Networks
George Toderici
,
Sean O'Malley
,
Damien Vincent
,
Sung Jin Hwang
,
Michele Covell
,
Shumeet Baluja
,
Rahul Sukthankar
,
David Minnen
Order Matters: Sequence to Sequence for Sets
Oriol Vinyals
,
Samy Bengio
,
Manjunath Kudlur
Grid Long Short-Term Memory
Nal Kalchbrenner
,
Alex Graves
,
Ivo Danihelka
Neural GPUs Learn Algorithms
Lukasz Kaiser
,
Ilya Sutskever
ACDC: A Structured Efficient Linear Layer
Marcin Moczulski,
Misha Denil
, Jeremy Appleyard,
Nando de Freitas
Workshop Track Posters
Revisiting Distributed Synchronous SGD
Jianmin Chen
,
Rajat Monga
,
Samy Bengio
,
Rafal Jozefowicz
Black Box Variational Inference for State Space Models
Evan Archer, Il Memming Park,
Lars Buesing
, John Cunningham, Liam Paninski
A Minimalistic Approach to Sum-Product Network Learning for Real Applications
Viktoriya Krakovna,
Moshe Looks
Efficient Inference in Occlusion-Aware Generative Models of Images
Jonathan Huang
,
Kevin Murphy
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Christian Szegedy
,
Sergey Ioffe
,
Vincent Vanhoucke
Deep Autoresolution Networks
Gabriel Pereyra
,
Christian Szegedy
Learning visual groups from co-occurrences in space and time
Phillip Isola, Daniel Zoran,
Dilip Krishnan
, Edward H. Adelson
Adding Gradient Noise Improves Learning For Very Deep Networks
Arvind Neelakantan, Luke Vilnis,
Quoc V. Le
,
Ilya Sutskever
,
Lukasz Kaiser
,
Karol Kurach
, James Martens
Adversarial Autoencoders
Alireza Makhzani,
Jonathon Shlens
,
Navdeep Jaitly
,
Ian Goodfellow
Generating Sentences from a Continuous Space
Samuel R. Bowman, Luke Vilnis,
Oriol Vinyals
,
Andrew M. Dai
,
Rafal Jozefowicz
,
Samy Bengio
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.