Google Research Blog
The latest news from Research at Google
A picture is worth a thousand (coherent) words: building a natural description of images
Monday, November 17, 2014
Posted by Google Research Scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan
“Two pizzas sitting on top of a stove top oven”
“A group of people shopping at an outdoor market”
“Best seats in the house”
People can summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve just gotten a bit closer -- we’ve developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.
Recent research has greatly improved
object detection, classification, and labeling
. But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language.
Automatically captioned: “Two pizzas sitting on top of a stove top oven”
Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both
computer vision
and
natural language processing
to form a
complete image description approach
. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?
This idea comes from recent advances in
machine translation
between languages, where a
Recurrent Neural Network
(RNN) transforms, say, a French sentence into a
vector representation
, and a second RNN uses that vector representation to generate a target sentence in German.
Now, what if we replaced that first RNN and its input words with a deep
Convolutional Neural Network
(CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final
Softmax
among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.
Our experiments with this system on several openly published datasets, including Pascal, Flickr8k, Flickr30k and SBU, show how robust the qualitative results are -- the generated sentences are quite reasonable. It also performs well in quantitative evaluations with the
Bilingual Evaluation Understudy
(BLEU), a metric used in machine translation to evaluate the quality of generated sentences.
A selection of evaluation results, grouped by human rating.
A picture may be worth a thousand words, but sometimes it’s the words that are most useful -- so it’s important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this. We look forward to continuing developments in systems that can read images and generate good natural-language descriptions. To get more details about the framework used to generate descriptions from images, as well as the model evaluation, read the full paper
here
.
The World Parks Congress: Using technology to protect our natural environment
Wednesday, November 12, 2014
Posted by Dave Thau, Developer Advocate for Google Earth Engine and Karin Tuxen-Bettman, Program Manager, Google Earth Outreach
(Cross posted on the
Official Google Australia Blog
)
This week, thousands of people from more than 160 countries will gather in Sydney for the once-in-a-decade
IUCN World Parks Congress
to discuss the governance and management of protected areas. The
Google Earth Outreach
and
Google Earth Engine
teams will be at the event to showcase exemplars of how technology can help protect our environment.
Here are a few of the workshops and events happening in Sydney this week:
Monday, November 10th - Tuesday, November 11th:
Over the last couple of days, the Google Earth Outreach and Earth Engine teams delivered a 2-day hands-on
workshop
to develop the technical capacity of park managers, researchers, and communities. At this workshop, participants were introduced to
Google mapping tools
to help them with their conservation programs.
November 13 - 19:
Google will be at the Oceans Pavilion inside the World Parks Congress to demonstrate how
Trekker
,
Street View
and
Open Data Kit
on Android mobile devices can assist with parks monitoring and management.
Friday, November 14, 9:30-10:30am:
Join a
Live Sydney Seahorse Hunt
in Sydney Harbour, via Google Hangout, with
Catlin Seaview Survey
and
Sydney Institute of Marine Science
. Richard Vevers, Director of the Catlin Seaview Survey, will venture underwater to his favorite dive site and talk with experts about the unique marine life (including seahorses!) that explorers can expect to find around Sydney.
Tune in here
at 10:30am to catch all the action.
Saturday, November 15th, 8:30am:
Networking for nature: the future is cool
. Hear about how technology-driven ocean initiatives can help us better understand and strengthen our connection with our natural environments. WPCA-Marine’s plenary session will includes presentations by Sylvia Earle and Mission Blue, Catlin Seaview Survey, Google, Oceana, and SkyTruth. The session will also feature leading young marine professionals Mariasole Bianco and Rebecca Koss.
Saturday, November 15th, 12:15pm:
We’ll be hosting a
panel discussion
on using
Global Forest Watch
to monitor protected areas in near-real-time. Global Forest Watch is a dynamic online alert system to help park rangers monitor and preserve vast stretches of parkland.
Saturday, November 15th, 1:30 - 3:00pm:
At the Biodiversity Pavilion join Walter Jetz from Yale and Dave Thau from Google for a presentation on
Google Earth Engine
and
The Map of Life
. The presentation will showcase how Google Earth Engine is being used in a variety of conservations efforts - including monitoring water resources, the health of the world's forests, and measuring the impact of protected areas on biodiversity preservation. We will also announce a new global resource from The Map of Life for mapping and monitoring biodiverse ecosystems.
We believe that technology can help address some of our world’s most pressing environmental challenges and we look forward to working with Australian conservationists to integrate technology into their work.
You can find us at the Oceans Pavilion inside the World Parks Congress, where we will be joined by our environmental partners including
The Jane Goodall Institute
,
The World Resources Institute
and
The Map of Life
.
We hope to see you at one of our events this week!
Googler Shumin Zhai awarded with the ACM UIST Lasting Impact Award
Monday, November 03, 2014
Posted by Alfred Spector, Vice President, Engineering
Recently, at the
27th ACM User Interface Software and Technology Symposium
(UIST’14), Google Senior Research Scientist
Shumin Zhai
and University of Cambridge Lecturer
Per Ola Kristensson
received the 2014
Lasting Impact Award
for their seminal paper
SHARK
2
: a large vocabulary shorthand writing system for pen-based computers
. Most simply put, this is one of those rare works that is responsible for fundamental and lasting advances in the industry, and is the basis for the rapidly growing number of keyboards that use gesture typing, including products such as
ShapeWriter
,
Swype
,
SwiftKey
,
SlideIT
,
TouchPal
, and
Google Keyboard
.
First presented 10 years ago at UIST’04, Shumin and Per Ola’s paper is a pioneering work on word-gesture keyboard interaction that described the architecture, algorithms and interfaces of a high-capacity multi-channel gesture recognition system-SHARK
2
. SHARK
2
increased recognition accuracy and relaxed precision requirements by using the shape and location of gestures in addition to context based language models. In doing so, Shumin and Per Ola delivered a paradigm of touch screen gesture typing as an efficient method for text entry that has continued to drive the development of mobile text entry across the industry.
"Awarded for its scientific contribution of algorithms, insights, and user interface considerations essential to the practical realization of large-vocabulary shape-writing systems for graphical keyboards, laying the groundwork for new research, industrial applications, and widespread user benefit."
Prior to joining Google in 2011, Shumin worked at the
IBM Almaden Research Center
for 15 years, where he originated and led the SHARK project, further developing and refining it to include a low latency recognition engine that introduced the ability to accurately recognize a large vocabulary of words based upon the patterns (
sokgraphs
) drawn on a touchscreen device. SHARK and SHARK
2
subsequently continued further development as
ShapeWriter
. During his tenure at IBM, Shumin additionally pursued a wide variety of HCI research areas including, but not limited to, studying the ease and efficiency of HCI interfaces, camera phone based motion sensing, and cross-device user experience.
At Google, Shumin has continued to inspire the Human-Computer Interaction research community,
publishing prolifically
and
leading a group
that incorporates HCI research, machine learning, statistical language modeling and mobile computing to advance the state of the art of text input for smart touchscreen keyboards. Building on his earlier work with SHARK/ShapeWriter,
Gesture Typing
is just one of the innovations that make things like typing messages on mobile device easier for hundreds of millions of people each day, and remains one of the most prominent features on Android keyboards.
Shumin has been highly active in academia during his career, as both visiting professor and lecturer at world-class universities, and is currently the Editor-in-Chief of
ACM Transactions on Computer- Interaction
, a
Fellow of the ACM
and a Member of the
CHI Academy
. We’re proud to congratulate Shumin and Per Ola on receiving one of the most prestigious honors in the Human-Computer Interaction (HCI) research community, and look forward to their future contributions.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.