Google Research Blog
The latest news from Research at Google
Improving Photo Search: A Step Across the Semantic Gap
Wednesday, June 12, 2013
Posted by Chuck Rosenberg, Image Search Team
Last month at
Google I/O
, we showed a
major upgrade to the photos experience
: you can now easily
search your own photos
without having to manually label each and every one of them. This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle.
For
many years
Google has offered
Image Search
over web images; however, searching across photos represents a difficult new challenge. In Image Search there are many pieces of information which can be used for ranking images, for example text from the web or the image filename. However, in the case of photos, there is typically little or no information beyond the pixels in the images themselves. This makes it harder for a computer to identify and categorize what is in a photo. There are some things a computer can do well, like recognize rigid objects and handwritten digits. For other classes of objects, this is a daunting task, because the average toddler is better at understanding what is in a photo than the world’s most powerful computers running state of the art algorithms.
This past October the state of the art seemed to move things a bit closer to toddler performance. A system which used
deep learning
and
convolutional neural networks
easily beat out more traditional approaches in the
ImageNet computer vision competition
designed to test image understanding. The
winning team
was from
Professor Geoffrey Hinton
’s group at the University of Toronto.
We built and trained models similar to those from the winning team using
software infrastructure
for training large-scale neural networks developed at Google in a group started by
Jeff Dean
and
Andrew Ng
. When we evaluated these models, we were impressed; on our test set we saw double the average precision when compared to other approaches we had tried. We knew we had found what we needed to make photo searching easier for people using Google. We
acquired the rights to the technology
and went full speed ahead adapting it to run at large scale on Google’s computers. We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. You can try it out at
photos.google.com
.
Why the success now? What is new? Some things are unchanged: we still use convolutional neural networks -- originally developed in the late 1990s by
Professor Yann LeCun
in the context of software for
reading handwritten letters and digits
. What is different is that both computers and algorithms have improved significantly. First, bigger and faster computers have made it feasible to train larger neural networks with much larger data. Ten years ago, running neural networks of this complexity would have been a momentous task even on a single image -- now we are able to run them on billions of images. Second, new training techniques have made it possible to train the large deep neural networks necessary for successful image recognition.
We feel it would be interesting to the research community to discuss some of the unique aspects of the system we built and some qualitative observations we had while testing the system.
The first is our label and training set and how it compares to that used in the
ImageNet Large Scale Visual Recognition
competition. Since we were working on search across photos, we needed an appropriate label set. We came up with a set of about 2000 visual classes based on the most popular labels on Google+ Photos and which also seemed to have a visual component, that a human could recognize visually. In contrast, the ImageNet competition has 1000 classes. As in ImageNet, the classes were not text strings, but are
entities
, in our case we use
Freebase entities
which form the basis of the
Knowledge Graph
used in Google search. An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “
/m/0449p
” for the former and “
/m/012x34
” for the latter. In order to train better classifiers we used more training images per class than ImageNet, 5000 versus 1000. Since we wanted to provide only high precision labels, we also refined the classes from our initial set of 2000 to the most precise 1100 classes for our launch.
During our development process we had many more qualitative observations we felt are worth mentioning:
1) Generalization performance. Even though there was a significant difference in visual appearance between the training and test sets, the network appeared to generalize quite well. To train the system, we used images mined from the web which did not match the typical appearance of personal photos. Images on the web are often used to illustrate a single concept and are carefully composed, so an image of a flower might only be a close up of a single flower. But personal photos are unstaged and impromptu, a photo of a flower might contain many other things in it and may not be very carefully composed. So our training set image distribution was not necessarily a good match for the distribution of images we wanted to run the system on, as the examples below illustrate. However, we found that our system trained on web images was able to generalize and perform well on photos.
A typical photo of a flower found on the web.
A typical photo of a flower found in an impromptu photo.
2) Handling of classes with multi-modal appearance. The network seemed to be able to handle classes with multimodal appearance quite well, for example the “car” class contains both exterior and interior views of the car. This was surprising because the final layer is effectively a
linear classifier
which creates a single dividing plane in a high dimensional space. Since it is a single plane, this type of classifier is often not very good at representing multiple very different concepts.
3) Handling abstract and generic visual concepts. The system was able to do reasonably well on classes that one would think are somewhat abstract and generic. These include "dance", "kiss", and "meal", to name a few. This was interesting because for each of these classes it did not seem that there would be any simple visual clues in the image that would make it easy to recognize this class. It would be difficult to describe them in terms of simple basic visual features like color, texture, and shape.
Photos recognized as containing a meal.
4) Reasonable errors. Unlike other systems we experimented with, the errors which we observed often seemed quite reasonable to people. The mistakes were the type that a person might make - confusing things that look similar. Some people have already noticed this, for example,
mistaking a goat for a dog or a millipede for a snake
. This is in contrast to other systems which often make errors which seem nonsensical to people, like mistaking a tree for a dog.
Photo of a banana slug mistaken for a snake.
Photo of a donkey mistaken for a dog.
5) Handling very specific visual classes. Some of the classes we have are very specific, like specific types of flowers, for example “hibiscus” or “dhalia”. We were surprised that the system could do well on those. To recognize specific subclasses very fine detail is often needed to differentiate between the classes. So it was surprising that a system that could do well on a full image concept like “sunsets” could also do well on very specific classes.
Photo recognized as containing a hibiscus flower.
Photo recognized as containing a dahlia flower.
Photo recognized as containing a polar bear.
Photo recognized as containing a grizzly bear.
The resulting computer vision system worked well enough to launch to people as a useful tool to help improve personal photo search, which was a big step forward. So, is computer vision solved? Not by a long shot. Have we gotten computers to see the world as well as people do? The answer is not yet, there’s still a lot of work to do, but we’re closer.
Video Stabilization on YouTube
Friday, May 04, 2012
Posted by Matthias Grundmann, Vivek Kwatra, and Irfan Essa, Research at Google
One thing we have been working on within Research at Google is developing methods for making casual videos look more professional, thereby providing users with a better viewing experience. Professional videos have several characteristics that differentiate them from casually shot videos. For example, in order to tell a story, cinematographers carefully control lighting and exposure and use specialized equipment to plan camera movement.
We have developed a technique that mimics professional camera moves and applies them to videos recorded by hand-held devices. Cinematographers use specialized equipment such as tripods and dollies to plan their camera paths and hold them steady. In contrast, think of a video you shot using a mobile phone camera. How steady was your hand and were you able to anticipate an interesting moment and smoothly pan the camera to capture that moment? To bridge these differences, we propose an algorithm that automatically determines the best camera path and recasts the video as if it were filmed using stabilization equipment. Specifically, we divide the original, shaky camera path into a set of segments, each approximated by either a constant, linear or parabolic motion of the camera. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm. For details, check out our
earlier blog post
or read our paper,
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
, published in
IEEE CVPR 2011
.
The next time you upload your videos to YouTube, try stabilizing them by going to the
YouTube editor
or directly from the
video manager
by clicking on Edit->Enhancements. For even more convenience, YouTube will automatically detect if your video needs stabilization and offer to do it for you. Many videos on YouTube have already been enhanced using this technology.
More recently, we have been working on a related problem common in videos shot from mobile phones. The camera sensors in these phones contain what is known as an electronic rolling shutter. When taking a picture with a rolling shutter camera, the image is not captured instantaneously. Instead, the camera captures the image one row of pixels at a time, with a small delay when going from one row to the next. Consequently, if the camera moves during capture, it will cause image distortions ranging from shear in the case of low-frequency motions (for instance an image captured from a driving car) to wobbly distortions in the case of high-frequency perturbations (think of a person walking while recording video). These distortions are especially noticeable in videos where the camera shake is independent across frames. For example, take a look at the video below.
Original video with rolling shutter distortions
In our recent paper titled
Calibration-Free Rolling Shutter Removal
, which was awarded the
best paper
at
IEEE ICCP 2012
, we demonstrate a solution to correct these rolling shutter distortions in videos. A significant feature of our approach is that it does not require any knowledge of the camera used to shoot the video. The time delay in capturing two consecutive rows that we mention above is in fact different for every camera and affects the extent of distortions. Having knowledge of this delay parameter can be useful, but difficult to obtain or estimate via calibration. Imagine a video that is already uploaded to YouTube -- it will be challenging to obtain this parameter! Instead, we show that just the visual data in the video has enough information to appropriately describe and compensate for the distortions caused by the camera motion, even in the presence of a rolling shutter. For more information, see the
narrated video description
of our paper.
This technique is already integrated with the
YouTube stabilizer
. Starting today, if you stabilize a video from a mobile phone or other rolling shutter cameras, we will also automatically compensate for rolling shutter distortions. To see our technique in action, check out the video below, obtained after applying rolling shutter compensation and stabilization to the one above.
After stabilization and rolling shutter removal
Excellent Papers for 2011
Thursday, March 22, 2012
Posted by Corinna Cortes and Alfred Spector, Google Research
UPDATE: Added
Theo Vassilakis
as an author for "Dremel: Interactive Analysis of Web-Scale Datasets"
Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our
publications
offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google.
In an effort to highlight some of our work, we periodically select a number of publications to be featured on this blog. We first posted a
set of papers
on this blog in mid-2010 and subsequently discussed them in more detail in the following blog postings. In a
second round
, we highlighted new noteworthy papers from the later half of 2010. This time we honor the influential papers authored or co-authored by Googlers covering all of 2011 -- covering roughly 10% of our total publications. It’s tough choosing, so we may have left out some important papers. So, do see the
publications list
to review the complete group.
In the coming weeks we will be offering a more in-depth look at these publications, but here are some summaries:
Audio processing
“
Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function
”,
Richard F. Lyon
,
Journal of the Acoustical Society of America
, vol. 130 (2011), pp. 3893-3904.
Lyon's long title summarizes a result that he has been working toward over many years of modeling sound processing in the inner ear. This nonlinear cochlear model is shown to be "good" with respect to psychophysical data on masking, physiological data on mechanical and neural response, and computational efficiency. These properties derive from the close connection between wave propagation and filter cascades. This filter-cascade model of the ear is used as an efficient sound processor for several machine hearing projects at Google.
Electronic Commerce and Algorithms
“
Online Vertex-Weighted Bipartite Matching and Single-bid Budgeted Allocations
”,
Gagan Aggarwal
,
Gagan Goel
,
Chinmay Karande
,
Aranyak Mehta
,
SODA 2011
.
The authors introduce an elegant and powerful algorithmic technique to the area of online ad allocation and matching: a hybrid of random perturbations and greedy choice to make decisions on the fly. Their technique sheds new light on classic matching algorithms, and can be used, for example, to pick one among a set of relevant ads, without knowing in advance the demand for ad slots on future web page views.
“
Milgram-routing in social networks
”,
Silvio Lattanzi
, Alessandro Panconesi, D. Sivakumar,
Proceedings of the 20th International Conference on World Wide Web, WWW 2011
, pp. 725-734.
Milgram’s "six-degrees-of-separation experiment" and the fascinating small world hypothesis that follows from it, have generated a lot of interesting research in recent years. In this landmark experiment, Milgram showed that people unknown to each other are often connected by surprisingly short chains of acquaintances. In the paper we prove theoretically and experimentally how a recent model of social networks, "Affiliation Networks", offers an explanation to this phenomena and inspires interesting technique for local routing within social networks.
“
Non-Price Equilibria in Markets of Discrete Goods
”, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Noam Nisan,
EC
, 2011.
We present a correspondence between markets of indivisible items, and a family of auction based n player games. We show that a market has a price based (Walrasian) equilibrium if and only if the corresponding game has a pure Nash equilibrium. We then turn to markets which do not have a Walrasian equilibrium (which is the interesting case), and study properties of the mixed Nash equilibria of the corresponding games.
HCI
“
From Basecamp to Summit: Scaling Field Research Across 9 Locations
”,
Jens Riegelsberger
, Audrey Yang, Konstantin Samoylov, Elizabeth Nunge, Molly Stevens, Patrick Larvie,
CHI 2011 Extended Abstracts
.
The paper reports on our experience with a basecamp research hub to coordinate logistics and ongoing real-time analysis with research teams in the field. We also reflect on the implications for the meaning of research in a corporate context, where much of the value may be less in a final report, but more in the curated impressions and memories our colleagues take away from the the research trip.
“
User-Defined Motion Gestures for Mobile Interaction
”, Jaime Ruiz,
Yang Li
, Edward Lank,
CHI 2011: ACM Conference on Human Factors in Computing Systems
, pp. 197-206.
Modern smartphones contain sophisticated sensors that can detect rich motion gestures — deliberate movements of the device by end-users to invoke commands. However, little is known about best-practices in motion gesture design for the mobile computing paradigm. We systematically studied the design space of motion gestures via a guessability study that elicits end-user motion gestures to invoke commands on a smartphone device. The study revealed consensus among our participants on parameters of movement and on mappings of motion gestures onto commands, by which we developed a taxonomy for motion gestures and compiled an end-user inspired motion gesture set. The work lays the foundation of motion gesture design—a new dimension for mobile interaction.
Information Retrieval
“
Reputation Systems for Open Collaboration
”, B.T. Adler, L. de Alfaro,
A. Kulshreshtha
, I. Pye,
Communications of the ACM
, vol. 54 No. 8 (2011), pp. 81-87.
This paper describes content based reputation algorithms, that rely on automated content analysis to derive user and content reputation, and their applications for Wikipedia and google Maps. The Wikipedia reputation system WikiTrust relies on a chronological analysis of user contributions to articles, metering positive or negative increments of reputation whenever new contributions are made. The Google Maps system Crowdsensus compares the information provided by users on map business listings and computes both a likely reconstruction of the correct listing and a reputation value for each user. Algorithmic-based user incentives ensure the trustworthiness of evaluations of Wikipedia entries and Google Maps business information.
Machine Learning and Data Mining
“
Domain adaptation in regression
”,
Corinna Cortes
,
Mehryar Mohri
,
Proceedings of The 22nd International Conference on Algorithmic Learning Theory, ALT 2011
.
Domain adaptation is one of the most important and challenging problems in machine learning. This paper presents a series of theoretical guarantees for domain adaptation in regression, gives an adaptation algorithm based on that theory that can be cast as a semi-definite programming problem, derives an efficient solution for that problem by using results from smooth optimization, shows that the solution can scale to relatively large data sets, and reports extensive empirical results demonstrating the benefits of this new adaptation algorithm.
“
On the necessity of irrelevant variables
”, David P. Helmbold,
Philip M. Long
,
ICML
, 2011
Relevant variables sometimes do much more good than irrelevant variables do harm, so that it is possible to learn a very accurate classifier using predominantly irrelevant variables. We show that this holds given an assumption that formalizes the intuitive idea that the variables are non-redundant. For problems like this it can be advantageous to add many additional variables, even if only a small fraction of them are relevant.
“
Online Learning in the Manifold of Low-Rank Matrices
”,
Gal Chechik
, Daphna Weinshall, Uri Shalit,
Neural Information Processing Systems (NIPS 23)
, 2011, pp. 2128-2136.
Learning measures of similarity from examples of similar and dissimilar pairs is a problem that is hard to scale. LORETA uses retractions, an operator from matrix optimization, to learn low-rank similarity matrices efficiently. This allows to learn similarities between objects like images or texts when represented using many more features than possible before.
Machine Translation
“
Training a Parser for Machine Translation Reordering
”, Jason Katz-Brown,
Slav Petrov
,
Ryan McDonald
,
Franz Och
, David Talbot, Hiroshi Ichikawa, Masakazu Seno,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP '11)
.
Machine translation systems often need to understand the syntactic structure of a sentence to translate it correctly. Traditionally, syntactic parsers are evaluated as standalone systems against reference data created by linguists. Instead, we show how to train a parser to optimize reordering accuracy in a machine translation system, resulting in measurable improvements in translation quality over a more traditionally trained parser.
“
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation
”, Ashish Venugopal,
Jakob Uszkoreit
, David Talbot,
Franz Och
, Juri Ganitkevitch,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
We propose a general method to watermark and probabilistically identify the structured results of machine learning algorithms with an application in statistical machine translation. Our approach does not rely on controlling or even knowing the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm, while being robust to limited editing operations.
“
Inducing Sentence Structure from Parallel Corpora for Reordering
”,
John DeNero
,
Jakob Uszkoreit
,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
Automatically discovering the full range of linguistic rules that govern the correct use of language is an appealing goal, but extremely challenging. Our paper describes a targeted method for discovering only those aspects of linguistic syntax necessary to explain how two different languages differ in their word ordering. By focusing on word order, we demonstrate an effective and practical application of unsupervised grammar induction that improves a Japanese to English machine translation system.
Multimedia and Computer Vision
“
Kernelized Structural SVM Learning for Supervised Object Segmentation
”,
Luca Bertelli
,
Tianli Yu
, Diem Vu, Burak Gokturk,
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2011
.
The paper proposes a principled way for computers to learn how to segment the foreground from the background of an image given a set of training examples. The technology is build upon a specially designed nonlinear segmentation kernel under the recently proposed structured SVM learning framework.
“
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
”,
Matthias Grundmann
,
Vivek Kwatra
, Irfan Essa,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011).
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our stabilization technique automatically converts casual shaky footage into more pleasant and professional looking videos by mimicking these cinematographic principles. The original, shaky camera path is divided into a set of segments, each approximated by either constant, linear or parabolic motion, using an algorithm based on robust L1 optimization. The stabilizer has been part of the YouTube Editor (
youtube.com/editor
) since March 2011.
“
The Power of Comparative Reasoning
”,
Jay Yagnik
, Dennis Strelow,
David Ross
, Ruei-Sung Lin,
International Conference on Computer Vision
(2011).
The paper describes a theory derived vector space transform that converts vectors into sparse binary vectors such that Euclidean space operations on the sparse binary vectors imply rank space operations in the original vector space. The transform a) does not need any data-driven supervised/unsupervised learning b) can be computed from polynomial expansions of the input space in linear time (in the degree of the polynomial) and c) can be implemented in 10-lines of code. We show competitive results on similarity search and sparse coding (for classification) tasks.
NLP
“
Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
”, Dipanjan Das,
Slav Petrov
,
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL '11)
, 2011,
Best Paper Award
.
We would like to have natural language processing systems for all languages, but obtaining labeled data for all languages and tasks is unrealistic and expensive. We present an approach which leverages existing resources in one language (for example English) to induce part-of-speech taggers for languages without any labeled training data. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in a hidden Markov model trained with the Expectation Maximization algorithm.
Networks
“
TCP Fast Open
”, Sivasankar Radhakrishnan,
Yuchung Cheng
,
Jerry Chu
,
Arvind Jain
, Barath Raghavan,
Proceedings of the 7th International Conference on emerging Networking EXperiments and Technologies (CoNEXT)
, 2011.
TCP Fast Open enables data exchange during TCP’s initial handshake. It decreases application network latency by one full round-trip time, a significant speedup for today's short Web transfers. Our experiments on popular websites show that Fast Open reduces the whole-page load time over 10% on average, and in some cases up to 40%.
“
Proportional Rate Reduction for TCP
”,
Nandita Dukkipati
, Matt Mathis,
Yuchung Cheng
, Monia Ghobadi,
Proceedings of the 11th ACM SIGCOMM Conference on Internet Measurement 2011, Berlin, Germany - November 2-4, 2011
.
Packet losses increase latency of Web transfers and negatively impact user experience. Proportional rate reduction (PRR) is designed to recover from losses quickly, smoothly and accurately by pacing out retransmissions across received ACKs during TCP’s fast recovery. Experiments on Google Web and YouTube servers in U.S. and India demonstrate that PRR reduces the TCP latency of connections experiencing losses by 3-10% depending on response size.
Security and Privacy
“
Automated Analysis of Security-Critical JavaScript APIs
”, Ankur Taly,
Úlfar Erlingsson
, John C. Mitchell,
Mark S. Miller
, Jasvir Nagra,
IEEE Symposium on Security & Privacy (SP)
, 2011.
As software is increasingly written in high-level, type-safe languages, attackers have fewer means to subvert system fundamentals, and attacks are more likely to exploit errors and vulnerabilities in application-level logic. This paper describes a generic, practical defense against such attacks, which can protect critical application resources even when those resources are partially exposed to attackers via software interfaces. In the context of carefully-crafted fragments of JavaScript, the paper applies formal methods and semantics to prove that these defenses can provide complete, non-circumventable mediation of resource access; the paper also shows how an implementation of the techniques can establish the properties of widely-used software, and find previously-unknown bugs.
“
App Isolation: Get the Security of Multiple Browsers with Just One
”, Eric Y. Chen, Jason Bau,
Charles Reis
, Adam Barth, Collin Jackson,
18th ACM Conference on Computer and Communications Security
, 2011.
We find that anecdotal advice to use a separate web browser for sites like your bank is indeed effective at defeating most cross-origin web attacks. We also prove that a single web browser can provide the same key properties, for sites that fit within the compatibility constraints.
Speech
“
Improving the speed of neural networks on CPUs
”,
Vincent Vanhoucke
,
Andrew Senior
, Mark Z. Mao,
Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
As deep neural networks become state-of-the-art in real-time machine learning applications such as speech recognition, computational complexity is fast becoming a limiting factor in their adoption. We show how to best leverage modern CPU architectures to significantly speed-up their inference.
“
Bayesian Language Model Interpolation for Mobile Speech Input
”,
Cyril Allauzen
,
Michael Riley
,
Interspeech 2011.
Voice recognition on the Android platform must contend with many possible target domains - e.g. search, maps, SMS. For each of these, a domain-specific language model was built by linearly interpolating several n-gram LMs from a common set of Google corpora. The current work has found a way to efficiently compute a single n-gram language model with accuracy very close to the domain-specific LMs but with considerably less complexity at recognition time.
Statistics
“
Large-Scale Parallel Statistical Forecasting Computations in R
”,
Murray Stokely
, Farzan Rohani, Eric Tassone,
JSM Proceedings, Section on Physical and Engineering Sciences
, 2011.
This paper describes the implementation of a framework for utilizing distributed computational infrastructure from within the R interactive statistical computing environment, with applications to timeseries forecasting. This system is widely used by the statistical analyst community at Google for data analysis on very large data sets.
Structured Data
“
Dremel: Interactive Analysis of Web-Scale Datasets
”, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton,
Theo Vassilakis
,
Communications of the ACM
, vol. 54 (2011), pp. 114-123.
Dremel is a scalable, interactive ad-hoc query system. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. Besides continued growth internally to Google, Dremel now also backs an increasing number of external customers including BigQuery and UIs such as AdExchange front-end.
“
Representative Skylines using Threshold-based Preference Distributions
”,
Atish Das Sarma
, Ashwin Lall, Danupon Nanongkai, Richard J. Lipton, Jim Xu,
International Conference on Data Engineering (ICDE)
, 2011.
The paper adopts principled approach towards representative skylines and formalizes the problem of displaying k tuples such that the probability that a random user clicks on one of them is maximized. This requires mathematically modeling (a) the likelihood with which a user is interested in a tuple, as well as (b) how one negotiates the lack of knowledge of an explicit set of users. This work presents theoretical and experimental results showing that the suggested algorithm significantly outperforms previously suggested approaches.
“
Hyper-local, directions-based ranking of places
”, Petros Venetis, Hector Gonzalez,
Alon Y. Halevy
, Christian S. Jensen,
PVLDB
, vol. 4(5) (2011), pp. 290-30.
Click through information is one of the strongest signals we have for ranking web pages. We propose an equivalent signal for raking real world places: The number of times that people ask for precise directions to the address of the place. We show that this signal is competitive in quality with human reviews while being much cheaper to collect, we also show that the signal can be incorporated efficiently into a location search system.
Systems
“
Power Management of Online Data-Intensive Services
”, David Meisner, Christopher M. Sadler,
Luiz André Barroso
,
Wolf-Dietrich Weber
, Thomas F. Wenisch,
Proceedings of the 38th ACM International Symposium on Computer Architecture
, 2011.
Compute and data intensive Web services (such as Search) are a notoriously hard target for energy savings techniques. This article characterizes the statistical hardware activity behavior of servers running Web search and discusses the potential opportunities of existing and proposed energy savings techniques.
“
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
”, Lingjia Tang, Jason Mars, Neil Vachharajani,
Robert Hundt
, Mary-Lou Soffa,
ISCA
, 2011.
In this work, the authors expose key characteristics of an emerging class of Google-style workloads and show how to enhance system software to take advantage of these characteristics to improve efficiency in data centers. The authors find that across datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing micro-architectural resources on a single machine (such as on-chip caches and bandwidth to memory). The impact of co-locating threads from multiple applications with diverse memory behavior changes the optimal mapping of thread to cores for each application. By employing an adaptive thread-to-core mapper, the authors improved the performance of the datacenter applications by up to 22% over status quo thread-to-core mapping, achieving performance within 3% of optimal.
“
Language-Independent Sandboxing of Just-In-Time Compilation and Self-Modifying Code
”, Jason Ansel, Petr Marchenko,
Úlfar Erlingsson
, Elijah Taylor,
Brad Chen
, Derek Schuff, David Sehr,
Cliff L. Biffle
, Bennet S. Yee,
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
, 2011.
Since its introduction in the early 90's, Software Fault Isolation, or SFI, has been a static code technique, commonly perceived as incompatible with dynamic libraries, runtime code generation, and other dynamic code. This paper describes how to address this limitation and explains how the SFI techniques in Google Native Client were extended to support modern language implementations based on just-in-time code generation and runtime instrumentation. This work is already deployed in Google Chrome, benefitting millions of users, and was developed over a summer collaboration with three Ph.D. interns; it exemplifies how Research at Google is focused on rapidly bringing significant benefits to our users through groundbreaking technology and real-world products.
“
Thialfi: A Client Notification Service for Internet-Scale Applications
”, Atul Adya, Gregory Cooper,
Daniel Myers
,
Michael Piatek
,
Proc. 23rd ACM Symposium on Operating Systems Principles (SOSP)
, 2011, pp. 129-142.
This paper describes a notification service that scales to hundreds of millions of users, provides sub-second latency in the common case, and guarantees delivery even in the presence of a wide variety of failures. The service has been deployed in several popular Google applications including Chrome, Google Plus, and Contacts.
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
Monday, June 20, 2011
Posted by
Matthias Grundmann
,
Vivek Kwatra
, and
Irfan Essa
, Research Team
Earlier this year, we
announced
the launch of new features on the
YouTube Video Editor
, including stabilization for shaky videos, with the ability to preview them in real-time. The core technology behind this feature is detailed in
this paper
, which will be presented at the IEEE International Conference on Computer Vision and Pattern Recognition (
CVPR 2011
).
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our goal was to devise a completely automatic method for converting casual shaky footage into more pleasant and professional looking videos.
Our technique mimics the cinematographic principles outlined above by automatically determining the best camera path using a robust optimization technique. The original, shaky camera path is divided into a set of segments, each approximated by either a constant, linear or parabolic motion. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm.
To achieve real-time performance on the web, we distribute the computation across multiple machines in the cloud. This enables us to provide users with a real-time preview and interactive control of the stabilized result. Above we provide a video demonstration of how to use this feature on the YouTube Editor. We will also demo this live at
Google’s exhibition booth
in CVPR 2011.
For further details, please read our
paper
.
Google at CVPR 2011
Thursday, June 16, 2011
Posted by Mei Han and Sergey Ioffe, Research Team
The computer vision community will get together in Colorado Springs the week of June 20th for the
IEEE International Conference on Computer Vision and Pattern Recognition
(CVPR 2011). This year will see a record number of people attending the conference and 27 co-located workshops and tutorials. The registration was closed at 1500 attendees even before the conference started.
Computer Vision is at the core of many Google products, such as
Image Search
,
YouTube
,
Street View
,
Picasa
, and
Goggles
, and as always, Google is involved in several ways with CVPR.
Andrew Senior
is serving as an area chair of CVPR 2011 and many Googlers are reviewers. Googlers also co-authored these papers:
Where's Waldo: Matching People in Images of Crowds
by Rahul Garg, Deva Ramanan, Steve Seitz, Noah Snavely
Visual and Semantic Similarity in ImageNet
by Thomas Deselaers, Vittorio Ferrari
Multicore Bundle Adjustment
by Changchang Wu, Sameer Agarwal, Brian Curless, Steve Seitz
A Hierarchical Conditional Random Field Model for Labeling and Segmenting Images of Street Scenes
by Qixing Huang, Mei Han, Bo Wu, Sergey Ioffe
Kernelized Structural SVM Learning for Supervised Object Segmentation
by Luca Bertelli, Tianli Yu, Diem Vu, Salih Gokturk
Discriminative Tag Learning on YouTube Videos with Latent Sub-tags
by Weilong Yang, George Toderici
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
by Matthias Grundmann, Vivek Kwatra, Irfan Essa
Image Saliency: From Local to Global Context
by Meng Wang, Janusz Konrad, Prakash Ishwar, Yushi Jing, Henry Rowley
If you are attending the conference, stop by Google’s exhibition booth. In addition to talking with Google researchers, you will get to see examples of exciting computer vision research that has made it into Google products including, among others, the following:
Google Earth Facade Shadow Removal
by Mei Han, Vivek Kwatra, and Shengyang Dai
We will demonstrate our technique for removing shadows and other lighting/texture artifacts from building facades in Google Earth. We obtain cleaner, clearer, and more uniform textures which provide users with an improved visual experience.
Video Stabilization on YouTube Editor
by Matthias Grundmann, Vivek Kwatra, and Irfan Essa
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. In contrast, professionally shot video usually employs stabilization equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our technique mimics these cinematographic principles, by optimally dividing the original, shaky camera path into a set of segments and approximating each with either constant, linear or parabolic motion using a computationally efficient and stable algorithm. We will showcase a live version of our algorithm, featuring real-time performance and interactive control, which is publicly available at youtube.com/editor.
Tag Suggest for YouTube
by George Toderici and Mehmet Emre Sargin
YouTube offers millions of users the opportunity to upload videos and share them with their friends. Many users would love to have their videos discoverable but don't annotate them properly. One new feature on YouTube that seeks to address this problem is tag prediction based on video content and independently based on text metadata.
6/17/2011 UPDATE: "Posted by" was changed to include Sergey Ioffe.
Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings
Thursday, March 10, 2011
Posted by Jason Weston and Samy Bengio, Research Team
In our
paper
, we introduce a generic framework to find a joint representation of images and their labels, which can then be used for various tasks, including image ranking and image annotation.
We focus on the task of automatic assignment of annotations (text labels) to images given only the pixel representation of the image (i.e., with no known metadata). This is achieved by a learning algorithm, that is, where the computer learns to predict annotations for new images given annotated training images. Such training datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. In this paper, we propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding vector space for both images and annotations. Our system learns an interpretable model, where annotations with alternate wordings ("president obama" or "barack"), different languages ("tour eiffel" or "eiffel tower"), or similar concepts (such as "toad" or "frog") are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations.
Our system is trained on ~10 million images with ~100,000 possible annotation types and it annotates a single new image in ~0.17 seconds (not including feature processing) and consumes only 82MB of memory. Our method both outperforms all the methods we tested against and in comparison to them is faster and consumes less memory, making it possible to house such a system on a laptop or mobile device.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.