Google Research Blog
The latest news from Research at Google
Understanding Bias in Peer Review
Thursday, November 30, 2017
Posted by Andrew Tomkins, Director of Engineering and William D. Heavlin, Statistician, Google Research
In the 1600’s, a series of practices came into being known collectively as the “scientific method.” These practices encoded verifiable experimentation as a path to establishing scientific fact. Scientific literature arose as a mechanism to validate and disseminate findings, and standards of scientific peer review developed as a means to control the quality of entrants into this literature. Over the course of development of peer review, one key structural question remains unresolved to the current day: should the reviewers of a piece of scientific work be made aware of the identify of the authors? Those in favor argue that such additional knowledge may allow the reviewer to set the work in perspective and evaluate it more completely. Those opposed argue instead that the reviewer may form an opinion based on past performance rather than the merit of the work at hand.
Existing academic literature on this subject describes specific forms of bias that may arise when reviewers are aware of the authors. In 1968, Merton proposed the
Matthew effect
, whereby credit goes to the best established researchers. More recently, Knobloch-Westerwick et al. proposed a
Matilda effect
, whereby papers from male-first authors were considered to have greater scientific merit that those from female-first authors. But with the exception of one classical
study
performed by Rebecca Blank in 1991 at the
American Economic Review
, there have been few controlled experimental studies of such effects on reviews of academic papers.
Last year we had the opportunity to explore this question experimentally, resulting in “
Reviewer bias in single- versus double-blind peer review
,” a paper that just appeared in the
Proceedings of the National Academy of Sciences
. Working with Professor Min Zhang of Tsinghua University, we performed an experiment during the peer review process of the
10th ACM Web Search and Data Mining Conference
(WSDM 2017) to compare the behavior of reviewers under single-blind and double-blind review. Our experiment ran as follows:
We invited a number of experts to join the conference Program Committee (PC).
We randomly split these PC members into a single-blind cadre and a double-blind cadre.
We asked all PC members to “bid” for papers they were qualified to review, but only the single-blind cadre had access to the names and institutions of the paper authors.
Based on the resulting bids, we then allocated two single-blind and two double-blind PC members to each paper.
Each PC member read his or her assigned papers and entered reviews, again with only single-blind PC members able to see the authors and institutions.
At this point, we closed our experiment and performed the remainder of the conference reviewing process under the single-blind model. As a result, we were able to assess the difference in bidding and reviewing behavior of single-blind and double-blind PC members on the same papers. We discovered a number of surprises.
Our first finding shows that compared to their double-blind counterparts, single-blind PC members tend to enter higher scores for papers from top institutions (the finding holds for both universities and companies) and for papers written by well-known authors. This suggests that a paper authored by an up-and-coming researcher might be reviewed more negatively (by a single-blind PC member) than exactly the same paper written by an established star of the field.
Digging a little deeper, we show some additional findings related to the “bidding process,” in which PC members indicate which papers they would like to review. We found that single-blind PC members (a) bid for about 22% fewer papers than their double-blind counterparts, and (b) bid preferentially for papers from top schools and companies. Finding (a) is especially intriguing; with no author information reviewers have less information, arguably making the job of weighing the merit of each paper more difficult. Yet, the double-blind reviewers bid for more work, not less, than their single-blind counterparts. This suggests that double-blind reviewers become more engaged in the review process. Finding (b) is less surprising, but nonetheless enlightening: In the presence of author names and institution, this information is incorporated into the reviewers’ bids. All else being equal, the odds that single-blind reviewers bid on papers from top institutions is about 15 percent above parity.
We also studied whether the actual or perceived gender of authors influenced the behavior of single-blind versus double-blind reviewers. Here the results are a little more nuanced. Compared to double-blind reviewers, we saw about a 22% decrease in the odds that a single-blind reviewer would give a female-authored paper a favorable review, but due to the smaller count of female-authored papers this result was not statistically significant. In an
extended version of our paper
, we consider our study as well as a range of other studies in the literature and perform a “meta-analysis” of all these results. From this larger pool of observations, the combined results do show a significant finding for the gender effect.
To conclude, we see that the practice of double-blind reviewing yields a denser landscape of bids, which may result in a better allocation of papers to qualified reviewers. We also see that reviewers who see author and institution information tend to bid more for papers from top institutions, and are more likely to vote to accept papers from top institutions or famous authors than their double-blind counterparts. This offers some evidence to suggest that a particular piece of work might be accepted under single-blind review if the authors are famous or come from top institutions, but rejected otherwise. Of course, the situation remains complex: double-blind review imposes an administrative burden on conference organizers, reduces the opportunity to detect several varieties of conflict of interest, and may in some cases be difficult to implement due to the existence of pre-prints or long-running research agendas that are well-known to experts in the field. Nonetheless, we recommend that journal editors and conference chairs carefully consider the merits of double-blind review.
Please take a look at our
full paper
for more details of our study.
Interpreting Deep Neural Networks with SVCCA
Tuesday, November 28, 2017
Posted by Maithra Raghu, Google Brain Team
Deep Neural Networks (DNNs) have driven unprecedented advances in areas such as
vision
,
language understanding
and
speech recognition
. But these successes also bring new challenges. In particular, contrary to many previous machine learning methods, DNNs can be susceptible to
adversarial examples
in classification,
catastrophic forgetting
of tasks in
reinforcement learning
, and
mode collapse
in generative
modelling
. In order to build better and more robust DNN-based systems, it is critically important to be able to interpret these models. In particular, we would like a notion of
representational similarity
for DNNs: can we effectively determine when the representations learned by two neural networks are same?
In our paper, “
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
,” we introduce a simple and scalable method to address these points. Two specific applications of this that we look at are comparing the representations learned by different networks, and interpreting representations learned by hidden layers in DNNs. Furthermore, we are open sourcing
the code
so that the research community can experiment with this method.
Key to our setup is the interpretation of each neuron in a DNN as an
activation vector
. As shown in the figure below, the activation vector of a neuron is the scalar output it produces on the input data. For example, for 50 input images, a neuron in a DNN will output 50 scalar values, encoding how much it responds to each input. These 50 scalar values then make up an activation vector for the neuron. (Of course, in practice, we take many more than 50 inputs.)
Here a DNN is given three inputs, x
1
, x
2
, x
3
. Looking at a neuron inside the DNN (bolded in red, right pane), this neuron produces a scalar output z
i
corresponding to each input x
i
. These values form the activation vector of the neuron.
With this basic observation and a little more formulation, we introduce Singular Vector Canonical Correlation Analysis (SVCCA), a technique for taking in two sets of neurons and outputting aligned feature maps learned by both of them. Critically, this technique accounts for superficial differences such as permutations in neuron orderings (crucial for comparing different networks), and can detect similarities where other, more straightforward comparisons fail.
As an example, consider training two convolutional neural nets (
net1
and
net2
, below) on
CIFAR-10
, a medium scale image classification task. To visualize the results of our method, we compare activation vectors of neurons with the aligned features output by SVCCA. Recall that the activation vector of a neuron is the raw scalar outputs on input images. The x-axis of the plot consists of images sorted by class (gray dotted lines showing class boundaries), and the y axis the output value of the neuron.
On the left pane, we show the two highest activation (largest euclidean norm) neurons in
net1
and
net2
. Examining highest activations neurons has been a popular method to interpret DNNs in computer vision, but in this case, the highest activation neurons in
net1
and
net2
have no clear correspondence, despite both being trained on the same task. However, after applying SVCCA, (right pane), we see that the latent representations learned by both networks do indeed share some very similar features. Note that the top two rows representing aligned feature maps are close to identical, as are the second highest aligned feature maps (bottom two rows). Furthermore, these aligned mappings in the right pane also show a clear correspondence with the class boundaries, e.g. we see the top pair give negative outputs for Class 8, with the bottom pair giving a positive output for Class 2 and Class 7.
While you can apply SVCCA across networks, one can also do this for the same network, across
time
, enabling the study of how different layers in a network converge to their final representations. Below, we show panes that compare the representation of layers in
net1
during training (y-axes) with the layers at the end of training (x-axes). For example, in the top left pane (titled “0% trained”), the x-axis shows layers of increasing depth of
net1
at 100% trained, and the y axis shows layers of increasing depth at 0% trained. Each (i,j) square then tells us how similar the representation of layer i at 100% trained is to layer j at 0% trained. The input layer is at the bottom left, and is (as expected) identical at 0% to 100%. We make this comparison at several points through training, at 0%, 35%, 75% and 100%, for convolutional (top row) and residual (bottom row) nets on CIFAR-10.
Plots showing learning dynamics of convolutional and residual networks on CIFAR-10. Note the additional structure also visible: the 2x2 blocks in the top row are due to batch norm layers, and the checkered pattern in the bottom row due to residual connections.
We find evidence of
bottom-up
convergence, with layers closer to the input converging first, and layers higher up taking longer to converge. This suggests a faster training method,
Freeze Training
— see our paper for details. Furthermore, this visualization also helps highlight properties of the network. In the top row, there are a couple of 2x2 blocks. These correspond to batch normalization layers, which are representationally identical to their previous layers. On the bottom row, towards the end of training, we can see a checkerboard like pattern appear, which is due to the residual connections of the network having greater similarity to previous layers.
So far, we’ve concentrated on applying SVCCA to CIFAR-10. But applying preprocessing techniques with the
Discrete Fourier transform
, we can scale this method to Imagenet sized models. We applied this technique to the
Imagenet Resnet
, comparing the similarity of latent representations to representations corresponding to different classes:
SVCCA similarity of latent representations with different classes. We take different layers in Imagenet Resnet, with 0 indicating input and 74 indicating output, and compare representational similarity of the hidden layer and the output class. Interestingly, different classes are learned at different speeds: the firetruck class is learned faster than the different dog breeds. Furthermore, the two pairs of dog breeds (a husky-like pair and a terrier-like pair) are learned at the same rate, reflecting the visual similarity between them.
Our
paper
gives further details on the results we’ve explored so far, and also touches on different applications, e.g. compressing DNNs by projecting onto the SVCCA outputs, and Freeze Training, a computationally cheaper method for training deep networks. There are many followups we’re excited about exploring with SVCCA — moving on to different kinds of architectures, comparing across datasets, and better visualizing the aligned directions are just a few ideas we’re eager to try out. We look forward to presenting these results next week at
NIPS 2017
in Long Beach, and we hope the
code
will also encourage many people to apply SVCCA to their network representations to interpret and understand what their network is learning.
Understanding Medical Conversations
Tuesday, November 21, 2017
Posted by Katherine Chou, Product Manager and Chung-Cheng Chiu, Software Engineer, Google Brain Team
Good documentation helps create good clinical care by communicating a doctor's thinking, their concerns, and their plans to the rest of the team. Unfortunately, physicians routinely spend more time doing documentation than doing what they love most — caring for patients. Doctors often spend ~6 hours in an 11-hour workday in the
Electronic Health Records
(EHR) on documentation.
1
Consequently, one study found that more than half of surveyed doctors report at least one symptom of burnout.
2
In order to help offload note-taking, many doctors have started using medical scribes as a part of their workflow. These scribes listen to the patient-doctor conversations and create notes for the EHR. According to a recent study, introducing scribes not only improved physician satisfaction, but also medical chart quality and accuracy.
3
But the number of doctor-patient conversations that need a scribe is far beyond the capacity of people who are available for medical scribing.
We wondered: could the voice recognition technologies already available in Google Assistant, Google Home, and Google Translate be used to document patient-doctor conversations and help doctors and scribes summarize notes more quickly?
In “
Speech Recognition for Medical Conversations
”, we show that it is possible to build
Automatic Speech Recognition
(ASR) models for transcribing medical conversations. While most of the current ASR solutions in medical domain focus on transcribing doctor dictations (i.e., single speaker speech consisting of predictable medical terminology), our research shows that it is possible to build an ASR model which can handle multiple speaker conversations covering everything from weather to complex medical diagnosis.
Using this technology, we will start working with physicians and researchers at Stanford University, who have done extensive
research
on how scribes can improve physician satisfaction, to understand how deep learning techniques such as ASR can facilitate the scribing process of physician notes. In our pilot study, we investigate what types of clinically relevant information can be extracted from medical conversations to assist physicians in reducing their interactions with the EHR. The study is fully patient-consented and the content of the recording will be de-identified to protect patient privacy.
We hope these technologies will not only help
return joy to practice
by facilitating doctors and scribes with their everyday workload, but also help the patients get more dedicated and thorough medical attention, ideally, leading to better care.
1
http://www.annfammed.org/content/15/5/419.full
↩
2
http://www.mayoclinicproceedings.org/article/S0025-6196%2815%2900716-8/abstract
↩
3
http://www.annfammed.org/content/15/5/427.full
↩
SLING: A Natural Language Frame Semantic Parser
Wednesday, November 15, 2017
Posted by Michael Ringgaard, Software Engineer and Rahul Gupta, Research Scientist
Until recently, most practical
natural language understanding
(NLU) systems used a pipeline of analysis stages, from part-of-speech tagging and dependency parsing to steps that computed a semantic representation of the input text. While this facilitated easy modularization of different analysis stages, errors in earlier stages would have cascading effects in later stages and the final representation, and the intermediate stage outputs might not be relevant on their own. For example, a typical pipeline might perform the task of
dependency parsing
in an early stage and the task of coreference resolution towards the end. If one was only interested in the output of
coreference resolution
, it would be affected by cascading effects of any errors during dependency parsing.
Today we are announcing
SLING
, an experimental system for parsing natural language text directly into a representation of its meaning as a
semantic frame graph
. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a special-purpose
recurrent neural network
model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING's parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses).
SLING provides fast parsing at inference time by providing (a) an efficient and scalable frame store implementation and (b) a JIT compiler that generates efficient code to execute the recurrent neural network. Although SLING is experimental, it achieves a parsing speed of >2,500 tokens/second on a desktop CPU, thanks to its efficient frame store and neural network compiler. SLING is implemented in C++ and it is available for download on GitHub. The entire system is described in detail in a
technical report
as well.
Frame Semantic Parsing
Frame Semantics [
1
] represents the meaning of text — such as a sentence — as a set of formal statements. Each formal statement is called a
frame
, which can be seen as a unit of knowledge or meaning, that also contains interactions with concepts or other frames typically associated with it. SLING organizes each frame as a list of slots, where each slot has a name (role) and a value which could be a literal or a link to another frame. As an example, consider the sentence:
“Many people now claim to have predicted Black Monday.”
The figure below illustrates SLING recognizing mentions of entities (e.g. people, places, or events), measurements (e.g. dates or distances), and other concepts (e.g. verbs), and placing them in the correct semantic roles for the verbs in the input. The word
predicted
evokes the most dominant sense of the verb "predict", denoted as a PREDICT-01 frame. Additionally, this frame also has interactions (slots) with who made the prediction (denoted via the ARG0 slot, which points to the PERSON frame for
people
) and what was being predicted (denoted via ARG1, which links to the EVENT frame for
Black Monday
). Frame semantic parsing is the task of producing a directed graph of such frames linked through slots.
Although the example above is fairly simple, frame graphs are powerful enough to model a variety of complex semantic annotation tasks. For starters, frames provide a convenient way to bring together language-internal and external information types (e.g. knowledge bases). This can then be used to address complex language understanding problems such as reference, metaphor, metonymy, and perspective. The frame graphs for these tasks only differ in the inventory of frame types, roles, and any linking constraints.
SLING
SLING trains a recurrent neural network by optimizing for the semantic frames of interest.
The internal learned representations in the network’s hidden layers replace the hand-crafted feature combinations and intermediate representations in pipelined systems. Internally, SLING uses an encoder-decoder architecture where each input word is encoded into a vector using simple lexical features like the raw word, its suffix(es), punctuation etc. The decoder uses that representation, along with recurrent features from its own history, to compute a sequence of
transitions
that update the frame graph to obtain the intended frame semantic representation of the input sentence. SLING trains its model using
TensorFlow
and
DRAGNN
.
The animation below shows how frames and roles are incrementally added to the under-construction frame graph using individual transitions. As discussed earlier with our simple example sentence, SLING connects the VERB and EVENT frames using the role ARG1, signifying that the EVENT frame is the concept being predicted. The EVOKE transition evokes a frame of a specified type from the next few tokens in the text (e.g. EVENT from
Black Monday
). Similarly, the CONNECT transition links two existing frames with a specified role. When the input is exhausted and the last transition (denoted as STOP) is executed, the frame graph is deemed as complete and returned to the user, who can inspect the graph to get the semantic meaning behind the sentence.
One key aspect of our transition system is the presence of a small fixed-size attention buffer of frames that represents the most recent frames to be evoked or modified, shown with the orange boxes in the figure above. This buffer captures the intuition that we tend to remember knowledge that was recently evoked, referred to, or enhanced. If a frame is no longer in use, it eventually gets flushed out of this buffer as new frames come into picture. We found this simple mechanism to be surprisingly effective at capturing a large fraction of inter-frame links.
Next Steps
The illustrative experiment above is just a launchpad for research in semantic parsing for tasks such as knowledge extraction, resolving complex references, and dialog understanding. The SLING
release
on Github comes with a pre-trained model for the task we illustrated, as well as examples and recipes to train your own parser on either the supplied synthetic data or your own data. We hope the community finds SLING useful and we look forward to engaging conversations about applying and extending SLING to other semantic parsing tasks.
Acknowledgements
The research described in this post was done by Michael Ringgaard, Rahul Gupta, and Fernando Pereira. We thank the Tensorflow and DRAGNN teams for open-sourcing their packages, and various colleagues at DRAGNN who helped us with multiple aspects of SLING's training setup.
1
Charles J. Fillmore. 1982. Frame semantics. Linguistics in the Morning Calm, pages 111–138.
↩
On-Device Conversational Modeling with TensorFlow Lite
Tuesday, November 14, 2017
Posted by Sujith Ravi, Research Scientist, Google Expander Team
Earlier this year, we launched
Android Wear 2.0
which featured the first
"on-device" machine learning
technology for smart messaging. This enabled cloud-based technologies like Smart Reply, previously available in
Gmail
,
Inbox
and
Allo
, to be used directly within any application for the first time, including third-party messaging apps, without ever having to connect to the cloud. So you can respond to incoming chat messages on the go, directly from your smartwatch.
Today, we announce
TensorFlow Lite
, TensorFlow’s lightweight solution for mobile and embedded devices. This framework is optimized for low-latency inference of machine learning models, with a focus on small memory footprint and fast performance. As part of the library, we have also released an
on-device conversational model
and a
demo app
that provides an example of a natural language application powered by TensorFlow Lite, in order to make it easier for developers and researchers to build new machine intelligence features powered by on-device inference. This model generates reply suggestions to input conversational chat messages, with efficient inference that can be easily plugged in to your chat application to power on-device conversational intelligence.
The on-device conversational model we have released uses a new ML architecture for training compact neural networks (as well as other machine learning models) based on a joint optimization framework, originally presented in
ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections
. This architecture can run efficiently on mobile devices with limited computing power and memory, by using efficient “projection” operations that transform any input to a compact bit vector representation — similar inputs are projected to nearby vectors that are dense or sparse depending on type of projection. For example, the messages
“hey, how's it going?”
and
“How's it going buddy?”
, might be projected to the same vector representation.
Using this idea, the conversational model combines these efficient operations at low computation and memory footprint. We trained this on-device model end-to-end using an ML framework that jointly trains two types of models — a compact
projection
model (as described above) combined with a
trainer
model. The two models are trained in a joint fashion, where the projection model learns from the trainer model — the trainer is characteristic of an expert and modeled using larger and more complex ML architectures, whereas the projection model resembles a student that learns from the expert. During training, we can also stack other techniques such as
quantization
or
distillation
to achieve further compression or selectively optimize certain portions of the objective function. Once trained, the smaller projection model is able to be used directly for inference on device.
For inference, the trained projection model is compiled into a set of TensorFlow Lite operations that have been optimized for fast execution on mobile platforms and executed directly on device. The TensorFlow Lite inference graph for the on-device conversational model is shown here.
TensorFlow Lite execution for the On-Device Conversational Model.
The open-source conversational
model
released today (along with
code
) was trained end-to-end using the joint ML architecture described above. Today’s release also includes a
demo app
, so you can easily download and try out one-touch smart replies on your mobile device. The architecture enables easy configuration for model size and prediction quality based on application needs. You can find a list of sample messages where this model does well
here
. The system can also fall back to suggesting replies from a fixed set that was learned and compiled from popular response intents observed in chat conversations. The underlying model is different from the ones Google uses for Smart Reply responses in its apps
1
.
Beyond Conversational Models
Interestingly, the ML architecture described above permits flexible choices for the underlying model. We also designed the architecture to be compatible with different machine learning approaches — for example, when used with TensorFlow deep learning, we learn a lightweight neural network (
ProjectionNet
) for the underlying model, whereas a different architecture (
ProjectionGraph
) represents the model using a
graph
framework instead of a neural network.
The joint framework can also be used to train lightweight on-device models for other tasks using different ML modeling architectures. As an example, we derived a ProjectionNet architecture that uses a complex feed-forward or recurrent architecture (like LSTM) for the trainer model coupled with a simple projection architecture comprised of dynamic projection operations and a few, narrow fully-connected layers. The whole architecture is trained end-to-end using backpropagation in TensorFlow and once trained, the compact ProjectionNet is directly used for inference. Using this method, we have successfully trained tiny ProjectionNet models that achieve significant reduction in model sizes (up to several orders of magnitude reduction) and high performance with respect to accuracy on multiple visual and language classification tasks (a few examples
here
). Similarly, we trained other lightweight models using our
graph learning framework
, even in
semi-supervised
settings.
ML architecture for training on-device models:
ProjectionNet
trained using deep learning (left), and
ProjectionGraph
trained using graph learning (right).
We will continue to improve and release updated TensorFlow Lite models in open-source. We think that the released model (as well as future models) learned using these ML architectures may be reused for many natural language and computer vision applications or plugged into existing apps for enabling machine intelligence. We hope that the machine learning and natural language processing communities will be able to build on these to address new problems and use-cases we have not yet conceived.
Acknowledgments
Yicheng Fan and Gaurav Nemade contributed immensely to this effort. Special thanks to Rajat Monga, Andre Hentz, Andrew Selle, Sarah Sirajuddin, and Anitha Vijayakumar from the TensorFlow team; Robin Dua, Patrick McGregor, Andrei Broder, Andrew Tomkins and the Google Expander team.
1
The released on-device model was trained to optimize for small size and low latency applications on mobile phones and wearables. Smart Reply predictions in Google apps, however are generated using larger, more complex models. In production systems, we also use multiple classifiers that are trained to detect inappropriate content and apply further filtering and tuning to optimize user experience and quality levels. We recommend that developers using the open-source TensorFlow Lite version also follow such practices for their end applications.
↩
Fused Video Stabilization on the Pixel 2 and Pixel 2 XL
Friday, November 10, 2017
Posted by Chia-Kai Liang, Senior Staff Software Engineer and Fuhao Shi, Android Camera Team
One of the most important aspects of current smartphones is easily capturing and sharing videos. With the Pixel 2 and Pixel 2 XL smartphones, the videos you capture are smoother and clearer than ever before, thanks to our Fused Video Stabilization technique based on both optical image stabilization (OIS) and electronic image stabilization (EIS). Fused Video Stabilization delivers highly stable footage with minimal artifacts, and the Pixel 2 is currently rated as the leader in DxO's video ranking (also earning the
highest overall rating for a smartphone camera
). But how does it work?
A key principle in videography is keeping the camera motion smooth and steady. A stable video is free of the distraction, so the viewer can focus on the subject of interest. But, videos taken with smartphones are subject to many conditions that make taking a high-quality video a significant challenge:
Camera Shake
Most people hold their mobile phones in their hands to record videos - you pull the phone from your pocket, record the video, and the video is ready to share right after recording. However, that means your videos shake as much as your hands do -- and they shake a lot! Moreover, if you are walking or running while recording, the camera motion can make videos almost unwatchable:
Motion Blur
If the camera or the subject moves during exposure, the resulting photo or video will appear blurry. Even if we stabilize the motion in between consecutive frames, the motion blur in each individual frame cannot be easily restored in practice, especially on a mobile device. One typical video artifact due to motion blur is sharpness inconsistency: the video may rapidly alternate between blurry and sharp, which is very distracting even after the video is stabilized:
Rolling Shutter
The
CMOS
image sensor collects one row of pixels, or “scanline”, at a time, and it takes tens of milliseconds to go from the top scanline to the bottom. Therefore, anything moving during this period can appear distorted. This is called the
rolling shutter distortion
. Even if you have a steady hand, the rolling shutter distortion will appear when you move quickly:
A simulated rendering of a video with global (left) and rolling (right) shutter.
Focus Breathing
When there are objects of varying distance in a video, the
angle of view
can change significantly due to objects “jumping” in and out of the foreground. As result, everything shrinks or expands like the video below, which professionals call “
breathing
”:
A good stabilization system should address all of these issues: the video should look sharp, the motion should be smooth, and the rolling shutter and focus breathing should be corrected.
Many professionals mount the camera on a mechanical stabilizer to entirely isolate hand motion. These devices actively sense and compensate for the camera’s movement to remove all unwanted motions. However, they are usually expensive and cumbersome; you wouldn’t want to carry one every day. There are also handheld gimbal mounts available for mobile phones. However, they are usually larger than the phone itself, and you have to put the phone on it before start recording. You’d need to do it fast before the interesting moment vanishes.
Optical Image Stabilization
(OIS) is the most well-known method for suppression of handshake artifacts. Typically, in mobile camera modules with OIS, the lens is suspended in the middle of the module by a number of springs and electromagnets are used to move the lens within its enclosure. The lens module actively senses and compensates for handshake motion at very high speeds. Because OIS responds to motion rapidly, it can greatly suppress the handshake blur. However, the range of correctable motion is fairly limited (usually around 1-2 degrees), which is not enough to correct the unwanted motions between consecutive video frames, or to correct excessive motion blur during walking. Moveover, OIS cannot correct some kinds of motions, such as in-plane rotation. Sometimes it can even introduce a “jello” artifact:
The video is taken by Pixel 2 with only OIS enabled. You can see the frame center is stabilized, but the boundaries have some jello-like artifacts.
Electronic Image Stabilization
(EIS) analyzes the camera motion, filters out the unwanted parts, and synthesizes a new video by transforming each frame. The final stabilization quality depends on the algorithm design and implementation optimization of these stages. In general, software-based EIS is more flexible than OIS so it can correct larger and more kinds of motions. However, EIS has some common limitations. First, to prevent undefined regions in the synthesized frame, it needs to reduce the field of view or resolution. Second, compared to OIS or an external stabilizer, EIS requires more computation, which is a limited resource on mobile phones.
Making a Better Video: Fused Video Stabilization
With Fused Video Stabilization, both OIS and EIS are enabled simultaneously during video recording to address all the issues mentioned above. Our solution has three processing stages as shown in the system diagram below. The first processing stage, motion analysis, extracts the gyroscope signal, the OIS motion, and other properties to estimate the camera motion precisely. Then, the motion filtering stage combines machine learning and signal processing to predict a person’s intention in moving the camera. Finally, in the frame synthesis stage, we model and remove the rolling shutter and focus breathing distortion. With Fused Video Stabilization, the videos from Pixel 2 have less motion blur and look more natural. The solution is efficient enough to run in all video modes, such as 60fps or 4K recording.
Motion Analysis
In the motion analysis stage, we use the phone’s high-speed gyroscope to estimate the rotational component of the hand motion (roll, pitch, and yaw). By sensing the motion at 200 Hz, we have dense motion vectors for each scanline, enough to model the rolling shutter distortion. We also measure lens motions that are not sensed by the gyroscope, including both the focus adjustment (z) and the OIS movement (x and y) at high speed. Because we need high temporal precision to model the rolling shutter effect, we carefully optimize the system to ensure perfect timestamp alignment between the CMOS image sensor, the gyroscope, and the lens motion readouts. A misalignment of merely a few milliseconds can introduce noticeable jittering artifact:
Left:
The stabilized video of a “running” motion with a 3ms timing error. Note the occasional jittering.
Right:
The stabilized video with correct timestamps. The bottom right corner shows the original shaky video.
Motion Filtering
The motion filtering stage takes the real camera motion from motion analysis and creates the stabilized virtual camera motion. Note that we push the incoming frames into a queue to defer the processing. This enables us to
lookahead
at future camera motions, using machine learning to accurately predict the user’s intention. Lookahead filtering is not feasible for OIS or any mechanical stabilizers, which can only react to previous or present motions. We will discuss more about this below.
Frame Synthesis
At the final stage, we derive how the frame is transformed based on the real and virtual camera motions. To handle the rolling shutter distortion, we use multiple transformations for each frame. We split the the input frame into a mesh and warp each part separately:
Left:
The input video with mesh overlay.
Right:
The warped frame, and the red rectangle is the final stabilized output. Note how the non-rigid warping corrects the rolling shutter distortion.
Lookahead Motion Filtering
One key feature in the Fused Video Stabilization is our new lookahead filtering algorithm. It analyzes future motions to recognize the user-intended motion patterns, and creates a smooth virtual camera motion. The lookahead filtering has multiple stages to incrementally improve the virtual camera motion for each frame. In the first step, a Gaussian filtering is applied on the real camera motions of both past and future to obtain a smoothed camera motion:
Left:
The input unstabilized video.
Right:
The smoothed result after Gaussian filtering.
You’ll notice that it’s still not very stable. To further improve the quality, we trained a model to extract intentional motions from the noisy real camera motions. We then apply additional filters given the predicted motion. For example, if we predict the camera is panning horizontally, we would reject more vertical motions. The result is shown below.
Left:
The Gaussian filtered result.
Right:
Our lookahead result. We predict that the user is panning to the right, and suppress more vertical motions.
In practice, the process above does not guarantee there is no undefined “bad” regions, which can appear when the virtual camera is too stabilized and the warped frame falls outside the original field of view. We predict the likelihood of this issue in the next couple frames and adjust the virtual camera motion to get the final result.
Left:
Our lookahead result. The undefined area at the bottom-left are shown in cyan.
Right:
The final result with the bad region removed.
As we mentioned earlier, even with OIS enabled, sometimes the motions are too large and cause motion blur in a single frame. When EIS is further applied to further smooth the camera motion, the motion blur leads to distracting sharpness variations:
Left:
Pixel 2 with OIS only.
Right:
Pixel 2 with the basic Fused Video Stabilization. Note that sharpness variation around the “Exit” label.
This is a very common problem in EIS solutions. To address this issue, we exploit the “masking” property in the human visual system. Motion blur usually blurs the frame along a specific direction, and if the overall frame motion follows that direction, the human eye will not notice it. Instead, our brain treats the blur as a natural part of the motion, and masks it away from our perception.
With the high-frequency gyroscope and OIS signals, we can accurately estimate the motion blur for each frame. We compute where the camera pointed to at both the beginning and end of exposure, and the movement in-between is the motion blur. After that, we apply a machine learning algorithm (trained on a set of videos with and without motion blur) to map the motion blurs in past and future frames to the amount of real camera motion we want to keep, and blend the weighted real camera motion with the virtual one. As you can see below, with the motion blur masking, the distracting sharpness variation is greatly reduced and the camera motion is still stabilized.
Left:
Pixel 2 with the basic Fused Video Stabilization.
Right:
The full Fused Video Stabilization solution with motion blur masking.
Results
We have seen many amazing videos from Pixel 2 with Fused Video Stabilization. Here are some for you to check out:
Videos taken by two Pixel 2 phones mounted on a single hand grip. Fused Video Stabilization is disabled in the left one.
Videos taken by two Pixel 2 phones mounting on a single hand grip. Fused Video Stabilization is disabled in the left one. Note that the videographer jumped together with the subject.
Fused Video Stabilization combines the best of OIS and EIS, shows great results in camera motion smoothing and motion blur reduction, and corrects both rolling shutter and focus breathing. With Fused Video Stabilization on the Pixel 2 and Pixel 2 XL, you no longer have to carefully place the phone before recording, hold it firmly over the entire recording session, or carry a gimbal mount everywhere. The recorded video will always be stable, sharp, and ready to share.
Acknowledgements
Fused Video Stabilization is a large-scale effort across multiple teams in Google, including the camera algorithm team, sensor algorithm team, camera hardware team, and sensor hardware team.
Seamless Google Street View Panoramas
Thursday, November 09, 2017
Posted by Mike Krainin, Software Engineer and Ce Liu, Research Scientist, Machine Perception
In 2007, we introduced
Google Street View
, enabling you to explore the world through panoramas of neighborhoods, landmarks, museums and more, right from your browser or mobile device. The creation of these panoramas is a complicated process, involving capturing images from a multi-camera rig called a rosette, and then using image blending techniques to carefully stitch them all together. However, many things can thwart the creation of a "successful" panorama, such as mis-calibration of the rosette camera geometry, timing differences between adjacent cameras, and
parallax
. And while we attempt to address these issues by using approximate scene geometry to account for parallax and frequent camera re-calibration, visible seams in image overlap regions can still occur.
Left: A Street View car carrying a multi-camera rosette. Center: A close-up of the rosette, which is made up of 15 cameras. Right: A visualization of the spatial coverage of each camera. Overlap between adjacent cameras is shown in darker gray.
Left: The Sydney Opera House with stitching seams along its iconic shells. Right: The same Street View panorama after optical flow seam repair.
In order to provide more seamless Street View images, we’ve developed a new algorithm based on
optical flow
to help solve these challenges. The idea is to subtly warp each input image such that the image content lines up within regions of overlap. This needs to be done carefully to avoid introducing new types of visual artifacts. The approach must also be robust to varying scene geometry, lighting conditions, calibration quality, and many other conditions. To simplify the task of aligning the images and to satisfy computational requirements, we’ve broken it into two steps.
Optical Flow
The first step is to find corresponding pixel locations for each pair of images that overlap. Using techniques described in our
PhotoScan blog post
, we compute optical flow from one image to the other. This provides a smooth and dense correspondence field. We then downsample the correspondences for computational efficiency. We also discard correspondences where there isn’t enough visual structure to be confident in the results of optical flow.
The boundaries of a pair of constituent images from the rosette camera rig that need to be stitched together.
An illustration of optical flow within the pair’s overlap region.
Extracted correspondences in the pair of images. For each colored dot in the overlap region of the left image, there is an equivalently-colored dot in the overlap region of the right image, indicating how the optical flow algorithm has aligned the point. These pairs of corresponding points are used as input to the global optimization stage. Notice that the overlap covers only a small portion of each image.
Global Optimization
The second step is to warp the rosette’s images to simultaneously align all of the corresponding points from overlap regions (as seen in the figure above). When stitched into a panorama, the set of warped images will then properly align. This is challenging because the overlap regions cover only a small fraction of each image, resulting in an under-constrained problem. To generate visually pleasing results across the whole image, we formulate the warping as a
spline
-based flow field with spatial regularization. The spline parameters are solved for in a non-linear optimization using Google’s open source
Ceres Solver
.
A visualization of the final warping process. Left: A section of the panorama covering 180 degrees horizontally. Notice that the overall effect of warping is intentionally quite subtle. Right: A close-up, highlighting how warping repairs the seams.
Our approach has many similarities to
previously published work
by Shum & Szeliski on “deghosting” panoramas. Key differences include that our approach estimates dense, smooth correspondences (rather than patch-wise, independent correspondences), and we solve a nonlinear optimization for the final warping. The result is a more well-behaved warping that is less likely to introduce new visual artifacts than the kernel-based approach.
Left: A close-up of the un-repaired panorama. Middle: Result of kernel-based interpolation. This fixes discontinuities but at the expense of strong wobbling artifacts due to the small image overlap and limited footprint of kernels. Right: Result of our global optimization.
This is important because our algorithm needs to be robust to the enormous diversity in content in Street View’s billions of panoramas. You can see how effective the algorithm is in the following examples:
Tower Bridge, London
Christ the Redeemer, Rio de Janeiro
An SUV on the streets of Seattle
This new algorithm was recently added to the Street View stitching pipeline. It is now being used to restitch existing panoramas on an ongoing basis. Keep an eye out for improved Street View near you!
Acknowledgements
Special thanks to Bryan Klingner for helping to integrate this feature with the Street View infrastructure.
Feature Visualization
Tuesday, November 07, 2017
Posted by Christopher Olah, Research Scientist, Google Brain Team and Alex Mordvintsev, Research Scientist, Google Research
Have you ever wondered what goes on inside neural networks? Feature visualization is a powerful tool for digging into neural networks and seeing how they work.
Our
new article
, published in
Distill
, does a deep exploration of feature visualization, introducing a few new tricks along the way!
Building on our work in
DeepDream
, and lots of work by others since, we are able to visualize what
every neuron
a strong vision model (GoogLeNet [
1
]) detects. Over the course of multiple layers, it gradually builds up abstractions: first it detects edges, then it uses those edges to detect textures, the textures to detect patterns, and the patterns to detect parts of objects….
But neurons don’t understand the world by themselves — they work together. So we also need to understand how they interact with each other. One approach is to explore interpolations between them. What images can make them both fire, to different extents?
Here we interpolate from a neuron that seems to detect artistic patterns to a neuron that seems to detect lizard eyes:
We can also let you try adding different pairs of neurons together, to explore the possibilities for yourself:
In addition to allowing you to play around with visualizations, we explore a variety of techniques for getting feature visualization to work, and let you experiment with using them.
Techniques for visualizing and understanding neural networks are becoming more powerful. We hope our article will help other researchers apply these techniques, and give people a sense of their potential. Check it out on
Distill
.
Acknowledgement
We're extremely grateful to our co-author, Ludwig Schurbert, who made incredible contributions to our paper and especially to the interactive visualizations.
Tangent: Source-to-Source Debuggable Derivatives
Monday, November 06, 2017
Posted by Alex Wiltschko, Research Scientist, Google Brain Team
(Crossposted on the
Google Open Source Blog
)
Tangent is a new, free, and open-source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function
f
and emitting a new Python function that computes the gradient of
f
. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models:
Easily debug your backward pass
Fast gradient surgery
Forward mode automatic differentiation
Efficient Hessian-vector products
Code optimizations
This post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.
Neural networks
(NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called
reverse-mode automatic differentiation
(also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy.
Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it.
Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like
TF Eager
,
PyTorch
and
Autograd
) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like
TensorFlow
). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.
As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.
You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.
Automatic differentiation of Python code
How do we automatically generate derivatives of plain Python code? Math functions like
tf.exp
or
tf.log
have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.
Tangent has a one-function API:
Here’s an animated graphic of what happens when we call
tangent.grad
on a Python function:
If you want to print out your derivatives, you can run:
Under the hood,
tangent.grad
first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as
TensorFlow Eager functions
. The function
tangent.grad
then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end of the derivative function. This
reverse-order processing
gives the technique its name: reverse-mode automatic differentiation.
The function
df
above only works for scalar (non-array) inputs. Tangent also supports
Using TensorFlow Eager functions
, for processing arrays of numbers.
Subroutines
Control flow
Although we started with TensorFlow Eager support, Tangent isn’t tied to one numeric library or another—we would gladly welcome pull requests adding PyTorch or MXNet derivative recipes.
Next Steps
Tangent is open source now at
github.com/google/tangent
. Go check it out for download and installation instructions. Tangent is still an experiment, so expect some bugs. If you
report
them to us on GitHub, we will do our best to fix them quickly.
We are working to add support in Tangent for more aspects of the Python language (e.g., closures, inline function definitions, classes, more NumPy and TensorFlow functions). We also hope to add more advanced automatic differentiation and compiler functionality in the future, such as automatic trade-off between memory and compute (
Griewank and Walther 2000
;
Gruslys et al., 2016
), more aggressive optimizations, and lambda lifting.
We intend to develop Tangent together as a community. We welcome pull requests with fixes and features. Happy differentiating!
Acknowledgments
Bart van Merriënboer contributed immensely to all aspects of Tangent during his internship, and Dan Moldovan led TF Eager integration, infrastructure and benchmarking. Also, thanks to the Google Brain team for their support of this post and special thanks to Sanders Kleinfeld, Matt Johnson and Aleks Haecky for their valuable contribution for the technical aspects of the post.
AutoML for large scale image classification and object detection
Thursday, November 02, 2017
Posted by Barret Zoph, Vijay Vasudevan, Jonathon Shlens and Quoc Le, Research Scientists, Google Brain Team
A few months ago, we introduced our
AutoML
project, an approach that automates the design of machine learning models. While we found that AutoML can design small neural networks that perform on par with neural networks designed by human experts, these results were constrained to small academic datasets like CIFAR-10, and Penn Treebank. We became curious how this method would perform on larger more challenging datasets, such as
ImageNet
image classification and COCO object detection. Many state-of-the-art machine learning architectures have been invented by humans to tackle these datasets in academic competitions.
In
Learning Transferable Architectures for Scalable Image Recognition
, we apply AutoML to the ImageNet image classification and
COCO
object detection dataset -- two of the most respected large scale academic datasets in computer vision. These two datasets prove a great challenge for us because they are orders of magnitude larger than CIFAR-10 and Penn Treebank datasets. For instance, naively applying AutoML directly to ImageNet would require many months of training our method.
To be able to apply our method to ImageNet we have altered the AutoML approach to be more tractable to large-scale datasets:
We redesigned the search space so that AutoML could find the best layer which can then be stacked many times in a flexible manner to create a final network.
We performed architecture search on CIFAR-10 and transferred the best learned architecture to ImageNet image classification and COCO object detection.
With this method, AutoML was able to find the best layers that work well on CIFAR-10 but work well on ImageNet classification and COCO object detection. These two layers are combined to form a novel architecture, which we called
“NASNet”
.
Our NASNet architecture is composed of two types of layers: Normal Layer (left), and Reduction Layer (right). These two layers are designed by AutoML.
On ImageNet image classification, NASNet achieves a prediction accuracy of 82.7% on the validation set, surpassing all previous Inception models that we built [2, 3, 4]. Additionally, NASNet performs 1.2% better than all previous published results and is on par with the best unpublished result reported on arxiv.org [5]. Furthermore, NASNet may be resized to produce a family of models that achieve good accuracies while having very low computational costs. For example, a small version of NASNet achieves 74% accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms. The large NASNet achieves state-of-the-art accuracy while halving the computational cost of the best reported result on arxiv.org (i.e., SENet) [5].
Accuracies of NASNet and state-of-the-art, human-invented models at various model sizes on ImageNet image classification.
We also transferred the learned features from ImageNet to object detection. In our experiments, combining the features learned from ImageNet classification with the Faster-RCNN framework [6] surpassed previous published, state-of-the-art predictive performance on the COCO object detection task in both the largest as well as mobile-optimized models. Our largest model achieves 43.1% mAP which is 4% better than the previous, published state-of-the-art.
Example object detection using Faster-RCNN with NASNet.
We suspect that the image features learned by NASNet on ImageNet and COCO may be reused for many computer vision applications. Thus, we have open-sourced NASNet for inference on image classification and for object detection in the
Slim
and
Object Detection
TensorFlow repositories. We hope that the larger machine learning community will be able to build on these models to address multitudes of computer vision problems we have not yet imagined.
Special thanks
to Jeff Dean, Yifeng Lu, Jonathan Huang, Vivek Rathod, Sergio Guadarrama, Chen Sun, Jonathan Shen, Vishy Tirumalashetty, Xiaoqiang Zheng, Christian Sigg and the Google Brain team for the help with the project.
References
[1]
Learning Transferable Architectures for Scalable Image Recognition
,
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le.
Arxiv, 2017.
[2]
Going Deeper with Convolutions
,
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
CVPR, 2015.
[3]
Rethinking the inception architecture for computer vision
,
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.
CVPR, 2016.
[4]
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
,
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
AAAI, 2017.
[5]
Squeeze-and-Excitation Networks
,
Jie Hu, Li Shen and Gang Sun.
Arxiv, 2017.
[6]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
,
Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun.
NIPS, 2015.
Latest Innovations in TensorFlow Serving
Thursday, November 02, 2017
Posted by Chris Olston, Research Scientist, and Noah Fiedel, Software Engineer, TensorFlow Serving
Since initially open-sourcing
TensorFlow Serving
in
February 2016
, we’ve made some major enhancements. Let’s take a look back at where we started, review our progress, and share where we are headed next.
Before TensorFlow Serving, users of TensorFlow inside Google had to create their own serving system from scratch. Although serving might appear easy at first, one-off serving solutions quickly grow in complexity. Machine Learning (ML) serving systems need to support model versioning (for model updates with a rollback option) and multiple models (for experimentation via A/B testing), while ensuring that concurrent models achieve high throughput on hardware accelerators (GPUs and TPUs) with low latency. So we set out to create a single, general TensorFlow Serving software stack.
We decided to make it open-sourceable from the get-go, and development started in September 2015. Within a few months, we created the initial end-to-end working system and our open-source release in February 2016.
Over the past year and half, with the help of our users and partners inside and outside our company, TensorFlow Serving has advanced performance, best practices, and standards:
Out-of-the-box optimized serving and customizability
: We now offer a pre-built canonical serving binary, optimized for modern CPUs with AVX, so developers don't need to assemble their own binary from our libraries unless they have exotic needs. At the same time, we added a registry-based framework, allowing our libraries to be used for custom (or even non-TensorFlow) serving scenarios.
Multi-model serving
: Going from one model to multiple concurrently-served models presents several performance obstacles. We serve multiple models smoothly by (1) loading in isolated thread pools to avoid incurring latency spikes on other models taking traffic; (2) accelerating initial loading of all models in parallel upon server start-up; (3)
multi-model batch interleaving
to multiplex hardware accelerators (GPUs/TPUs).
Standardized model format
: We added
SavedModel
to TensorFlow 1.0, giving the community a single standard model format that works across training and serving.
Easy-to-use inference APIs
: We released easy-to-use APIs for common inference tasks (
classification
,
regression
) that we know work for a wide swathe of our applications. To support more advanced use-cases we support a lower-level tensor-based API (
predict
) and a new multi-inference API that enables multi-task modeling.
All of our work has been informed by close collaborations with: (a) Google’s ML
SRE
team, which helps ensure we are robust and meet internal SLAs; (b) other Google machine learning infrastructure teams including ads serving and
TFX
; (c) application teams such as Google Play; (d) our partners at the
UC Berkeley RISE Lab
, who explore complementary research problems with the
Clipper
serving system; (e) our open-source user base and contributors.
TensorFlow Serving is currently handling tens of millions of inferences per second for 1100+ of our own projects including Google’s
Cloud ML Prediction
. Our core serving code is available to all via our open-source
releases
.
Looking forward, our work is far from done and we are exploring several avenues of innovation. Today we are excited to share early progress in two experimental areas:
Granular batching
: A key technique we employ to achieve high throughput on specialized hardware (GPUs and TPUs) is "batching": processing multiple examples jointly for efficiency. We are developing technology and best practices to improve batching to: (a) enable batching to target just the GPU/TPU portion of the computation, for maximum efficiency; (b) enable batching within recursive neural networks, used to process sequence data e.g. text and event sequences. We are experimenting with batching arbitrary sub-graphs using the
Batch/Unbatch
op pair.
Distributed model serving
: We are looking at model sharding techniques as a means of handling models that are too large to fit on one server node or sharing sub-models in a memory-efficient way. We recently launched a 1TB+ model in production with good results, and hope to open-source this capability soon.
Thanks again to all of our users and partners who have contributed feedback, code and ideas. Join the project at:
github.com/tensorflow/serving
.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2018
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.