Google AI Blog

Scalable Deep Reinforcement Learning for Robotic Manipulation

Thursday, June 28, 2018

Posted Alex Irpan, Software Engineer, Google Brain Team and Peter Pastor, Senior Roboticist, Xdeep learningreinforcement learningprevious work

Examples of singulation.large-scale distributed optimizationQ-learningarXiv

Some of the training objects used.

that sharing experience across robots can accelerate learning

Seven robots collecting grasp data.

The objects used at evaluation time. To make the task challenging, we aimed for a large variety of object sizes, textures, and shapes.

When presented with a set of interlocking blocks that cannot be picked up together, the policy separates one of the blocks from the rest before picking it up.

When presented with a difficult-to-grasp object, the policy figures out it should reposition the gripper and regrasp it until it has a firm hold.

When grasping in clutter, the policy probes different objects until the fingers hold one of them firmly, before lifting.

When we perturbed the robot by intentionally swatting the object out of the gripper -- something it had not seen during training -- it automatically repositioned the gripper for another attempt.

Examples of the learned behaviors. In the left GIF, the policy corrects for the moved ball. In the right GIF, the policy tries several grasps until it succeeds at picking up the tricky object.domain adaptationrecent work on learning how to self-calibrateAcknowledgementsThis research was conducted by Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. We’d also like to give special thanks to Iñaki Gonzalo and John-Michael Burke for overseeing the robot operations, Chelsea Finn, Timothy Lillicrap, and Arun Nair for valuable discussions, and other people at Google and X who’ve contributed their expertise and time towards this research. A preprint is available on arXiv.

Self-Supervised Tracking via Video Colorization

Wednesday, June 27, 2018

Posted by Carl Vondrick, Research Scientist, Machine Perceptionactivity recognitionobject interactionvideo stylizationTracking Emerges by Colorizing Videosany

Example tracking predictions on the publicly-available, academic dataset DAVIS 2017. After learning to colorize videos, a mechanism for tracking automatically emerges without supervision. We specify regions of interest (indicated by different colors) in the first frame, and our model propagates it forward without any additional learning or supervision.Learning to Recolorize VideoKinetics dataset

We illustrate the video recolorization task using video from the DAVIS 2017 dataset. The model receives as input one color frame and a gray-scale video, and predicts the colors for the rest of the video. The model learns to copy colors from the reference frame, which enables a mechanism for tracking to be learned without human supervision.

Examples of predicted colors from colorized reference frame applied to input video using the publicly-available Kinetics dataset.Analyzing the TrackerPrincipal Component Analysis

Top Row: We show videos from the DAVIS 2017 dataset. Bottom Row: We visualize the internal embeddings from the colorization model. Similar embeddings will have a similar color in this visualization. This suggests the learned embedding is grouping pixels by object identity.Tracking PoseJHMDB

Examples of using the model to track movements of the human skeleton. In this case the input was a human pose for the first frame and subsequent movement is automatically tracked. The model can track human poses even though it was never explicitly trained for this task.latest methodsoptical flowthe paperFuture WorkAcknowledgementsThis project was only possible thanks to several collaborations at Google. The core team includes Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama and Kevin Murphy. We also thank David Ross, Bryan Seybold, Chen Sun and Rahul Sukthankar.

Teaching Uncalibrated Robots to Visually Self-Adapt

Friday, June 22, 2018

Posted by Fereshteh Sadeghi, Student Researcher, Google Brain Teamvisual motor integrationSim2Real Viewpoint Invariant Visual Servoing by Recurrent ControlCVPR 2018fully convolutional networkslong short-term memory

Viewpoint invariant manipulation for visually indicated goal reaching with a physical robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from drastically different camera viewpoints. First row shows the visually indicated goals.
The Challengedegrees of freedom

How can we make it feasible to provide the right amount of experience for the robot to learn the self-adaptation behavior based on pure visual observations that simulate a lifelong learning paradigm?

How can we design a model that integrates robust perception and self-adaptive control such that it can quickly transfer to unseen environments?

Visually indicated goal reaching task with a physical robotic arm and diverse camera viewpoints.

Harnessing Simulation to Learn Complex Behaviorsdistributing the data collection and trials to multiple robotsvisual self-calibration

We use domain randomization technique to learn generalizable policies in simulation.

Sadeghi & Levineindoor navigationobject localizationpick and placingreinforcement learning

Viewpoint invariant manipulation for visually indicated goal reaching with a simulated seven-DoF robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from dramatically different camera viewpoints.
Disentangling Perception from Control

Real-world robot and moving camera setup. First row shows the scene arrangements and the second row shows the visual sensory input to the robot.

Early Results

After adapting the visual features with the small amount of real images, performance was boosted by more than 10%. All used real objects are drastically different from the objects seen in simulation.

AcknowledgementsThis research was conducted by Fereshteh Sadeghi, Alexander Toshev, Eric Jang and Sergey Levine. We would also like to thank Erwin Coumans and Yunfei Bai for providing pybullet, and Vincent Vanhoucke for insightful discussions.

How Can Neural Network Similarity Help Us Understand Training and Generalization?

Thursday, June 21, 2018

Posted by Maithra Raghu, Google Brain Team and Ari S. Morcos, DeepMind
previous postCanonical Correlation Analysisconvolutional neural networksInsights on Representational Similarity in Neural Networks with Canonical Correlationrecurrent neural networkscode used for applying CCA on neural networksRepresentational Similarity of Memorizing and Generalizing CNNs

generalizing networks: CNNs trained on data with unmodified, accurate labels and which learn solutions which generalize to novel data.

memorizing networks: CNNs trained on datasets with randomized labels such that they must memorize the training data and cannot, by definition, generalize (as in Zhang et al., 2017).

our paperdifferentsoftmax

Groups of generalizing networks (blue) converge to more similar solutions than groups of memorizing networks (red). CCA distance was calculated between groups of networks trained on real CIFAR-10 labels (“Generalizing”) or randomized CIFAR-10 labels (“Memorizing”) and between pairs of memorizing and generalizing networks (“Inter”).

while there are many different ways to memorize the training data (resulting in greater CCA distances), there are fewer ways to learn generalizable solutionsUnderstanding the Training Dynamics of Recurrent Neural Networksbottom-upprevious work

Convergence dynamics for RNNs over the course of training exhibit bottom up convergence, as layers closer to the input converge to their final representations earlier in training than later layers. For example, layer 1 converges to its final representation earlier in training than layer 2 than layer 3 and so on. Epoch designates the number of times the model has seen the entire training set while different colors represent the convergence dynamics of different layers.

ConclusionscodeAcknowledgementsSpecial thanks to Samy Bengio, who is a co-author on this work. We also thank Martin Wattenberg, Jascha Sohl-Dickstein and Jon Kleinberg for helpful comments.

Google at CVPR 2018

Monday, June 18, 2018

2018 Conference on Computer Vision and Pattern Recognitionworkshops tutorialsmachine perceptionportrait mode on the Pixel 2 and Pixel 2 XL smartphonesOpen Images V4 datasetblueOrganization Ramin Zabih Sameer Agarwal, Aseem Agrawala, Jon Barron, Abhinav Shrivastava, Carl Vondrick, Ming-Hsuan YangOrals/SpotlightsUnsupervised Discovery of Object Landmarks as Structural Representations Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak LeeDoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu Neural Kinematic Networks for Unsupervised Motion Retargetting Ruben Villegas, Jimei Yang, Duygu Ceylan, Honglak Lee Burst Denoising with Kernel Prediction Networks Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren NgQuantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry Kalenichenko, Hartwig AdamAVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik Focal Visual-Text Attention for Visual Question Answering Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. HauptmannInferring Light Fields from Shadows Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba Modifying Non-Local Variations Across Multiple Views Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-ManorIterative Visual Reasoning Beyond Convolutions Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta Unsupervised Training for 3D Morphable Model RegressionKyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William FreemanLearning Transferable Architectures for Scalable Image RecognitionBarret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le The iNaturalist Species Classification and Detection Dataset Grant van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie Learning Intrinsic Image Decomposition from Watching the World Zhengqi Li, Noah Snavely Learning Intelligent Dialogs for Bounding Box Annotation Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari PostersRevisiting Knowledge Transfer for Training Object Class Detectors Jasper Uijlings, Stefan Popov, Vittorio Ferrari Rethinking the Faster R-CNN Architecture for Temporal Action Localization Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David Ross, Jia Deng, Rahul Sukthankar Hierarchical Novelty Detection for Visual Object Recognition Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee COCO-Stuff: Thing and Stuff Classes in Context Holger Caesar, Jasper Uijlings, Vittorio Ferrari Appearance-and-Relation Networks for Video Classification Limin Wang, Wei Li, Wen Li, Luc Van Gool MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep NetworksAriel Gordon, Elad Eban, Bo Chen, Ofir Nachum, Tien-Ju Yang, Edward Choi Deformable Shape Completion with Graph Convolutional AutoencodersOr Litany, Alex Bronstein, Michael Bronstein, Ameesh Makadia MegaDepth: Learning Single-View Depth Prediction from Internet Photos Zhengqi Li, Noah Snavely Unsupervised Discovery of Object Landmarks as Structural Representations Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee Burst Denoising with Kernel Prediction Networks Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren NgPix3D: Dataset and Methods for Single-Image 3D Shape ModelingXingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Tianfan Xue, Joshua Tenenbaum, William FreemanSparse, Smart Contours to Represent and Edit Images Tali Dekel, Dilip Krishnan, Chuang Gan, Ce Liu, William FreemanMaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning Yin Cui, Yang Song, Chen Sun, Andrew Howard, Serge Belongie Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Sung Jin Hwang, George Toderici, Troy Chinen, Joel Shor MobileNetV2: Inverted Residuals and Linear BottlenecksMark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Juergen Sturm, Matthias NießnerSim2Real View Invariant Visual Servoing by Recurrent Control Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey LevineAlternating-Stereo VINS: Observability Analysis and Performance Evaluation Mrinal Kanti Paul, Stergios Roumeliotis Soccer on Your Tabletop Konstantinos Rematas, Ira Kemelmacher, Brian Curless, Steve Seitz Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints Reza Mahjourian, Martin Wicke, Anelia Angelova AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra MalikInferring Light Fields from Shadows Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba Modifying Non-Local Variations Across Multiple Views Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor Aperture Supervision for Monocular Depth Estimation Pratul Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, Jonathan BarronInstance Embedding Transfer to Unsupervised Video Object SegmentationSiyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C.-C. Jay Kuo Frame-Recurrent Video Super-Resolution Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew BrownWeakly Supervised Action Localization by Sparse Temporal Pooling Network Phuc Nguyen, Ting Liu, Gautam Prasad, Bohyung Han Iterative Visual Reasoning Beyond Convolutions Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta Learning and Using the Arrow of Time Donglai Wei, Andrew Zisserman, William Freeman, Joseph Lim HydraNets: Specialized Dynamic Architectures for Efficient Inference Ravi Teja Mullapudi, Noam Shazeer, William Mark, Kayvon Fatahalian Thoracic Disease Identification and Localization with Limited Supervision Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-jia Li, Fei-Fei Li Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak LeeDeep Semantic Face Deblurring Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, Ming-Hsuan YangUnsupervised Training for 3D Morphable Model Regression Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William FreemanLearning Transferable Architectures for Scalable Image Recognition Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le Learning Intrinsic Image Decomposition from Watching the World Zhengqi Li, Noah SnavelyPiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection Nian Liu, Junwei Han, Ming-Hsuan YangMobile Video Object Detection with Temporally-Aware Feature MapsMason Liu, Menglong ZhuTutorials Computer Vision for Robotics and Driving Anelia Angelova, Sanja Fidler Unsupervised Visual Learning Pierre Sermanet, Anelia Angelova UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and EnvironmentsSean Fanello, Julien Valentin, Jonathan Taylor, Christoph Rhemann, Adarsh Kowdle, Jürgen Sturm, Christine Kaeser-Chen, Pavel Pidlypenskyi, Rohit Pandey, Andrea Tagliasacchi, Sameh Khamis, David Kim, Mingsong Dou, Kaiwen Guo, Danhang Tang, Shahram IzadiGenerative Adversarial NetworksJun-Yan Zhu, Taesung Park, Mihaela Rosca, Phillip Isola, Ian Goodfellow

Google at NAACL

Friday, June 8, 2018

Posted by Kenton Lee, Research Scientist and Slav Petrov, Principal Scientist, Language Team, Google AINorth American Association of Computational LinguisticsWidening Natural Language Processing Workshop

Googler Margaret Mitchell setting up our telepresence robots for remote presenters Diana Gonazelez and Gibran Fuentes Pineda to remotely present their first-place work on visual storytelling from Universidad Nacional Autonoma de Mexico.

Test of Time AwardblueBLEU: a Method for Automatic Evaluation of Machine TranslationKishore Papineni, Salim Roukos, Todd Ward, Wei-Jing ZhuDiscriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron AlgorithmsMichael CollinsThumbs up?: Sentiment Classification using Machine Learning TechniquesBo Pang, Lillian Lee, Shivakumar VaithyanathanblueGoogle AI Language Team pageArea Chairs include:Dan Bikel, Dilek Hakkani-Tur, Zornitsa Kozareva, Marius Pasca, Emily Pitler, Idan Szpektor, Taro WatanabePublications Co-ChairMargaret MitchellKeynoteGoogle Assistant or My Assistant? Towards Personalized Situated Conversational AgentsDilek Hakkani-TürPublicationsBootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement LearningPararth Shah, Dilek Hakkani-Tür, Bing Liu, Gokhan TürSHAPED: Shared-Private Encoder-Decoder for Text Style AdaptationYe Zhang, Nan Ding, Radu SoricutOlive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural ModelVered Schwartz, Chris WatersonAre All Languages Equally Hard to Language-Model?Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, Brian RoarkSelf-Attention with Relative Position RepresentationsPeter Shaw, Jakob Uszkoreit, Ashish VaswaniDialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue SystemsBing Liu, Gokhan Tür, Dilek Hakkani-Tür, Parath Shah, Larry HeckWorkshopsSubword & Character Level Models in NLPManaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah YaghoobzadehStorytelling WorkshopMargaret Mitchell, Ishan Misra, Ting-Hao 'Kenneth' Huang, Frank FerraroEthics in NLPMichael Strube, Dirk Hovy, Margaret Mitchell, Mark AlfanoNAACL HLT PanelsCareers in IndustryPhilip Resnik (moderator), Jason Baldridge, Laura Chiticariu, Marie Mateer, Dan RothEthics in NLPDirk Hovy (moderator), Margaret Mitchell, Vinodkumar Prabhakaran, Mark Yatskar, Barbara Plank

Realtime tSNE Visualizations with TensorFlow.js

Thursday, June 7, 2018

Posted by Nicola Pezzotti, Software Engineering Intern, Google Züricht-distributed Stochastic Neighbor EmbeddingTensorFlow Embedding ProjectorTensorBoardLinear tSNE Optimization for the WebWebGLreleasing this work as an open source libraryTensorFlow.js

Real-time evolution of the tSNE embedding for the complete MNIST dataset with our technique. The dataset contains images of 60,000 handwritten digits. You can find a live demo here.

objective functionN-body simulationBarnes-HutMNIST

Rendering of the three functions used to approximate the repulsive effect created by a single point. In the above figure the repulsive forces show a point in a blue area is pushed to the left/bottom, while a point in the red area is pushed to the right/top while a point in the white region will not move.

This animation shows the evolution of the tSNE embedding (upper left) and of the scalar fields used to approximate its gradient with normalization term (upper right), horizontal shift (bottom left) and vertical shift (bottom right).

TensorFlow.jsan open source libraryFuture WorkN-body simulationsAcknowledgementsWe would like to thank Alexander Mordvintsev, Yannick Assogba, Matt Sharifi, Anna Vilanova, Elmar Eisemann, Nikhil Thorat, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Alessio Bazzica, Boudewijn Lelieveldt, Thomas Höllt, Baldur van Lew, Julian Thijssen and Marvin Ritter.

Announcing an updated YouTube-8M, and the 2nd YouTube-8M Large-Scale Video Understanding Challenge and Workshop

Tuesday, June 5, 2018

Posted by Joonseok Lee, Software Engineer, Google AIYouTube-8M Large-Scale Video Understanding ChallengeKaggleYouTube-8M datasetworkshop at CVPR’17update to the YouTube-8M dataseta new Kaggle video understanding challenge2nd Workshop on YouTube-8M Large-Scale Video Understanding2018 European Conference on Computer Vision

An Updated YouTube-8M Dataset (2018 Edition)ground truthstarter codeTensorFlowThe 2nd YouTube-8M Video Understanding Challenge2nd YouTube-8M Video Understanding ChallengeKaggle competition pageThe 2nd Workshop on YouTube-8M Large-Scale Video UnderstandingECCV’18workshop pageAcknowledgementsThis post reflects the work of many machine perception researchers including Sami Abu-El-Haija, Ke Chen, Nisarg Kothari, Joonseok Lee, Hanhan Li, Paul Natsev, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, as well as Sohier Dane, Julia Elliott, Wendy Kan and Walter Reade from Kaggle. We are also grateful for the support and advice from our partners at YouTube.

Blog