Research Blog: Computer Vision

Announcing Open Images V4 and the ECCV 2018 Open Images Challenge

Monday, April 30, 2018

Posted by Vittorio Ferrari, Research Scientist, Machine PerceptionOpen ImagesupdatingrefiningOpen Images V4largest existing datasetvisualizer

Annotated images from the Open Images dataset. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: Civilization by Paul Downey. Both images used under CC BY 2.0 license.

Open Images Challenge2018 European Conference on Computer VisionPASCAL VOCImageNetCOCO

12.2M bounding-box annotations for 500 categories on 1.7M training images,

A broader range of categories than previous detection challenges, including new objects such as “fedora” and “snowman”.

In addition to the object detection main track, the challenge includes a Visual Relationship Detection track, on detecting pairs of objects in particular relations, e.g. “woman playing guitar”.

crowdsource.google.com

Seeing More with In Silico Labeling of Microscopy Images

Thursday, April 12, 2018

Eric Christiansen, Senior Software Engineer, Google ResearchmicroscopyTransmitted lightFluorescence microscopyautomatically assess the quality of imagesassist pathologists diagnosing cancerous tissueIn Silico Labeling: Predicting Fluorescent Labels in Unlabeled ImagesCellopen sourced our networkBackgroundphase-contrastphase-shifted

Transmitted light (phase-contrast) image of a human motor neuron culture derived from induced pluripotent stem cells. Outset 1 shows a cluster of cells, possibly neurons. Outset 2 shows a flaw in the image obscuring underlying cells. Outset 3 shows neurites. Outset 4 shows what appear to be dead cells. Scale bar is 40 μm. Source images for this and the following figures come from the Finkbeiner lab at the Gladstone Institutes.

x, yz

A phase-contrast z-stack of the same cells. Note how the appearance changes as the focus is shifted. Now we can see that the fuzzy shape in the lower right of Outset 1 is a single oblong cell, and that the rightmost cell in Outset 4 is taller than the uppermost cell, possibly indicating that it has undergone programmed cell death.

Fluorescence microscopy image of the same cells. The blue fluorescent label localizes to DNA, highlighting cell nuclei. The green fluorescent label localizes to a protein found only in dendrites, a neural substructure. The red fluorescent label localizes to a protein found only in axons, another neural substructure. With these labels it is much easier to understand what’s happening in the sample. For example, the green and red labels in Outset 1 confirm this is a neural cluster. The red label in Outset 3 shows that the neurites are axons, not dendrites. The upper-left blue dot in Outset 4 reveals a previously hard-to-see nucleus, and the lack of a blue dot for the cell at the left shows it to be DNA-free cellular debris.

Seeing more with deep learning

Overview of our system. (A) The dataset of training examples: pairs of transmitted light images from z-stacks with pixel-registered sets of fluorescence images of the same scene. Several different fluorescent labels were used to generate fluorescence images and were varied between training examples; the checkerboard images indicate fluorescent labels which were not acquired for a given example. (B) The untrained deep network was (C) trained on the data A. (D) A z-stack of images of a novel scene. (E) The trained network, C, is used to predict fluorescence labels learned from A for each pixel in the novel images, D.

Inceptionin-scaledown-scaleup-scaleGoogle HypertuneSteve Finkbeiner's lab at the Gladstone InstitutesRubin Lab at Harvardbright-fieldphase-contrastdifferential interference contrastmotor neuronsinduced pluripotent stem cellscortical cultures

Animation showing the same cells in transmitted light and fluorescence imaging, along with predicted fluorescence labels from our model. Outset 2 shows the model predicts the correct labels despite the artifact in the input image. Outset 3 shows the model infers these processes are axons, possibly because of their distance from the nearest cells. Outset 4 shows the model sees the hard-to-see cell at the top, and correctly identifies the object at the left as DNA-free cell debris.

Try it for yourself!open sourced our modelsingle imagetransfer learningcheck out the codeAcknowledgementsWe thank the Google Accelerated Science team for originating and developing this project and its publication, and additionally Kevin P. Murphy for supporting its publication. We thank Mike Ando, Youness Bennani, Amy Chung-Yu Chou, Jason Freidenfelds, Jason Miller, Kevin P. Murphy, Philip Nelson, Patrick Riley, and Samuel Yang for ideas and editing help with this post. This study was supported by NINDS (NS091046, NS083390, NS101995), the NIH’s National Institute on Aging (AG065151, AG058476), the NIH’s National Human Genome Research Institute (HG008105), Google, the ALS Association, and the Michael J. Fox Foundation.

Looking to Listen: Audio-Visual Speech Separation

Wednesday, April 11, 2018

Posted by Inbar Mosseri and Oran Lang, Software Engineers, Google Researchcocktail party effectLooking to Listen at the Cocktail Party

The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video.

An Audio-Visual Speech Separation ModelAudioSetspectrogramour paper

Our multi-stream, neural network-based model architecture.

Application to Speech Recognitionproject web pagethis work from UC Berkeleythis work from MITAcknowledgementsThe research described in this post was done by Ariel Ephrat (as an intern), Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Michael Rubinstein. We would like to thank Yossi Matias and Google Research Israel for their support for the project, and John Hershey for his valuable feedback. We also thank Arkady Ziefman for his help with animations and figures, and Rachel Soh for helping us procure permissions for video content in our results.

MobileNetV2: The Next Generation of On-Device Computer Vision Networks

Tuesday, April 03, 2018

Posted by Mark Sandler and Andrew Howard, Google ResearchMobileNetV1MobileNetV2TensorFlow-Slim Image Classification LibraryColaboratorydownloadJupytermoduleson github¹

Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.

MobileNet V2: Inverted Residuals and Linear BottlenecksHow does it compare to the first generation of MobileNets?

MobileNetV2 improves speed (reduced latency) and increased ImageNet Top 1 accuracy

Tensorflow Object Detection API

Model	Params	Multiply-Adds	mAP	Mobile CPU
MobileNetV1 + SSDLite	5.1M	1.3B	22.2%	270ms
MobileNetV2 + SSDLite	4.3M	0.8B	22.1%	200ms

announced recentlyPASCAL VOC 2012

Model	Params	Multiply-Adds	mIOU
MobileNetV1 + DeepLabV3	11.15M	14.25B	75.29%
MobileNetV2 + DeepLabV3	2.11M	2.75B	75.32%

Acknowledgements:We would like to acknowledge our core contributors Menglong Zhu, Andrey Zhmoginov and Liang-Chieh Chen. We also give special thanks to Bo Chen, Dmitry Kalenichenko, Skirmantas Kligys, Mathew Tang, Weijun Wang, Benoit Jacob, George Papandreou, Zhichao Lu, Vivek Rathod, Jonathan Huang, Yukun Zhu, and Hartwig Adam.References

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H, arXiv:1704.04861, 2017.

MobileNetV2: Inverted Residuals and Linear Bottlenecks, Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. arXiv preprint. arXiv:1801.04381, 2018.

Rethinking Atrous Convolution for Semantic Image Segmentation, Chen LC, Papandreou G, Schroff F, Adam H. arXiv:1706.05587, 2017.

Speed/accuracy trade-offs for modern convolutional object detectors, Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017.

Deep Residual Learning for Image Recognition, He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. arXiv:1512.03385,2015

1 The shortcut (also known as skip) connections, popularized by ResNets[5] are commonly used to connect the non-bottleneck layers. MobilenNetV2 inverts this notion and connects the bottlenecks directly.^↩

Behind the Motion Photos Technology in Pixel 2

Tuesday, March 13, 2018

Posted by Matthias Grundmann, Research Scientist and Jianing Wei, Software Engineer, Google Research
new camera featureMotion Stills for Android

Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.

Camera Motion Estimation by Combining Hardware and SoftwareMotion StillsFused Video Stabilization

Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.

affine transformationhomography

Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.

scene at infinityparallaxmixture homographiesrolling shutter

Background motion estimation in motion photos. By using the motion metadata from Gyro and OIS we are able to accurately classify features from the visual analysis into foreground and background.

Motion Photo Stabilization and Playbacklinear programming techniquesearlier posts

Motion photos stabilize even complex scenes with large foreground motions.

Motion Photo SharingAcknowledgementsMotion photos is a result of a collaboration across several Google Research teams, Google Pixel and Google Photos. We especially want to acknowledge the work of Karthik Raveendran, Suril Shah, Marius Renn, Alex Hong, Radford Juang, Fares Alhassen, Emily Chang, Isaac Reynolds, and Dave Loxton.

Semantic Image Segmentation with DeepLab in TensorFlow

Monday, March 12, 2018

Posted by Liang-Chieh Chen and Yukun Zhu, Software Engineers, Google Researchportrait mode of the Pixel 2 and Pixel 2 XL smartphonesmobile real-time video segmentationimage-level classificationbounding box-level detection

DeepLab-v3+^*TensorFlowconvolutional neural networkPascal VOC 2012Cityscapesdepthwise separable convolution

AcknowledgementsWe would like to thank the support and valuable discussions with Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille (co-authors of DeepLab-v1 and -v2), as well as Mark Sandler, Andrew Howard, Menglong Zhu, Chen Sun, Derek Chow, Andre Araujo, Haozhi Qi, Jifeng Dai, and the Google Mobile Vision team. References

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, arXiv: 1802.02611, 2018.

Xception: Deep Learning with Depthwise Separable Convolutions, François Chollet, Proc. of CVPR, 2017.

Deformable Convolutional Networks — COCO Detection and Segmentation Challenge 2017 Entry, Haozhi Qi, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng, Yichen Wei, and Jifeng Dai, ICCV COCO Challenge Workshop, 2017.

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, Proc. of ICLR, 2015.

Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, TPAMI, 2017.

Rethinking Atrous Convolution for Semantic Image Segmentation, Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam, arXiv:1706.05587, 2017.

* DeepLab-v3+ is not used to power Pixel 2's portrait mode or real time video segmentation. These are mentioned in the post as examples of features this type of technology can enable.^↩

Introducing the iNaturalist 2018 Challenge

Friday, March 09, 2018

Posted by Yang Song, Staff Software Engineer and Serge Belongie, Visiting Faculty, Google Researchdeep learningself-driving carsvirtual realityfine-grainedinstance-level recognitioninstance-level landmark recognition challengelong tail2018 iNaturalist ChallengeiNaturalistVisipediaFGVC5CVPR 2018iNat-2017KaggleGitHub repothe following photo

The map on the right shows where the photo was taken. Image credit: Serge Belongie.

Trachemys scriptafine-grainedsubordinatelong-tailed distribution

Distribution of training images per species for iNat-2017 and iNat-2018, plotted on a log-linear scale, illustrating the long-tail behavior typical of fine-grained classification problems. Image Credit: Grant Van Horn and Oisin Mac Aodha.

FGVC5furniture categorization challengeFGVCxFGVC5CVPR 2018competitionsAcknowledgementsWe’d like to thank our colleagues and friends at iNaturalist, Visipedia, and FGVC5 for working together to advance this important area. At Google we would like to thank Hartwig Adam, Weijun Wang, Nathan Frey, Andrew Howard, Alessandro Fin, Yuning Chai, Xiao Zhang, Jack Sim, Yuan Li, Grant Van Horn, Yin Cui, Chen Sun, Yanan Qian, Grace Vesom, Tanya Birch, Celeste Chung, Wendy Kan, and Maggie Demkin.

Google-Landmarks: A New Dataset and Challenge for Landmark Recognition

Thursday, March 01, 2018

Posted by André Araujo and Tobias Weyand, Software Engineers, Google ResearchImagenetdrop substantially every yearLandmark RecognitionLandmark RetrievalCVPR’18 Landmarks workshopopen-sourcingDELF

Geographic distribution of landmarks in our dataset.

artwork recognitionrecognitionretrieval

A few examples of images from the Google-Landmarks dataset, including landmarks such as Big Ben, Sacre Coeur Basilica, the rock sculpture of Decebalus and the Megyeri Bridge, among others.

CVPRCVPR’18 Landmarks workshopAcknowledgments Jack Sim, Will Cukierski, Maggie Demkin, Hartwig Adam, Bohyung Han, Shih-Fu Chang, Ondrej Chum, Torsten Sattler, Giorgos Tolias, Xu Zhang, Fernando Brucher, Marco Andreetto, Gursheesh Kour.

The Instant Motion Tracking Behind Motion Stills AR

Tuesday, February 06, 2018

Posted by Jianing Wei and Tyler Mullen, Software Engineers, Google ResearchMotion Stills on Androidmotion photos feature in Pixel 2Motion Stills for Android

Motion Stills with instant motion tracking in action

Motion Textprivacy blur on YouTubeaccelerometer

When the phone is approximately steady, the accelerometer sensor provides the acceleration due to the Earth’s gravity. For horizontal planes the gravity vector is parallel to normal of the tracked plane and can accurately provide the initial orientation of phone.

Instant Motion Tracking

The translation and the change in size (relative scale) of the box in the image plane can be used to determine 3D translation between two camera position C1 and C2. However, as our camera model doesn’t assume the focal length of the camera lens, we do not know the true distance/depth of the tracked plane.

To account for this, we added scale estimation to our existing tracker (the one used in Motion Text) as well as region tracking outside the field of view of the camera. When the camera gets closer to the tracked surface, the virtual content scales accurately, which is consistent with perception of real-world objects. When you pan outside the field of view of the target region and back the virtual object will reappear in approximately the same spot.

Independent translation (from visual signal only as shown by red box) and rotation tracking (from gyro; not shown)

After all this, we obtain the device’s 3D rotation (roll, pitch and yaw) using the phone’s built-in gyroscope. The estimated 3D translation combined with the 3D rotation provides us with the ability to render the virtual content correctly in the viewfinder. And because we treat rotation and translation separately, our instant motion tracking approach is calibration free and works on any Android device with a gyroscope.

Augmented chicken family with Motion Stills AR mode

We are excited to bring this new mode to Motion Stills for Android, and we hope you’ll enjoy it. Please download the new release of Motion Stills and keep sending us feedback with #motionstills on your favorite social media.

AcknowledgementsFor rendering, we are thankful we were able to leverage Google’s Lullaby engine using animated Poly models. A thank you to our team members who worked on the tech and this launch with us: John Nack, Suril Shah, Igor Kibalchich, Siarhei Kazakou, and Matthias Grundmann.

Introducing the CVPR 2018 Learned Image Compression Challenge

Wednesday, January 10, 2018

Posted by Michele Covell, Research Scientist, Google ResearchEdit 17/01/2018: Due to popular request, the CLIC competition submission deadline has been extended to April 22. Please see compression.cc for more details.JPEGBPGWebPend-to-end with machine learningcompression through superresolutionperceptually improved JPEG imagesETHTwitterWorkshop and Challenge on Learned Image Compression (CLIC)CVPR 2018Jim BankoskiJens OhmOren RippelRamin Zabih

Training set of 1,633 uncompressed images from both the Mobile and Professional datasets, available on compression.cc

ImageNetOpen Images DatasetToderici2016Ballé2016Ballé2017Theis2017Agustsson2017Santurkar2017Rippel2017]compression.ccEdit 17/01/2018: Due to popular request, the CLIC competition submission deadline has been extended to April 22. Please see compression.cc for more details.

Introducing NIMA: Neural Image Assessment

Monday, December 18, 2017

Posted by Hossein Talebi, Software Engineer and Peyman Milanfar Research Scientist, Machine Perceptionconvolutional neural networksaddress the subjective nature of image qualityNIMA: Neural Image Assessmentobject recognitionBackgroundPSNRSSIMImageNetNIMApaperAVAphotography contests

Ranking some examples labelled with the “landscape” tag from AVA dataset using NIMA. Predicted NIMA (and ground truth) scores are shown below each image.

TID2013

Ranking some examples from TID2013 dataset using NIMA. Predicted NIMA scores are shown below each image.

Perceptual Image Enhancementpaper

NIMA can be used as a training loss to enhance images. In this example, local tone and contrast of images is enhanced by training a deep CNN with NIMA as its loss. Test images are obtained from the MIT-Adobe FiveK dataset.

Looking Ahead

Fused Video Stabilization on the Pixel 2 and Pixel 2 XL

Friday, November 10, 2017

Posted by Chia-Kai Liang, Senior Staff Software Engineer and Fuhao Shi, Android Camera Teamhighest overall rating for a smartphone cameraCamera ShakeMotion BlurRolling ShutterCMOSrolling shutter distortion

A simulated rendering of a video with global (left) and rolling (right) shutter.

Focus Breathingangle of viewbreathingOptical Image Stabilization

The video is taken by Pixel 2 with only OIS enabled. You can see the frame center is stabilized, but the boundaries have some jello-like artifacts.

Electronic Image StabilizationMaking a Better Video: Fused Video Stabilization

Motion Analysis

Left: The stabilized video of a “running” motion with a 3ms timing error. Note the occasional jittering. Right: The stabilized video with correct timestamps. The bottom right corner shows the original shaky video.

Motion FilteringlookaheadFrame Synthesis

Left: The input video with mesh overlay. Right: The warped frame, and the red rectangle is the final stabilized output. Note how the non-rigid warping corrects the rolling shutter distortion.

Lookahead Motion Filtering

Left: The input unstabilized video. Right: The smoothed result after Gaussian filtering.

Left: The Gaussian filtered result. Right: Our lookahead result. We predict that the user is panning to the right, and suppress more vertical motions.

Left: Our lookahead result. The undefined area at the bottom-left are shown in cyan. Right: The final result with the bad region removed.

Left: Pixel 2 with OIS only. Right: Pixel 2 with the basic Fused Video Stabilization. Note that sharpness variation around the “Exit” label.

Left: Pixel 2 with the basic Fused Video Stabilization. Right: The full Fused Video Stabilization solution with motion blur masking.

Results

Videos taken by two Pixel 2 phones mounted on a single hand grip. Fused Video Stabilization is disabled in the left one.

Videos taken by two Pixel 2 phones mounting on a single hand grip. Fused Video Stabilization is disabled in the left one. Note that the videographer jumped together with the subject.

AcknowledgementsFused Video Stabilization is a large-scale effort across multiple teams in Google, including the camera algorithm team, sensor algorithm team, camera hardware team, and sensor hardware team.