Google Research Blog
The latest news from Research at Google
Distributing the Edit History of Wikipedia Infoboxes
Thursday, May 30, 2013
Posted by Enrique Alfonseca, Google Research
Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building
disambiguation resources
,
parallel corpora
and
structured knowledge
from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.
Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.
Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in
training systems to learn to extract data from documents
, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.
For this reason, we released, in collaboration with
Wikimedia Deutschland e.V.
, a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is
available for download
. A description of the dataset can be found in our paper
WHAD: Wikipedia Historical Attributes Data
, accepted for publication at the
Language Resources and Evaluation journal
.
What kind of information can be learned from this data? Some examples from preliminary analyses include the following:
Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.
While infobox attribute updates will be much easier to process as they transition into the
Wikidata
project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.
Thanks to Googler Jean-Yves Delort and
Guillermo Garrido
and
Anselmo Peñas
from
UNED
for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from
Wikipedia Deutschland
for their support. Thanks also to
Thomas Hofmann
and
Fernando Pereira
for making this data release possible.
Open Access for Publications
Wednesday, May 29, 2013
Posted by Alfred Spector, Vice President, Engineering
The Association for Computing Machinery
(ACM) recently
announced
a new option for publication rights management, wherein researchers can choose to pay for the public to have perpetual open access to the publication. Google applauds this new option, and today we are announcing that we will pay the open access fees for all articles by Google researchers that are published in ACM journals.
IEEE
also has an open access option for some of its publications, and we also pay the open access fee for them and for publications in like organizations.
Google has always believed that by improving access to the world’s knowledge, we can help improve everyone’s lives. When it comes to scientific research, we have
consistently said
that open access to publications speeds up research, accelerates innovation, and helps grow the global economy.
Policies like ACM’s continue to demonstrate the sustainability of open access publishing. It will also provide better access to the papers that we write at Google. We encourage researchers everywhere to pursue open access options whenever publishing articles, and to continue to make publications available as widely as possible, within your rights.
Explore more with Mapping with Google
Tuesday, May 28, 2013
Posted by Tina Ornduff, Program Manager
In September 2012 we launched
Course Builder
, an open source learning platform for educators or anyone with something to teach, to create online courses. This was our experimental first step in the world of online education, and since then the features of Course Builder have continued to evolve. Mapping with Google, our latest
MOOC
, showcases new features of the platform.
From your own backyard all the way to Mount Everest, Google Maps and Google Earth are here to help you explore the world. You can learn to harness the world’s most comprehensive and accurate mapping tools by registering for
Mapping with Google
.
Mapping with Google
is a self-paced, online course developed to help you better navigate the world around you by improving your use of the
new Google Maps,
Maps Engine Lite, and Google Earth. All registrants will receive an invitation to preview the new Google Maps.
Through a combination of video and text lessons, activities, and projects, you’ll learn to do much more than look up directions or find your house from outer space. Tell a story of your favorite locations with rich 3D imagery, or plot sights to see on your upcoming trip and share with your travel buddies. During the course, you’ll have the opportunity to learn from Google experts and collaborate with a worldwide community of participants, via Google+ Hangouts and a course forum.
Mapping with Google
will be offered from
June 10 - June 24
, and you can choose whether to explore the features of Google Maps, Google Earth, or both. In addition, you’ll have the option to complete a project, applying the skills you’ve learned to earn a certificate. Visit
g.co/mappingcourse
to learn more and register today.
The world is a big place; we like to think that you can make it a bit more manageable and adventurous with Google’s mapping tools.
Syntactic Ngrams over Time
Thursday, May 23, 2013
Posted by Yoav Goldberg, Professor at Bar Ilan University & Post-doc at Google 2011-2013
We are proud to announce the release of a very large dataset of counted dependency tree fragments from the English Books Corpus. This resource will help researchers, among other things, to model the meaning of English words over time and create better natural-language analysis tools. The resource is based on information derived from a syntactic analysis of the text of millions of English books.
Sentences in languages such as English have structure. This structure is called syntax, and knowing the syntax of a sentence is a step towards understanding its meaning. The process of taking a sentence and transforming it into a syntactic structure is called parsing. At Google, we parse a lot of text every day, in order to better understand it and be able to provide better results and services in many of our products.
There are many kinds of syntactic representations (you may be familiar with
sentence diagramming
), and at Google we've been focused on a certain type of syntactic representation called "dependency trees". Dependency-trees representation is centered around words and the relations between them. Each word in a sentence can either modify or be modified by other words. The various modifications can be represented as a tree, in which each node is a word.
For example, the sentence "
we really like syntax
" is analyzed as:
The verb "like" is the main word of the sentence. It is modified by a subject (denoted nsubj) "we", a direct object (denoted dobj) "syntax", and an adverbial modifier "really".
An interesting property of syntax is that, in many cases, one could recover the structure of a sentence without knowing the meaning of most of the words. For example, consider the sentence "the krumpets gnorked the koof with a shlap". We bet you could infer its structure, and tell that group of something which is called a krumpet did something called "gnorking" to something called a "koof", and that they did so with a "shlap".
This property by which you could infer the structure of the sentence based on various hints, without knowing the actual meaning of the words, is very useful. For one, it suggests that a even computer could do a reasonable job at such an analysis, and indeed it can! While still not perfect, parsing algorithms these days can analyze sentences with impressive speed and accuracy. For instance, our parser correctly analyzes the made-up sentence above.
Let's try a more difficult example. Something rather long and literary, like the opening sentence of
One hundred years of solitude
by Gabriel García Márquez, as translated by Gregory Rabassa:
Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice.
Pretty good for an automatic process, eh?
And it doesn’t end here. Once we know the structure of many sentences, we can use these structures to infer the meaning of words, or at least find words which have a similar meaning to each other.
For example, consider the fragments:
"order a XYZ"
"XYZ is tasty"
"XYZ with ketchup"
"juicy XYZ"
By looking at the words modifying XYZ and their relations to it, you could probably infer that XYZ is a kind of food. And even if you are a robot and don't really know what a "food" is, you could probably tell that the XYZ must be similar to other unknown concepts such as "steak" or "tofu".
But maybe you don't want to infer anything. Maybe you already know what you are looking for, say "tasty food". In order to find such tasty food, one could collect the list of words which are objects of the verb "ate", and are commonly modified by the adjective "tasty" and "juicy". This should provide you a large list of yummy foods.
Imagine what you could achieve if you had hundreds of millions of such fragments. The possibilities are endless, and we are curious to know what the research community may come up with. So we parsed a lot of text (over 3.5 million English books, or roughly 350 billion words), extracted such tree fragments, counted how many times each fragment appeared, and put the counts online for everyone to download and play with.
350 billion words is a lot of text, and the resulting dataset of fragments is very, very large. The resulting datasets, each representing a particular type of tree fragments, contain billions of unique items, and each dataset’s compressed files takes tens of gigabytes. Some coding and data analysis skills will be required to process it, but we hope that with this data amazing research will be possible, by experts and non-experts alike.
The dataset is based on the English Books corpus, the same dataset behind the
ngram-viewer
. This time there is no easy-to-use GUI, but we still retain the time information, so for each syntactic fragment, you know not only how many times it appeared overall, but also how many times it appeared in each year -- so you could, for example, look at the subjects of the word “drank” at each decade from 1900 to 2000 and learn how drinking habits changed over time (much more ‘beer’ and ‘coffee’, somewhat less ‘wine’ and ‘glass’ (probably ‘of wine’). There’s also a drop in ‘whisky’, and an increase in ‘alcohol’. Brandy catches on around 1930s, and start dropping around 1980s. There is an increase in ‘juice’, and, thankfully, some decrease in ‘poison’).
The dataset is described in details in this
scientific paper
, and is available for download
here
.
Launching the Quantum Artificial Intelligence Lab
Thursday, May 16, 2013
Posted by Hartmut Neven, Director of Engineering
We believe quantum computing may help solve some of the most challenging computer science problems, particularly in machine learning. Machine learning is all about building better models of the world to make more accurate predictions. If we want to cure diseases, we need better models of how they develop. If we want to create effective environmental policies, we need better models of what’s happening to our climate. And if we want to build a more useful search engine, we need to better understand spoken questions and what’s on the web so you get the best answer.
So today we’re launching the Quantum Artificial Intelligence Lab. NASA’s Ames Research Center will host the lab, which will house a quantum computer from
D-Wave Systems
, and the
USRA
(Universities Space Research Association) will invite researchers from around the world to share time on it. Our goal: to study how quantum computing might advance machine learning.
Machine learning is highly difficult. It’s what mathematicians call an “NP-hard” problem. That’s because building a good model is really a creative act. As an analogy, consider what it takes to architect a house. You’re balancing lots of constraints -- budget, usage requirements, space limitations, etc. -- but still trying to create the most beautiful house you can. A creative architect will find a great solution. Mathematically speaking the architect is solving an optimization problem and creativity can be thought of as the ability to come up with a good solution given an objective and constraints.
Classical computers aren’t well suited to these types of creative problems. Solving such problems can be imagined as trying to find the lowest point on a surface covered in hills and valleys. Classical computing might use what’s called “gradient descent”: start at a random spot on the surface, look around for a lower spot to walk down to, and repeat until you can’t walk downhill anymore. But all too often that gets you stuck in a “local minimum” -- a valley that isn’t the very lowest point on the surface.
That’s where quantum computing comes in. It lets you cheat a little, giving you some chance to “tunnel” through a ridge to see if there’s a lower valley hidden beyond it. This gives you a much better shot at finding the true lowest point -- the optimal solution.
We’ve already developed some quantum machine learning algorithms. One produces very compact, efficient recognizers -- very useful when you’re short on power, as on a mobile device. Another can handle highly polluted training data, where a high percentage of the examples are mislabeled, as they often are in the real world. And we’ve learned some useful principles: e.g., you get the best results not with pure quantum computing, but by mixing quantum and classical computing.
Can we move these ideas from theory to practice, building real solutions on quantum hardware? Answering this question is what the Quantum Artificial Intelligence Lab is for. We hope it helps researchers construct more efficient and more accurate models for everything from speech recognition, to web search, to protein folding. We actually think quantum machine learning may provide the most creative problem-solving process under the known laws of physics. We’re excited to get started with NASA Ames, D-Wave, the USRA, and scientists from around the world.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
MapReduce
market algorithms
Market Research
ML
MOOC
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Ngram
NIPS
NLP
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PiLab
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.