Please subscribe to my channel
.
In the race among tech companies to attract top talent in artificial intelligence,
Yahoo Inc. is making a dramatic move: giving away a huge amount of data about how users interact with its services.
On Thursday, the embattled
Internet company said it would release the largest cache of
Internet behavior data—the clicks, hovers and scrolls of some 20 million anonymous users on
Yahoo’s sports, finance, news, real estate and other pages. The trove, which will be available only to universities, is expected to give researchers a rare, real-world look at how large numbers of people behave online.
Yahoo, which is facing a brain drain after years of stagnant growth, is looking to attract academic researchers in the fast-growing and highly competitive field of artificial intelligence.
The Yahoo data dump comes at a time when technology companies are racing to strengthen their ties with academia, particularly in areas of artificial intelligence known as machine learning and deep learning, which involve training machines to mine massive data sets so they can respond to complex queries or make predictions.
Facebook Inc. and
Google have recruited top researchers; for instance,
Yann LeCun, who joined
Facebook in
2013, continues to run
New York University’s
Center for
Data Science.
“No matter how much talent you have, there is always more on a manager’s bucket list,” said
Andrew Moore,
Dean of the
School of Computer Science at
Carnegie Mellon University. “No one in these big technology companies feels like they have enough people to do the things they want to do.”
Large quantities of data are necessary for machine learning, in which computers spot complex patterns and figure out in Yahoo’s case, say, what kinds of headlines or design features attract teenage girls living in
Rapid City,
S.D., at 7:30 p.m. Such data sets are rare outside major
Internet companies, and they’re closely held for what they can reveal about the business. The Yahoo data set weighs in at 13.5 terabytes, about two-thirds the size of the library of
Congress.
That is larger than anything available to the vast majority of academic computer scientists, and so big that it likely will have to be stored outside a university system, possibly in a cloud computing center run by
Amazon.com Inc. or
Alphabet Inc.’s Google, said
Carnegie’s
Moore, a former Google executive. The university signed a five-year, $
10 million partnership with Yahoo last year, to develop personalized apps based on user data.
“
Data is not easy to come by for folks not inside companies,” said GertLanckriet, a professor in the
Department of
Electrical and Computer Engineering,
University of California, San Diego, who spoke at an event announcing the data release.
The Yahoo cache’s sheer size makes it valuable, experts said. Algorithms capable of analyzing large amounts of data differ fundamentally from those designed for less data. Yahoo’s generous release can help researchers learn how to build the large-scale algorithms, which are especially useful to corporations. Yahoo has released over 50 data sets since
2006, including a cache of
100 million Flickr photos in 2014. Its largest past release was 413 gigabytes, a fraction of the current set. Google and
Amazon have released relatively little data.
Tension is higher than ever between the need to attract talent and generate new ideas, on one hand, and the need to protect privacy and competitive advantage, on the other, said
Hilary Mason, the founder of
Fast Forward Labs, a data science startup. Many of the large technology companies are trying to create the same sorts of capabilities, she said, such as self-driving cars, image recognition, and personalized services. Yahoo runs a small risk of revealing trade secrets by revealing user data, but it has decided that the reward of attracting talent could be greater.
While several companies have released data aimed at researchers, the practice has a fraught history.
AOL, in the process of releasing data to researchers in 2006, accidentally revealed search queries. Netflix released the movie recommendations and logs of hundreds of thousands of customers in 2009, offering a $1 million prize to anyone who improved its recommendation algorithm. In both instances, outsiders used the data to deduce users’ identities, leading to class action lawsuits over violations of privacy laws. Netflix cancelled its prize.
The study, which adjusted the content of users’ news feeds to generate emotional responses, set off a huge privacy backlash. Facebook since has limited the proprietary data it makes available to outsiders.
“
Ever since the AOL privacy debacle in 2006, companies have been afraid to release data,” Ms.
Mason said.
Yahoo’s cache appears to be less sensitive. It includes only basic demographic information such as city, age, and gender, along with clicks and other interactions with Yahoo
Web properties.
- published: 15 Jan 2016
- views: 7