Yahoo releases 13.5TB of data to help researchers

Dive Brief:

Machine learning is growing in popularity, but the average company can find it difficult to get the large data sets needed to test machine learning programs.
In response, Yahoo released the "largest ever" data set to machine learning scientists.
The data comes from interactions with the company's news feeds, including Yahoo News, Sports, Finance, Movies and Real Estate.

Dive Insight:

Computer scientists require large data sets to guide and test machine learning systems. To help aid in this effort, Yahoo released 110 billion records comprising 13.5TB of data for public use yesterday. The data is now available for download through Yahoo Labs' Webscope data sharing program, and is more than ten times the size of the largest previously released dataset, Yahoo said.

"Data is the life-blood of research in machine learning," the company said. "However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies -- and out of reach for most academic researchers."

Machine learning is used for many different types of tasks, including some for business.

“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research,” said Suju Rajan, Director of Research for Personalization Science at Yahoo Labs.

Rajan said she “expects scientists to use the data to help build better recommendation engines, like those those on Netflix and Amazon.” But it could also drive other research areas.

"They might be able to solve problems in a way that we can make use of at Yahoo, or come up with new research problems that we haven't even thought of yet," Rajan said.