TIL how Reddit transformed its data culture and processes

With 14 billion screenviews a month across 138,000 active communities and 330 million active monthly users, Reddit is one of the most popular websites in the world.

But when Mike Doherty joined Reddit as data engineering lead in late 2014, the company had no extract, transform, load (ETL) or pipelines and was using Google Analytics' free services, he wrote in a company blog post. They had the chance to "build up data systems from scratch" and mature the company's data architecture to take advantage of the stores of users and information.

Reddit had HAProxy and pixel-tracking logs as the two primary sources of data to derive insights, and moved from simple PIG scripts to Amazon's MapReduce to create its first data warehouse, according to Doherty. The company brought in Apache Hive to query data for deeper analysis and used Azkaban and Jenkins for its first ETL iteration, and turned to third-party vendors for tools for data visualization and centralized systems for running queries.

Reddit had entered the "Bronze Age of Data" with the stage set for the Iron Age, according to Doherty.

The company has since moved from intuitive-based product and feature design to a data-literate and data-driven one, ensuring design and business decisions are not made without first consulting the data, according to Luis Bitencourt-Emilio, director of engineering, in a Q&A with DataScience.com.

In April, the "front page of the internet," as the site is commonly described by its users, rolled out its first major design rehaul in a decade, dancing the fine line of maintaining the core platform that had brought hundreds of millions of users in while upending a user interface and branding that customers had grown accustomed to.

A front-end redesign was just the start. Underneath the sleeker, more user-friendly interface was an overhaul in the company's data culture and a back-end transition from legacy systems in order to leverage modern technologies, such as artificial intelligence and machine learning.

AMA: How does Reddit's data operation work today?

The Reddit experience is as intimate as users want it to be. The site doesn't collect personal data, so submitting information such as a name or email address is up to the user, according to Steve Huffman, CEO of Reddit, in an April interview with The Wall Street Journal.

When it comes to advertising, the site tailors ads to users through their interests — details that can be gleaned from what posts or subreddits (sub-communities dedicated to specific topics, such as technology, Q&As, prequel memes or c orgis) a user interacts with.

Measuring metrics like time on site and making use of that data at scale is no small feat, and in the last year the company has increased its engineering base from 30 people to around 150, according to Nick Caldwell, VP of engineering, in the same interview with the Journal. The company also rebuilt its data pipelines, settling on a schema everyone could agree on and offered a consistent view of data.

Shortly before the redesign, Reddit launched a ML personalization model, immediately improving time on site by 8% to 10%, according to Huffman. And the back-end changes sped up developer productivity, allowing new environments to get up and running in a matter of minutes or hours instead of days, Caldwell said.

While Reddit has a centralized data science team reporting to one director, the company decided to embed data workers in product teams so they can work closely with engineers, designers and product managers and have an impact close to end users, according to Bitencourt-Emilio.

By embedding data scientists throughout product teams, units are more cohesive and can figure out how to improve features or iterate experiments without crossing team lines, Bitencourt-Emilio said. This is important because with every "hop" between two people, information can get lost in communication; fewer hops between data scientists and engineers or even data engineers and data scientists means more efficiency and speed.

Breaking down information silos and ensuring titles aren't a barrier to collaboration helps teams function as a cohesive unit, but it can also make defining the role of the data scientist at Reddit difficult, Bitencourt-Emilio said. Across teams, roles can vary widely.

For example, UX developers are running different experiments than the machine learning team measuring time on site or return visitors, and skills and work can vary between these.