With 170 million active users and 25 million songs, Spotify is the world's largest music streaming subscription service. Undergirding the 1 billion plays per day is a digital infrastructure that is slowly shifting.
Spotify open sourced its in-house container orchestration service, Helios, in 2014. After several years of use, Spotify decided to make the switch from Helios to Kubernetes, another orchestration service released shortly after that has since become the de facto orchestrator platform.
Backed by thousands of developers, Kubernetes comes with a huge ecosystem behind it, and trying to reach feature parity with an in-house system not widely adopted by other enterprises is difficult, even for a business as large as Spotify, according to James Wen, site reliability engineer for Spotify, speaking at DevFest DC last week.
It became clear that Spotify needed a managed solution rather than operating clusters from scratch. By moving to Kubernetes, Spotify would benefit from:
Cloud-native "magic," such as autoscaling, better resource utilization and self-healing
Less capacity planning for developers
Less proprietary technology
Faster experimentation and operations
How Spotify did — and still are doing — it
Although they had a clear goal in mind — to run all stateless services on Kubernetes — Wen and his colleagues couldn't just move every service and team to Kubernetes in one fell swoop. Spotify's workers sit between centralized operations and team autonomy; each team has operational responsibility for specific services, with set processes common across the business, similar to Netflix's "Paved Road."
Spotify decided to start small, experimenting with running one service on one Kubernetes cluster and then moving up to three services on a shared cluster for a few days, according to Wen. By working with a few services at a time, the transition would only affect a few teams while the kinks were worked out.
Spotify set up permissioning by namespace so developers wouldn't disrupt resources in another team's space, resource quotas so no single team could take up too many resources and developer documentation to communicate between the three teams.
With the experimentation phase complete, Spotify moved on to the alpha phase, asking R&D teams to volunteer services they were interested in running on Kubernetes. Wen and his team helped migrate the services and pipelines to Kubernetes.
During this phase, it was okay if the tests weren't operationally sound, said Wen. The services were still running on Helios and Kubernetes, and Spotify could wind down on Kubernetes and focus on Helios if an incident occurred.
For the second part of the alpha phase, Spotify ran two complex, high-traffic services on shared clusters. The move required that they understand network setup at a granular level, allowed them to experiment with autoscaling and provided a reference point and confidence for other teams that would be migrating their services in the future.
Spotify is currently in the beta phase of self-service migration, in which teams that want to move to Kubernetes can follow the documentation to partially or fully move to Kubernetes, Wen said. Some services are now fully off of Helios and only running on Kubernetes.
In the general availability phase, any new service at Spotify will be deployed only on Kubernetes, with tools for developers such as a one-click migration tool, vertical autoscaling, custom metrics autoscaling, IT general controls and migration tracking.
Spotify used a road team to help transition teams from bare metal to the Google Cloud Platform, and engineers familiar with Kubernetes will conduct a similar operation and go around Spotify offices to help developers and teams make the migration.
Lessons and takeaways
In the transition process, Spotify was able to build custom extensions on top of Kubernetes, such as admission controllers, resource labels and metadata and Spinnaker, an open-source delivery platform for software changes developed by Netflix.
Moving forward, the company still has to tackle a few challenges, including cluster management, multicluster operations in each region and building up support for data jobs, machine learning workloads and GPU workloads, according to Wen.
The company has learned a few lessons along the way to make future projects a little easier. Being mindful and deliberate in choosing terminology is incredibly important when bringing in outside systems with large ecosystems, according to Wen. "Overloaded" terminology can cause communication barriers between teams.
Moving step by step with steadily increasing goals, instead of a single, monolithic migration, allowed Spotify to steadily increase scope and complexity and handle unknown factors at a manageable pace, keeping morale for developers up. And, most importantly, companies undergoing a migration need to talk to other companies and peers about infrastructure solutions, according to Wen.
Correction: The amount of capacity planning for developers was clarified.