Building a Golden Path from Production
Infrastructure, Platforms, DevOps, Systems Administration, SRE, call it what you will. This has been my life across three eras of the internet over the last three decades.
Along the way, I’ve had the privilege to play a part in scaling teams, infrastructure, and processes for three giants in their respective eras: AOL, YouTube/Google, and Dropbox. In the case of YouTube and Dropbox, I saw them grow and mature from small teams (10’s of people in platform & infrastructure) to thousands. I also stayed long enough at each to see and learn from my mistakes.
One great part of seeing the internet in three different eras is the ability to derive principles that are long-lived and battle-tested. This blog outlines Prodvana’s Golden Path -- our default software path to production. This is how we ship our product to you, our customers.
With our golden path, we can efficiently onboard a new engineer at Prodvana in 30 minutes and have them deploying code safely in less than 1 hour.
We’re sharing it here to provide a tangible north star with proven flight miles that you can replicate in your organization. If you want to deliver an amazing developer experience that leads to higher efficiency for your engineering teams, read on.
Our Golden Path Principles
Reliable - Don’t lose data. Be correct. Be available. Return responsively.
Principle of Least Surprise - The user should not be surprised by the outcome. People are part of the system, and their mental models should not be ignored. Systems should be debuggable.
Modular & Composable - Each component of the path should have clear interfaces that are reliable and operate under the principle of least surprise. As our needs evolve, modules should be able to be swapped out.
While these are certainly not the only principles that can work, these are the principles that guide and explain how we’ve built our Golden Path at Prodvana. This is ultimately the path that we use to ship software to you.
To get as close to you as possible, we find the closest touch point — production. Then, we work backward to understand the best environment to do development in for the product that we are delivering for you. At Prodvana, we value our mission of intelligently delivering your software with zero overhead more than hyper-specific toolchains, person by person.
Prodvana’s Golden Path
Start with the user. In this case, that’s you. Our users expect us to deliver a product that works. Our Golden Path is always in service to that.
Our Golden Path in production comprises three building blocks: a Reliability Platform, a Delivery Platform, and an Infrastructure Platform.
Our development path has two key building blocks: Cloud Development Environments and a Build & Test Platform.
Production is where we serve our customers.
Our users don’t care about our development environment or our test environment. They care about how the systems that they interact with behave.
We started here with our golden path principles when selecting the technology and creating the framework that the technology would fit. This is to ensure an outcome for the customer.
Using the principles we outlined above, we wanted to create a system for production that satisfies them. Ultimately, you could have more or fewer boxes in the diagram. Given our size and scale, we settled on three. We decided they need to be modular and composable, so we can easily swap them out and increase complexity within the box when needed. If we need to add a new platform, we can attach it to the outputs that a box generates.
Job to be done: Store, Observe, and Notify us of anomalies in production.
Tech Stack: PagerDuty, Datadog, Detectify, Rootly, Vantage
The inputs: DataDog API, PagerDuty API
The outputs: Pages, Dashboards, Rootly Incident Management, Vantage Cost Reports
We treat the Reliability Platform as a closed loop; it is a watcher workflow. The data in the underlying subsystems, such as Datadog and PagerDuty, is used in control flows for the Delivery Platform, but the output is for humans.
These pieces of technology are chosen because they operate under the principle of least surprise: They are well-known industry standards, and they have flexible enough interfaces so that we can compose workflows on top of them. We plan to add a chaos system to this platform as we scale. Additionally, we expect that the monitoring aspects will require additional intelligence in the future for faster root cause analysis.
Dynamic Delivery Platform
Job to be done: Delivery of software intelligently with zero overhead.
Tech Stack: Prodvana
The inputs: Docker image, Environment Config
The outputs: Configuration for Production Environments
It may seem self-serving to use our own system, but there are five key reasons:
Uniform deployment flow for all engineers regardless of role.
A way to manage production constraints.
Ability to handle single and multi-tenant architectures - potentially with version skew for enterprise use cases that require high degrees of isolation.
A single place to catalog and deploy our services across multiple clouds.
Ability to handle infrastructure upgrades across multiple Kubernetes clusters.
Oh, and we love to dogfood.
Dynamic Delivery gives us all of the abilities listed.
With Dynamic Delivery, we’ve defined our environments and constraints. These requirements include schema validations, SOC2 checks, and integrations with Datadog and PagerDuty. With how Dynamic Delivery works as we mature our environments, all of the services within Dynamic Delivery level up -- simultaneously. This saves us considerable time. As a small team, time is extremely valuable.
Plus, no one wanted to work with the IaC toolchain to deploy code.
Job to be done: Deliver the baseline environment in a closed loop (meaning we do not leak this into our Delivery Platform stack)
Tech Stack: GCP, GKE, Pulumi, AWS, and other cloud primitives
The inputs: Declarative Pulumi
The outputs: Cloud Configuration - nothing more
The infrastructure stack is the smallest portion of our Production stack. GCP provides a foundational layer for most of the underlying infrastructure needs we have. We use Pulumi to manage the creation of resources that we need. We selected Pulumi over Terraform because of Pulumi’s testing frameworks and familiarity with python within our engineering team.
This layer is decoupled from the Dynamic Delivery Platform so that each can move and be deployed independently. Prodvana’s Runtime Abstraction provides an interface to this layer.
Development is where we innovate.
Without innovation, we die. We strive to ensure a world-class development environment for ourselves and our business. Friction should be low here, and we want to ensure the ability to bring in best-in-class tools as needed.
We also believe that there is and should be a defined boundary between development and production. Today, we do this with a docker image serving as the interface. This allows developers to move quickly by mocking out certain components without impacting production.
Build and Test
Job to be done: Generate a high-confidence artifact for production.
Tech Stack: GitHub Actions, Cypress, Mergify, Trunk.io
The inputs: Source Code
The output: Docker Image for production qualification
Our build and test environment at this stage is fairly simple. We use GitHub Actions (with its imperative config) to define our build and test steps. We keep this simple and use GitHub Actions to push builds into our docker registries.
Cloud Development Environment
Job to be done: Eliminate friction between your idea and testing.
Tech Stack: GitHub Codespaces, Docker Compose, Trunk.io
The Inputs: Ideas
The Output: Pull Request
Codespaces is the backbone of our development workflow. We heavily use it and build each subsystem we have as a docker image. We run these docker images along with basic infrastructure scaffolding such as Kind (Kubernetes in Docker) to emulate a production environment for functional testing.
We recognize that there is no way to validate production fully in this space. Hence, we rely heavily on tests and our Dynamic Delivery Platform insights to inform us about issues during deployments.
After reading this, you may think, “wow, Prodvana had the luxury of putting this together end-to-end in a greenfield environment.”
That is far from the truth. As a startup, we’re constantly moving fast and trying new things. With this framework, we have been able to pivot technical decisions, introduce new systems, and deploy software rapidly for our users. Additionally, by having this north star, we moved quickly by dividing and conquering the critical elements of the Golden Path. We did not wait for development to be great before we tackled production. We swapped components in the infrastructure layer seamlessly and efficiently as we continued to learn more about how we wanted our load-balancing and traffic layers configured.
You can apply the same thinking to larger brownfield golden path problems — I’ve had to personally do this 2x in my career at Dropbox and YouTube. In one case, I led a team of ~20 engineers through navigating and building the golden path for over 600 engineers on a Go, Python (including a 4 million line monolith), Rust, and Java code base. We migrated and sunsetted the previous flows in less than a year. This included building a remote development platform similar to Github Codespaces, migrating tens of thousands of tests to Bazel tests, and building an entirely new way of thinking about the development environment that was “unimpeachable.”
Prodvana’s Golden Path and the principles we used to define it are based on hard-won lessons from my experience leading teams in this space. This north star framework can be applied to organizations big and small, greenfield and brownfield.
If you’re interested in learning how Dynamic Delivery can fit into your Golden Path to drive efficiency, sign up here!