Stop Shipping Hope: SLOs Must Guide Your Architecture, Not Just Your Release

Let’s talk about something fundamental, something often relegated to the last minute, but which, when embraced early, can elevate the craft of software engineering from mere coding to true engineering excellence.

I’m speaking of Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

Remind me what they are again

SLI - Service Level Indicator

A quantitative metric for a service’s performance, as experienced by the user of the service. It is a measure of a property of the service that is a good proxy for your user experience.

This is your design input. It forces you to define what success looks like. If you don’t know what this is, you can’t instrument your application code.

Some examples include successful request rates, request latency.

SLO - Service Level Objective

This is a specific target range of an SLI or its minimum value that the service must maintain over a period of time.

It determines your investment in infrastructure - the level of redundancy required to provide high availability, the level of testing, etc. It denotes the trade-off between reliability and delivery lead time.

This one needs more detailed explanation. Let’s try some examples:

99.999% availability - This is an extremely high SLO. It means, the system can only afford 5 minutes and 15 seconds of total unplanned downtime a year. This forces you to consider solutions involving multi-region deployments, high redundancy using active-active failover, automated chaos engineering in production. This would require very high investment. This is predominant in financial trading applications, safety critical applications, critical infrastructure control systems - power grids, gas pipelines, etc. And some critical infrastructure relied on by worldwide streaming solutions like Netflix.
99.9% availability - This is a moderate SLO and hence the most common and pragmatic SLO. This means, the system can afford a maximum total unplanned downtime of 8 hours and 45 minutes. The priority here is to recover quickly from a downtime rather than avoid downtime as much as possible. This may still need cross-region or cross-availability-zone fail-over preparedness. This pragmatic SLO is chosen everywhere - e-commerce where customers may experience difficulty purchasing but can come back and try later online. In brick and mortar shopping stores, payment difficulties could mean that the cashier team resort to pen and paper based accounting during the outage. Many CRM (customer relationship management) software - Salesforce, can afford to go down during non-working hours of the firms that use the software. All of that is based on publicly available information.

If you’re still in the camp that believes SLOs are a “productionisation thing” or a box to tick before release, then let’s have a serious chat about shifting left - our thinking around reliability.

Checkout Google’s SRE book for more details on SLI/SLO concepts and also see the Availability Table for other 9 availability examples.

The Myth of the Last-Minute SLO

I’ve seen it time and again at places that I work and have worked, and I’m tired of it. The number of times, we have had discussions in our Platform Health Review or the likes about teams not having set their SLOs. The reasons go from: “we have other priorities”, “Our capacity is limited” or “We have bigger fish to fry”. Package it as you wish. But it shows that you are focusing on the wrong metric.

A team discovers, designs, codes, tests, and then, as the release date looms, someone from the Platform or SRE asks: “What about the SLOs?”.

Suddenly, we’re scrambling to define what success looks like after the thing is built.

This is like building a two-storey house on clay foundations because it was easier and quicker. Then later wondering why the house is sinking.

Exactly why these parameters inform the entire build process.

Reliability isn’t a coat of paint applied at the end; it’s the very fibre of the wood.

Why Reliability Must Shift Left

1. Design & Architecture stage: Laying the Foundations

When we begin sketching out our architecture diagrams, the SLIs are our initial design constraints. They aren’t suggestions; they are concrete, measurable Non-Functional Requirements (NFRs).

Think of it this way: If we were a construction company and our client demands a bridge that can support X tons (which would translate to our availability/latency SLO), we don’t design a suspension bridge and then wonder if it is strong enough to bear the maximum load. The load requirement dictates the entire structural design, the materials, the stress points, etc. You ignore it, and you risk building a bridge that collapses under pressure.

The Cost of “Later”: Discovering a fundamental architectural flaw that prevents us from hitting our SLO after the code is written is exponentially more expensive than catching it during the design review. This isn’t just about money; it’s about wasted effort, delayed delivery, and eroding trust in the team’s ability to build robust systems.

2. Development Stage: Building with Purpose

Knowing our target SLOs from the outset guides our very development practices.

Being intentional about instrumentation

How can we measure our SLIs if we haven’t instrumented our code to emit the necessary data points? We must ensure that the telemetry required to gauge latency, error rates, and throughput are in place from the first commit. This should never be an afterthought. It is a critical component of an observable system.

Purposeful Quality Assurance

Our SLOs provide the definition of success for our performance, load, and integration tests. Instead of vague directives like “make it fast,” we now have concrete targets: “ensure 95th percentile API response time remains under 200ms at 1000 requests/second.” This makes testing focused on validating our promise to our application users.

3. Pre-Production: The Quality Assurance Gate

The Error Budget as a Release Readiness indicator: Once we have our SLOs, we can derive an Error Budget. This is a powerful governance mechanism. If our testing in staging, or even our canary deployments, indicates we’re burning through that budget at an unacceptable rate, then we don’t deploy. The SLO becomes the objective arbiter of release readiness, protecting both our users and our operational sanity.

Effective Alerting: Without defined SLIs and SLOs, our alerts are often noisy, reactive, and untrustworthy. By knowing what truly matters to the user experience, we can craft alerts that are actionable, signal-rich, and directly tied to the health of our service from the user’s perspective.

Note

Software Craftsperson’s Mantra: Reliability is a Feature

Let this be our guiding principle: Reliability is not a bolt-on; it is a fundamental feature of our software, no less important than user authentication or a new business workflow.

Let me ask you, would you build an algorithm without specifying its inputs, expected outputs, expected time or memory constraints?

Of course not.

SLOs are simply the explicit definition of the user experience for this critical “reliability feature”. They are your north star that ensure your system doesn’t just work, but is fit for purpose.

Making reliability a first-class citizen in every stage of your software development lifecycle is how you become a software-craftsperson.

What About SLOs For Non-production Environments?

This is another question that comes up when one mentions SLOs early in the SDLC. Non-production environments do not have end users. So, do we need SLOs there?

SLOs are a measure of user experience. Different environments have different purposes and expectations of stability and availability. It is always going to benefit the team if they had clear SLOs for every environment they had to ensure that they understand problems early on.

Let us look at some ways this can be implemented.

Development Environment

This environment is expected to be unstable, after all this is where developers actively push their changes to. So this is not something that one would recommend for measuring end-user experience.

But if developers are the users of the environment, the perhaps there are metrics relevant to that that can be measured and also be useful.

Some indicators:

Developer feedback-cycle time: time taken from commit, to PR merge to deployment in dev environment.
Local build time: must be fast enough to not get in the way
Test reliability: rate of test failures

These could determine how useful the development environment is to developers.

Staging or Integration Environment

This is an environment where changes from multiple teams are integrated and there is some level of automated testing in place. The expectation is that this environment’s automated tests should pass reliably to prove that the E2E system is functional. In some organisations, this environment could also be used for performance testing or user acceptance testing.

Thus, the stability of this environment must definitely be higher than the development environment. Therefore, the SLOs would be even more relevant.

Some indicators:

Automated Tests success rate: percentage of automated tests passing
Availability: uptime of the environment
Latency - relevant if performance testing is done here: API response times, p99 <= specific milliseconds. In case this is a performance testing environment, this indicator must be more stringent than what the equivalent is in production.
Load testing - ability to simulate production or higher than production loads - thrash the system with 2x or 3x load and see how it behaves.
Critical Journeys - success rate of critical journeys, sometimes also referred to as critical user flows or smoke tests.

Error Budgets for Non-production Environments

This is common sense really, but I thought it is better to be explicit about this. If error budget is the measure of how much unreliability can be afforded before breaching the relevant SLO, then non-production environments will have their own interpretation of error budgets.

Production - unreliability in production might need a rollback or a hotfix
Staging - unreliability in staging might mean halting further deployments until the environment is stable again. For example, no merges until the journey tests pass reliably.
Development - despite being accepted as THE unstable environment, there maybe some failed tests or CI issues that must be fixed before development can continue.

Conclusion

Large corporations such as Google, Microsoft, Amazon, Netflix, and many more have implemented and tested SRE principles. The fundamental concepts are sound and have been demonstrated to be effective. This is why I advocate for SLOs and SLIs to be defined early in the SDLC and not just think about this before it is time to release. Let’s not just create job roles and titles called SREs, but also incorporate the principles of SRE in our engineering.

SLI - Service Level Indicator#

SLO - Service Level Objective#

The Myth of the Last-Minute SLO#

Why Reliability Must Shift Left#

1. Design & Architecture stage: Laying the Foundations#

2. Development Stage: Building with Purpose#

Being intentional about instrumentation#

Purposeful Quality Assurance#

3. Pre-Production: The Quality Assurance Gate#

What About SLOs For Non-production Environments?#

Development Environment#

Staging or Integration Environment#

Error Budgets for Non-production Environments#

Conclusion#

Useful Links#