Reliability planning is most effective when it starts during design, not shortly before release.

This post focuses on how Service Level Objectives (SLOs) and Service Level Indicators (SLIs) can be used early in the SDLC to guide architecture and delivery choices.

Remind me what they are again

SLI - Service Level Indicator

A quantitative metric for a service’s performance, as experienced by the user of the service. It is a measure of a property of the service that is a good proxy for your user experience.

An SLI is a design input. It defines what success looks like in measurable terms and informs what telemetry must be emitted from the application.

Some examples include successful request rates, request latency.

SLO - Service Level Objective

This is a specific target range of an SLI or its minimum value that the service must maintain over a period of time.

It influences infrastructure and process decisions: redundancy model, failover strategy, testing depth, and the trade-off between reliability and delivery speed.

Examples:

  • 99.999% availability - This allows roughly 5 minutes and 15 seconds of unplanned downtime per year. It usually requires high investment in multi-region design, strong redundancy, and advanced operational controls.

  • 99.9% availability - This allows roughly 8 hours and 45 minutes of unplanned downtime per year. It is often a pragmatic target where fast recovery is prioritised, while still requiring thoughtful failover and incident response practices.

SLOs are useful throughout delivery, not only during production readiness checks.

For additional background, see Google’s SRE book and the Availability Table.

Why Define SLOs Early

In many teams, SLOs are drafted close to release. At that point, architecture and implementation are already fixed, so major reliability improvements become costly.

Defining SLOs earlier helps teams make better trade-offs before decisions are locked in. It also keeps reliability discussions objective, because success criteria are measurable from the start.

Why Reliability Must Shift Left

1. Design and Architecture

At design time, SLIs act as measurable non-functional requirements.

For example, if availability and latency targets are strict, architecture may need regional redundancy, queueing, and graceful degradation patterns from day one.

Discovering a reliability gap after implementation usually means redesign, rework, and schedule impact. Catching the same gap in design review is significantly cheaper.

2. Development Stage

Once targets are clear, development becomes more intentional.

Being intentional about instrumentation

Teams can instrument latency, error rate, and throughput from early iterations, which avoids late-stage observability gaps.

Purposeful Quality Assurance

SLOs also define pass/fail criteria for performance and integration tests. Instead of generic goals, teams test against explicit targets such as p95 latency thresholds under expected load.

3. Pre-Production and Release Readiness

The Error Budget becomes a practical release indicator. If staging or canary data shows budget burn above acceptable levels, deployment can be paused with a clear rationale.

Alerting quality also improves when alerts map directly to SLI health instead of low-level infrastructure noise.

Note

Software Craftsperson’s note: Reliability is a feature

Reliability should be treated like any other product requirement: explicitly defined, testable, and reviewed throughout delivery.

What About SLOs For Non-production Environments?

This question comes up often when introducing SLOs early in the SDLC: should non-production environments also have SLOs?

Although end users are not directly present in non-production environments, these environments still have users and service expectations. Defining environment-specific targets helps teams detect reliability issues earlier.

Here is a practical way to approach it.

Development Environment

Development environments are naturally less stable because change frequency is high.

However, developers are the users in this context, so operational quality still matters.

Some indicators:

  • Developer feedback-cycle time: time from commit to deploy in development.
  • Local build time: must be fast enough to not get in the way
  • Test reliability: failure rate and flakiness trend

These indicators show whether the environment supports fast and predictable iteration.

Staging or Integration Environment

Staging or integration environments combine changes across teams and are often used for end-to-end validation, and in some organisations, performance and UAT.

Stability expectations are therefore higher than in development, so explicit reliability targets are valuable.

Some indicators:

  • Automated Tests success rate: percentage of automated tests passing
  • Availability: uptime of the environment
  • Latency - relevant if performance testing is done here: API response times, such as p99 <= a defined threshold.
  • Load testing - ability to simulate production, and where appropriate, stress conditions above production baseline.
  • Critical Journeys - success rate of critical journeys, sometimes also referred to as critical user flows or smoke tests.

Error Budgets for Non-production Environments

If error budget is the tolerated unreliability before an SLO breach, each environment should interpret that budget according to its purpose.

  • Production - breaches may require rollback, incident response, or hotfixes.
  • Staging - breaches may pause merges or deployments until critical paths are stable.
  • Development - recurring CI or test failures may require immediate stabilisation to restore team flow.

Conclusion

SRE practices from organisations such as Google, Microsoft, Amazon, and Netflix show that measurable reliability targets improve engineering outcomes.

Defining SLIs and SLOs early helps teams make better design choices, test against clear criteria, and use error budgets as objective release signals. In practice, this makes reliability work more predictable and easier to operationalise.