
System Reliability - Building Dependable Systems
Complete guide to system reliability covering MTBF (Mean Time Between Failures), MTTR (Mean Time To Recovery), fault tolerance, and building dependable distributed systems.

Complete guide to system reliability covering MTBF (Mean Time Between Failures), MTTR (Mean Time To Recovery), fault tolerance, and building dependable distributed systems.

What is SRE? SRE stands for Site Reliability Engineering. That’s just a lot of words. What does it mean though? Site Reliability engineering is what IT operations would be if it was run by software engineers. That’s an interesting take. But it was not helpful in clarifying anything about SRE just yet. Let’s try probing more. How did we go from Development to SRE? You know the part where people deploy software and then ensure things run fine in production. ...