Site reliability engineering
Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
The field of site reliability engineering originated at Google with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003. In 2016, Google employed more than 1,000 site reliability engineers. After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers. The position is more common at larger web companies, as small companies often don't operate at a scale that would require dedicated SREs. Companies who have adopted the concept include Dropbox, Airbnb, IBM, and Netflix. According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.
Site reliability engineering is the application of software engineering to IT subjects including infrastructure and operations, with the goal of creating and maintaining scalable and reliable systems. Site reliability engineers often have a backgrounds in software engineering, system engineering, or system administration. Focuses of site reliability engineering include automation, system design, and improvements to system resilience. SRE teams are responsible for system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and has also been described as a specific implementation of DevOps. Site reliability engineering focuses specifically on building reliable systems, whereas DevOps is more broadly focused on infrastructure. The definition varies somewhat by company, and Stephen Gossett wrote in Built In that some companies have rebranded their operations teams to SRE teams with little meaningful change.
- Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
- Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
- Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
- "What is SRE?". Red Hat. Retrieved June 17, 2021.
- Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
- Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
- Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
- "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
- Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
- Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
- Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40 no. 3. pp. 35–39. Retrieved June 17, 2021.
- Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
- "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.
- Blank-Edelman, David N., ed. (2018). Seeking SRE: Conversations About Running Production Systems at Scale (1 ed.). Sebastopol, CA: O'Reilly Media. ISBN 978-1491978863. OCLC 1052565720.
- Limoncelli, Tom; Chalup, Strata R.; Hogan, Christina J. (September 2014). The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services. 2. Upper Saddle River, NJ: Addison-Wesley. ISBN 978-0-13-347854-9. OCLC 891786231.