What is site reliability engineering (SRE) and its principles? When Agile and DevOps approach become more established in software engineering, there are various loopholes of software engineers in IT infrastructures and application resilience needed to be filled. This is where SRE came into the picture.
Introduction to Site Reliability Engineering (SRE)
SRE is neither an old nor a new concept. The term SRE originates from Benjamin Treynor Sloss, a VP of engineering at Google and refers to the use of software engineering approach to automate IT operations such as production system management, change management, incident response as well as emergency response.
The principles in SRE
In practice, SRE teams inherited the tasks that are previously done by the operation teams. SRE teams work based on principles. The principles provides useful guidelines to organisation in deciding on the best practice to be implemented. What is included in the SRE principles?
1.Embracing Risk
Perfection is not always the best option. Some service might gives a better outcome, but with software engineering the story differs. A more reliable service often comes with higher cost, be it time, money or energy. Nevertheless, there is no difference between a services that give out 99% reliability with a similar service that offers 99.9% reliability. Embracing risk is a practice that allow organisation to spend at better cost and increase development velocity instead of overspending on improving reliability that has zero business value.
2.Service Level Objectives
Service level objective (SLO) defines customer satisfaction target of level of service in an organisation. SLO is based on service level indicators (SLI) which is a set of metrics that indicate what matters to customers. Adopting a suitable metris allows the team to take the right measure during downtime.
3.Eliminating Toil
Toil is regarded as tasks related to running a production that are prone to be manual, repetitive, tactical, and automable. When toil is reduced or eliminated, the SRE team can focus more on invention.
4.Monitoring
Monitoring is important in production since it provides various information such as long-term trend analysis and alerts for device or tool malfunctions. Without monitoring, organisation cannot ensure the reliability of their service.
5.Automation
Automation completes repetitive tasks with minimal human intervention. Through automation, the speed to complete a task increases. The area of tasks that can be improved by automation include deployment, testing, incident response and communication.
6.Release engineering
A reliable release process allows organisation to run reliable service. Release engineering is referred as building and delivering a consistent and repeatable software. In product development, release engineering is the responsibility of release engineers and they work closely with software engineers to determine the best practice needed to release software.
7.Simplicity
“Everything should be made as simple as possible, but not simpler” Said Einstein. Therefore, a system should be built with simplicity but having the same ability to deliver the service required by users. Reasonably, a simple system is easy to monitor, repair or upgrade.
To sum up, customer satisfaction is what drives businesses these days. Without a proper practice, organisation cannot deliver reliable service thus customer satisfaction in unachievable. In software engineering, adopting SRE in the organisation should be in your checklist.
E-SPIN Group in the business of enterprise ICT solutions supply, consulting, project management, training and maintenance, for multinational corporations and government agencies across the region E-SPIN did business, since 2005. Feel free to contact E-SPIN for your requirements and project inquiry.