Metrics to measure in DevOps
There are many metrics we saw in a traditional world…most of them certainly no longer apply, when we on-boarding DevOps with new development life-cycle (Agile).
The Key to metrics would be data collection and reporting formats; In this post, i recommend few metrics for DevOps teams, Can select required metrics from these…not all ! always start with 4 or 5 metrics for tracking…later you can have additional as maturity and tool based data capture increases.
1. Mean Lead Time(Idea to Value): It’s very important metric for DevOps, this metric helps to measure Mean time of all work-items for a release which makes way from creation to completed/done state. This will indicate if teams are taking long lead time to complete the work flow.
Note: Ensure work items move’s to completed or done state when it is deployed in production environment.
Mean Lead time = Mean (duration of each work items in sprint or release from creation to completed or done state).
2. Mean Cycle Time(Dev to Value) : This is like Mean Lead Time, but instead of creation state we use In-Progress state. i.e. Cycle time calculated from work item first entering an In-Progress state to entering a Completed/done state.
Note: Ensure work items move’s to completed/done when it is deployed in production environment.
Mean Cycle time = Mean (duration of each work items in sprint or release from In-Progress to completed/done state).
3. Automated Tests Failure Rate: Measuring results of automation tests will helps to find early development bugs or environment issues.
Automated Tests Failure %= (Failed automated test cases / Total test cases (pass + fail) ) *100
4. Defect escape ratio: Also known as Defect Leakage ratio; this Measures the defects escaping from System Test /SIT environment into Production environment. It’s important in DevOps cultures as it indicates the testing effectiveness.
Defect escape ratio = (Bugs or Incidents reported in Production environment for a release) / (Total bugs or incidents for release (bugs found in DEV or SIT, PRE-PROD and PROD))
5. Mean Time to Recovery: MTTR is a metric that measures the availability of systems. Operations and Development teams use MTTR to support contractual agreements. Lower the MTTR, the better.
- Mean Time to Recovery for Application: Mean Time to Recovery is the Mean time between the of application outages reported time and the recovery time of the service.
MTTR for Application = Mean (recovery duration of reported application outages)
- Mean Time to Recovery for Environment: Mean Time to Recovery is the Mean time between the of Environment outages reported time and the recovery time of the service.
MTTR for Environment = Mean (recovery duration of reported environment outages)
Note: recovery duration = recovery time – reported time.
6. Build failure rate: Measures the percentage of builds that fail over a given period (e.g. sprint or release). An increase in build failure rates likely indicates that the application’s overall “health” or teams focus is deteriorating.
Build failure % = ((Number of failed build in sprint or release) / (Total builds in sprint or release)) *100
Note: Try to include both build failures and build deployment failures in “Build failure rate”, Or else track separately.
- Infrastructure- Failure Rate: % of build or deployment failures related to infrastructure issues.
- Change failure rate: Measured as the total failed deploys divided by the total deploys.
7. Unplanned work rate: Measures how much unplanned work from business or production environments or management or any stakeholders after sprint planned is done. This will help how team gets disturbed from agreed sprint goal.
Unplanned work % = ((Total effort of Unplanned work in a sprint) / (Total Sprint Effort)) * 100
8. Rework Effort: Measures actual value of all rework activities done in sprint or release. Lower the better.
Rework Effort % = ((Total Effort for Rework on reverted work-items or reported bugs) / (Total Effort for sprint or release) )* 100
9. Flow efficiency: Measures actual time spent on a work item; Calculates the difference between productive ‘work time’ and non-productive time (e.g. waiting for approvals, waiting for infra/env, resources, etc.) while work-item travels from Start state(e.g. Committed state) to the End state ( e.g. PROD deployed). Higher the percentage of Flow Efficiency, better and smoother is the process.
Flow efficiency % = (Work Time / ( Work Time + Wait Time) )*100
Flow efficiency above 40% is generally considered good ( its guidance ! ) this also indirectly helps us to look into the constrains.
There many more metrics used by DevOps teams, but above mentioned are good enough to measure and internally control time-lines, cost, quality. When team got matured and built-in automated measurements, then team can have additional metrics such as Mean Time to Acknowledge , Mean Time Between Failures…etc.