Measuring Engineering Performance: Beyond the Trap of Vanity Metrics
Measuring engineering performance through vanity metrics like velocity creates false progress and weakens trust. This article explores better alternatives: six metrics that reveal flow, quality, resilience, and team health.
Introduction
Every organisation wants to know if its engineering teams are performing well. Leaders naturally reach for metrics: they convert uncertainty into charts, create a sense of clarity, and make progress appear tangible.
But metrics are not all the same. Some reveal what truly drives outcomes, while others mislead, distort behaviour, and slowly undermine performance.
Velocity is one of the most common examples. Another is the classic lines of code (LOC). Both look great on a chart and are easy to collect, but when used as measures of performance they create perverse incentives, demotivate teams, and offer a false sense of progress.
The problem is not unique to velocity. It is part of a wider challenge: the over-reliance on vanity metrics. Numbers that look good in dashboards but fail to capture value, quality, resilience, or sustainability.
If we want to measure engineering performance effectively, we need to move beyond these traps and focus on metrics that guide teams towards outcomes that matter.
The Trap of Vanity Metrics
Velocity tracks how many story points a team completes in a sprint within an agile context. In its intended role, it is a planning tool, useful for forecasting what a team might achieve next sprint.
The issues arise when it is lifted out of context and turned into a scorecard:
- compared across teams,
- applied to individuals,
- or reported as proof of productivity.
In those moments, velocity stops being a neutral planning aid and becomes a vanity metric. It looks great on a chart, but tells you little about customer value, product quality, or team health.
Lines of code (LOC) is a typical example of a vanity metric. It looks measurable and objective, but says very little about engineering performance or value. More lines of code can mean:
- over-engineering instead of simplicity,
- verbose solutions where concise ones would be clearer,
- accumulated technical debt rather than progress.
In fact, in many cases, the fewer lines of code required to solve a problem, the better the engineering.
This makes LOC very similar to velocity: easy to count, but misleading as a measure of performance. At best, it offers a rough sense of project size; at worst, it incentivises waste and poor design.
This is the wider risk of vanity metrics: they create the illusion of progress whilst quietly distorting behaviour.
The Cost of Measuring the Wrong Things
The consequences of vanity metrics build slowly but steadily:
- Distorted focus. Teams optimise for the metric instead of the outcome.
- Lost transparency. Leaders believe tasks and deliveries are on track while risks grow unnoticed.
- Weakened morale. Engineers feel judged on the wrong aspects and disengage.
- Slower progress overall. Time goes into managing optics rather than improving flow.
Optimising for velocity, or any vanity metric, may feel like control but it often undermines the very performance it claims to measure.
How to Measure Instead
What we need are metrics that balance speed with quality, outcomes with sustainability. Measures that expose systemic issues rather than pressuring individuals.
Here are six metrics that do just that, with practical guidance on how to use them well, and how not to.
1. Cycle Time
Cycle time is the duration between when work begins and when it is completed. It is usually measured from when a ticket moves into In Progress until it reaches Done.
It shows how quickly work flows through the system and how long customers wait once something is actively being worked on.
Why it matters: Long cycle times are often caused by hidden bottlenecks: slow reviews, unresolved dependencies, or tasks being too large. Shorter cycle times mean faster feedback and predictability.
How to apply:
- Track trends, not absolute numbers. Every team has a different baseline.
- Use spikes as signals of systemic issues, not of individual failure.
- Discuss in retrospectives to ask why work is stalling, not who is stalling it.
Good use: Identifying that review queues are delaying delivery and improving the review process.
Bad use: Pressuring an engineer to complete tasks faster because their items took longer than the average cycle time.
2. Lead Time for Changes
Lead time for changes measures how long it takes for code to move from commit to production. It is often captured automatically via the CI/CD pipeline.
It highlights how quickly value actually reaches customers, from the moment it is ready to ship.
Why it matters: Lead time is a direct reflection of organisational agility. Short lead times mean fast response to customer feedback. Long lead times indicate that value is stuck inside the system.
How to apply:
- Break down the pipeline into stages (build, test, review, deploy) to pinpoint delays.
- Invest in automation to reduce manual steps.
- Track improvements over time, not as absolute targets.
Good use: Discovering that manual approvals add days to the process and automating them.
Bad use: Comparing lead times between teams and declaring one “underperforming” without considering context.
3. Deployment Frequency
Deployment frequency tracks how often changes are released into production, typically measured weekly or daily.
It reflects how regularly customers receive updates, fixes, and improvements.
Why it matters: Smaller, more frequent deployments reduce risk and accelerate feedback. Infrequent deployments often signal brittle pipelines or fear of breaking production.
How to apply:
- Normalise by context. Not every service needs daily deployments, but no product benefits from quarterly ones.
- Encourage small, safe releases over large, risky ones.
- Use frequency as a conversation starter, not as a competition.
Good use: Encouraging teams to ship smaller, incremental improvements weekly rather than one giant release.
Bad use: Setting a target number of deployments per sprint and demanding teams to hit the number regardless of context.
4. Change Failure Rate
Change failure rate is the proportion of deployments that result in incidents, rollbacks, or hotfixes. It is calculated as failed deployments divided by total deployments.
It balances speed with stability, ensuring that moving quickly does not mean breaking things for customers.
Why it matters: High failure rates undermine trust in the release process and create hidden costs through firefighting. Low rates indicate quality in development and resilience in operations.
How to apply:
- Track failures through incident reports and blameless post-mortems.
- Investigate causes such as testing gaps, large release sizes, weak observability.
- Use it as a learning tool, not a stick.
Good use: Learning from repeated rollback patterns and investing in better automated testing.
Bad use: Naming and shaming teams whose deployment caused an incident.
5. Mean Time to Restore (MTTR)
MTTR is the average time taken to restore service after an incident. It starts at detection and ends when the system is fully recovered.
It reflects how resilient your systems and your teams are when things go wrong.
Why it matters: Incidents are inevitable. The key measure is not how rarely they happen, but how quickly customers are impacted and how fast service is restored.
How to apply:
- Ensure monitoring and alerting detect issues immediately.
- Document clear on-call and rollback procedures.
- Share learnings from incidents across the organisation.
Good use: Reducing MTTR by improving observability and automating rollbacks.
Bad use: Forcing on-call engineers or support to fix the issues faster without addressing systemic gaps.
6. Team Health and Engagement
Team health is qualitative. It is measured through surveys, retrospectives, and conversations. Common questions include workload, clarity of goals, and collaboration.
It shows whether delivery is sustainable and whether teams are engaged in their work.
Why it matters: Burnout and disengagement rarely appear in metrics until it is too late. Healthy teams deliver consistently and are more innovative. Unhealthy ones eventually fail regardless of what velocity says.
How to apply:
- Run regular pulse surveys and discuss results openly.
- Combine quantitative scores with qualitative feedback.
- Act visibly on feedback to build trust.
Good use: Adjusting priorities after surveys reveal overcommitment.
Bad use: Collecting survey data and filing it away without change.
Why These Metrics Work Better
Together, these measures provide a balanced view:
- Cycle time and lead time reflect flow.
- Deployment frequency shows delivery of value.
- Change failure rate and MTTR safeguard stability.
- Team health ensures sustainability.
Unlike vanity metrics, they cannot be gamed easily. They highlight systemic issues, tie engineering work to customer outcomes, and protect the human side of performance.
Conclusion: Measuring What Matters
Vanity metrics aren’t evil; they are simply misunderstood. Metrics like velocity can be useful in their proper place (as planning tools for teams) but they are damaging when used as performance scorecards.
The bigger lesson is that vanity metrics of any kind mislead. They seduce with simplicity but fail to guide organisations towards genuine progress.
“The question isn’t ‘How many points did we complete?’ It’s ‘How quickly, safely, and sustainably are we delivering value to our customers?’”
Leaders have a choice: measure what is easy, or measure what is meaningful. Those who choose the latter stop confusing effort with effectiveness. They stop tracking motion and start measuring impact.
Because true engineering performance isn’t about chasing numbers. It’s about building resilient systems, empowered teams, and outcomes that matter.