The Problem with Test Coverage

There are some out there in the industry that will swear by the criticality of test coverage. The conclusion seems to be that 100% test coverage equates to 100% correctness. However, that simply isn't true in all cases.

Blasphemy!

I know, I know. You might be upset about this take already, but let me explain how I got here. When I first wrote the post about Test Driven Development and Effective Unit Testing, I had been using TDD for problems where it happened to be a really good fit. At the time I conflated the experience I had with those problems with the experience I would have on others. In short, it's an over-generalization. That doesn't take away from the fact that TDD works really well for the right situation, but that's not all situations. I still test my code, but the level of effort I put into it correlates with the level of complexity and risk. Here are some experiences with testing as a concept and in practice that could help qualify the statements above.

The problem with "correct"

Different engineering domains require different levels of verification. I wouldn't expect the same approach to testing in aeronautic software as we would apply to a Slack notification bot (hopefully?). The stakes are very different.

I don't work in code where a bug can result in lives lost. So if I try to create a mathematical proof of correctness for every change I commit to a code base, not only is it going to take much longer, I'm also going to find those proofs very difficult to maintain as the code evolves over time. As a more senior person on the team, the standards I promote for how we understand correctness tend to get incorporated into other devs' workflows, so it's not just me that would have to contend with this problem over time. This means a miscalculated standard of correctness can hurt the whole team's velocity.

The other side of this is if we become complacent with our view of correctness. If I'm not testing at all, then the bug density in production can really quickly get out of hand. Usually, we see the impact of not testing quicker than the impact of too much. I think that's part of why we see such divisive opinions in the industry around testing, and why teams swing wildly to different ends of that spectrum.

Also, for someone out there thinking "just use Rust," no. Stop it. A language doesn't prevent bugs. Good practice, workflow, and standards do. Rust can prevent a certain class of bugs, but it doesn't automatically make your code work as intended.

The problem with "covered"

It's worth pointing out what coverage actually means. When you get a coverage report on your code as a result of running tests, what you're seeing is an aggregation of a set of paths that have been visited. If I run a function in my test and get coverage output, I'll see that the function was run.

Now let's note what is not in the definition of coverage: assertions. Coverage doesn't show whether I made good assertions about the behavior of those paths. That's not a factor in coverage at all. I can just run all the code, make no assertions at all, and have 100% test coverage.

More often than not, if I run my tests and make good assertions, then I can be confident about the quality of the result. However, coverage alone gives me zero information to make those kinds of qualitative assessments.

Say it with me:

Requiring a certain code coverage percentage neither guarantees, nor necessarily correlates with, code quality.

In fact, in an environment with a lot of delivery pressure, it actually creates a strong incentive for cheesing coverage, having the opposite effect as intended.

The problem with "100%"

100% coverage is a very high bar and I think we can all agree with that. This is especially true with a team supporting non-trivial code in production for more than a few years. That being said there are still some that work diligently to make the 100% standard a reality. It's completely possible, and may even be sustainable and appropriate given the domain. However, there can be a darker side to the standard.

Working on one of these teams that maintained 100% coverage and having an evolving view of coverage got me thinking about why we maintain this standard. I thought about it for a while and eventually the answer became clear: ego. We loved to look at that shiny 100% coverage number and feel the pride of ownership that came with it. We identified with that number as not just a metric for coverage, but of our capability.

The problem with ego

Pride, as deadly a sin as it can be, eventually became a real problem. As it happens, there came a time where we were under some significant pressure to get a feature out the door. We worked tirelessly to make sure we had met all requirements and had reviewed it with stakeholders to get their approval (more eyes make bugs shallow). We got it in production on time, but we had to make up the time somewhere. The users loved it and were overjoyed that we are able to ship it. They were telling us how they were worried for a bit but so proud that we got it done... but the team felt defeated. Our golden 100% was no more.

After some decompression, we were able to have a little perspective. We talked it out as a team, and realized that we weren't even chasing quality anymore. Bugs still happened with 100% coverage, so how important can that number be? We found that putting so much pressure on ourselves to hit 100% coverage likely contributed to putting us behind in the first place. That's when we decided to revise the standard for the sake of sustainability, quality, and peace of mind.

A new standard - Critical Path

Keep in mind, this might not be right for you and your team, but this is what works for now. I think it's healthy to regularly review standards to make sure they're still providing value and solving problems we still have.

Risk vs Complexity

Before we make a change to a code base, we first try to quantify risk and complexity of the change. Not only can this help us qualify the time taken to make the change, but it helps us understand the test burden. A quick definition of terms in this context:

  • Risk: this can mean a few different things. It can relate to the "risk of the unknown," when we really aren't sure how well the change will work out (probably needs more discovery or prototyping). It can also relate to risk or consequence of failure. If the code supports or provides customer interactions, then there's an inherent optics risk that needs to be accounted for. You may be working in a domain where life and limb are at stake, so your baseline risk would be much higher. A foundational aspect of risk is trying to understand how difficult it would be to identify and resolve a problem in the effected area of the code. We've spun up new repos just to test the behavior of dependencies when they're not well understood, as a way to address this particular risk.
  • Complexity: this isn't just a measure of "difficult," it's a little more pragmatic regarding the realities of software engineering. A change that has to interact with more "legacy" systems can be considered higher complexity, because we don't often make changes around those systems. If we have a lot of technical debt in that area of the code base, then we may incorporate some tech debt remediation at the cost of higher complexity. If we know that the affected area of code isn't in the best shape, then we want to leave things better than we found them. We'll often take a hit on complexity in favor of code cleanliness in this case.

Categorizing Code Paths

Once we have an idea of risk and complexity for the change, we can identify what code paths most contribute to that assessment. These are the "critical paths" of the code that must be covered by testing with good assertions. We can still write more tests where it makes sense, but the critical path is a hard requirement validated by peer review. With this approach we're able to strike a balance between test coverage and time to delivery, with good results so far.

This also means that our understanding of code coverage has fundamentally changed. When we see >80% coverage in a repo where we've applied this method for a while, we look at that as a warning sign because it indicates a large proportion of critical path code.

High test coverage is an indicator that this is "hazmat code."

Testing Hazmat Code

When we know we're operating within code with high inherent risk and complexity, we shift testing approaches. Adding a new change to this code can likely impact a lot, and refactoring must be done carefully. This environment is where it's much more common to see TDD in action, and usually there are a lot more comments explaining how things are supposed to behave. Having a high risk and complexity means we likely need to set aside a lot of time to work in this code base, and habitually push back on reactive feature requests that impact hazmat code. Intentionally taking it slow where it really matters means we can allow ourselves to deliver faster in other areas. If this approach becomes a problem, then it could be a sign that we need to de-risk hazmat code with better abstraction or architectural decisions.

Conclusion

Hopefully I've helped you understand that 100% test coverage does not equate to 100% correctness, and can in fact be a harmful standard. Test coverage standards are misguided, don't actually ensure quality, and in the wrong environment can negatively impact code quality. Taking a balanced, informed approach allows us to focus on the right things: quality and delivery. Measuring risk and complexity helps us calibrate our view of correctness in a way that is healthy and sustainable. If you see others chasing a number, then share this post with them to provide another perspective.