Quality control in a world of agentic software development

Written by ClearPoint | Jun 15, 2026 2:31:38 AM

Agents now write a meaningful share of production code. Most teams' quality controls were designed for human authors. For an agentic SDLC, the required throughput and failure modes are different.

Three failure modes come up repeatedly:

Agents that write both tests and the implementation code tend to produce tests that pass for the wrong reasons;
Agents without organisational context write code that's technically correct but doesn't fit your team's conventions or decisions;
Agents without tight feedback loops drift, turning small wrong turns into significant rewrites.

None of the controls that address these problems are new. What's changed is which ones are now load-bearing, and how cheap they've become to run.

Walking the SDLC

To illustrate, I will walk the software development lifecycle (SDLC) and name the failure mode at each stage and a control that addresses it. Think of the pipeline in five stages:

Plan → Design → Build → Verify → Ship

Each stage has a natural point where the right control costs little and catches a lot.

Plan: Give agents organisational knowledge

The most upstream control isn't a tool or a test. It's giving the agent your team's accumulated knowledge before it starts. Agents are good at reading text, and a small knowledge base sitting alongside your codebase that captures customer context, architectural direction, your threat model, and stylistic conventions improves output relevance before a single line of code is written. This shifts left on context you'd otherwise repeat in every code review.

Critical to this is ensuring that the agent uses them at the right moment. An approach we’ve been experimenting with is structuring knowledge as agent Skills, which loads a brief description of each knowledge file into every session. This gives agents awareness of knowledge, while allowing them to only add full context to their context window when appropriate. The result is an agent that cares about the things your organisation cares about, not just the things that compile.

Design: run threat modelling on every feature

Threat modelling (the practice of identifying cybersecurity risks that your system must control for) used to be too expensive for most organisations to perform at all, let alone on every new feature. It required dedicated time, cybersec people in the room, and often specialist software.

Given that this is largely an exercise of reflection and in idea generation (guided by structures like STRIDE), AI makes it cheap enough to do for everything. Producing threat models as a design-stage deliverable ensures that required controls are in place. It also prevents the waste of “shooting from the hip” - adding security controls that aren’t needed. It encourages engineers to step back and consider their software more holistically at design time.

Frontier models are trained on specialist knowledge and are well-suited to surfacing these risks that an engineer focused on a single feature wouldn't see: trust boundaries, data flows, and assumptions that look fine in isolation but leave the door unlocked for attackers in real conditions.

Threat modelling Skills exist to get consistent results, however, getting started requires no specialist tooling. Just ask your agent to build a threat model with you. Build this into your process. The cost is low, and the value can be high.

Build: use separate agents for tests and code

Software Engineers have been using test-driven development (TDD) to improve code quality for decades, however, many see it as too expensive and cumbersome. In an AI-assisted workflow, the return on investment is much higher.

For best results, have a different agent writing the tests to the one writing the code. When one agent does both, it naturally writes tests that prove the code it just wrote works. When a separate agent writes the tests first, the implementation agent has an independent specification to satisfy. Evergreen tests are a real risk - tests written without knowledge of the implementation have a higher chance of passing for the right reasons.

This discipline is easy to lose - agents quietly stop following workflows as context is compressed - so codifying TDD as a documented skill is important. Frameworks like Superpowers provide battle-tested implementations of such workflows for free.

Build: have an agent review code before asking a human

The #1 AI improvement we recommend to clients is to turn on automatic agentic PR review. A second agent brings a fresh context window, and prevents an agent from “marking its own homework”. When agents find the obvious issues, the human reviewer can focus on what they do best. AI also catches classes of errors that the author agent is structurally unlikely to catch itself: tests which do something different to their name, obvious security issues, or fighting the framework.

The next evolution is having the reviewer agent leave comments, then having the author agent address them and reply with rationale. They can then go back and forth until they’re both satisfied that it’s ready for a human. By the time a human reviewer looks at the code, the obvious problems are already resolved and the human review becomes genuinely higher-value.

Another improvement is multiple AI reviewers - each with a speciality. This multiplies what the loop catches: a security-focused reviewer notices vulnerabilities, an architecture-focused reviewer spots bad abstractions, and a domain expert catches code that misses the customer problem. Loading each agent with different context and a different lens is what makes the adversarial review more than just running the same check twice.

Verify: turn your existing tools into a real-time feedback loop

Your existing linters, static analysis tools, and automated tests were useful for human authors. For agents making more frequent changes they are even more important: they become a real-time feedback mechanism that keeps the agent pointed in the right direction during the work, not after the pull request is open. Tell your agents about your tools and workflows in AGENTS.md/CLAUDE.md, and wire lightweight checks to run automatically as a pre-commit hook. This helps the agent to continuously course correct rather than accumulating errors.

If you hadn’t heard of mutation testing, it’s another old technique which has proven even more important with AI generated tests. Mutation testing randomly changes production code behaviour, then runs your test suite to confirm that those errors are caught. If the tests stay green despite the introduced errors, the tests aren't doing what you think they are. You can fully automate this with a scheduled GitHub Action - from discovery to an agent raising a PR to improve the tests.

Your custom prompts and Skills also benefit from their own automated tests. Tools like Promptfoo allow you to write straightforward configuration files that verify your Skills and prompts behave as expected. It's worth treating these the same way you'd treat unit tests for code.

Ship: pipeline checks must block, not advise

Everything in the previous section needs to run in your CI/CD pipeline, not just in the development environment. In-session checks give the agent fast feedback during the work; but agents can choose to ignore them.

Pipeline controls like linting, unit tests, SAST and SCA gate what reaches production. The single most important principle across all of this: checks must block the pipeline, not advise. Agents must not have permissions that allow them to override these controls. Humans and agents get tunnel vision on delivering the current feature, and ignore advisory checks under pressure. If a check can be overridden, it isn't a control.

What to do on Monday morning

Most of these controls aren't new. TDD, threat modelling, and pipeline gates all predate AI-assisted development by years. What's changed is how cheap they are to run and how much weight they now carry when agents are doing the building.

If your team is shipping AI-written code without tacit knowledge encoded in skills, adversarial review in the loop, and blocking pipeline gates, they are a cheap investment that vests quickly.

View full post