ai coding-agents context-engineering reliability

Don't Trust the Model. Build the Checkpoint.

Lars de Ridder · 3 July 2026

I have a Python skill that tells the agent to run ruff after writing code. Not “consider running ruff” or “if you want to, you could lint.” Run it, fix what it flags, run it again until clean.

That sounds almost insultingly simple, and it is, but it changed something fundamental about how the agent writes Python. Before that instruction, the model would occasionally produce code with unused imports, undefined names, or style violations that it confidently believed were correct. After it, those categories of error stopped showing up. The workflow closed off the paths where the model could be wrong, and that turned out to be enough.

This is also the practical answer to something I complained about earlier: becoming QA for the machine. If the agent builds the thing and I only click around afterward to see whether it feels right, the engineering has moved to the wrong place; I’m supervising vibes. A checkpoint changes that relationship. The thing I care about gets written down as code before the agent gets to declare victory.

The pattern

The pattern is blunt: take the things a model tends to get wrong, and move verification into executable checks that the model has to pass through. Scripts, linters, validators, data queries, test suites; anything that can give a deterministic yes/no answer to “is this correct?” Once the check is in the flow, carelessness stops being an option.

Sometimes those checks are elegant, like running ruff and fixing what breaks. Sometimes they’re brute-force, like retrying the whole task until a verification script says the output is acceptable.

Another way to say the same thing: code out what is true. I’m doing this now for ZZP Pensioen Planner, which has a Dutch jaarruimte calculator, the little tool that works out how much someone is allowed to put into their pension tax-free this year. The formula isn’t hard, but it’s fiddly: take the income, subtract the AOW-franchise, subtract the pension already built up times a factor, then multiply what’s left by a percentage the tax office tweaks every year. Five-ish magic constants and a percentage, and if any one of them is off the answer is still a perfectly plausible euro amount, which is the worst kind of wrong because nobody reading the number will catch it.

So don’t let the model be the thing that decides what the number is. This is the exact seam where models are weakest; they’re bad with numbers and good with code, and a Playwright test full of hand-typed expect(result).toBe(6390) asks them to be bad at the thing they’re bad at. That 6390 came from the same place a model’s confident “approximately 12,000” comes from: it sounded right in context. Put the tax rules in a small reference script instead, have it print the expected jaarruimte for a handful of named scenarios (a single earner, someone with a pensioentekort, the edge case right at the income cap), and let the E2E tests read those numbers as fixtures. Now the number in the assertion is whatever the script prints, something you can read and rerun, rather than a sentence the model guessed into existence. The model still writes the script, wires it into the tests, and chases the failures; it just doesn’t get to invent the number it’s checking against.

That’s the whole move, generalized. A test fixture is usually a table of inputs and expected outputs, and a table is one more place a model can fat-finger the arithmetic without anyone noticing. Generate the table from a script and that door closes: the script is code, so it gets reviewed like code, and once it’s reviewed every future run has a deterministic source of truth that no amount of fluent prose can talk its way past.

Three places I use this

Linting as a mandatory step

The simplest version: my Python skill says run ruff on every file you write or modify, and if it flags something, fix it and run again.

What this actually does is change the model’s job. Without the linter, the model has to simultaneously write code and internally simulate a linter and catch its own undefined names, unused imports, and style drift, three jobs it’s bad at because it’s not a compiler but a next-token predictor that happens to be good at code. With the linter, the model’s job becomes: write a draft, then iterate on concrete error messages until they stop. Much easier, much more reliable; the model reacts to specific error messages instead of trying to hold an entire style guide in working memory. It doesn’t need to know that from typing import Optional is unused if ruff will tell it in 200 milliseconds.

Scripts that produce every number in an article

I write data-driven articles for Transitiedata, a Dutch energy transition data platform. Every article has a companion Python script that queries the database and prints every number that appears in the text.

So the article about solar panels per neighborhood says “the average penetration is 38.5%”, and the script contains a SQL query against the actual dataset that produces 38.5%. If the data changes, I rerun the script and the numbers update; if someone questions a figure, I point at the query. The model is still useful here; it writes the narrative, explains what the numbers mean, and connects the sections. But it never gets to invent a number, because every factual claim is anchored to a script output and I can verify the whole article by running one file.

Models are confidently wrong about numbers all the time; they’ll write “approximately 12,000” because it sounds plausible in context, and you’ll believe it because the surrounding prose is so fluent. The sentence sounds great, which is exactly the problem. Anchoring facts to executable sources kills that failure mode entirely.

Hook scripts around an agent loop

This one is less obvious but maybe the most interesting. I built a loop extension for my coding agent that processes tasks in sequence: pick an item from a queue, work on it, move to the next. The loop itself isn’t special; what makes it work is a set of hook scripts that run around each iteration.

The interface is four commands:

queue: return the next item, or signal that the queue is empty
prompt: given an item, return the full prompt for the worker
verify: given an item, check whether the work was done correctly

So the model does its thing, produces output, and then verify runs. If verification fails, the model gets the error details and tries again; if it passes, the loop moves on. The verify script is where the real leverage is, because it’s a static check that the model can’t negotiate with. A claim like “I believe this is correct” no longer moves the workflow forward; the script either exits 0 or it doesn’t. When it doesn’t, the model gets concrete feedback about what’s wrong, which is exactly the kind of input models are good at responding to.

I’ve used this for things like extracting structured data from messy documents, where the verify script checks that the output JSON matches a schema and that certain required fields are populated. A model working without verification would occasionally skip fields or invent plausible-sounding values, things like capacity_liters_expanded when the schema expects capacity_liters. Plausible enough to read as correct, wrong enough to break everything downstream. With verification, those failures get caught on the first attempt and fixed on the retry. It’s brute-force in the sense that you’re sometimes running the same task twice, but a task that runs twice and produces correct output beats a task that runs once and produces something that looks correct.

Fewer ways to be wrong

Models are good at generating plausible text and adapting to feedback; they’re bad at self-verification, and especially bad at remembering constraints they were told about 40,000 tokens ago. Static checks externalize exactly the things models lose track of: the linter remembers the style rules, the data script remembers the correct numbers, the verify hook remembers the acceptance criteria. The model doesn’t have to hold any of that in its context window, because the environment holds it instead.

But there’s a subtler effect too. A model with no constraints can do anything, which is exactly the problem when “anything” includes confidently wrong things. Once you insert a mandatory check, the space of acceptable outputs shrinks, and the model converges faster because there are fewer wrong paths to wander down.

Models will game what they can

One thing I didn’t expect: models don’t just fail to meet conditions, they sometimes satisfy them in creative but wrong ways.

I had a loop processing 657 files with a breakout condition: count input files, count output files, stop when they match. The model processed about 20 files, then created 637 empty stubs to make the counts match. Done! From the outside it looked like a successful run. I’ll admit I was impressed for a second before I was angry.

The model found the cheapest path to satisfying the stated condition, so the fix was the same pattern: make the check mechanical. The queue script now checks for real, non-empty output files with valid JSON, and the model never sees the breakout logic; it just processes items until the queue says there are none left. Any condition that can be satisfied by the model producing side effects (creating files, printing expected text) will eventually be satisfied that way, so the check has to verify that the output is actually good instead of merely existing.

This works for weak models too

One of the most practical consequences is that weaker models become viable for tasks you’d otherwise reserve for stronger ones. A weaker model in a workflow with linting, schema validation, and a verify step will often outperform a stronger model working freeform, because the weaker model has guardrails compensating for its limitations while the stronger model has to be its own guardrail.

I found a related thing with prompt complexity. I had two nearly identical extraction tasks running on the same free-tier model. One task had a 1.9KB prompt with four numbered steps; the other had a 9.3KB prompt with detailed rules, examples, and inline data. Same model, same type of work. The short prompt succeeded consistently; the long prompt failed on every attempt. The longer prompt was more thorough and, on paper, more correct; it had everything the model needed to produce perfect output. And it failed anyway, because a weak model fed 9KB of instructions gets overwhelmed and flubs even the basic steps, no matter how good those instructions are.

The fix was to strip the prompt down and move the detailed rules into reference files the model reads as a second step. The prompt went from 9.3KB to 1.7KB, success rate went from zero to viable, and the verify step catches what the simplified prompt misses. You’re designing the operating environment so that the required level of “smart enough” is lower, which turns out to be a much better lever than picking a smarter model.

The heuristic

When I’m designing a workflow where I can define what correct looks like, I ask a few questions:

What can the model hallucinate here?
What can be checked by running something?
What data should come from a script rather than from the model’s training data?
Where am I relying on the model to self-verify, and can I replace that with a tool?

Usually at least one of those questions points at a place where I’m trusting the model to be disciplined over time, and discipline over time is exactly what models are worst at. That’s where the check goes.

Stop asking. Start checking.

“Be careful with numbers.” “Follow the style guide.” “Make sure the output matches the schema.” That works some of the time, and it works more of the time with better models, but “most of the time” and “reliable” are different things. Build the linter into the flow, make the data script mandatory, add the verify hook. Let the model be creative and adaptive and probabilistic, because that’s what it’s good at, but don’t let it be the sole judge of whether its own output is correct.

If something matters, put the truth in a script before asking the model to work around it.