Abhishek Kumar

A smarter model makes your agent more confidently wrong

2026-05-26T00:00:00+00:00

Upgrade the model. That’s the reflex when an agent gives a wrong answer in production. A new release drops, you change the model string, the demo looks sharper, everyone moves on to the next ticket.

It’s the wrong lever. And on a frontier model, pulling it makes the failure harder to catch, not easier.

The gap isn’t where you think it is

A few years ago, most agent failures were model failures. The model wasn’t capable enough, full stop. That gap has closed. The frontier models everyone’s shipping on can reason, plan, and call tools fluently enough that raw capability is rarely the thing breaking your system.

What breaks it now is context. The wrong documents got retrieved. The memory is stale. A tool returned nothing and nobody handled it. The instructions contradicted themselves fifteen turns into a session. The model was fine — it just acted on a bad picture of the world.

Call it context debt: the gap between what your agent needs to know at a given step and what you actually put in its window. Like any debt, it’s invisible until it’s called in. And the interest compounds in a way most teams never see coming.

The confidence inversion

Here’s the part that catches people.

A weaker model running on incomplete context produces obvious garbage. The answer is incoherent, the reasoning is visibly broken, and it gets caught in review in about four seconds. The bad context is still there — but the weak model is doing you a favor by failing loudly.

Put a frontier model on that same incomplete context and you get something coherent, well-structured, and confidently wrong. It reads like a careful senior engineer wrote it. Every sentence is plausible. The citation looks real. And it sails straight through review, because nothing about it looks like an error.

So the upgrade didn’t fix the context problem. It raised the stakes of it. You made your failures more expensive to detect, and you spent budget feeling like you’d solved something.

A stronger model doesn’t remove the error. It removes the tell.

That’s the inversion: the better the model, the later you find out it was wrong, and the more it costs when you do.

Where the debt actually accumulates

I ran a head-to-head retrieval evaluation once on a set of 300-page compliance documents — standard chunked retrieval against a more structure-aware approach. Same model on both sides. The thing that decided output quality wasn’t the model at all. It was whether retrieval surfaced the governing clause or a plausible-looking neighbor two pages away. One strategy grounded the answer. The other produced a citation that looked perfect and pointed at the wrong rule.

The model never knew the difference. It can’t. It reasons over whatever you hand it.

A few places the debt piles up:

Similarity-only retrieval. Semantically close isn’t relevant. A document about “regulatory compliance risk” looks similar to a query about “audit exposure” and answers neither. Cosine similarity is not comprehension.
Stale memory. Context that was true forty turns ago and quietly isn’t anymore.
Unbounded scope. Give an agent access to everything and it pulls in noise it can’t distinguish from signal.
Silent tool failures. A tool returns malformed output, the agent doesn’t catch it, and reasons confidently across the gap.

None of these improve when you swap the model. Some get worse — a stronger model is just better at making the gap invisible.

What I’d build instead

The lever that actually moves the failure rate is the context layer. The unglamorous discipline is treating it as an engineering problem, not a search problem.

Concretely, that’s four moves. Score sources on freshness and authority, not just vector similarity. Bound the agent’s scope to the least context that answers the question, and let it refuse anything outside that boundary. Put a verification layer between retrieval and generation that checks retrieved context against a system of record before the model ever sees it. And trace every decision, with low-confidence answers routed to a human instead of shipped.

The trade-off, named

This isn’t free, and pretending it is would be the kind of thing I’m arguing against.

A verification layer adds latency and cost — you’re spending tokens and milliseconds to check work before generation. You give up the seductive simplicity of “change the string, ship the release.” Bounding scope means your agent will decline things it could plausibly attempt, which looks worse in a demo and is better in production.

I’ll take that trade every time. A slower agent that’s wrong in ways I can see beats a fast one that’s wrong in ways only a domain expert catches three weeks later, after it’s quietly shaped a decision nobody traced.

The hard part of agents was never the model. It’s the discipline to engineer what the model knows at the exact moment it acts. The teams shipping reliable agents in 2026 aren’t running smarter models than you. They’re running better context.

So before you reach for the next release, ask the only question that matters: would a smarter model fix this — or just hide it better?

Context engineering is just RAG with a compliance layer

2026-05-23T00:00:00+00:00

“RAG is dead, context engineering won.”

It’s the take of the month. Budget data backs it up — retrieval optimization spend just overtook evaluation spend for the first time. The framing is that we’ve stopped stuffing chunks into a pipeline and started letting the model pull what it needs.

Half of that is right. The wrong half is the half that matters if you ship into a regulated industry.

The rebrand hides the actual decision

Strip the marketing and “context engineering” is an umbrella term. RAG is still in there — embedded queries, similarity search, top-k injection. What changed is that retrieval is now one tool among several for deciding what the model sees, instead of the whole strategy. Long context windows handle some of it. Structured business data handles some. Memory handles some. Fine.

But that’s a capability story, not an architecture decision. The architecture decision is which of three things you’re actually building:

Plain RAG retrieves top-k chunks. Fast, cheap, and opaque — you can’t easily say why a chunk showed up or whether the user was allowed to see it. Full context engineering assembles the whole working set the model needs for the task. Richer, more flexible, and you pay for it in tokens and latency. Governed context engineering does all of that and adds the part nobody puts on a slide: it decides what’s eligible before anything gets retrieved, and it records what happened.

Most “RAG vs context engineering” posts stop at the first two. For a consumer chatbot, that’s the whole conversation. For anything that touches regulated data, the third column is the only one that survives contact with a compliance review.

The part the demos skip

Here’s the trap I’ve watched teams walk into. The retrieval-then-filter order feels natural: pull the relevant chunks, then redact or filter whatever the user shouldn’t see before you respond.

By the time you’re filtering the output, the restricted passage has already entered the prompt. The model has already conditioned on it. You’re not preventing a leak, you’re hoping to catch one. In a domain with row-level access rules — who can see which document, which clause, which customer’s record — that’s not a bug you patch later. It’s the wrong order of operations baked into the foundation.

Governed context engineering flips it. Permission scoping runs first, retrieval runs inside that scope, and the model only ever sees what this specific user is allowed to see. Same components, inverted order, completely different audit posture. The directional flip is the whole point — and it’s the part that’s invisible in a demo, because demos run as one all-powerful user.

Reproducibility is the feature, not the metric

The deeper reason regulated teams can’t treat this as a rebrand: an answer is only as defensible as your ability to recreate it.

Say a model gave a wrong classification last Tuesday and someone’s now asking why. With chunk-based RAG, you usually can’t reconstruct the exact context that produced it. The index has been re-embedded since. The documents changed. The retrieval was non-deterministic. You’re left explaining a decision you can’t reproduce, which in a compliance setting is indistinguishable from having no answer at all.

A governed context system treats the assembled context as a recorded artifact. You can replay precisely what the model saw at that moment — same documents, same permissions, same ordering. That single requirement reshapes the whole stack. It’s why you log the context, not just the output. It’s why retrieval order is deterministic where it can be. It’s why “which chunks, for which user, at what time” is a first-class field and not an afterthought.

I ran a head-to-head retrieval evaluation on 300-page compliance documents — chunked RAG against a structure-aware approach — and the accuracy numbers were close enough to argue about. What wasn’t close was explainability. One approach could tell me why a passage was retrieved and tie it back to a section of the source. The other gave me a similarity score and a shrug. For a workflow that has to defend its outputs to an auditor, that gap decided the architecture before precision@k ever entered the conversation.

The call I’d actually make

If you’re building for a static corpus, a single agent, and no compliance pressure: use plain RAG and move on. Don’t let anyone talk you into a context-graph cathedral for a FAQ bot. The simplest thing that retrieves well is the right thing.

If you’re in a regulated domain — telecom, healthcare, pharma, finance, anything audited — start from governed context engineering and treat the governance layer as load-bearing, not a wrapper you add at the end. Build the permission filter before the retriever. Log the assembled context as an artifact. Make reproducibility a design constraint from day one, because retrofitting it is brutal and usually means a rebuild.

What you sacrifice is real: setup time, more moving parts, a slower path to the first impressive demo. The governed path looks worse in week one and decisively better the first time someone challenges an answer. That’s the trade — you’re spending early velocity to buy the ability to stand behind your system later.

The mistake I see is teams picking based on the demo, where governance is invisible and the lean pipeline always looks faster. Then the first security review lands and the project stalls — not because the model was wrong, but because the layer around the model can’t answer the questions a regulator asks.

The retrieval mechanism was never the hard part. It hasn’t been for a while. The hard part is proving, after the fact, that the system only ever saw what it was allowed to see and can show its work. Call that context engineering if you want. In regulated AI it has an older name: doing it properly.

So before you join the “RAG is dead” chorus, ask the only question that ends the debate — if one of your answers gets challenged six months from now, can you recreate exactly what the model saw? If the answer is no, you didn’t skip RAG. You skipped the part that mattered.

On shipping a website in 2 hours

2026-05-19T00:00:00+00:00

The deployment call was crawling. I was bored.

Three months ago, I bought a domain. iamabhishek.cloud. Paid for it. Forgot about it. It had been sitting there doing nothing while my actual work kept piling up.

Last night, mid-deployment call, I built and shipped a personal website on that domain.

Total time: under 2 hours.

What I actually did

The workflow was three steps, three Claude surfaces:

Step 1 — prompt generation. I attached my resume to Claude and asked it to write me a detailed prompt for a personal site using Anthropic’s design system. Five minutes.

Step 2 — design. I took that prompt to Claude Designs. Iterated twice — fixed some project descriptions, added a partner strip, dropped in a portrait. Thirty minutes. Design done.

Step 3 — deploy. Took the design to Claude Code. Generated the full codebase. Pushed to GitHub Pages. Pointed my domain. Live.

What changed

Three years ago this is a two-week project. Hire a designer. Brief a developer. Iterate. Debug. Pay invoices.

Last night: one engineer, three Claude surfaces, a domain that was rotting on a shelf.

The recursion isn’t lost on me — I’m an ML engineer who builds AI systems for a living, and I used AI to build the website that explains I build AI systems for a living.

But here’s the part worth saying out loud:

I’m not a designer. I shouldn’t be able to ship something that looks like this. But the tools meet you where you are now. The gap between “I have an idea” and “I have a thing on the internet” has collapsed to a single boring night on a deployment call.

That’s the part to pay attention to.