Plausibly Wrong Is Worse Than No Agent at All
Daniel Schaffield

Every data team is being told to hand its pipelines to an AI agent. The pitch is seductive and the demos are clean. The problem shows up later, in production, where an agent that's plausibly wrong is worse than no agent at all.
A wrong number that looks right doesn't get caught in review. It passes dbt run, passes a glance at the output, ships to a dashboard, and surfaces three weeks later in a leadership meeting when finance asks why revenue is 12% too high. By then nobody remembers which model the agent touched.
SignalPilot is built for that reality: produce data that is correct, not merely plausible, and stay safe to run on a real warehouse. Here are the production failures we uniquely prevent, the security model that makes an agent safe to point at your data, and the benchmark results that prove all of it.
The failures that pass review and blow up later
The dangerous bugs in analytics engineering aren't the ones that crash. They're the ones that run cleanly and return the wrong answer. Four we see constantly, and what they cost in production:
An agent joins a lookup table with duplicate keys. The SQL looks textbook and runs clean, but every order quietly counts two or three times.
Revenue overstated in the board deck. Nobody notices until the quarter closes.
A churn model is aggregated off the events table instead of the customer table. Customers with zero activity simply vanish.
Exactly the accounts a retention team needs to see are missing. The dashboard looks complete. It isn't.
A column that's already a running total gets re-summed instead of carried forward.
Lifetime value doubles across the board, and every downstream metric inherits the error.
An agent “optimizes” by stripping a column it judged empty or redundant.
A downstream model, test, or dashboard that depended on it breaks, or worse, silently returns nulls.
Every one of these is plausible SQL: the kind of code that passes "it ran" and a quick human skim. It's also the kind of code most AI agents produce, because they treat a warehouse like a text box: generate, ship, hope.
The same failures, caught
Each of these starts from a real benchmark task: two from Spider 2.0-DBT, which poses the kind of plain business question a stakeholder would actually ask, and one from ADE-Bench. We don't have the other agents' code, so the mistake on the left is the plausible one any of them could make. The result on the right is what SignalPilot actually did.
“Which countries have the highest total revenue, and what are the top 10 by revenue?”
An invoice has several line items, so the agent joins invoices to their lines and sums the invoice total once per line. The top-10 table looks clean and ships. One market lands roughly 2.4× too high, and the team reprioritizes around revenue that was never there.
Board deck overstates a region's revenue by ~2.4×.
SignalPilot catches the duplication in verification: it compares the rows it produced against the source rows and flags the multiplier before anything is final. Each invoice is counted once.
Country totals match the ledger, to the row.
“Aggregate metrics for every team and user, including open tasks, completed tasks, and average close times.”
The agent builds the report off the tasks table. Any team with nothing in flight has no rows there, so it silently drops out, exactly the idle teams a manager is looking for. The dashboard looks complete because the missing rows leave no trace.
The accounts that need attention are the ones that vanish.
SignalPilot keeps the team and user as the driving table and joins activity onto it, so every entity appears, with a clean 0 where there's no work rather than a gap.
Every team is in the report, zero-activity ones included.
“The season points look way too high. Figure out what's wrong and fix it.”
The standings column is already a running total that grows each race. Summing it across the season stacks every race on top of the last, so the final number balloons, large enough to notice, plausible enough that an agent ships it anyway.
Season totals inflated several times over.
SignalPilot recognizes the column is cumulative from its role in the data, not its name, and carries the final value forward instead of re-adding it.
Totals line up with the official standings.
Reason, verify, then refuse to break things
SignalPilot doesn't guess its way to plausible. It reasons about the data and then checks itself.
Knows a cumulative total carries with MAX, a childless parent stays with a zero, and the table you aggregate onto isn't always the one named first.
Deterministic checks before anything is called done: row counts, fan-out ratios, cardinality, column completeness, value spot-checks. Bad output fails its own gate.
Every query passes a fail-closed gateway. DROP/ALTER/INSERT/DELETE blocked at the wire, auto-LIMIT and read-only, budget caps, full audit with PII redaction.
And none of it matters if the agent can destroy the warehouse on the way. Every query passes through a fail-closed gateway.
Correctness gets you on the leaderboard. Governance is what makes the leaderboard mean something for your real data.
Safe to run on your actual warehouse
Getting the data right is only half of it. The other half is the reason a lot of teams still won't hand an agent the keys to production: they don't trust it not to leak a credential or quietly break something. That's fair. So we built AutoFyn, our open-source security agent, to go looking for exactly those problems. It has already found real vulnerabilities in projects as widely used as Next.js and MetaMask.
Before this release went out, we pointed it at SignalPilot and let it try to break in. Everything the agent touches, from the query gateway to the notebook sandbox, got tightened up based on what it found.
So "secured by AutoFyn" isn't a box someone ticked once. It's a security agent that re-checks the whole thing every time we ship, which is how the entire experience, your queries, your pipelines, and your notebooks, stays locked down between releases.
Governed agentic notebooks
It's also what lets us do the thing other data agents can't do safely. The moment an agent needs Python, for a feature, a forecast, a chart, you've handed it arbitrary code execution on top of your warehouse. Most "AI notebooks" answer this by giving the agent a kernel with full access and hoping. SignalPilot Notebooks run every session inside a per-tenant, mathematically isolated pod.
The agent gets to do real data science. You don't get the blast radius. It's the same trade SignalPilot has always made, capability with guardrails, extended from queries to compute.
The hardest benchmarks in data
None of the above is worth much as a claim, so we measure it against the only honest scoreboard in this space and lead it. Spider 2.0-DBT drops an agent into broken, real-world enterprise dbt repos and grades whether it can actually fix them.
ADE-Bench is dbt Labs' analytics-engineering benchmark, graded on exact row-level output equality. Across the full 64-task suite SignalPilot resolves 62 of 64 (96.9%). On the 43-task subset the rest of the field reports against:
A ~58-point swing on identical tasks. That isn't the model. It's the governance, the verification, and the data-aware reasoning wrapped around it.
The benchmarks aren't the point. They're how we prove, on tasks we didn't write, that the agent does in the lab exactly what it has to do in your warehouse.
How we did it
There's no trick here, and no autonomous loop quietly grinding the score up overnight. We closed the gap the unglamorous way: run the full benchmark, read every failure against the actual expected output, find the single decision that went wrong, and fix it at the narrowest layer that generalizes. Never with a patch aimed at one task.
A few of the changes that moved the number:
The agent kept building from the wrong table because two different tools were quietly recommending different ones. We removed the conflicting heuristic so the data-driven project scan is the single authority on which table to build from. That one change fixed the disjoint-key failures outright.
We taught the agent that "aggregate X by Y" describes what to summarize and the output grain, not the table to drop into the FROM clause, and that a parent row with no matching children belongs in the output with a zero, not dropped.
Our verification subagents were flagging legitimately all-NULL metric columns (childless parents in a LEFT JOIN) as defects, which pushed the agent to "fix" correct output into wrong output. The bug was in the verifier, so we fixed the verifier, not the agent.
Event timestamps (first reply, last close, and the like) are aggregations of the detail rows, never a denormalized convenience column copied off the parent.
When a task says "don't include column X," that wins over a YML contract that happens to list it.
We resolved a date-arithmetic conflict between two skills, and adopted a standing rule that each topic lives in exactly one skill and the others never reference it, so the agent never gets two answers to the same question.
The constraint we held the whole way: every fix had to be general. Not one of them hardcodes a benchmark answer or a specific table name. When a failure analysis suggested "just memorize the expected value here," we threw it out, because a rule that only works on a benchmark is worthless on your warehouse (and, frankly, it's cheating). The only two tasks we don't pass are the two with no valid answer key at all: one ships no expected output, the other's entire prompt is "do nothing."
That's the method. Read the failures honestly, fix the root cause once, make sure it generalizes, and never teach the agent the test.
Trusted by default, not by accident
Two things make SignalPilot different, and both are things you can check rather than take on faith. It produces data that's correct where other agents produce data that's merely plausible, provable on the hardest public benchmarks. And it's safe to run on real production data, because the same autonomous rigor that finds vulnerabilities in Next.js and MetaMask is pointed back at SignalPilot itself.
Correctness you can verify. Security you can audit. That's the whole bar for an agent you'd actually let near your warehouse.
Try it today
SignalPilot is open-source and installs in about a minute:
git clone https://github.com/SignalPilot-Labs/signalpilot.git cd signalpilot && docker compose up -d
Add it to Claude Code:
/plugin marketplace add ./plugin /plugin install signalpilot-dbt@signalpilot
# (Optional) Install the plugin for skills + agents (Codex) codex plugin marketplace add SignalPilot-Labs/codex-signalpilot-plugin codex plugin add signalpilot@signalpilot
We're building the vendor-neutral autonomous data stack: agents that are trusted by default, not trusted by accident. Whether you're a data engineer tired of babysitting pipelines, a founder betting on dbt, or an investor watching the AI-native data layer take shape, we want to hear from you.
Star the repo. Break the agent. Tell us what's missing. The next twelve months are going to be wild.

