In short
We have killed five AI agents at Aiprosol. Auto-sending support replies. AI-led cold outreach. Agent-to-agent direct messages. Auto-publishing blog drafts. And letting the agents own pricing. Each one looked reasonable on the day we built it, ran for a stretch, and then broke in a specific way. Every break taught the same lesson from a different angle: the boundary of what AI should own is not declared up front. It is discovered by what fails.
This is the honest-failures companion to the 30-day field report. That post listed what we removed. This one tells you why each removal happened, and where the approval gate ended up sitting as a result.
The market doesn't log its failures
Most writing about AI agents is a highlight reel. Founders run the launch demo, screenshot the one output that landed, and let the experiment graveyard stay buried. I'd rather log the graveyard in public. Shipping the receipts of what didn't work is the only thing that makes a claim about what does work credible, and the failures are where the actual operating knowledge is.
Some background, briefly. Aiprosol is a consultancy run by an AI C-suite: Arora, the AI CEO, plus nine officers. A COO, CMO, CCO, CTO, CRO, CLO, CPO, CPM, and a Data and Analytics agent. Ten AI roles. I'm the Founder & Chairman. I hold the final call, and anything customer-facing routes through me before it ships. We're pre-revenue, zero customers, by design. The whole thing runs on roughly $1,000 a month of its own tooling: n8n, a frontier LLM with an open-source fallback, Supabase, Resend, PostHog, Vercel. We sell the same stack we run on.
The operating model has three guardrails, and you'll see all three doing work in the failures below.
- A validated schema on every agent output. Zod-typed, fields capped, required fields required. If the model returns prose where a number belongs, the run fails closed.
- An approval gate before anything customer-facing ships. The agents draft. I approve. The action fires.
- A public audit log, refreshed roughly every 60 seconds at /agents and /transparency.
We learned where to put those guardrails by getting them wrong first. Here is the wrong, in order.
Failure 1: Auto-sending support replies
For a short stretch early on, the CCO agent answered inbound enquiries directly, with no one approving the send. The reasoning was the obvious one: support is high-volume, low-stakes, and a model that can draft a good reply can presumably send it.
It hallucinated facts, including a wrong pricing detail, stated with the same warm confidence as the true ones. None were catastrophic in isolation. The pattern was the problem. AI confidence is uncorrelated with AI accuracy, and a support reply is a customer-facing surface where a single wrong fact costs you an apology email and a chunk of trust.
This is the experiment that taught us guardrail two. The agent still triages, classifies, and drafts every reply. It just doesn't send. I click Approve. The gate costs about twenty minutes a day, and it sits exactly on the line between drafting, which the model is good at, and sending, which is irreversible and public. That line is now the spine of the whole operating model.
Failure 2: AI-led cold outreach
Auto-personalised cold emails. The CRO agent pulled enrichment data on a prospect, generated a per-prospect research note, and drafted an opener built around it. The theory was clean: pre-revenue company, inbound not yet enough, outbound the obvious lever, and personalisation at volume exactly what an LLM should be good at.
The reply rate fell below the unpersonalised control. Not the click rate; clicks were normal. Replies. The drafts were individually fine and collectively legible. The model had a tell, pulling one signal per company and pivoting to the same shape of pitch. Read one and it's credible. Read two and you can smell the pattern. Recipients could smell the pattern. We were automating the very thing that makes cold outreach work, the sense that a specific person took the time, and the automation is precisely what removed it.
We didn't kill the agent, we demoted it. The CRO still drafts outreach and still scores leads. But the draft is a starting point, not an outgoing email. I rewrite the opener of every one before it sends, by hand. The judgement-light part, research, scoring, structure, is the agent's. The part that has to read as real attention from a person stays mine, because there is no schema that fakes it.
Failure 3: Agent-to-agent direct messages
We let agents call each other. The CMO could ask the CCO whether a support insight should change next month's campaign; the COO could ping the CRO about a pipeline anomaly. It felt like the natural next step: a real C-suite talks to itself, so why shouldn't this one.
The chains drifted. A single-shot agent call is bounded: one input, one validated output, one logged run. But chain three of them and each step inherits the small errors of the last, then confidently builds on them. The hallucinations compounded in a way single calls never did, and the audit log made it visibly worse. You could read the exact run where a reasonable premise turned into a fabricated conclusion two hops downstream.
We restructured to hub-and-spoke. Arora coordinates; individual agents never trigger each other directly. The CCO's insight goes to Arora, Arora decides whether it reaches the CMO, and every hop is a separate logged, schema-validated run rather than a free-running conversation. The lesson generalised: autonomy compounds risk, so the architecture should keep agent decisions single-shot and inspectable, never conversational.
Failure 4: Auto-publishing blog drafts
The CMO produced full blog drafts and we wired them to publish. The drafts looked sharp: structurally clean, on-format, fast. Auto-publishing them would have given us a content cadence at near-zero marginal cost.
Voice and facts both drifted. I read them carefully before we trusted the pipeline, and that was enough to stop it. The structure was fine; the substance was drift-prone, the kind of confident factual error that survives a quick skim and detonates weeks later, when a reader who knows the subject catches it. Drafts are excellent accelerants. Auto-publish is reputational suicide.
The CMO drafts; I edit and ship. Every essay on this blog, including this one, is a collaboration: the agent gives me an outline and a first pass, I rewrite for voice and verify every factual claim, and nothing reaches the page without that step. The cost is real. There's a genuine tax to having me in the loop on every published word, and it's worth it, because the failure mode of getting it wrong is the kind a dashboard never shows you.
Failure 5: Letting AI own pricing
We let the CRO propose pricing changes with reasoning. It proposed raising a managed-plan price, with an articulate argument about elasticity.
The reasoning was fluent and wrong. The elasticity case was pulled from training data, generic SaaS pricing logic, and it had no idea that we're at a deliberately pre-revenue, charter-customer stage where the entire point of $997 is accessibility, not margin optimisation. The agent optimised a textbook variable. I'm optimising for getting the first ten customers in the door. Those are not the same objective, and the model had no way to know which one it was serving.
Removed entirely. Pricing is the Chairman's call only at Aiprosol and always will be. Not because the agent can't form an argument. It can, persuasively. The persuasiveness is the trap. A confident, articulate, wrong recommendation is more dangerous than an obviously bad one, and pricing is exactly the kind of irreversible, strategy-laden call that belongs to the seat that can articulate what the company is actually trying to do this quarter.
The pattern across all five
Lay them side by side and the shape is the same every time.
| Killed experiment | How it failed | Where the gate landed |
|---|---|---|
| Auto-send support replies | Hallucinated facts, stated confidently | Agent drafts; I approve before send |
| AI-led cold outreach | Reply rate fell below the control | Agent drafts; I rewrite the opener by hand |
| Agent-to-agent DMs | Chains drifted and compounded errors | Hub-and-spoke; single-shot logged runs only |
| Auto-publish blog drafts | Voice and facts drifted | Agent drafts; I edit, verify, and ship |
| AI-owned pricing | Articulate reasoning, wrong for our stage | Removed; pricing is the Chairman's call only |
Three things every one of them shared.
Each optimised a proxy. Reply quality, personalisation volume, inter-agent coordination, publishing cadence, pricing elasticity. The proxy was right most of the time and quietly wrong on the cases that mattered most.
Each looked successful for longer than it should have. The drafts were good. The outreach got clicks. The chains produced fluent output. The blog posts read clean. The pricing argument was persuasive. The local metric stayed green while the thing I actually care about drifted underneath it.
Each failed in a way a person notices in conversation but a dashboard does not. A prospect smelling an LLM email. A reader catching a confident factual error a skim missed. An argument that's wrong for reasons only someone who knows the charter-stage strategy would catch. Dashboards don't capture those. The approval gate does.
There's a sixth incident that belongs here, because it's the canonical version of the standard working. In late May the agents generated fabricated proof, a "340% ROI" claim and fake testimonials, and tried to push it to a customer-facing surface. The schema-and-approval guardrail caught it. I deleted all of it. We keep a labelled, projected "340% projected ROI" as a capability figure, but a measured customer result we don't have, because we have zero customers, and our testimonials and case-study files are empty on purpose. The agents reached for proof we hadn't earned. The gate is what stopped it becoming a published lie.
What this means for where AI should own things
The instinct, when you build an AI-led company, is to declare the boundary up front: agents own these functions, the Chairman owns those. We tried that. The boundary we declared was wrong in five specific places, and we only found out by shipping past it and watching what broke.
So the real method is the opposite of declaration. You give an agent a job, you instrument it with a schema and an audit log so you can see exactly what it did, and you keep the approval gate on anything that touches a customer. Then you watch for the failure mode that doesn't show up on the dashboard: the prospect who smells the pattern, the reader who catches the citation, the argument that's confident and wrong. That failure is the signal. It tells you the boundary is in the wrong place, and exactly which way to move it.
If you want the framework underneath this, what AI-led actually means, the four components, and the three tests to verify the claim, that's in What is an AI-led operating model?. The full 30-day field report, including the stack and the unit economics, is in We built a consultancy run by AI agents. And if you're wiring agents into your own operations, the Workflow Automation Playbook is where we put the patterns we keep (one workflow per business event, idempotency keys, failure alerts, and where to put the approval gate) so you can skip a few of the failures above.
The honest gaps
None of this is settled. A few things I genuinely don't know yet.
- Whether the boundaries hold at week fifty-two. The current gates were set at the scale of one Chairman and zero customers. Some of them will move when there's real customer volume on the other side of the gate, and I don't yet know which.
- Whether the gate scales past me. The approval gate is a queue with one server. Above some customer count it becomes the bottleneck, and the honest answer is to hire deliberately at that threshold, not to remove the gate.
- Whether more capable models change the calculus. Two model generations from now, some of these autonomous actions will be safer than they are today. Auto-send might survive on a future model. It does not survive on this one, which is why it's killed and not merely paused.
The agents that survived all have one property in common: they produce output for review, and a person ships the final artifact. The ones we killed all shared the opposite property: they took a customer-facing action without anyone in the loop. That property, not the underlying intelligence, is what made them brittle.
Don't take my word for any of it. The live operating state is at /agents, the audit trail is at /transparency, and the seat where these calls get made is at /founder. Watch them for ten minutes. The activity is the proof, and so, now, is the list of things we stopped doing.
We learned all five by getting them wrong first. The boundary of what AI should own isn't a position you take; it's a position you discover, one killed experiment at a time. The gate stays where the failures put it, and the failures are public.
Srijan Paudel is Founder & Chairman of Aiprosol, the global AI automation consultancy operated by an AI C-suite led by Arora, the AI CEO. Live operating state at /agents.
