From AI Theater to AI Engine: Making Your Generative AI Experiments Add Up

Article

Jun 14

We've all sat through some version of this talk. A CTO stands up and walks through everything the company is doing with AI. Thirty experiments. A shiny new "AI Center of Excellence." Pilots in customer service, legal, finance, and engineering. Beautiful slides.

Then someone asks which one actually moved a business metric, and the room goes quiet.

That's AI theater. Lots of motion, no movement. And the numbers back it up. McKinsey says 88% of companies now use AI somewhere in the business, up from 78% a year ago. So almost everyone is doing something. But only about 6% are the high performers, the ones where AI adds more than 5% of EBIT. MIT puts it even harder: 95% of generative AI pilots never return a dime.

So what's actually going wrong? It isn't the technology, and it isn't the experimenting. Experiments are good. You want a lot of them. The trouble is that most companies run them as a scattered pile of one-offs, and a scattered pile is all they ever add up to. The companies pulling ahead do something different. They wire those same experiments into an engine: a system that picks what's worth testing, actually learns from whatever comes back, and turns those lessons into value that builds on itself. Same experiments. Very different machine around them.

The Question That Kills Most AI Initiatives

Most AI initiatives kick off the same way. Somebody senior gets excited about a new capability and tells the team to go find a use case for it. And honestly, "where can we add AI?" feels like a productive question. It gets you a long list fast. The catch is that every item on that list is a solution out hunting for a problem.

That's why 73% of failed AI projects never pinned down what success even looked like before they started. The project existed because the technology existed, not because a real problem was demanding a new answer.

So turn the question around. Clayton Christensen's jobs-to-be-done idea is the right lens. Don't start with the technology. Start with the highest-friction, highest-value work your people actually do. Where are they burning hours on synthesis that should take minutes? Where do decisions stall because nobody can find the right information? Where does the expertise sit in one person's head when a few hundred people need it?

Morgan Stanley didn't ask "how do we use generative AI?" They asked a much better question: how do our wealth advisors get to the right research fast? That was a real problem, and it was eating real time across 16,000 advisors. The answer happened to be a generative-AI assistant that pulls from more than 100,000 of the firm's internal documents. The AI was the answer to a specific question. It wasn't a technology looking for a home.

Here's the discipline, and it's rarer than it should be. Decide what success looks like before you write a single prompt. If you can't put the business outcome in one sentence, you're not ready to build.

One-Off Pilots Stall. Engines Compound.

Once you know which problems are worth solving, the next trap is how you test for solutions. Most companies run pilots. And from the outside, a one-off pilot and an engine look the same on day one. Small team, real problem, some AI in the mix. The difference only shows up later.

A pilot on its own is a one-off. An eager team hand-builds it, runs it a few months, ships a slick demo, and then it stalls. The champion moves to a new role, or the budget resets, and whatever they learned ends up in a deck nobody opens again.

The fix is an experimentation engine, and this time I mean the word literally. It's the actual infrastructure that runs those tests on purpose, over and over. A repeatable way to find out whether AI earns its keep in a given spot, and to shut down the ones that don't. None of the pieces are glamorous. You standardize how you evaluate things so every experiment gets graded on the same scale. You share prompt libraries so nobody's reinventing the basics. You build harnesses to check whether your retrieval is pulling back the right answers, and you keep score across models so the build-or-buy call runs on data instead of a hunch. It won't impress a board. It's also the only thing that lets each experiment start smarter than the one before it.

The number that matters is learning velocity. How fast do you get from "could AI help here?" to "here's what we know, and here's what we're doing about it"?

In 2025, 42% of companies scrapped most of their AI initiatives before they ever reached production, more than double the 17% from the year before. The money that burned mostly wasn't the technology. It was the time, the people, and the attention poured into pilots that were never built to scale in the first place. An engine doesn't make failure go away. It makes failure cheap, fast, and useful, which is exactly what you want an experiment to be.

The Portfolio Problem

Say you've got the right questions and a real engine for testing them. Most companies still trip on the next thing: how they spread the AI money around. The default is a flat list of projects all fighting for the same budget, and whoever tells the best story wins. You can guess how that goes. Too many moonshots, not enough plumbing, and a couple of obvious quick wins sitting on the shelf because nobody bothered to prioritize them.

The teams getting results treat it like a portfolio with tiers.

The base layer is infrastructure: data pipelines, evaluation tooling, the governance scaffolding every other project ends up leaning on. Nobody gets excited about it. But Gartner figures 60% of AI projects that aren't sitting on AI-ready data will get abandoned through 2026. So the boring layer is the one that decides whether any of the interesting stuff survives.

On top of that sit the quick wins, the proven use cases where the ROI is clear and the tech is mature enough to trust. Customer-service copilots, code assistance, document synthesis, internal search. None of it reinvents your business. But it builds real muscle and buys the credibility to fund bigger swings.

Then, and only then, the transformational bets. The genuinely new stuff that could move where you sit in the market and might also go nowhere. That's what the experimentation engine is for. It's not where most of your money should go.

The whole game is how you weight those three. Heavy at the base, steady in the middle, choosy at the top. Run AI like a venture portfolio and the value compounds. Run it like a to-do list and you land right back with thirty disconnected experiments and nothing to show for them.

Build Less Than You Think

In just about every AI conversation, somebody floats building something custom. Building feels strategic. It keeps your options open, and it keeps the engineers on the fun problems. But the data says slow down. That same MIT study found buying from specialized vendors, or partnering, pays off about 67% of the time. Build it in-house and your odds drop to roughly a third of that.

That doesn't make building wrong. It just means the call deserves more than "eh, we could build that."

Building makes sense when you've got proprietary data or a workflow that's a genuine, lasting edge, the kind that gets sharper the more it's used and can't be bought off a shelf. JPMorgan built PRBuddy, an in-house tool that writes pull-request descriptions and tags code changes, because their codebase and the way they ship are specific enough that anything off the shelf would miss the context that makes it worth having.

When the capability is basically a commodity, buy it and move on. EY dropped ServiceNow's Now Assist into their global service desk to draft resolution notes, and their agents accept about 70% of them untouched, which adds up to something like 66,000 hours a year. Could EY have built their own? Sure. But that's months of work for something that doesn't make their consulting one bit more competitive.

And when nobody has it figured out yet, not you and not any vendor, partnering lets you split the cost of the learning.

So the question to keep coming back to is a plain one. Does building this hand you an edge a competitor can't easily copy? If it doesn't, you're not building strategy. You're building overhead.

Governance That Learns

Most AI governance is built to stop bad things from happening. Biased models, made-up answers, data leaks, regulators knocking. That work matters. It's just not enough.

Here's a number that should bother you. 61% of enterprise AI projects got the green light on a projected ROI that nobody ever measured after launch. The company spent the money, built the thing, and never circled back to see if it worked. Nothing blew up. No data leaked. But nobody learned anything either, which means the next call is no smarter than the last one. That's not a governance gap in the usual sense. It's a learning gap.

So flip the goal. Govern for learning, not just risk. Decide what success looks like before the project starts, not after. Then track which experiments actually taught you something, not just which ones shipped, and measure the results against the bar you set up front. Whatever you learn goes back into the portfolio, so the next round of bets is smarter than this one.

The Stanford Enterprise AI Playbook, out in March 2026, found that the deployments that actually scaled all leaned on the same few things: clear value targets, AI built into real workflows, and change management somebody actually planned for. Governance is the thread that ties those together. Without it, every win is its own little island. With it, each win makes the next one easier.

From Theater to Engine

Think back to that CTO and the thirty experiments. The experimenting was never the problem. Experimenting is how any organization learns. The problem was that nothing tied the experiments together. No problem-first discipline, no repeatable testing, no portfolio behind the spending, no honest build-or-buy calls, no loop to turn the scattered wins into something that lasts.

None of that takes a bigger budget or smarter people. It takes a different way of working: start from problems worth solving, run your experiments through a system that actually learns, spend like an investor, build only where building buys you a real edge, and close the loop so every dollar teaches the next one something.

Do that, and you're not just adopting AI. You're compounding it. And the gap between you and everyone still putting on AI theater gets a little wider every quarter.

Michael Krouze https://www.brainrazr.com