The 'Vibe Check' is Not a Strategy

Published on 1/11/2026
Article hero image

Why Evals are the New Product Requirement

Most AI agent projects follow a predictable, doomed path.

The Product Owner (PO) watches a demo. The agent successfully researches a company, finds a key contact, and drafts a personalized email. It looks like magic. The PO says, “Great, let’s ship it.”

Then, in production, the agent hallucinates a CEO’s name, tries to email a dead domain, and gets stuck in an infinite loop searching for a LinkedIn profile that doesn’t exist. The project stalls. Stakeholders lose faith. The "magic" evaporates.

The problem wasn't the model. The problem was that the product was built on a "vibe check" rather than engineering rigor.

As a Product Owner in the agentic age, your role has fundamentally shifted. You are no longer just managing a backlog of features; you are managing a Success Rubric.

The End of Deterministic Thinking

In traditional software, if a user clicks "Submit," a specific row is created in a database. It is deterministic. You test it once; if the code doesn't change, the result doesn't change.

AI agents are probabilistic. A single prompt adjustment designed to fix a minor formatting issue might inadvertently break the agent’s ability to handle a complex edge case three steps later. In the world of Large Language Models (LLMs), this is a "regression," and it is a silent killer of product momentum.

If your current testing process involves you or a QA lead manually running five prompts to see if the output "looks right," you aren't building a product—you're managing a science experiment.

Your New Primary Artifact: The Grader

For a PO, the most important artifact is no longer just the PRD. It’s the Eval Harness.

Evaluations (evals) are automated tests that run your agent through hundreds of scenarios to see where it breaks. But for these to work, someone has to define what "good" actually looks like. That is the PO’s new core responsibility.

Borrowing from the framework recently highlighted by Anthropic, we need to move from manual spot-checks to automated graders. As a PO, you must define the rubrics that these graders use to judge the agent:

  • The Accuracy Rubric: Did the agent actually solve the user's problem? (e.g., "Did the refund amount match the invoice?")
  • The Constraint Rubric: Did it stay within legal and brand guardrails? (e.g., "Did the agent refrain from offering a discount it wasn't authorized to give?")
  • The Efficiency Rubric: Did it take 10 steps to do something that should take two?

When you define a feature, you must simultaneously define the Grader for that feature. If you cannot define how to measure success programmatically, you haven’t defined the feature well enough to automate it.

Why Evals are a Competitive Advantage

Building an eval harness often feels like "extra work" that slows down the initial demo. This is a fallacy. Evals are actually the only way to move fast.

Speed of Iteration When you have 100+ evals that run in minutes, your developers can experiment with new prompts or models without fear. They know within seconds if they broke a "gold standard" case.

Operational Intelligence Evals give you a data-driven way to tell stakeholders exactly how reliable the system is. "The agent is 94% successful on refund requests" is a business statement. "It seems to work pretty well" is a liability.

Model Agnosticism When a cheaper or faster model is released—like the jump from Claude 3 Opus to 3.5 Sonnet—your evals tell you instantly if you can switch without losing quality. You are no longer locked into a specific provider's "vibes."

Start with the Failure

At FM, we tell our partners: Don't start with the happy path.

If you're building an agentic system, your first sprint shouldn't be the "perfect demo." It should be identifying the 20 ways the agent is most likely to fail and writing an eval for each one.

The companies that win the AI race will be the ones with the most rigorous feedback loops. They will be the ones who turned "vibes" into versioned, tested, and scalable operations.

Is your product built on a vibe or a rubric?


At FM, we help companies build operational intelligence through AI and systems thinking. If you're ready to move your AI agents from prototype to production-grade operations, let's talk.