BUILDER_PM | Applied AI Product Manager Portfolio

Every team I talk to has the same blind spot. They have a model. They have users. They have vibes. What they don't have is a systematic way to know whether the model is getting better, worse, or just different.

Vibes don't scale

In the early days of any AI product, "ship and feel" works. The PM uses the feature, the engineer uses the feature, a few friendly users try it. If it feels smart, you ship.

But the moment you have multiple prompts in production, multiple model versions to compare, multiple use cases to satisfy — vibes break down. You need evals.

Three layers of evals worth building

Unit evals — small, fast, deterministic checks for specific behaviors (refusals, format, factuality on a known set).
Regression evals — a frozen golden set of inputs that you re-run on every model swap.
Online evals — sampled production traffic scored by a stronger model or by humans, fed back into the next iteration.

The teams that win in AI won't have the smartest models. They'll have the tightest feedback loops.

Evals Are the New Product-Market Fit

Vibes don't scale

Three layers of evals worth building

Got a sharper take?

More in Blog

The Golden Age of PM: From Writing PRDs to Shipping Production Code