BUILDER_PM | Applied AI Product Manager Portfolio

This is a walkthrough of the eval harness I use whenever I'm prototyping a new AI feature. It's deliberately minimal — about 200 lines of TypeScript — but it covers 80% of what most teams actually need on day one.

What it does

Loads a JSON list of test cases
Runs each through a configurable LLM call
Scores the output against an expected answer using a rubric
Writes a CSV report with pass/fail and per-case latency

What it doesn't do

It doesn't replace a real eval platform. Once you have more than a few hundred cases, you'll want something like Braintrust, Langsmith, or your own internal tool. But for the first hundred cases, this is plenty.

The core loop

for (const testCase of testCases) {
  const output = await llm(testCase.input);
  const score = await rubricGrader(output, testCase.expected);
  results.push({ ...testCase, output, score });
}

for (const testCase of testCases) {
  const output = await llm(testCase.input);
  const score = await rubricGrader(output, testCase.expected);
  results.push({ ...testCase, output, score });
}

That's it. Everything else is wrapping, logging, and reporting around this loop. The hardest part isn't the code — it's writing the test cases. Start with 20 cases drawn from real user requests, and you'll learn more in an afternoon than in a week of strategy meetings.

Building a Minimal Eval Harness in an Afternoon

What it does

What it doesn't do

The core loop

Got a sharper take?