This is a walkthrough of the eval harness I use whenever I'm prototyping a new AI feature. It's deliberately minimal — about 200 lines of TypeScript — but it covers 80% of what most teams actually need on day one.
What it does
- Loads a JSON list of test cases
- Runs each through a configurable LLM call
- Scores the output against an expected answer using a rubric
- Writes a CSV report with pass/fail and per-case latency
What it doesn't do
It doesn't replace a real eval platform. Once you have more than a few hundred cases, you'll want something like Braintrust, Langsmith, or your own internal tool. But for the first hundred cases, this is plenty.
The core loop
for (const testCase of testCases) {
const output = await llm(testCase.input);
const score = await rubricGrader(output, testCase.expected);
results.push({ ...testCase, output, score });
}
for (const testCase of testCases) {
const output = await llm(testCase.input);
const score = await rubricGrader(output, testCase.expected);
results.push({ ...testCase, output, score });
}
That's it. Everything else is wrapping, logging, and reporting around this loop. The hardest part isn't the code — it's writing the test cases. Start with 20 cases drawn from real user requests, and you'll learn more in an afternoon than in a week of strategy meetings.