Evals · Golden sets
Golden conversation sets.
Curate named collections of test conversations per agent. Each case asserts a specific behaviour — must cite a doc, must call a tool, must refuse, must escalate, or any grounded answer — and the eval runner replays every case against the agent end-to-end.
Filter by agent
No golden sets yet. Create one to start curating eval cases.