teardown

Teardown: agent evals that catch real failures

Which scenarios expose compounding tool-use errors before users do? Include what your test catches and what it still misses.
Community critique

0 replies

No replies yet. Add the first useful critique.