4 Comments
User's avatar
Fruk's avatar

ameer – just adding one more angle: do you benchmark your confidence intervals against human-label disagreement? i’ve found models are often overconfident on examples where humans themselves split, so the “true mean” ground truth becomes the wobbly part.

Ameer Saleem's avatar

Good question, that's a tricky one. To be honest, I think it depends on what trade-offs you want/need to make. If you're looking to make confidence intervals for a low sample size, it's easy enough to just get humans to evaluate the responses. But of course, as you scale up the sample size, this approach will no longer remain feasible. Essentially, it comes down to what human resources you can allocate to the benchmarking/what risk you're willing to take with LLM evaluators. Sorry, this isn't a super definitive answer. However, trade-offs are a massive aspect of this work, and for real ML projects in general.

Fruk's avatar

your post about confidence intervals for llm eval scores – i’ve been building a collector to filter inputs before the prompt. the model has no immune system. question: what’s your fastest test for “this source is likely wrong”?

Ameer Saleem's avatar

When you say "source", are you referring to the original prompt, or the data source that the model is relying on for its response? For the former, I suppose that manual guardrails have to be put in place, rather than trusting an AI to build that out autonomously. As for the latter, nothing beats manual checks as you might imagine, though there's definitely some research going on in this area, such as the following: https://arxiv.org/abs/2512.13771