Discussion about this post

User's avatar
Fruk's avatar

ameer – just adding one more angle: do you benchmark your confidence intervals against human-label disagreement? i’ve found models are often overconfident on examples where humans themselves split, so the “true mean” ground truth becomes the wobbly part.

Fruk's avatar

your post about confidence intervals for llm eval scores – i’ve been building a collector to filter inputs before the prompt. the model has no immune system. question: what’s your fastest test for “this source is likely wrong”?

2 more comments...

No posts

Ready for more?