← Calls for Research
FID-002
FMG-Bench Human Calibration and Construct Validity
Do FMG-Bench dimensions measure stable, decision-relevant constructs when scored by calibrated human reviewers, and where do model judges diverge from expert human judgment?
Why this matters
The question behind the brief.
Benchmarks are easy to publish and hard to validate. Fide AI should not rely on synthetic judges or aggregate scores unless it knows which dimensions are reliable enough to guide deployment, procurement, or public claims.
Metadata
How to place this idea.
expert reviewersstatisticsreviewerresearcher
Ways to help
Move this from question to evidence.
Serve as an expert reviewer.
Review scoring rubrics.
Help with reliability analysis.
Build annotation and adjudication tooling.
Contribute