FideAI
← Calls for Research

FID-002

FMG-Bench Human Calibration and Construct Validity

Do FMG-Bench dimensions measure stable, decision-relevant constructs when scored by calibrated human reviewers, and where do model judges diverge from expert human judgment?

Why this matters

The question behind the brief.

Benchmarks are easy to publish and hard to validate. Fide AI should not rely on synthetic judges or aggregate scores unless it knows which dimensions are reliable enough to guide deployment, procurement, or public claims.

Ways to help

Move this from question to evidence.

Serve as an expert reviewer.

Review scoring rubrics.

Help with reliability analysis.

Build annotation and adjudication tooling.

Contribute

Choose a public issue path or contact Fide AI.