FID-002

FMG-Bench Human Calibration and Construct Validity

Do FMG-Bench dimensions measure stable, decision-relevant constructs when scored by calibrated human reviewers, and where do model judges diverge from expert human judgment?

Why this matters

The question behind the brief.

Benchmarks are easy to publish and hard to validate. Fide AI should not rely on synthetic judges or aggregate scores unless it knows which dimensions are reliable enough to guide deployment, procurement, or public claims.

Metadata

How to place this idea.

expert reviewersstatisticsreviewerresearcher

Program

Faith-facing evaluation platform

Benchmarks, harness comparisons, reviewer calibration, scorer reliability, red-team suites, agent-security tests, and public evidence infrastructure.

Ways to help

Move this from question to evidence.

Serve as an expert reviewer.

Review scoring rubrics.

Help with reliability analysis.

Build annotation and adjudication tooling.

Contribute

Choose a public issue path or contact Fide AI.

Comment on methodology Claim or help Open GitHub source Contact or sponsor

← Back to research catalog View canonical GitHub brief ↗