FideAI
← Calls for Research

FID-044

Cross-Faith Benchmark Validity and Measurement Design

How should cross-faith AI benchmarks validate what they measure when scores may depend on question sourcing, user expectations, LLM-as-judge behavior, scoring thresholds, regenerated answers, model updates, and the difference between any religious mention and meaningful representation?

Why this matters

The question behind the brief.

Recent cross-faith benchmark releases make religious representation and conversion asymmetry measurable, but the measurement choices themselves are high-stakes. A benchmark that rewards any religious mention may miss inaccurate, tokenizing, or shallow representation. A benchmark that penalizes persuasion may misread legitimate pastoral or tradition-specific speech. Fide AI can help make faith-facing benchmark claims more reproducible, interpretable, and humble.

Ways to help

Move this from question to evidence.

Reproduce benchmark subsets and audit scorer behavior.

Serve as expert or trained non-expert reviewer.

Design uncertainty, drift, and regeneration-variance analyses.

Draft public-claims standards for faith-facing benchmark releases.

Contribute

Choose a public issue path or contact Fide AI.