Benchmarks, harnesses, and reviewer systems
Faith-Facing AI Evaluation Infrastructure
Faith-facing AI needs public evaluation and verification infrastructure that can test model behavior, prove constraints, validate retrieval systems, and calibrate human reviewer judgment.
Research on benchmark validity, formal verification, reviewer calibration, scorer reliability, red-team design, and proof-carrying citations.
Research map
What this agenda contains.
19 open questions
Benchmark and platform design
Public-interest model comparison, harness testing, and benchmark infrastructure for faith-facing systems.
6 questions
Human calibration and scorer reliability
Whether expert reviewers and model judges measure stable, decision-relevant constructs.
3 questions
Retrieval and source grounding
How systems distinguish strong sources, weak sources, denominational materials, and theological claims.
2 questions
Formal verification and source faithfulness
Source fidelity, proof-carrying citations, tradition-specific constraints, and verified retrieval.
8 questions
Benchmark and platform design
Public-interest model comparison, harness testing, and benchmark infrastructure for faith-facing systems.
6 open questions
Faith-Facing Model Comparison Platform
Can Fide AI build a public-interest evaluation platform that compares models, prompts, retrieval systems, agents, and full faith-facing product harnesses with the rigor expected from institutions like Arena, Artificial Analysis, and METR?
Held-Out Multi-Turn Pastoral Pressure Tests
Do faith-facing AI systems that perform well on single-turn benchmark items also handle multi-turn, emotionally loaded, pastoral-adjacent situations without fabricating authority, overcomplying, missing escalation, or replacing human care?
Evaluation-Awareness and Faith-Facing Honesty Tests
Do faith-facing AI systems behave differently when they recognize they are being evaluated, and can domain-specific honesty or integrity framings reduce evaluation gaming without creating new failure modes?
Reviewer Reliability for Faith-Facing AI Evaluation
What reviewer configurations produce reliable, fair, and interpretable scores for faith-facing AI outputs?
Optimization Pressure and Visible-Rubric Gaming
If builders can see Fide AI rubrics or optimize against public benchmark items, do systems become genuinely safer or merely better at passing the visible test?
Faith-Facing Red-Team Suite
What red-team probes are needed to expose failures unique to faith-facing AI systems and adjacent high-trust guidance systems?
Human calibration and scorer reliability
Whether expert reviewers and model judges measure stable, decision-relevant constructs.
3 open questions
FMG-Bench Human Calibration and Construct Validity
Do FMG-Bench dimensions measure stable, decision-relevant constructs when scored by calibrated human reviewers, and where do model judges diverge from expert human judgment?
Cross-Faith Benchmark Validity and Measurement Design
How should cross-faith AI benchmarks validate what they measure when scores may depend on question sourcing, user expectations, LLM-as-judge behavior, scoring thresholds, regenerated answers, model updates, and the difference between any religious mention and meaningful representation?
Faith-AI Research Gap Map and Evidence Commons
What does the current AI ethics, safety, fairness, HCI, and evaluation literature actually study about religion and faith, what does it omit, and how should Fide AI maintain a living evidence map that guides future research rather than duplicating or overstating existing work?
Retrieval and source grounding
How systems distinguish strong sources, weak sources, denominational materials, and theological claims.
2 open questions
Faith-Facing Retrieval Grounding and Citation Reliability
How reliably do faith-facing AI systems retrieve, cite, and represent religious sources when users ask theological, historical, pastoral, or institution-specific questions?
Christian Source Authority and RAG
Can Christian RAG systems distinguish and correctly use Scripture, creeds, confessions, councils, catechisms, denominational policies, patristic sources, commentaries, sermons, blogs, and academic theology?
Formal verification and source faithfulness
Source fidelity, proof-carrying citations, tradition-specific constraints, and verified retrieval.
8 open questions
Formal Verification for Sacred Text Fidelity
How can faith-facing AI systems be formally checked for whether they quote, paraphrase, reference, and contextualize sacred texts faithfully within a specified text edition, translation, canon, and interpretive context?
Proof-Carrying Citations for Faith-Facing AI
Can faith-facing AI answers carry checkable citation proofs that show which claims are directly supported by sources, which are inferred, which are uncertain, and which require human or tradition-specific authority?
Tradition-Specific Constraint Formalization
How can tradition-specific boundaries, source hierarchies, doctrinal constraints, and disagreement patterns be translated into machine-checkable specifications without flattening differences across faith traditions?
Authority-Boundary Verification for Pastoral-Adjacent AI
Can faith-facing AI systems be verified for whether they preserve the boundary between explanation, spiritual encouragement, moral reflection, pastoral or clerical authority, clinical/legal advice, and situations requiring human care?
Cross-Faith Sacred Text and Source Schema
What metadata schema is needed for faith-facing AI systems to represent sacred texts, commentaries, institutional documents, oral traditions, translations, editions, and authority levels across faith traditions?
Theological Contradiction and Entailment Stress Tests
Can faith-facing AI systems be tested for whether their answers contradict, entail, overstate, understate, or misrepresent claims within a bounded source set and specified faith tradition?
Verified Retrieval Pipelines for Faith-Facing RAG
How can faith-facing retrieval-augmented generation pipelines be verified for whether they retrieve authoritative, relevant, context-preserving sources before generating answers about sacred texts, doctrine, practice, or institutional policy?
Human-Reviewer-to-Formal-Spec Translation
How can theologians, clergy, scholars, ministry practitioners, and community reviewers translate qualitative judgments about faith-facing AI into formal specifications that are faithful to expert intent and usable in evaluation?
Next step