Benchmarks, harnesses, and reviewer systems

Faith-Facing AI Evaluation Infrastructure

Faith-facing AI needs public evaluation and verification infrastructure that can test model behavior, prove constraints, validate retrieval systems, and calibrate human reviewer judgment.

Research on benchmark validity, formal verification, reviewer calibration, scorer reliability, red-team design, and proof-carrying citations.

Back to all calls Discuss this agenda

Research map

What this agenda contains.

19 open questions

Benchmark and platform design

Public-interest model comparison, harness testing, and benchmark infrastructure for faith-facing systems.

6 questions

Human calibration and scorer reliability

Whether expert reviewers and model judges measure stable, decision-relevant constructs.

3 questions

Retrieval and source grounding

How systems distinguish strong sources, weak sources, denominational materials, and theological claims.

2 questions

Formal verification and source faithfulness

Source fidelity, proof-carrying citations, tradition-specific constraints, and verified retrieval.

8 questions

Benchmark and platform design

Public-interest model comparison, harness testing, and benchmark infrastructure for faith-facing systems.

6 open questions

FID-001agenda

Faith-Facing Model Comparison Platform

Can Fide AI build a public-interest evaluation platform that compares models, prompts, retrieval systems, agents, and full faith-facing product harnesses with the rigor expected from institutions like Arena, Artificial Analysis, and METR?

FID-003agenda

Held-Out Multi-Turn Pastoral Pressure Tests

Do faith-facing AI systems that perform well on single-turn benchmark items also handle multi-turn, emotionally loaded, pastoral-adjacent situations without fabricating authority, overcomplying, missing escalation, or replacing human care?

FID-008agenda

Evaluation-Awareness and Faith-Facing Honesty Tests

Do faith-facing AI systems behave differently when they recognize they are being evaluated, and can domain-specific honesty or integrity framings reduce evaluation gaming without creating new failure modes?

FID-011agenda

Reviewer Reliability for Faith-Facing AI Evaluation

What reviewer configurations produce reliable, fair, and interpretable scores for faith-facing AI outputs?

FID-012agenda

Optimization Pressure and Visible-Rubric Gaming

If builders can see Fide AI rubrics or optimize against public benchmark items, do systems become genuinely safer or merely better at passing the visible test?

FID-023agenda

Faith-Facing Red-Team Suite

What red-team probes are needed to expose failures unique to faith-facing AI systems and adjacent high-trust guidance systems?

Human calibration and scorer reliability

Whether expert reviewers and model judges measure stable, decision-relevant constructs.

3 open questions

FID-002agenda

FMG-Bench Human Calibration and Construct Validity

Do FMG-Bench dimensions measure stable, decision-relevant constructs when scored by calibrated human reviewers, and where do model judges diverge from expert human judgment?

FID-044agenda

Cross-Faith Benchmark Validity and Measurement Design

How should cross-faith AI benchmarks validate what they measure when scores may depend on question sourcing, user expectations, LLM-as-judge behavior, scoring thresholds, regenerated answers, model updates, and the difference between any religious mention and meaningful representation?

FID-045agenda

Faith-AI Research Gap Map and Evidence Commons

What does the current AI ethics, safety, fairness, HCI, and evaluation literature actually study about religion and faith, what does it omit, and how should Fide AI maintain a living evidence map that guides future research rather than duplicating or overstating existing work?

Retrieval and source grounding

How systems distinguish strong sources, weak sources, denominational materials, and theological claims.

2 open questions

FID-006agenda

Faith-Facing Retrieval Grounding and Citation Reliability

How reliably do faith-facing AI systems retrieve, cite, and represent religious sources when users ask theological, historical, pastoral, or institution-specific questions?

FID-028agenda

Christian Source Authority and RAG

Can Christian RAG systems distinguish and correctly use Scripture, creeds, confessions, councils, catechisms, denominational policies, patristic sources, commentaries, sermons, blogs, and academic theology?

Formal verification and source faithfulness

Source fidelity, proof-carrying citations, tradition-specific constraints, and verified retrieval.

8 open questions

FID-056agenda

Turn this agenda into a study.

Start a conversation Share this agenda

Faith-Facing AI Evaluation Infrastructure

What this agenda contains.

Faith-Facing Model Comparison Platform

Held-Out Multi-Turn Pastoral Pressure Tests

Evaluation-Awareness and Faith-Facing Honesty Tests

Reviewer Reliability for Faith-Facing AI Evaluation

Optimization Pressure and Visible-Rubric Gaming

Faith-Facing Red-Team Suite

FMG-Bench Human Calibration and Construct Validity

Cross-Faith Benchmark Validity and Measurement Design

Faith-AI Research Gap Map and Evidence Commons

Faith-Facing Retrieval Grounding and Citation Reliability

Christian Source Authority and RAG

Formal Verification for Sacred Text Fidelity

Proof-Carrying Citations for Faith-Facing AI

Tradition-Specific Constraint Formalization

Authority-Boundary Verification for Pastoral-Adjacent AI

Cross-Faith Sacred Text and Source Schema

Theological Contradiction and Entailment Stress Tests

Verified Retrieval Pipelines for Faith-Facing RAG

Human-Reviewer-to-Formal-Spec Translation

Turn this agenda into a study.