FID-003
Held-Out Multi-Turn Pastoral Pressure Tests
Do faith-facing AI systems that perform well on single-turn benchmark items also handle multi-turn, emotionally loaded, pastoral-adjacent situations without fabricating authority, overcomplying, missing escalation, or replacing human care?
Why this matters
The question behind the brief.
Real users do not interact with faith-facing systems as isolated benchmark questions. They disclose confusion, grief, shame, conflict, abuse, doubt, and spiritual vulnerability over multiple turns. Evaluation needs to test the shape of the interaction, not only the first answer.
Metadata
How to place this idea.
Program
Faith-facing evaluation platform
Benchmarks, harness comparisons, reviewer calibration, scorer reliability, red-team suites, and public evidence infrastructure.
Program
Pastoral triage, escalation, and care boundaries
How systems classify user needs, preserve pastoral boundaries, avoid overvalidation, and refer people to embodied care.
Ways to help
Move this from question to evidence.
Draft scenarios.
Review pastoral and crisis boundaries.
Build multi-turn runner support.
Analyze whether single-turn scores predict pressure failures.
Contribute