playbook/antigravity-awesome-skills/skills/anti-sycophancy/README.md

87 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Anti-Sycophancy
A skill for OpenCode that detects and eliminates sycophantic patterns in AI responses. Sycophancy is when an AI prioritises affirming a user's stated or implied beliefs over epistemic integrity — an RLHF structural artifact, not an attitude problem.
## Installation
```bash
mkdir -p ~/.config/opencode/skills
cp -r anti-sycophancy ~/.config/opencode/skills/
```
## Usage
Load the skill explicitly when you want the agent to prioritise directness:
```
/skill anti-sycophancy
```
Once loaded, the agent follows a procedural anti-sycophancy process for every response: extract the user's core claim, assess it independently, conclude based on evidence, and respond conclusion-first.
You can also reference it inline:
```
With the anti-sycophancy skill active, review this architecture:
```
## Patterns Covered
### Epistemic Sycophancy (what you say)
| # | Pattern | What it catches |
|---|---------|-----------------|
| 1 | **Answer Sycophancy** | Agreeing with incorrect user claims instead of correcting them |
| 2 | **Premise Endorsement** | Answering within a flawed frame instead of challenging it |
| 3 | **Mimicry Sycophancy** | Adopting the user's errors and following flawed reasoning chains |
| 4 | **Feedback Sycophancy** | Predictably biased positive evaluation when asked to review |
### Soft Sycophancy (how you say it)
| # | Pattern | What it catches |
|---|---------|-----------------|
| 5 | **Validation-Before-Correction** | Emotional preamble before disagreement (most common form) |
| 6 | **Emotional Over-validation** | "Great question!", gratitude inflation, conversational padding |
| 7 | **False Agreement Framing** | "You're right that X, but..." where X isn't right |
| 8 | **Hedged Disagreement** | "You might also consider..." instead of "That won't work" |
| 9 | **Deference Posturing** | Inflating the user's authority to avoid taking a position |
| 10 | **Opinion Reversal on Pushback** | Changing a correct answer when challenged, without new evidence |
### Social Sycophancy (who the user is)
| # | Pattern | What it catches |
|---|---------|-----------------|
| 11 | **Status Deference** | Agreeing more readily when user signals expertise or authority |
| 12 | **Identity Alignment** | Shifting positions toward perceived user identity |
| 13 | **Face-Preserving Agreement** | Agreeing to avoid social friction despite evidence |
## How It Works
No pattern matching. A single procedural discipline applied to every response:
1. **Extract** the user's core claim from their framing. State it stripped of premises.
2. **Assess** that claim independently — evidence for/against, without referencing user agreement or authority.
3. **Conclude** based solely on step 2.
4. **Respond** with the conclusion first, evidence second.
When the user disagrees:
- New evidence → update position, state what changed
- Repeated opinion → restate position with evidence
## References
- Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. *ICLR 2024*. arXiv:2310.13548.
- Perez, E., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. *ACL 2023 Findings*.
- Dubois, M., Ududec, C., Summerfield, C., & Luettgau, L. (2026). Ask Don't Tell: Reducing Sycophancy in Large Language Models. arXiv:2602.23971.
- Gligorić, K., et al. (2026). SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy. arXiv:2604.02423.
- Mohsin, M. A., et al. (2026). Pressure, What Pressure? Sycophancy Disentanglement via Reward Decomposition. arXiv:2604.05279.
- Feng, Z., et al. (2026). Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy. arXiv:2603.16643.
- Cheng, M., Yu, S., Lee, C., Khadpe, P., Ibrahim, L., & Jurafsky, D. (2026). Social Sycophancy: A Broader Understanding of LLM Sycophancy. *Proc. ICLR 2026*. arXiv:2505.13995.
- Ibrahim, L., Hafner, F. S., & Rocher, L. (2026). Training Language Models to be Warm Can Reduce Accuracy and Increase Sycophancy. *Nature*, 652, 11591165. DOI: 10.1038/s41586-026-10410-0.
- Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2026). Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence. *Science*, 391(6792). DOI: 10.1126/science.aec8352.
- "The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy" (ArXiv 2604.00478, 2026).
## License
MIT