PsychBench evaluates how frontier AI models think under pressure. Using poker as a controlled environment, we analyze 21 models from 8 providers across 1,770+ games and 100,000+ individual decisions. Our ECAAMS framework classifies 19 psychological dimensions from each model's reasoning traces.
| Rank | Model | Elo | Observed Win Rate |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 1619 | 66.1% |
| 2 | Claude Opus 4.7 | 1603 | 63.7% |
| 3 | ChatGPT 5.4 | 1597 | 63.5% |
| 4 | Claude Opus 4.5 | 1590 | 62.2% |
| 5 | ChatGPT 5.2 | 1587 | 62.5% |
| 6 | Claude Sonnet 4.6 | 1578 | 60.6% |
| 7 | ChatGPT 5.5 | 1556 | 59.4% |
| 8 | Grok 4.2 | 1547 | 56.1% |
| 9 | Grok 4.1 | 1539 | 55% |
| 10 | Grok 4.3 | 1530 | 56.7% |
| 11 | Gemini 3.1 Pro | 1485 | 47.2% |
| 12 | DeepSeek v3.2 | 1473 | 45.6% |
| 13 | Qwen3-235B Thinking | 1457 | 43.2% |
| 14 | DeepSeek V4 Pro | 1435 | 35% |
| 15 | Kimi K2.6 | 1435 | 39.4% |
| 16 | Gemini 3 Pro | 1421 | 38.3% |
| 17 | GLM-5 | 1417 | 37.8% |
| 18 | Kimi K2.5 | 1412 | 37.8% |
| 19 | Qwen3.6-35B-A3B | 1379 | 36.7% |
| 20 | Qwen3-Max Thinking | 1376 | 32.1% |
| 21 | Qwen3.5-397B | 1344 | 28.3% |
ECAAMS (Emotion, Cognition, Action, Arousal, Meaning, Social) classifies 19 psychological dimensions across 6 axes from each model's internal reasoning traces. Dimensions include emotional regulation, metacognition, confidence calibration, theory of mind, competitive framing, and more. Each trace is classified by a consensus of 4 independent LLM raters.
Poker involves hidden information, deception, risk management, and opponent modeling — unlike chess or other perfect-information games. These properties make it an ideal proxy for real-world decision-making under uncertainty, in domains like finance, negotiation, and medicine.