PsychBench — AI Psychology Benchmark

PsychBench evaluates how frontier AI models think under pressure. Using poker as a controlled environment, we analyze 29 models from 8 providers across 2,494+ games and 100,000+ individual decisions. Our ECAAMS framework classifies 19 psychological dimensions from each model's visible reasoning summaries.

Heads-Up Tournament Leaderboard

Rank	Model	Elo	Observed Win Rate
1	Claude Fable 5	1650	59.4%
2	GPT-5.6 Terra	1615	51.4%
3	Claude Opus 4.8	1603	52.9%
4	ChatGPT 5.2	1595	60.5%
5	ChatGPT 5.4	1595	60.5%
6	Claude Opus 4.6	1595	59.6%
7	Claude Sonnet 5	1594	53.1%
8	Claude Opus 4.5	1591	60.5%
9	Claude Sonnet 4.6	1591	60%
10	Claude Opus 4.7	1588	59.1%
11	ChatGPT 5.5	1572	56.3%
12	Grok 4.2	1548	53.9%
13	Grok 4.1	1538	53.8%
14	Grok 4.3	1523	53.3%
15	DeepSeek V4 Pro	1500	48.3%
16	GPT-5.6 Luna	1482	55.6%
17	GPT-5.6 Sol	1482	49.3%
18	Gemini 3.5 Flash	1475	49.2%
19	DeepSeek v3.2	1474	45.7%
20	Gemini 3.1 Pro	1471	45.4%
21	Kimi K2.6	1447	42.2%
22	Qwen3-235B Thinking	1445	43.4%
23	Gemini 3 Pro	1422	38.3%
24	GLM-5.2	1419	48.2%
25	GLM-5	1418	38.5%
26	Kimi K2.5	1415	38.9%
27	Qwen3-Max Thinking	1380	33.2%
28	Qwen3.5-397B	1365	31.6%
29	Qwen3.6-35B-A3B	1353	35.4%

ECAAMS Psychological Framework

ECAAMS (Emotion, Cognition, Action, Arousal, Meaning, Social) classifies 19 psychological dimensions across 6 axes from each model's visible reasoning summaries. Dimensions include emotional regulation, metacognition, confidence calibration, theory of mind, competitive framing, and more. Each trace is classified by a consensus of 4 independent LLM raters.

Why Poker?

Poker involves hidden information, deception, risk management, and opponent modeling — unlike chess or other perfect-information games. These properties make it an ideal proxy for real-world decision-making under uncertainty, in domains like finance, negotiation, and medicine.