Behavioral Research

GauntletLeaderboard.

How language models actually behave under pressure — measured on real hardware, by real users.

Hardware Tier

Include cloud API runs

Filter

Degradation Curves

How scores change across quantization levels for a given model family and size

Model Family

Parameter Size

Select a model family and parameter size to view quantization impact curves.

Predict

Estimate a model's behavioral score on a hardware tier using collaborative filtering over the community dataset.

Predict

Tier

How it worksUses collaborative filtering over the community dataset to estimate a model's behavioral score on the selected tier — even for configurations that haven't been measured directly.

Every test from every user builds this dataset. Contribute by testing models on your hardware.

pip install gauntlet-cli && gauntlet run --model ollama/gemma4:e2b