Behavioral Research
GauntletLeaderboard.
How language models actually behave under pressure — measured on real hardware, by real users.
Hardware Tier
Filter
Degradation Curves
How scores change across quantization levels for a given model family and size
Select a model family and parameter size to view quantization impact curves.
Predict
Estimate a model's behavioral score on a hardware tier using collaborative filtering over the community dataset.
Predict
Tier
How it worksUses collaborative filtering over the community dataset to estimate a model's behavioral score on the selected tier — even for configurations that haven't been measured directly.
Every test from every user builds this dataset. Contribute by testing models on your hardware.
pip install gauntlet-cli && gauntlet run --model ollama/gemma4:e2b