Benchmark Tracker
8881 score entries
1803 models tracked
2741 benchmarks
181 discrepancies detected
Discrepancies
| Model | Benchmark | Δ pp | Reported scores | N |
|---|---|---|---|---|
| Codegen-Mono | Cwe Detection Recall | 100.0 | 100.0% / 0.0% / evidence | 2 |
| Codegen-Mono | Cwe Detection F1-Score | 99.0 | 99.0% / 0.0% / evidence | 2 |
| Codegen-Mono | Cwe Detection Accuracy | 99.0 | 99.0% / 0.0% / evidence | 2 |
| Codegen-Mono | Cwe Detection Precision | 98.1 | 98.1% / 0.0% / evidence | 2 |
| Llama-3 | GSM8K | 95.8 | 95.8% / 95.4% / 74.8% / 62.5% / 60.7% / 54.3% / 41.7% / 0.0% / evidence | 8 |
| Llama-3 | Longbench | 94.2 | 100.0% / 49.4% / 5.8% / evidence | 3 |
| Qwen2.5 | Ruler | 93.8 | 95.7% / 1.9% / evidence | 2 |
| Llama-3 | Ruler | 84.0 | 100.0% / 16.0% / evidence | 2 |
| Llama-3.1-8B | Ruler | 83.7 | 85.6% / 32.0% / 3.5% / 1.9% / 1.9% / evidence | 5 |
| GPT-4o | HumanEval | 83.2 | 86.7% / 86.2% / 86.2% / 27.7% / 3.4% / evidence | 5 |
| Qwen2.5 | Docvqa | 80.3 | 94.3% / 94.3% / 14.1% / evidence | 3 |
| GPT-3.5 | Rouge-L | 78.2 | 80.0% / 18.9% / 18.1% / 1.8% / evidence | 4 |
| ChatGPT | HumanEval | 77.2 | 85.2% / 8.0% / evidence | 2 |
| GPT-4o | SWE-bench | 76.4 | 83.4% / 7.0% / evidence | 2 |
| DeepSeek-R1 | MBPP | 75.8 | 92.6% / 16.8% / evidence | 2 |
| GPT-4 | Babilong | 75.0 | 85.0% / 10.0% / evidence | 2 |
| Qwen3 | MATH | 75.0 | 75.0% / 0.0% / evidence | 2 |
| mBERT | Pos | 74.6 | 88.9% / 70.3% / 14.3% / evidence | 3 |
| DeepSeek-Coder | MBPP | 74.0 | 94.2% / 85.5% / 35.1% / 20.2% / evidence | 4 |
| Llama-2 | MATH | 73.4 | 88.3% / 29.1% / 28.9% / 19.2% / 14.9% / evidence | 5 |
| Qwen2.5 | GSM8K | 70.7 | 95.0% / 91.5% / 69.9% / 47.5% / 40.0% / 37.7% / 24.3% / evidence | 7 |
| GPT-3.5 | Rouge-2 | 70.7 | 83.0% / 12.3% / evidence | 2 |
| mBERT | Paws-X | 67.8 | 81.9% / 14.1% / evidence | 2 |
| DeepSeek-R1 | MATH | 67.3 | 97.3% / 79.8% / 72.2% / 30.0% / evidence | 4 |
| DeepSeek-R1 | SWE-bench | 67.3 | 72.1% / 44.8% / 4.8% / evidence | 3 |
| Llama-3.1-8B | F1 | 66.6 | 87.6% / 21.0% / evidence | 2 |
| GPT-4o | MATH | 66.0 | 95.0% / 88.7% / 85.9% / 76.7% / 73.4% / 72.7% / 29.0% / evidence | 7 |
| GPT-3.5 | MATH | 64.8 | 93.7% / 72.2% / 42.5% / 28.9% / evidence | 4 |
| GPT-4 | GSM8K | 63.9 | 95.0% / 94.9% / 92.7% / 86.0% / 84.2% / 80.0% / 56.2% / 40.0% / 31.1% / evidence | 9 |
| Qwen2.5 | HumanEval | 63.7 | 96.3% / 82.2% / 79.6% / 59.6% / 59.1% / 41.0% / 32.6% / evidence | 7 |
| Llama-3 | HumanEval | 63.3 | 81.7% / 61.0% / 18.4% / evidence | 3 |
| GPT-4 | Hlce | 61.8 | 76.9% / 15.1% / evidence | 2 |
| CodeLlama-7B | MBPP | 61.8 | 65.0% / 3.2% / 3.2% / evidence | 3 |
| Llama-3.1-70B | GSM8K | 61.5 | 91.6% / 30.1% / evidence | 2 |
| Qwen2.5 | MMLU | 61.1 | 86.1% / 74.9% / 71.0% / 70.0% / 25.0% / evidence | 5 |
| XLM-R | Mlqa | 60.9 | 80.0% / 19.1% / evidence | 2 |
| Mixtral | Ruler | 60.8 | 92.8% / 32.0% / evidence | 2 |
| XLM-R | Bucc | 60.1 | 66.0% / 5.9% / evidence | 2 |
| GPT-3.5 | Medqa | 59.7 | 69.7% / 10.0% / evidence | 2 |
| LLaVA-Video-7B | Longvideobench | 59.4 | 62.7% / 56.0% / 3.3% / 3.3% / evidence | 4 |
| Phi-3 | Ruler | 59.0 | 91.0% / 32.0% / evidence | 2 |
| Qwen3 | GSM8K | 58.8 | 88.5% / 86.5% / 79.9% / 29.7% / evidence | 4 |
| LLaVA-Video-72B | Longvideobench | 58.4 | 61.9% / 3.5% / evidence | 2 |
| Llama-3 | MATH | 58.0 | 87.0% / 76.6% / 72.2% / 52.8% / 48.9% / 48.2% / 39.3% / 34.1% / 29.0% / evidence | 9 |
| Llama-3.1-8B | MATH | 57.6 | 82.9% / 52.2% / 51.0% / 30.0% / 25.3% / evidence | 5 |
| DeepSeek-R1 | Codereval | 57.3 | 59.2% / 1.9% / evidence | 2 |
| Llama-2 | Svamp | 57.3 | 88.0% / 52.4% / 45.2% / 30.7% / evidence | 4 |
| Qwen3 | Math500 | 57.0 | 57.0% / 0.0% / evidence | 2 |
| GPT-5 | GSM8K | 55.9 | 97.4% / 41.5% / evidence | 2 |
| Gemini-1.5-Flash | HumanEval | 55.6 | 73.0% / 17.4% / evidence | 2 |
| Llama-3 | Piqa | 55.4 | 85.4% / 30.0% / evidence | 2 |
| Llama-3 | Belebele | 54.2 | 90.0% / 35.8% / evidence | 2 |
| Qwen2.5 | MATH | 54.0 | 81.0% / 75.0% / 71.3% / 62.1% / 46.4% / 45.7% / 44.1% / 34.8% / 27.0% / evidence | 9 |
| GPT-4o | GSM8K | 54.0 | 95.9% / 94.3% / 93.7% / 91.4% / 41.9% / evidence | 5 |
| GPT-4o | Loong | 53.7 | 74.0% / 20.3% / evidence | 2 |
| Llama-3 | MMLU | 53.0 | 100.0% / 69.6% / 63.5% / 59.9% / 54.5% / 50.7% / 50.2% / 47.0% / evidence | 8 |
| Gemini-1.5-Pro | HumanEval | 52.1 | 75.0% / 48.0% / 22.9% / evidence | 3 |
| Qwen2.5 | Amc | 50.5 | 58.5% / 8.0% / evidence | 2 |
| Llama-3.1-8B | Rouge-L | 50.4 | 53.0% / 2.6% / evidence | 2 |
| mBERT | Xnli | 48.9 | 65.4% / 16.5% / evidence | 2 |
| Qwen2.5 | HellaSwag | 48.6 | 87.6% / 39.0% / evidence | 2 |
| DeepSeek-Coder | HumanEval | 48.6 | 97.6% / 79.3% / 59.6% / 49.4% / 49.0% / evidence | 5 |
| Yi-34B | Ruler | 48.0 | 80.0% / 32.0% / evidence | 2 |
| Llama-2 | Strategyqa | 47.0 | 50.0% / 35.6% / 3.0% / evidence | 3 |
| Qwen2.5 | MBPP | 46.6 | 85.2% / 69.2% / 38.6% / evidence | 3 |
| GPT-4o | MMLU | 46.5 | 95.9% / 49.4% / evidence | 2 |
| mBERT | Ner | 46.1 | 62.2% / 16.1% / evidence | 2 |
| GPT-4o | Recall | 45.7 | 98.0% / 52.3% / evidence | 2 |
| DeepSeek-V3 | SWE-bench | 45.7 | 57.0% / 11.3% / evidence | 2 |
| GPT-4 | Leetcode | 45.3 | 85.0% / 39.7% / evidence | 2 |
| Llama-3.1-8B | Avg. Accuracy | 43.8 | 62.8% / 19.0% / evidence | 2 |
| Llama-2 | GSM8K | 43.2 | 72.0% / 70.6% / 68.4% / 66.6% / 53.3% / 52.1% / 42.5% / 40.0% / 35.0% / 28.8% / evidence | 10 |
| PaLM | BBH | 43.2 | 49.2% / 6.0% / evidence | 2 |
| Llama-3 | Mt-Bench | 41.0 | 50.0% / 9.0% / evidence | 2 |
| Llama-2 | MMLU | 40.2 | 46.7% / 46.7% / 44.8% / 43.9% / 37.8% / 6.5% / evidence | 6 |
| Llama-3.1-8B | Easy | 40.0 | 76.4% / 36.4% / evidence | 2 |
| LongChat-7B-v1.5-32K | Ruler | 40.0 | 100.0% / 60.0% / evidence | 2 |
| GPT-4 | MATH | 39.9 | 69.7% / 60.1% / 48.1% / 42.5% / 33.7% / 29.8% / evidence | 6 |
| CodeLlama-7B | HumanEval | 38.3 | 67.0% / 28.7% / 28.7% / evidence | 3 |
| Llama-2 | Multiarith | 36.2 | 66.0% / 29.8% / evidence | 2 |
| DeepSeek-R1 | Rouge-L | 36.0 | 60.0% / 24.0% / evidence | 2 |
| Gemini-1.5-Flash | MMLU | 35.4 | 43.6% / 8.2% / evidence | 2 |
| Llama-3.1-70B | MMLU | 35.2 | 85.2% / 74.0% / 50.0% / evidence | 3 |
| Phi-3 | MMLU | 34.8 | 69.0% / 63.7% / 34.2% / evidence | 3 |
| Qwen2 | GSM8K | 34.5 | 89.5% / 55.0% / evidence | 2 |
| Llama-3 | Average | 34.4 | 56.6% / 22.2% / evidence | 2 |
| Qwen2.5 | Rouge-L | 34.0 | 52.0% / 18.0% / evidence | 2 |
| Llama-3.1-8B | Longbench | 33.5 | 36.0% / 34.1% / 2.5% / evidence | 3 |
| Llama-3.1-8B | MMLU | 33.3 | 78.2% / 72.3% / 66.9% / 63.5% / 44.9% / evidence | 5 |
| Gemini-1.5-Pro | GSM8K | 33.2 | 91.7% / 85.0% / 58.5% / evidence | 3 |
| Claude-3.5 | Codereval | 33.1 | 85.9% / 52.8% / evidence | 2 |
| Llama-3.1-70B | BBH | 33.0 | 93.0% / 60.0% / evidence | 2 |
| GPT-4 | F1 Score | 32.6 | 71.0% / 38.4% / evidence | 2 |
| GPT-3.5 | GSM8K | 32.0 | 92.0% / 80.1% / 74.9% / 60.0% / evidence | 4 |
| Llama-3 | Rewardbench | 31.5 | 88.8% / 57.3% / evidence | 2 |
| Qwen3 | Longbench | 31.4 | 34.0% / 2.6% / evidence | 2 |
| Gemini-1.5-Pro | MMLU | 30.3 | 75.3% / 45.0% / evidence | 2 |
| GPT-4 | MMLU | 30.3 | 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 86.4% / 83.9% / 57.0% / evidence | 11 |
| Llama-3.1-8B | Safety | 30.2 | 75.1% / 44.9% / evidence | 2 |
| BLIP | Coco | 30.2 | 69.9% / 39.7% / evidence | 2 |
| Llama-3.1-70B | Hotpotqa | 30.2 | 80.2% / 50.0% / evidence | 2 |
| Llama-3.1-70B | Ruler | 30.0 | 94.0% / 64.0% / evidence | 2 |
| WizardCoder | HumanEval | 30.0 | 87.3% / 57.3% / evidence | 2 |
| Llama-2 | Csqa | 29.6 | 35.6% / 6.0% / evidence | 2 |
| Llama-3.1-8B | GSM8K | 29.5 | 83.9% / 77.6% / 59.1% / 54.5% / evidence | 4 |
| Llama-3 | WinoGrande | 28.8 | 90.0% / 61.2% / evidence | 2 |
| Llama-3.1-8B | Hard | 27.7 | 33.6% / 5.9% / evidence | 2 |
| Gemini-2.0 | GSM8K | 27.6 | 95.6% / 68.0% / evidence | 2 |
| Gemma-2-2B | MMLU | 26.5 | 65.4% / 38.9% / 38.9% / evidence | 3 |
| GPT-4o | Codereval | 25.4 | 84.6% / 59.2% / evidence | 2 |
| CodeLLaMA | MBPP | 24.8 | 64.8% / 40.0% / evidence | 2 |
| LLaVA-Video-7B | Videomme | 24.7 | 90.0% / 65.3% / evidence | 2 |
| Qwen2 | MATH | 24.4 | 67.4% / 54.8% / 43.0% / evidence | 3 |
| GPT-3.5 | HumanEval | 24.3 | 86.8% / 65.2% / 62.5% / evidence | 3 |
| Llama-3.1-70B | MATH | 24.2 | 67.0% / 62.2% / 60.0% / 47.1% / 42.8% / evidence | 5 |
| Phi-3 | MMLU-Pro | 24.1 | 78.9% / 54.8% / evidence | 2 |
| LongAlpaca-13B | Ruler | 24.0 | 40.0% / 16.0% / evidence | 2 |
| Mistral-7B | Rouge-L | 24.0 | 57.0% / 33.0% / evidence | 2 |
| Qwen3 | Olympiadbench | 23.4 | 23.4% / 0.0% / evidence | 2 |
| LLaVA-1.5-7B | Mme | 23.4 | 76.0% / 52.6% / 52.6% / evidence | 3 |
| GPT-4 | MBPP | 22.8 | 39.6% / 16.8% / evidence | 2 |
| Qwen2.5 | Medical | 21.8 | 88.7% / 66.9% / evidence | 2 |
| Llama-3 | AIME | 20.0 | 100.0% / 80.0% / evidence | 2 |
| GPT-4o | Longvideobench | 19.6 | 66.7% / 47.1% / evidence | 2 |
| Llama-3.1-70B | F1 | 19.3 | 80.3% / 61.0% / evidence | 2 |
| Qwen2.5 | Bleu | 19.2 | 23.8% / 23.8% / 4.6% / evidence | 3 |
| Llama-3 | Boolq | 19.1 | 90.0% / 70.9% / evidence | 2 |
| Qwen2.5 | AIME | 18.1 | 45.6% / 42.8% / 27.5% / evidence | 3 |
| Llama-3 | Svamp | 17.9 | 72.2% / 54.3% / evidence | 2 |
| GPT-4 | HumanEval | 17.8 | 87.3% / 87.2% / 80.0% / 78.5% / 76.9% / 69.5% / evidence | 6 |
| CLIP | Mm-Bright | 17.6 | 28.0% / 10.4% / evidence | 2 |
| Gemini-2.0 | MATH | 17.5 | 91.5% / 74.1% / evidence | 2 |
| Llama-2 | Coin Flip | 17.0 | 26.0% / 9.0% / evidence | 2 |
| Qwen3 | Accuracy | 17.0 | 94.0% / 77.0% / evidence | 2 |
| LLaMA | Ppl | 16.0 | 17.0% / 1.1% / evidence | 2 |
| Llama-2 | Ruler | 14.4 | 100.0% / 85.6% / evidence | 2 |
| Llama-3 | MMLU-Pro | 14.2 | 56.2% / 42.0% / evidence | 2 |
| XLM-R | Xquad | 14.2 | 16.3% / 2.1% / evidence | 2 |
| LLaVA-1.5-13B | Pope | 14.1 | 100.0% / 85.9% / evidence | 2 |
| GPT-4o | Precision | 13.9 | 58.0% / 44.1% / evidence | 2 |
| LLaVA-1.5-7B | Gqa | 13.8 | 54.9% / 41.1% / evidence | 2 |
| Llama-2 | Accuracy | 13.3 | 83.3% / 70.0% / evidence | 2 |
| Llama-3 | Ifeval | 12.9 | 57.8% / 44.9% / evidence | 2 |
| GLM-4 | SWE-bench | 11.9 | 47.6% / 35.7% / evidence | 2 |
| GPT-4 | MMLU-Pro | 11.3 | 81.0% / 69.7% / evidence | 2 |
| Gemini-1.5-Pro | Egoschema | 10.7 | 72.2% / 61.5% / evidence | 2 |
| LLaMA-7B | C4 | 10.7 | 17.8% / 7.1% / evidence | 2 |
| LLaDA-MoE-7B-A1B | MATH | 10.4 | 55.0% / 44.6% / evidence | 2 |
| Qwen3 | HellaSwag | 10.4 | 52.0% / 41.6% / evidence | 2 |
| Baseline | Imagenet-1K | 9.7 | 78.7% / 69.0% / evidence | 2 |
| Gemma-2-9B | MMLU | 9.6 | 75.0% / 75.0% / 65.4% / evidence | 3 |
| LLaMA-13B | C4 | 9.6 | 16.2% / 6.6% / evidence | 2 |
| LLAVA-NEXT-34B-NH | Logicvista | 9.4 | 29.9% / 20.6% / evidence | 2 |
| LoRA-FAIR | Domainnet | 8.6 | 77.1% / 68.5% / evidence | 2 |
| DeepSeek-R1 | MMLU | 8.3 | 90.8% / 82.5% / evidence | 2 |
| Llama-3 | Obqa | 8.1 | 90.0% / 81.9% / evidence | 2 |
| GPT-3.5 | MMLU | 7.7 | 67.3% / 59.6% / evidence | 2 |
| RoBERTa | Mnli | 7.5 | 83.5% / 76.0% / evidence | 2 |
| RoBERTa | Cola | 7.4 | 82.0% / 74.6% / evidence | 2 |
| GPT-4o | AIME | 7.3 | 9.3% / 2.0% / evidence | 2 |
| DeepSeek-R1 | HumanEval | 7.3 | 90.7% / 83.4% / evidence | 2 |
| LLaDA-MoE-7B-A1B | GSM8K | 7.0 | 65.8% / 58.8% / evidence | 2 |
| GPT-4 | SWE-bench | 6.8 | 78.8% / 72.0% / evidence | 2 |
| DeepSeek-33B | HumanEval | 6.7 | 78.6% / 71.9% / evidence | 2 |
| Llama-3 | Accuracy | 6.6 | 79.4% / 72.8% / evidence | 2 |
| Qwen3 | LiveCodeBench | 6.6 | 54.6% / 53.6% / 53.4% / 48.0% / evidence | 4 |
| Gemini-1.5-Pro | MATH | 6.5 | 65.0% / 62.2% / 58.5% / evidence | 3 |
| Llama-3.1-8B | Rewardbench | 6.2 | 75.0% / 68.8% / evidence | 2 |
| Llama-3.1-8B | Ifeval | 6.1 | 78.9% / 72.8% / evidence | 2 |
| Llama-3.1-70B | Musique | 6.0 | 50.0% / 44.0% / evidence | 2 |
| SmolLM-3B | GSM8K | 5.2 | 61.9% / 56.7% / evidence | 2 |
| Claude-3.5 | SWE-bench | 5.1 | 56.4% / 51.3% / evidence | 2 |
| Qwen3 | MMLU | 4.8 | 53.0% / 48.2% / evidence | 2 |
| Llama-3.1-70B | Rewardbench | 4.7 | 92.7% / 88.0% / evidence | 2 |
| Llama-3.1-8B | HellaSwag | 4.7 | 77.2% / 72.5% / evidence | 2 |
| Claude-3.5 | HumanEval | 4.4 | 89.0% / 84.6% / evidence | 2 |
| ZSJoint | Multiatis++ | 4.4 | 87.0% / 82.6% / evidence | 2 |
| Codegen-2.7b | Cwe Detection | 3.5 | 9.0% / 5.5% / evidence | 2 |
| Qwen2.5 | SWE-bench | 3.5 | 43.5% / 40.0% / evidence | 2 |
| Qwen3 | AIME | 3.3 | 3.3% / 0.0% / evidence | 2 |
| Gemma-2-9B | MMLU-Pro | 3.1 | 69.1% / 69.1% / 66.0% / evidence | 3 |
Score Database
Showing top 15 benchmarks by coverage (of 2741 total). Search below to filter by benchmark or model.