Index |  Research ▾  |  Verification ▾  | About

Benchmark Tracker

Cross-paper score extraction; automated discrepancy detection; updated continuously

8881 score entries 1803 models tracked 2741 benchmarks 181 discrepancies detected

Discrepancies

Same model, same benchmark, different papers report divergent scores (≥3 pp spread). Ordered by spread. Evidence index.

Model Benchmark Δ pp Reported scores N
Codegen-Mono Cwe Detection Recall 100.0 100.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection F1-Score 99.0 99.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection Accuracy 99.0 99.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection Precision 98.1 98.1% / 0.0% / evidence 2
Llama-3 GSM8K 95.8 95.8% / 95.4% / 74.8% / 62.5% / 60.7% / 54.3% / 41.7% / 0.0% / evidence 8
Llama-3 Longbench 94.2 100.0% / 49.4% / 5.8% / evidence 3
Qwen2.5 Ruler 93.8 95.7% / 1.9% / evidence 2
Llama-3 Ruler 84.0 100.0% / 16.0% / evidence 2
Llama-3.1-8B Ruler 83.7 85.6% / 32.0% / 3.5% / 1.9% / 1.9% / evidence 5
GPT-4o HumanEval 83.2 86.7% / 86.2% / 86.2% / 27.7% / 3.4% / evidence 5
Qwen2.5 Docvqa 80.3 94.3% / 94.3% / 14.1% / evidence 3
GPT-3.5 Rouge-L 78.2 80.0% / 18.9% / 18.1% / 1.8% / evidence 4
ChatGPT HumanEval 77.2 85.2% / 8.0% / evidence 2
GPT-4o SWE-bench 76.4 83.4% / 7.0% / evidence 2
DeepSeek-R1 MBPP 75.8 92.6% / 16.8% / evidence 2
GPT-4 Babilong 75.0 85.0% / 10.0% / evidence 2
Qwen3 MATH 75.0 75.0% / 0.0% / evidence 2
mBERT Pos 74.6 88.9% / 70.3% / 14.3% / evidence 3
DeepSeek-Coder MBPP 74.0 94.2% / 85.5% / 35.1% / 20.2% / evidence 4
Llama-2 MATH 73.4 88.3% / 29.1% / 28.9% / 19.2% / 14.9% / evidence 5
Qwen2.5 GSM8K 70.7 95.0% / 91.5% / 69.9% / 47.5% / 40.0% / 37.7% / 24.3% / evidence 7
GPT-3.5 Rouge-2 70.7 83.0% / 12.3% / evidence 2
mBERT Paws-X 67.8 81.9% / 14.1% / evidence 2
DeepSeek-R1 MATH 67.3 97.3% / 79.8% / 72.2% / 30.0% / evidence 4
DeepSeek-R1 SWE-bench 67.3 72.1% / 44.8% / 4.8% / evidence 3
Llama-3.1-8B F1 66.6 87.6% / 21.0% / evidence 2
GPT-4o MATH 66.0 95.0% / 88.7% / 85.9% / 76.7% / 73.4% / 72.7% / 29.0% / evidence 7
GPT-3.5 MATH 64.8 93.7% / 72.2% / 42.5% / 28.9% / evidence 4
GPT-4 GSM8K 63.9 95.0% / 94.9% / 92.7% / 86.0% / 84.2% / 80.0% / 56.2% / 40.0% / 31.1% / evidence 9
Qwen2.5 HumanEval 63.7 96.3% / 82.2% / 79.6% / 59.6% / 59.1% / 41.0% / 32.6% / evidence 7
Llama-3 HumanEval 63.3 81.7% / 61.0% / 18.4% / evidence 3
GPT-4 Hlce 61.8 76.9% / 15.1% / evidence 2
CodeLlama-7B MBPP 61.8 65.0% / 3.2% / 3.2% / evidence 3
Llama-3.1-70B GSM8K 61.5 91.6% / 30.1% / evidence 2
Qwen2.5 MMLU 61.1 86.1% / 74.9% / 71.0% / 70.0% / 25.0% / evidence 5
XLM-R Mlqa 60.9 80.0% / 19.1% / evidence 2
Mixtral Ruler 60.8 92.8% / 32.0% / evidence 2
XLM-R Bucc 60.1 66.0% / 5.9% / evidence 2
GPT-3.5 Medqa 59.7 69.7% / 10.0% / evidence 2
LLaVA-Video-7B Longvideobench 59.4 62.7% / 56.0% / 3.3% / 3.3% / evidence 4
Phi-3 Ruler 59.0 91.0% / 32.0% / evidence 2
Qwen3 GSM8K 58.8 88.5% / 86.5% / 79.9% / 29.7% / evidence 4
LLaVA-Video-72B Longvideobench 58.4 61.9% / 3.5% / evidence 2
Llama-3 MATH 58.0 87.0% / 76.6% / 72.2% / 52.8% / 48.9% / 48.2% / 39.3% / 34.1% / 29.0% / evidence 9
Llama-3.1-8B MATH 57.6 82.9% / 52.2% / 51.0% / 30.0% / 25.3% / evidence 5
DeepSeek-R1 Codereval 57.3 59.2% / 1.9% / evidence 2
Llama-2 Svamp 57.3 88.0% / 52.4% / 45.2% / 30.7% / evidence 4
Qwen3 Math500 57.0 57.0% / 0.0% / evidence 2
GPT-5 GSM8K 55.9 97.4% / 41.5% / evidence 2
Gemini-1.5-Flash HumanEval 55.6 73.0% / 17.4% / evidence 2
Llama-3 Piqa 55.4 85.4% / 30.0% / evidence 2
Llama-3 Belebele 54.2 90.0% / 35.8% / evidence 2
Qwen2.5 MATH 54.0 81.0% / 75.0% / 71.3% / 62.1% / 46.4% / 45.7% / 44.1% / 34.8% / 27.0% / evidence 9
GPT-4o GSM8K 54.0 95.9% / 94.3% / 93.7% / 91.4% / 41.9% / evidence 5
GPT-4o Loong 53.7 74.0% / 20.3% / evidence 2
Llama-3 MMLU 53.0 100.0% / 69.6% / 63.5% / 59.9% / 54.5% / 50.7% / 50.2% / 47.0% / evidence 8
Gemini-1.5-Pro HumanEval 52.1 75.0% / 48.0% / 22.9% / evidence 3
Qwen2.5 Amc 50.5 58.5% / 8.0% / evidence 2
Llama-3.1-8B Rouge-L 50.4 53.0% / 2.6% / evidence 2
mBERT Xnli 48.9 65.4% / 16.5% / evidence 2
Qwen2.5 HellaSwag 48.6 87.6% / 39.0% / evidence 2
DeepSeek-Coder HumanEval 48.6 97.6% / 79.3% / 59.6% / 49.4% / 49.0% / evidence 5
Yi-34B Ruler 48.0 80.0% / 32.0% / evidence 2
Llama-2 Strategyqa 47.0 50.0% / 35.6% / 3.0% / evidence 3
Qwen2.5 MBPP 46.6 85.2% / 69.2% / 38.6% / evidence 3
GPT-4o MMLU 46.5 95.9% / 49.4% / evidence 2
mBERT Ner 46.1 62.2% / 16.1% / evidence 2
GPT-4o Recall 45.7 98.0% / 52.3% / evidence 2
DeepSeek-V3 SWE-bench 45.7 57.0% / 11.3% / evidence 2
GPT-4 Leetcode 45.3 85.0% / 39.7% / evidence 2
Llama-3.1-8B Avg. Accuracy 43.8 62.8% / 19.0% / evidence 2
Llama-2 GSM8K 43.2 72.0% / 70.6% / 68.4% / 66.6% / 53.3% / 52.1% / 42.5% / 40.0% / 35.0% / 28.8% / evidence 10
PaLM BBH 43.2 49.2% / 6.0% / evidence 2
Llama-3 Mt-Bench 41.0 50.0% / 9.0% / evidence 2
Llama-2 MMLU 40.2 46.7% / 46.7% / 44.8% / 43.9% / 37.8% / 6.5% / evidence 6
Llama-3.1-8B Easy 40.0 76.4% / 36.4% / evidence 2
LongChat-7B-v1.5-32K Ruler 40.0 100.0% / 60.0% / evidence 2
GPT-4 MATH 39.9 69.7% / 60.1% / 48.1% / 42.5% / 33.7% / 29.8% / evidence 6
CodeLlama-7B HumanEval 38.3 67.0% / 28.7% / 28.7% / evidence 3
Llama-2 Multiarith 36.2 66.0% / 29.8% / evidence 2
DeepSeek-R1 Rouge-L 36.0 60.0% / 24.0% / evidence 2
Gemini-1.5-Flash MMLU 35.4 43.6% / 8.2% / evidence 2
Llama-3.1-70B MMLU 35.2 85.2% / 74.0% / 50.0% / evidence 3
Phi-3 MMLU 34.8 69.0% / 63.7% / 34.2% / evidence 3
Qwen2 GSM8K 34.5 89.5% / 55.0% / evidence 2
Llama-3 Average 34.4 56.6% / 22.2% / evidence 2
Qwen2.5 Rouge-L 34.0 52.0% / 18.0% / evidence 2
Llama-3.1-8B Longbench 33.5 36.0% / 34.1% / 2.5% / evidence 3
Llama-3.1-8B MMLU 33.3 78.2% / 72.3% / 66.9% / 63.5% / 44.9% / evidence 5
Gemini-1.5-Pro GSM8K 33.2 91.7% / 85.0% / 58.5% / evidence 3
Claude-3.5 Codereval 33.1 85.9% / 52.8% / evidence 2
Llama-3.1-70B BBH 33.0 93.0% / 60.0% / evidence 2
GPT-4 F1 Score 32.6 71.0% / 38.4% / evidence 2
GPT-3.5 GSM8K 32.0 92.0% / 80.1% / 74.9% / 60.0% / evidence 4
Llama-3 Rewardbench 31.5 88.8% / 57.3% / evidence 2
Qwen3 Longbench 31.4 34.0% / 2.6% / evidence 2
Gemini-1.5-Pro MMLU 30.3 75.3% / 45.0% / evidence 2
GPT-4 MMLU 30.3 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 86.4% / 83.9% / 57.0% / evidence 11
Llama-3.1-8B Safety 30.2 75.1% / 44.9% / evidence 2
BLIP Coco 30.2 69.9% / 39.7% / evidence 2
Llama-3.1-70B Hotpotqa 30.2 80.2% / 50.0% / evidence 2
Llama-3.1-70B Ruler 30.0 94.0% / 64.0% / evidence 2
WizardCoder HumanEval 30.0 87.3% / 57.3% / evidence 2
Llama-2 Csqa 29.6 35.6% / 6.0% / evidence 2
Llama-3.1-8B GSM8K 29.5 83.9% / 77.6% / 59.1% / 54.5% / evidence 4
Llama-3 WinoGrande 28.8 90.0% / 61.2% / evidence 2
Llama-3.1-8B Hard 27.7 33.6% / 5.9% / evidence 2
Gemini-2.0 GSM8K 27.6 95.6% / 68.0% / evidence 2
Gemma-2-2B MMLU 26.5 65.4% / 38.9% / 38.9% / evidence 3
GPT-4o Codereval 25.4 84.6% / 59.2% / evidence 2
CodeLLaMA MBPP 24.8 64.8% / 40.0% / evidence 2
LLaVA-Video-7B Videomme 24.7 90.0% / 65.3% / evidence 2
Qwen2 MATH 24.4 67.4% / 54.8% / 43.0% / evidence 3
GPT-3.5 HumanEval 24.3 86.8% / 65.2% / 62.5% / evidence 3
Llama-3.1-70B MATH 24.2 67.0% / 62.2% / 60.0% / 47.1% / 42.8% / evidence 5
Phi-3 MMLU-Pro 24.1 78.9% / 54.8% / evidence 2
LongAlpaca-13B Ruler 24.0 40.0% / 16.0% / evidence 2
Mistral-7B Rouge-L 24.0 57.0% / 33.0% / evidence 2
Qwen3 Olympiadbench 23.4 23.4% / 0.0% / evidence 2
LLaVA-1.5-7B Mme 23.4 76.0% / 52.6% / 52.6% / evidence 3
GPT-4 MBPP 22.8 39.6% / 16.8% / evidence 2
Qwen2.5 Medical 21.8 88.7% / 66.9% / evidence 2
Llama-3 AIME 20.0 100.0% / 80.0% / evidence 2
GPT-4o Longvideobench 19.6 66.7% / 47.1% / evidence 2
Llama-3.1-70B F1 19.3 80.3% / 61.0% / evidence 2
Qwen2.5 Bleu 19.2 23.8% / 23.8% / 4.6% / evidence 3
Llama-3 Boolq 19.1 90.0% / 70.9% / evidence 2
Qwen2.5 AIME 18.1 45.6% / 42.8% / 27.5% / evidence 3
Llama-3 Svamp 17.9 72.2% / 54.3% / evidence 2
GPT-4 HumanEval 17.8 87.3% / 87.2% / 80.0% / 78.5% / 76.9% / 69.5% / evidence 6
CLIP Mm-Bright 17.6 28.0% / 10.4% / evidence 2
Gemini-2.0 MATH 17.5 91.5% / 74.1% / evidence 2
Llama-2 Coin Flip 17.0 26.0% / 9.0% / evidence 2
Qwen3 Accuracy 17.0 94.0% / 77.0% / evidence 2
LLaMA Ppl 16.0 17.0% / 1.1% / evidence 2
Llama-2 Ruler 14.4 100.0% / 85.6% / evidence 2
Llama-3 MMLU-Pro 14.2 56.2% / 42.0% / evidence 2
XLM-R Xquad 14.2 16.3% / 2.1% / evidence 2
LLaVA-1.5-13B Pope 14.1 100.0% / 85.9% / evidence 2
GPT-4o Precision 13.9 58.0% / 44.1% / evidence 2
LLaVA-1.5-7B Gqa 13.8 54.9% / 41.1% / evidence 2
Llama-2 Accuracy 13.3 83.3% / 70.0% / evidence 2
Llama-3 Ifeval 12.9 57.8% / 44.9% / evidence 2
GLM-4 SWE-bench 11.9 47.6% / 35.7% / evidence 2
GPT-4 MMLU-Pro 11.3 81.0% / 69.7% / evidence 2
Gemini-1.5-Pro Egoschema 10.7 72.2% / 61.5% / evidence 2
LLaMA-7B C4 10.7 17.8% / 7.1% / evidence 2
LLaDA-MoE-7B-A1B MATH 10.4 55.0% / 44.6% / evidence 2
Qwen3 HellaSwag 10.4 52.0% / 41.6% / evidence 2
Baseline Imagenet-1K 9.7 78.7% / 69.0% / evidence 2
Gemma-2-9B MMLU 9.6 75.0% / 75.0% / 65.4% / evidence 3
LLaMA-13B C4 9.6 16.2% / 6.6% / evidence 2
LLAVA-NEXT-34B-NH Logicvista 9.4 29.9% / 20.6% / evidence 2
LoRA-FAIR Domainnet 8.6 77.1% / 68.5% / evidence 2
DeepSeek-R1 MMLU 8.3 90.8% / 82.5% / evidence 2
Llama-3 Obqa 8.1 90.0% / 81.9% / evidence 2
GPT-3.5 MMLU 7.7 67.3% / 59.6% / evidence 2
RoBERTa Mnli 7.5 83.5% / 76.0% / evidence 2
RoBERTa Cola 7.4 82.0% / 74.6% / evidence 2
GPT-4o AIME 7.3 9.3% / 2.0% / evidence 2
DeepSeek-R1 HumanEval 7.3 90.7% / 83.4% / evidence 2
LLaDA-MoE-7B-A1B GSM8K 7.0 65.8% / 58.8% / evidence 2
GPT-4 SWE-bench 6.8 78.8% / 72.0% / evidence 2
DeepSeek-33B HumanEval 6.7 78.6% / 71.9% / evidence 2
Llama-3 Accuracy 6.6 79.4% / 72.8% / evidence 2
Qwen3 LiveCodeBench 6.6 54.6% / 53.6% / 53.4% / 48.0% / evidence 4
Gemini-1.5-Pro MATH 6.5 65.0% / 62.2% / 58.5% / evidence 3
Llama-3.1-8B Rewardbench 6.2 75.0% / 68.8% / evidence 2
Llama-3.1-8B Ifeval 6.1 78.9% / 72.8% / evidence 2
Llama-3.1-70B Musique 6.0 50.0% / 44.0% / evidence 2
SmolLM-3B GSM8K 5.2 61.9% / 56.7% / evidence 2
Claude-3.5 SWE-bench 5.1 56.4% / 51.3% / evidence 2
Qwen3 MMLU 4.8 53.0% / 48.2% / evidence 2
Llama-3.1-70B Rewardbench 4.7 92.7% / 88.0% / evidence 2
Llama-3.1-8B HellaSwag 4.7 77.2% / 72.5% / evidence 2
Claude-3.5 HumanEval 4.4 89.0% / 84.6% / evidence 2
ZSJoint Multiatis++ 4.4 87.0% / 82.6% / evidence 2
Codegen-2.7b Cwe Detection 3.5 9.0% / 5.5% / evidence 2
Qwen2.5 SWE-bench 3.5 43.5% / 40.0% / evidence 2
Qwen3 AIME 3.3 3.3% / 0.0% / evidence 2
Gemma-2-9B MMLU-Pro 3.1 69.1% / 69.1% / 66.0% / evidence 3

Score Database

Showing top 15 benchmarks by coverage (of 2741 total). Search below to filter by benchmark or model.

Average (36)

Model Score Source paper Year
o1 100.0% From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge / evidence 2024
GPT-4o 100.0% From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge / evidence 2024
DeepSeek-V2 100.0% From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge / evidence 2024
yi-large 100.0% From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge / evidence 2024
Discrete-Policy 84.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
DeepSeek-R1 83.5% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
Prompt-Adapter-F 81.5% Prompt Tuning based Adapter for Vision-Language Model Adaption / evidence 2023
Tip-Adapter-F 81.5% Prompt Tuning based Adapter for Vision-Language Model Adaption / evidence 2023
Unified-Prompt-Learning 81.4% Prompt Tuning based Adapter for Vision-Language Model Adaption / evidence 2023
Prompt-Adapter 78.4% Prompt Tuning based Adapter for Vision-Language Model Adaption / evidence 2023
Tip-Adapter 77.3% Prompt Tuning based Adapter for Vision-Language Model Adaption / evidence 2023
MathCoder 69.1% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
OpenVLA 69.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
Phi-3 61.0% Return of the Encoder: Maximizing Parameter Efficiency for SLMs / evidence 2025
MT-ACT 59.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
Diffusion-Policy 58.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
Lang-o3dp 57.0% VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation / evidence 2025
Llama-3 56.6% Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models / evidence 2025
MiniGPT-v2 49.8% MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning / evidence 2023
MathCoder-L 46.9% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Ling-mini-beta 45.5% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Gemini 44.0% From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge / evidence 2024
Dense-6.1B 44.0% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Qwen3 43.2% Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models / evidence 2025
Octo 43.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
DINO 42.3% Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks / evidence 2026
FLAME-MoE-115M-459M 38.2% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
BeT 37.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
MDT 37.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
FLAME-MoE-98M-349M 36.3% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLAME-MoE-38M-100M 35.1% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
BESO 33.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
Llama-1-RFT 24.1% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Llama-3 22.2% SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning / evidence 2026
RT-1 22.0% Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation / evidence 2024
MathGPT-8B 19.1% Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages / evidence 2025

Conll-2002-Es (37)

Model Score Source paper Year
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-sim) 78.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-avg) 77.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 76.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 76.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 76.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 76.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al.-(2020) 76.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 75.7% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 75.7% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al.-(2019) 75.7% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 74.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 74.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze-(2019) 74.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 73.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 73.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al.-(2019) 73.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al.-(2018) 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 71.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 71.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al.-(2019) 71.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 66.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 66.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al.-(2017) 66.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 65.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 65.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al.-(2017) 65.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 61.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 61.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-(2012) 61.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 60.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 60.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al.-(2016) 60.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 59.3% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 59.3% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al.-(2012) 59.3% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020

Conll-2002-Nl (37)

Model Score Source paper Year
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-sim) 81.3% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 80.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 80.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-avg) 80.7% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al.-(2020) 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al.-(2019) 80.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 79.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 79.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze-(2019) 79.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al.-(2019) 72.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 71.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 71.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al.-(2018) 71.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 67.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 67.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al.-(2019) 67.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 66.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 66.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al.-(2017) 66.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 65.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 65.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al.-(2017) 65.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 61.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 61.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al.-(2016) 61.6% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 59.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 59.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-(2012) 59.9% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 58.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 58.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al.-(2012) 58.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020

Conll-2003-De (37)

Model Score Source paper Year
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-sim) 75.3% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning-(Ours-avg) 75.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 73.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Single-/Multi-Source-Cross-Lingual-NER-via-Teacher-Student-Learning 73.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 73.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al. 73.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-et-al.-(2020) 73.2% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 71.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al. 71.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Moon-et-al.-(2019) 71.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 71.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze 71.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Wu-and-Dredze-(2019) 71.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al. 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Mayhew-et-al.-(2017) 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al. 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Rahimi-et-al.-(2019) 59.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 58.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al. 58.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Ni-et-al.-(2017) 58.5% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 57.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al. 57.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Xie-et-al.-(2018) 57.8% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 56.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al. 56.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Chen-et-al.-(2019) 56.0% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 48.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al. 48.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Tsai-et-al.-(2016) 48.1% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 40.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al. 40.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-et-al.-(2012) 40.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 36.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström 36.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020
Täckström-(2012) 36.4% Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language / evidence 2020

GSM8K (145)

Model Score Source paper Year
Kimi-K2 97.9% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
GPT-5 97.4% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
DeepSeek-V3 97.1% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Claude-3.5-Sonnet 96.4% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
LLaMA-Adapter+MRP 96.1% LLaMA-Adapter + MRP: Integrating Meta-Reasoning Prompting with LLaMA-Adapter for Efficient Multi-Modal and Task-Adaptive Reasoning / evidence 2025
GPT-4o 95.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Llama-3 95.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Gemini-2.0 95.6% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Llama-3 95.4% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Gemini-2.5-Pro 95.2% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
GPT-4 95.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 95.0% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
GPT-4 94.9% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
GPT-4o 94.3% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Llama-3.1-405B 93.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
GPT-4o 93.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4 92.7% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
GPT-3.5 92.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Gemini-1.5-Pro 91.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3.1-70B 91.6% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Qwen2.5 91.5% Qwen2.5 Technical Report / evidence 2024
GPT-4o 91.4% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Claude-3-Opus 90.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2 89.5% Qwen2 Technical Report / evidence 2024
SST 89.0% State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence / evidence 2025
Qwen3 88.5% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Qwen3 86.5% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
GPT-4 86.0% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Gemini-1.5-Pro 85.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
GPT-4 84.2% Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs / evidence 2025
Llama-3.1-8B 83.9% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
Gemini-1.5-Flash 83.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
GPT-3.5 80.1% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
GPT-4 80.0% Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems / evidence 2024
Qwen3 79.9% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
MathCoder 79.9% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Gemini-Pro 79.3% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Phi-4 78.9% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
Claude-2 78.2% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Llama-3.1-8B 77.6% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
OpenChat-3.5 77.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
Dream 77.0% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
GPT-3.5 74.9% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
Llama-3 74.8% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
Phi-3 74.5% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Phi-3 73.5% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Mixtral 72.4% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
ChatGLM3-6B 72.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
CodeT5+ 72.2% CodeT5+: Open Code Large Language Models for Code Understanding and Generation / evidence 2023
Llama-2 72.0% Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems / evidence 2024
Llama-2 70.6% SALS: Sparse Attention in Latent Space for KV cache Compression / evidence 2025
Qwen2.5 69.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA 69.8% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Llama-2 68.4% Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models / evidence 2023
Phi-2 68.3% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Gemini-2.0 68.0% CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization / evidence 2026
Llama-2 66.6% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
LLaDA-MoE-7B-A1B 65.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
MathCoder-L 64.2% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Gemini-3.1-Pro 64.0% CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization / evidence 2026
Llama-3 62.5% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Llama-3B 62.1% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
SmollM-3B 61.9% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
LLaDA-8B 61.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Llama-3 60.7% SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning / evidence 2026
Qwen-14B 60.1% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
GPT-3.5 60.0% Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems / evidence 2024
Llama-3.1-8B 59.1% Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity / evidence 2026
LLaDA-MoE-7B-A1B 58.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Gemini-1.5-Pro 58.5% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
SmolLM-3B 56.7% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
GPT-4 56.2% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
GSA-1.7B 56.0% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
BaseRL-7B 55.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Gemma-2-2B 55.5% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
BaseRL-32B 55.3% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Gated-Only-1.7B 55.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-only-1.7B 55.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Qwen2 55.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3.1-8B 54.5% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Llama-3 54.3% Understanding Reasoning in Chain-of-Thought from the Hopfieldian View / evidence 2024
Phi-2-2.7B 53.4% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-2 53.3% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Mistral-7B 53.2% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Sparse-Only-1.7B 53.2% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-only-1.7B 53.2% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Standard-1.7B 52.9% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Llama-2 52.1% Understanding Reasoning in Chain-of-Thought from the Hopfieldian View / evidence 2024
Gemma-2-it-9B 51.0% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
SwS-32B 50.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 47.5% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Llama-1-RFT 46.5% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
CodeGemma-1.1-it-7B 46.4% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Gemma-1.1-it-7B 46.1% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Dream-v0-Instruct-7B 43.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-1.5 42.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Mistral-7B-v0.3 42.7% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Llama-2 42.5% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Llama-4-17B-128E 42.3% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
L17B-128E 42.3% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
GPT-4o 41.9% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Llama-3 41.7% Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity / evidence 2026
GPT-5 41.5% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Qwen2.5 40.0% TokenSkip: Controllable Chain-of-Thought Compression in LLMs / evidence 2025
GPT-4 40.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Llama-2 40.0% Making Large Language Models Better Reasoners with Alignment / evidence 2023
Falcon-H1-0.5B 39.8% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
Qwen2.5 37.7% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Llama-4-17B-16E 36.8% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
InternVL3.5-241B-A28B 36.8% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
InternVL3.5-38B 36.0% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
PRIME-RL-7B 35.7% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
planning-LM 35.4% Interpretable Math Word Problem Solution Generation Via Step-by-step Planning / evidence 2023
Llama-2 35.0% Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents / evidence 2024
SwS-7B 35.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Open-Reasoner-32B 34.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-3-175B-Finetuned 33.0% Large Language Models are Zero-Shot Reasoners / evidence 2022
Oat-Zero-7B 31.4% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-4 31.1% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
QVQ-Max-Latest 30.5% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
SimpleRL-Base-7B 30.5% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Llama-3.1-70B 30.1% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
SimpleRL-Base-32B 29.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen3 29.7% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
InternVL3.5-30B-A3B 29.4% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Llama-2 28.8% TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models / evidence 2023
MathGPT-8B 28.2% Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages / evidence 2025
Open-Reasoner-7B 27.6% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 24.3% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
DeepSeek-V2 20.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Claude3-Sonnet 20.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
SwS-3B 19.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
QWEN-0.5B 19.5% LLM Compression: How Far Can We Go in Balancing Size and Performance? / evidence 2025
BaseRL-3B 18.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
InternVL3.5-8B 16.9% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Vicuna-13B 11.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
text-davinci-002 10.4% Large Language Models are Zero-Shot Reasoners / evidence 2022
text-davinci-002 10.4% Large Language Models are Zero-Shot Reasoners / evidence 2022
MiMo-v2-Flash 8.0% Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks / evidence 2026
LLaMA-1B 1.2% LLM Compression: How Far Can We Go in Balancing Size and Performance? / evidence 2025
Claude-3-Sonnet 0.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Qwen-v1.5-1.8B 0.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Llama-3 0.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
MAmmoTH-70B 0.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Claude-3 0.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023

HellaSwag (37)

Model Score Source paper Year
GPT-2 100.0% MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization / evidence 2026
LLaMA-350M 100.0% MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization / evidence 2026
Llama-3.1-70B 94.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
GLM-4.5 88.9% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
Qwen2.5 87.6% Qwen2.5 Technical Report / evidence 2024
FlexLoRA 79.4% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
Llama-3.1-8B 77.2% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
FLoRA 77.2% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
RAVAN 76.7% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
GSA-1.7B 74.9% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-Only-1.7B 74.6% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-only-1.7B 74.6% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-Only-1.7B 73.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-only-1.7B 73.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Standard-1.7B 73.1% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
FSLoRA 73.0% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
HeteroLoRA 73.0% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
Llama-3.1-8B 72.5% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
Llama-3 72.5% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
RoBERTa 70.0% Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs / evidence 2025
Ling-mini-beta 66.6% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Dense-6.1B 65.6% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Phi-3 59.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Gemma-2-2B 52.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Qwen3 52.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Qwen3 41.6% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
Falcon-H1-0.5B 39.8% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
Qwen2.5 39.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
UB-SMoE-OLMo-1B 35.4% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
T0-3B 27.8% Prompt Consistency for Zero-Shot Task Generalization / evidence 2022
FLAME-MoE-115M-459M 27.7% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLAME-MoE-98M-349M 26.3% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLAME-MoE-38M-100M 25.9% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLoRA-OLMo-1B 19.1% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
FlexLoRA-OLMo-1B 13.6% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
FLoRIST-OLMo-1B 12.9% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
HetLoRA-OLMo-1B 11.0% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026

HumanEval (156)

Model Score Source paper Year
Code-Llama-7B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-13B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-34B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Seed-Coder-8B 98.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
ChatGPT+CodeT+SCoT 98.0% Structured Chain-of-Thought Prompting for Code Generation / evidence 2023
Claude-3.7-Sonnet 97.8% SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation / evidence 2025
DeepSeek-Coder 97.6% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Qwen2.5 96.3% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
o1 96.2% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
o1 96.2% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen3 95.1% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
GPT-5.1 95.1% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Llama-3.1-70B 95.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
ReflexiCoder-8B 94.5% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
WizardCoder-CodeLlama 92.1% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
Phind-CodeLlama 91.7% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
DeepSeek-R1 90.7% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
LeDex-RL-13B 90.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Claude-3.5 89.0% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4 87.3% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
WizardCoder 87.3% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
GPT-4 87.2% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GPT-3.5 86.8% Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models / evidence 2025
GPT-4o 86.7% Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models / evidence 2025
GPT-4o 86.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4o 86.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Starcoder-3B 85.7% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
ChatGPT 85.2% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
Claude-3.5 84.6% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
ARCS-Large 83.5% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
ARCS-Full 83.5% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
DeepSeek-R1 83.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Qwen2.5 82.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Llama-3 81.7% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GLM-4 81.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
WizardCoder-34B 81.2% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Claude-3-Opus 80.5% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GPT-4 80.0% Reflexion: Language Agents with Verbal Reinforcement Learning / evidence 2023
ARCS-Small 79.9% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
Qwen2.5 79.6% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
DeepSeek-Coder 79.3% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GLM-4 79.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
ARCS-CoT-Feedback 79.1% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
DeepSeek-33B 78.6% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
GPT-4 78.5% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
ARCS-Retrieval-CoT 78.4% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
WizardCoder-15B 78.1% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
ARCS-Retrieval-Feedback 77.3% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
LLaMA-3B 76.9% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
GPT-4 76.9% Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition? / evidence 2025
ARCS-Medium 76.8% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
ARCS-CoT 76.1% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
ARCS-Retrieval 75.2% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
Gemini-1.5-Pro 75.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
ARCS-Feedback 74.8% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
Gemini-1.5-Flash 73.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Llama-3.1-8B 72.6% SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation / evidence 2025
ARCS-Baseline 72.6% ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement / evidence 2025
DeepSeek-33B 71.9% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Llama-3.1-8B 70.1% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
GPT-4 69.5% Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code / evidence 2023
Code-Llama 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-7B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-13B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python-7B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
GPT-3.5 65.2% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
Qwen2 64.6% Qwen2 Technical Report / evidence 2024
CodeT5+-2B 63.0% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
GPT-3.5 62.5% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
WaveCoder-Ultra-6.7B 61.4% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
CodeGeeX-13B-FT 61.2% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
Llama-3 61.0% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
CodeGeeX-13B 60.9% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
DeepSeek-Coder 59.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen2.5 59.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen2.5 59.1% Qwen2.5 Technical Report / evidence 2024
DeepSeek-V3 59.0% Reward Shaping to Mitigate Reward Hacking in RLHF / evidence 2025
Yi-Coder-9B 57.9% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
WizardCoder 57.3% Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code / evidence 2023
Phi-2-EpiCoder-func-380k 56.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Phi-2-EpiCoder-func-380k-5k 56.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
OpenCoder-8B 56.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
OpenCoder-1-8B 56.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Claud-3.5-Sonnet 55.3% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Phi1-1.3B 52.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
ChatGPT+CodeT 52.0% Structured Chain-of-Thought Prompting for Code Generation / evidence 2023
Magicoder-S-DS-6.7B 50.9% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
CodeRM-8B 50.0% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
DeepSeek-Coder 49.4% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
DeepSeek-Coder 49.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
OpenCodeInterpreter-1.3B 49.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
Gemini-1.5-Pro 48.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
phi-2-2.7B 45.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Phi-2 45.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
phi-2 45.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Yi-Coder-1.5B 45.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
PolyCoder-2.7B 45.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
LLaDA-MoE-7B-A1B 44.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-MoE-7B-A1B 43.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 41.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
CodeGemma-2.0B 41.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
StarCoder 40.0% StarCoder: may the source be with you! / evidence 2023
CodeGemma 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
CodeQwen 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
StarCoder 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
CodeLlama 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
LLaDA-8B 38.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Claude-3.5-Sonnet 36.8% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
LLaDA-1.5 35.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
CodeT5+ 35.2% CodeT5+: Open Code Large Language Models for Code Understanding and Generation / evidence 2023
Dream 34.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Dream-v0-Instruct-7B 33.7% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 32.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA 31.7% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
CodeGen-16B 31.7% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-mono 30.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
GSA-1.7B 30.5% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-Only-1.7B 29.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-Only-1.7B 29.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-only-1.7B 29.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-only-1.7B 29.3% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
codellama-7b 28.7% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeLlama-7B 28.7% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
Standard-1.7B 28.7% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
codellama-7b-hf-float16 28.6% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
GPT-4o 27.7% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
CodeGen-6B-mono 26.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-mono 23.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Gemini-1.5-Pro 22.9% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Cycle-2.7B 21.9% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
Pixtral-124B 21.3% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Pixtral-12B 21.3% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
CodeGen2-16B 20.8% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
starcoderbase-3b 20.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
StarCoderBase-3B 20.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeGen-6B-multi 19.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-multi 19.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Llama-3 18.4% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
Gemini-1.5-Flash 17.4% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Cycle-1B 15.9% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
CodeGen2-3.7B 15.4% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
starcoderbase-1b 15.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
StarCoderBase-1B 15.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
GPT-J-6B 15.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
codegen-2b 14.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeGen-2B 14.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeGen-2B-multi 14.0% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-6B 10.4% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-1B 10.4% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-1.3B 9.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
ChatGPT 8.0% Structured Chain-of-Thought Prompting for Code Generation / evidence 2023
GPT-4o 3.4% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
PolyCoder-0.4B 3.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
LLaMA-1B 1.3% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-8B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025

Logicvista (34)

Model Score Source paper Year
otter-9B 31.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
otter9B 31.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-34B-NH 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
GPT-4 23.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-flan-t5-xl 23.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-13B 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-34B-NH 20.6% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-34B-NH 20.6% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
BLIP-2 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-flan-t5-xxl 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
BLIP2 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-7B-Mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-7B-mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-7B-mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPT-vicuna-13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna-13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPT-vicuna-7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna-7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-vicuna-7B 4.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-vicuna-13B 3.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024

MATH (143)

Model Score Source paper Year
Qwen-Max 98.6% MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence 2024
DeepSeek-R1 97.3% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
Kimi-k1.5 96.2% Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence 2025
QwQ-32B-Preview 95.0% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4o 95.0% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Kimi-k1.5-Short-CoT 94.6% Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence 2025
GPT-3.5 93.7% MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence 2024
Gemini-2.0 91.5% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Claude-3.5-Sonnet 89.8% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Mistral-7B 89.2% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
GPT-4o 88.7% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Llama-2 88.3% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
CodeT5+ 87.4% CodeT5+: Open Code Large Language Models for Code Understanding and Generation / evidence 2023
Llama-3 87.0% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
GPT-4o 85.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Llama-3.1-405B 84.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Gemma-2-it-9B 84.1% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Llama-3.1-8B 82.9% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Qwen2.5 81.0% LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs / evidence 2025
DeepSeek-R1 79.8% LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems / evidence 2026
Mistral-7B-v0.3 77.8% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Gemma-1.1-it-7B 77.5% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
CodeGemma-1.1-it-7B 77.3% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
GPT-4o 76.7% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Llama-3 76.6% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 75.0% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
Qwen3 75.0% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Phi-4 74.1% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Gemini-2.0 74.1% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4o 73.4% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4o 72.7% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
DeepSeek-V3 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
DeepSeek-R1 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Llama-3 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-3.5 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
SmolLM-3B 72.0% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
o3 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
o1 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Qwen2.5 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4 69.7% Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification / evidence 2023
MINT-CoT-7B 69.6% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
LLaVA-OV-1.5-RL 69.4% LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence 2025
Claude-3-Opus 67.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2 67.4% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
Llama-3.1-70B 67.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
Gemini-1.5-Pro 65.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Qwen-VL-Max-(3-Shot) 64.9% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
GLM-4.5 64.0% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
Gemini-1.5-Flash 63.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Gemini-1.5-Pro 62.2% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Llama-3.1-70B 62.2% Assessing Robustness to Spurious Correlations in Post-Training Language Models / evidence 2025
Qwen2.5 62.1% Qwen2.5 Technical Report / evidence 2024
LLaVA-OV-1.5 61.5% LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence 2025
GPT-4 60.1% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3.1-70B 60.0% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Gemini-1.5-Pro 58.5% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Skywork-RLHFlow 55.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
BaseRL-7B 55.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
BaseRL-32B 55.3% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA-MoE-7B-A1B 55.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2 54.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3 52.8% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
LLaDA-8B 52.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Llama-3.1-8B 52.2% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
Llama-3.1-8B 51.0% Assessing Robustness to Spurious Correlations in Post-Training Language Models / evidence 2025
SwS-32B 50.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen-VL-Max 49.9% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Skywork 48.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Llama-3 48.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Llama-3 48.2% Assessing Robustness to Spurious Correlations in Post-Training Language Models / evidence 2025
GPT-4 48.1% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
Llama-3.1-70B 47.1% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Qwen2.5 46.4% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
MathCoder 45.9% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Qwen2.5 45.7% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
LLaDA-MoE-7B-A1B 44.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
DeepSeek-Coder 44.4% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 44.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2 43.0% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3.1-70B 42.8% Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus / evidence 2024
GPT-3.5 42.5% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4 42.5% Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance / evidence 2023
Gemini 41.9% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Gemini-(3-Shot) 41.6% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3 39.3% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Dream 38.7% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
LLaDA-1.5 37.2% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Dream-v0-Instruct-7B 37.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
PRIME-RL-7B 35.7% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
SwS-7B 35.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Open-Reasoner-32B 34.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 34.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Llama-3 34.1% SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning / evidence 2026
GPT-4 33.7% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Mixtral 32.6% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
Phi-2 32.6% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Oat-Zero-7B 31.4% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-5 30.5% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
SimpleRL-Base-7B 30.5% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA 30.2% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
DeepSeek-R1 30.0% Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH / evidence 2025
Llama-3.1-8B 30.0% TokenSkip: Controllable Chain-of-Thought Compression in LLMs / evidence 2025
GPT-4 29.8% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
SimpleRL-Base-32B 29.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaVA-v1.6-mistral-(3-Shot) 29.4% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-2 29.1% STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing / evidence 2024
GPT-4o 29.0% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3 29.0% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Llama-2 28.9% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
GPT-3.5 28.9% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Open-Reasoner-7B 27.6% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Claude-2 27.1% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Qwen2.5 27.0% s1: Simple test-time scaling / evidence 2025
s1-32B 27.0% s1: Simple test-time scaling / evidence 2025
Gemini-Pro 26.8% WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct / evidence 2023
Phi-3 26.5% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-3.1-8B 25.3% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
InternLM-VL-(3-Shot) 25.1% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
MathCoder-L 23.3% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
InternLM2.5-7B 22.9% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
LLaVA-v1.5-(3-Shot) 21.1% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
InternLM-VL 19.8% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-2 19.2% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
Gemma-3 19.0% Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful / evidence 2025
LLaVA-1.5-13B 18.9% AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting / evidence 2024
LLaVA-v1.5 18.1% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
LLaVA-v1.6-mistral 16.8% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Phi-2-2.7B 16.1% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-2 14.9% DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning / evidence 2025
MathGPT-8B 9.5% Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages / evidence 2025
GPT-2-1.5B 8.3% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-175B 7.7% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-175B* 7.7% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-2-0.7B 6.9% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-13B-FineTuned 6.8% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
Llama-1-RFT 6.7% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
GPT-2-0.3B 6.7% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-2-0.1B 5.2% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-13B 4.1% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-13B* 4.1% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
SwS-3B 2.2% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
BaseRL-3B 0.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen3 0.0% SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning / evidence 2025

MBPP (86)

Model Score Source paper Year
Code-Llama-7B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-13B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-34B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
DeepSeek-Coder 94.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
DeepSeek-R1 92.6% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
DeepSeek-Coder 85.5% Prompt Optimization for LLM Code Generation via Reinforcement Learning / evidence 2026
Qwen2.5 85.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Qen2.5-72B 84.7% Qwen2.5 Technical Report / evidence 2024
Qwen3 84.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
GPT-5.1 84.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
ReflexiCoder-8B 81.8% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Seed-Coder-8B 76.8% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
WizardCoder-34B 75.4% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Starcoder-3B 75.0% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
WizardCoder 75.0% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
WizardCoder-15B 72.3% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Qwen2.5 69.2% Qwen2.5 Technical Report / evidence 2024
StarCoder2 66.2% Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code / evidence 2023
Phi-2-EpiCoder-func-380k 66.1% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Phi-2-EpiCoder-func-380k-5k 66.1% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Code-Llama 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-L-Llama 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-7B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-13B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python-7B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLLaMA 64.8% Prompt Optimization for LLM Code Generation via Reinforcement Learning / evidence 2026
phi-2-2.7B 62.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
Phi-2 62.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
phi-2 62.7% Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks / evidence 2025
DeepSeek-33B 62.5% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
DeepSeek-1.3B 62.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
CodeGeeX-13B 61.3% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
CodeT5+ 57.6% Prompt Optimization for LLM Code Generation via Reinforcement Learning / evidence 2026
Gemini-1.5-Pro 54.7% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Dream 54.2% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
GPT-4o 50.0% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
LLaMA-8B 42.9% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaDA 40.8% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
CodeGen-16B-mono 40.7% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGemma 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
CodeQwen 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
StarCoder 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
CodeLlama 40.0% Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks / evidence 2024
GPT-4 39.6% Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code / evidence 2023
Qwen2.5 38.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
LLaDA-MoE-7B-A1B 38.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
CodeGen-16B 36.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-6B-mono 36.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
DeepSeek-Coder 35.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Cycle-2.7B 34.7% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
Magicoder-S-DS-6.7B 33.3% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
WaveCoder-Ultra-6.7B 26.3% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Cycle-1B 25.8% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
InCoder-6B 24.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-multi 24.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-6B-multi 22.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Yi-Coder-9B 21.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
DeepSeek-Coder 20.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
GPT-J-6B 19.9% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-mono 19.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-multi 19.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
GPT-4 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Llama-3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Gemini-2.5-Pro 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Claude-3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
DeepSeek-R1 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Mistral 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Qwen 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
o1 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
o3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
InCoder-1B 12.8% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
OpenCoder-8B 10.5% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
codegen-2b 4.7% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeGen-2B 4.7% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeGen-104-mono 4.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
codellama-7b 3.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeLlama-7B 3.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
starcoderbase-3b 3.1% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
codellama-7b-hf-float16 3.1% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
StarCoderBase-3B 3.1% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
starcoderbase-1b 3.0% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
StarCoderBase-1B 3.0% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
DeepSeek-6.7B 1.5% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-3B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-1B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025

MMLU (130)

Model Score Source paper Year
Llama-3 100.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Codegen-mono 99.0% Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code / evidence 2025
GPT-4o 95.9% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
V-RNN 93.8% End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models / evidence 2018
LLaMA-Adapter+MRP 91.2% LLaMA-Adapter + MRP: Integrating Meta-Reasoning Prompting with LLaMA-Adapter for Efficient Multi-Modal and Task-Adaptive Reasoning / evidence 2025
DeepSeek-R1 90.8% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
AV-RNN 90.3% End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models / evidence 2018
o3 89.0% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Gemini-2.5-Pro 88.1% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GLM-4.5 87.8% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
GPT-4 87.3% Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs / evidence 2025
GPT-4 87.3% Adaptive Self-Prompting in Agentic LLM Frameworks for Code Fault Detection / evidence 2026
GPT-4 87.3% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GPT-4 87.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
GPT-4 87.3% Capabilities of GPT-4 on Medical Challenge Problems / evidence 2023
GPT-4 87.3% Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance / evidence 2023
GPT-4 87.3% ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing / evidence 2026
GPT-4 87.3% A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources / evidence 2025
o1 86.9% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GPT-4 86.4% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Qwen2.5 86.1% Qwen2.5 Technical Report / evidence 2024
Llama-3.1-70B 85.2% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
OLMoE-1B-7B-0125 84.3% Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers / evidence 2026
Qwen2 84.2% Qwen2 Technical Report / evidence 2024
Claude-3 84.0% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GPT-4 83.9% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Qwen 83.6% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
DeepSeek-R1 82.5% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Mistral 80.7% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Foundation-Sec-8B-Reasoning 78.2% Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report / evidence 2026
Llama-3.1-8B 78.2% Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report / evidence 2026
MEDITRON-70B 76.0% MEDITRON-70B: Scaling Medical Pretraining for Large Language Models / evidence 2023
Gemini-1.5-Pro 75.3% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Flan-PaLM-540B 75.2% Scaling Instruction-Finetuned Language Models / evidence 2022
Flan-PaLM 75.2% Scaling Instruction-Finetuned Language Models / evidence 2022
Gemma-3-27B 75.1% MedGemma Technical Report / evidence 2025
Gemma-2-9B 75.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-9b 75.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen2.5 74.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
MedGemma-27B 74.5% MedGemma Technical Report / evidence 2025
GLM-4 74.3% SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization / evidence 2024
Llama-3.1-70B 74.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
Dream 72.6% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Llama-3.1-8B 72.3% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
InternVL-2.5 72.0% Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling / evidence 2024
Qwen2.5 71.0% SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models / evidence 2025
LLaDA-MoE-7B-A1B 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-MoE-7B-A1B 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Llama-3 69.6% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Phi-3 69.0% Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone / evidence 2024
GPT-3.5 67.3% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-3.1-8B 66.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
LLaDA-8B 66.2% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
PaLM 65.8% Scaling Instruction-Finetuned Language Models / evidence 2022
gemma-2-2b 65.4% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-9B 65.4% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Phi-3 63.7% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Llama-3.1-8B 63.5% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
Llama-3 63.5% SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization / evidence 2024
Dream-v0-Instruct-7B 63.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-1.5 63.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
gemma2-9b 63.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
gemma2:9b 63.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
LLaDA 62.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
DeepSeek-V3 62.0% Reward Shaping to Mitigate Reward Hacking in RLHF / evidence 2025
Phi-4 61.8% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
GSA-1.7B 61.4% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-Only-1.7B 60.8% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Gated-only-1.7B 60.8% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Llama-3 59.9% Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs / evidence 2025
GPT-3.5 59.6% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
YaRN-Mistral-7B 59.4% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Sparse-Only-1.7B 59.1% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Sparse-only-1.7B 59.1% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
Standard-1.7B 58.8% Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models / evidence 2026
GPT-4 57.0% Investigating Data Contamination in Modern Benchmarks for Large Language Models / evidence 2023
Mistral-7B-Instruct-v0.3 55.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Mistral-7B-v0.3 55.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Mistral-7B-V0.3 55.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
MedGemma-4B 55.5% MedGemma Technical Report / evidence 2025
Gemma-3-4B 54.5% MedGemma Technical Report / evidence 2025
Llama-3 54.5% Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling / evidence 2026
Ministral-8B 53.7% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Ling-mini-beta 53.1% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Qwen3 53.0% SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models / evidence 2025
ChatGPT 52.0% Investigating Data Contamination in Modern Benchmarks for Large Language Models / evidence 2023
Dense-6.1B 51.1% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Llama-3 50.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Llama-3 50.2% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
LongLoRA-13B 50.1% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-3.1-70B 50.0% Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus / evidence 2024
GPT-4o 49.4% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
Falcon-H1-0.5B 49.0% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
Qwen3 48.2% Where Should LoRA Go? Component-Type Placement in Hybrid Language Models / evidence 2026
Llama-3 47.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
Llama-2 46.7% Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs / evidence 2024
Llama-2 46.7% Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs / evidence 2024
Zephyr-7B-beta 46.7% Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs / evidence 2024
Vicuna-7B-V1.5 46.2% Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes / evidence 2024
Gemini-1.5-Pro 45.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Llama-3.1-8B 44.9% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Llama-2 44.8% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-2 43.9% Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models / evidence 2026
Gemini-1.5-Flash 43.6% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
LongChat-v1.5-7B 42.3% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Aya-101 41.2% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Gemma-2-2B 38.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-2B 38.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Aya-23 38.8% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Llama 38.0% OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training / evidence 2025
LongLoRA-7B 37.9% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-2 37.8% Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes / evidence 2024
Bloomz-7B 36.3% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Phi-3 34.2% Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / evidence 2024
Qwen2.5 25.0% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
Gemini-1.5-Flash 8.2% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Mistral-7B 7.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Zephyr-7B 7.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Llama-2 6.5% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Dolly-v2-3B 5.8% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
StableLM-3B 3.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Open-Llama-3B 3.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Pythia-2.8B 2.8% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Mamba-2-Hybrid 2.6% An Empirical Study of Mamba-based Language Models / evidence 2024
Gemma-2B 2.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Gemma-3-1B 2.0% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
Phi-1.5B 1.4% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
GPT-Neo-1.3B 1.4% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025
TinyLlama-1.1B 1.1% SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts--Extended Version / evidence 2025

MMLU-Pro (59)

Model Score Source paper Year
Kimi-Linear 84.3% Kimi Linear: An Expressive, Efficient Attention Architecture / evidence 2025
GPT-4 81.0% Capabilities of GPT-4 on Medical Challenge Problems / evidence 2023
Gemma-2-70B 80.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Phi-3 78.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2 78.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Claude-3-Opus 78.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
o3 78.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Yi-Large 76.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
o1 76.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
GPT-4o 72.6% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3.1-70B 70.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
GPT-4 69.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3.1-70B 69.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-9B 69.1% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-9b 69.1% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Claude-3-Sonnet 68.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-3-27B 67.5% MedGemma Technical Report / evidence 2025
Gemini-2.5-Pro 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-27B 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-9B 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-2 64.8% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mixtral 63.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen 61.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-14B 61.4% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2.5 60.6% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
MedGemma-27B 60.2% MedGemma Technical Report / evidence 2025
Qwen-1.5-7B 59.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mistral-7B-Instruct-v0.1 58.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-32B 58.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
DeepSeek-R1 58.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2.5 58.1% Qwen2.5 Technical Report / evidence 2024
Llama-3.1-8B 57.1% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Llama-3 56.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Phi-3 54.8% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Mistral-7B-Instruct-v0.2 53.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-7B 49.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-34B 45.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-3-4B 43.6% MedGemma Technical Report / evidence 2025
Gemini-2 43.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-1.5-72B 43.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3 42.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
MedGemma-4B 39.1% MedGemma Technical Report / evidence 2025
Qwen-72B 35.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
DeepSeek-V2 34.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-72-7B 34.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-2B 31.2% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-2b 31.2% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen3 31.2% SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning / evidence 2025
Qwen-1.5-14B 28.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mistral-7B 28.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-7B 28.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-1.5-34B 25.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2B 25.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Dream 24.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Ling-mini-beta 24.0% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
LLaDA 23.3% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Qwen-14.8B 22.4% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Dense-6.1B 21.7% Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models / evidence 2025
Mistral-7B-v0.1 20.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024

Ruler (40)

Model Score Source paper Year
Yi-9B-200K 100.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Yarn-Mistral-7B 100.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-2 100.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
longchat-7b-v1.5-32k 100.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-3 100.0% ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference / evidence 2024
Qwen2.5 95.7% Sparser Block-Sparse Attention via Token Permutation / evidence 2025
Llama-3.1-70B 94.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
Mixtral 92.8% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Phi-3 91.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-3.1-8B 85.6% ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing / evidence 2026
Gemini-1.5-Pro 85.6% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Llama-2 85.6% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Kimi-Linear 84.3% Kimi Linear: An Expressive, Efficient Attention Architecture / evidence 2025
Yi-34B-200K 80.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Yi-34B 80.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
GPT-4 64.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Llama-3.1-70B 64.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
GLM4-9B 64.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
LongChat-7B-v1.5-32K 60.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Mistral-7B 40.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
LongAlpaca-13B 40.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Qwen2 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Command-R-plus-104B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Llama-3.1-8B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Mixtral 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Yi-34B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Phi-3 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
DBRX-36B/132B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Together-7B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
LongChat-7B 32.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Llama-3 16.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Mistral-v0.2-7B 16.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
LongAlpaca-13B 16.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
ReST-KV 10.6% ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing / evidence 2026
LWM-7B 4.0% RULER: What's the Real Context Size of Your Long-Context Language Models? / evidence 2024
Llama-3.1-8B 3.5% AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference / evidence 2026
Qwen3 2.2% AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference / evidence 2026
Qwen2.5 1.9% MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training / evidence 2025
Llama-3.1-8B 1.9% MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training / evidence 2025
Llama-3.1-8B 1.9% Ruler Score Discrepancies in Llama-3.1-8B Benchmark Evaluations Across Studies / evidence 2026

Squad (33)

Model Score Source paper Year
MobileBERT 90.3% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
BERT 88.5% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
TinyBERT-6 87.5% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
Phi-3 85.0% Return of the Encoder: Maximizing Parameter Efficiency for SLMs / evidence 2025
MobileBERT-tiny 84.2% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
J1-Grande 83.8% Parallel Context Windows for Large Language Models / evidence 2022
TinyBERT-4 82.1% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
DistilBERT 79.8% Lightweight Transformer Architectures for Edge Devices in Real-Time Applications / evidence 2026
J1-Large 79.2% Parallel Context Windows for Large Language Models / evidence 2022
BM25+DPR-Single 71.5% Dense Passage Retrieval for Open-Domain Question Answering / evidence 2020
BM25 68.8% Dense Passage Retrieval for Open-Domain Question Answering / evidence 2020
BM25+DPR-Multi 66.2% Dense Passage Retrieval for Open-Domain Question Answering / evidence 2020
stella-en-1.5B-v5 64.1% Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis / evidence 2025
DPR-Single 63.2% Dense Passage Retrieval for Open-Domain Question Answering / evidence 2020
FLAN-T5-XXL 57.6% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
Blended-RAG 57.6% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
LLaDA-MoE-7B-A1B 53.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
DPR-Multi 51.6% Dense Passage Retrieval for Open-Domain Question Answering / evidence 2020
LLaDA-8B 50.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Codex 41.8% A Taxonomy for Data Contamination in Large Language Models / evidence 2024
RAG-end2end 40.0% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
LLaDA-1.5 36.9% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Dream-v0-Instruct-7B 30.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
PaLM540B-(Oneshot) 29.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
PaLM540B-(Zeroshot) 29.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
PaLM 29.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
PaLM540B 29.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
PaLM-540B 29.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
RAG-original 28.1% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
GLaM-(Oneshot) 26.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
GLaM 26.3% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
GLaM-(Zeroshot) 24.7% Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers / evidence 2024
Qwen2.5 21.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025

Xnli (32)

Model Score Source paper Year
XLM 80.8% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020
MMTE 79.6% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020
MMKD 75.4% Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model / evidence 2022
XLM-K 74.8% XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge / evidence 2021
XLM-Rbase 74.2% XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge / evidence 2021
MMKD-XWCL 74.1% Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model / evidence 2022
MMKD-StrucA 73.9% Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model / evidence 2022
MMKD-TLM 73.5% Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model / evidence 2022
MMKD-SentA 73.4% Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model / evidence 2022
LLaMA 70.0% Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning / evidence 2025
Gemini 70.0% Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning / evidence 2025
DLFA-D11 66.9% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D8 66.9% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D6 66.8% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D9 66.6% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D10 66.5% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D7 66.2% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D2 66.2% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D4 66.1% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D3 66.0% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D1 65.8% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
DLFA-D5 65.4% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
mBERT 65.4% Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT / evidence 2022
mT5 60.8% Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer / evidence 2022
Llama-3.1-8B 52.9% Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation / evidence 2025
Mistral-Small-24B 30.0% Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation / evidence 2025
Mistral-v0.3-7B 29.2% Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation / evidence 2025
Llama-3 29.2% Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation / evidence 2025
mBERT 16.5% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020
XLM-R 10.2% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020
Translate-train 7.3% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020
Translate-test 6.7% XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / evidence 2020