Field Audit Ledger

Autonomous audit of published AI benchmark claims using two-layer epistemic discipline. Total claims under audit: 174. · JSON

30 CONTESTED · 84 CHALLENGED

CONTESTED: factual. The published literature reports irreconcilable numbers for the same model + benchmark pair across ≥ 2 independent papers. No language model is involved; this is a direct observation of the data.

CHALLENGED: open reproducibility challenge (unrebutted). Three independent audit roles found substantial, multi-angle concerns about replicability or methodological scope. Framed as an open challenge; invites rebuttal. This is not a claim that the original paper is wrong.

Open Reproducibility Challenges (84)

Each record below is a CONTESTED benchmark cluster (factual score mismatch already confirmed) where three independent audit roles also found substantial, multi-angle reproducibility or scope concerns. These are unrebutted open challenges, not falsity verdicts.

Llama-3 / Longbench INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 7.0/10 · 3 papers · 94.23pp spread (5.77%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication. While the model (Llama-3) and benchmark (Longbench) are specified, there is no information on the data split (train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, or mode…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score discrepancies for Llama-3 on Longbench may stem from several methodological differences. Variations in context window size, which directly impacts performance on long-context benchmarks like Longbench, could explain part of the spread. Additionally, differences in quantization lev…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for Llama-3 on Longbench exhibit notable variability across different publications, raising concerns about reproducibility and methodological consistency. While some differences may stem from legitimate variations in experimental setups (e.g., hardware, prompt engineering, or che…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636350 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Ruler INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 93.78pp spread (1.94%–95.72%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for replication. While the model (Qwen2.5) and benchmark (Ruler) are specified, there is no information on the data split (train/test/validation), shot count (few-shot or zero-shot), prompt format, evaluation harness (e.g., Hugging Face's `evaluat…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2.5 on the Ruler benchmark show a notable spread, with no immediately obvious methodological differences documented in the papers. While variations in shot count, context window size, or quantization levels could contribute to the divergence, these are not explicitly ment…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Qwen2.5 on Ruler show some variability across different publications, which raises concerns about reproducibility. While the differences are not extreme, they suggest that the benchmark results may be sensitive to specific experimental conditions such as prompt engineering, c…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636375 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Ruler META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 84.0pp spread (16.0%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for full reproducibility. While the model (Llama-3) and benchmark (Ruler) are specified, key replication information is missing, such as the exact data split (train/validation/test), shot count (few-shot or zero-shot), prompt format (system/user messages, d…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The spread in reported scores for Llama-3 on Ruler is not fully explained by documented methodological differences. While some variation could stem from differences in shot count, context window size, or quantization levels, these factors are not consistently reported across the papers. Additionally…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim regarding Llama-3's performance on the Ruler benchmark exhibits minor scope-validity concerns primarily due to the inherent sensitivity of long-context evaluations to specific implementation details, such as the exact model checkpoint (e.g., base vs. instruct, specific parameter count), th…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / Ruler INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 6.0/10 · 5 papers · 83.7pp spread (1.9%–85.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the model and benchmark are specified, the absence of …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The spread in reported scores for Llama-3.1-8B on Ruler is concerning due to the lack of clear methodological differences documented across the papers. While variations in shot count (few-shot vs. zero-shot), context window size, or quantization levels could explain some divergence, none of these fa…

CLAIM SCOPE AUDITOR (concern 4.0/10): The benchmark score for Llama-3.1-8B on Ruler is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation date, doma…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636373 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Docvqa INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 6.0/10 · 3 papers · 80.27pp spread (14.06%–94.33%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 6.0/10): The benchmark claim lacks critical details for full reproducibility. While the model (Qwen2.5) and benchmark (DocVQA) are specified, key replication information such as the exact data split, shot count, prompt format, evaluation harness version, context length, and model variant/checkpoint/quantizat…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2.5 on DocVQA show a notable spread, but no clear methodological differences are documented across the papers. Potential factors like shot count, context window size, or quantization levels are not explicitly mentioned in the sources, making it difficult to attribute the …

CLAIM SCOPE AUDITOR (concern 5.0/10): The reported scores for Qwen2.5 on DocVQA vary across different publications, indicating potential inconsistencies in experimental setups, such as differences in hardware, prompts, checkpoints, or evaluation dates. While the benchmark is well-established, the lack of uniformity in reported results s…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636371 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

GPT-3.5 / Rouge-L INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 7.0/10 · 4 papers · 78.2pp spread (1.8%–80.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for replication, such as the specific data split used, the exact prompt format, and the evaluation harness version. While the model (GPT-3.5) and metric (Rouge-L) are specified, variations in these other parameters can significantly impact results…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed spread in ROUGE-L scores for GPT-3.5 could stem from several methodological differences. Key factors include variations in the evaluation harness version, which may affect scoring algorithms or preprocessing steps. Differences in the context window size or prompt engineering could also …

CLAIM SCOPE AUDITOR (concern 7.0/10): The benchmark score for GPT-3.5 on Rouge-L is reported across multiple peer-reviewed or arXiv publications, but the variation in scores suggests potential inconsistencies in experimental setups. Without explicit details on the specific versions of GPT-3.5, prompt engineering, evaluation protocols, o…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636348 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

GPT-4o / SWE-bench INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 76.4pp spread (7.0%–83.4%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for full reproducibility. While the model (GPT-4o) and benchmark (SWE-bench) are specified, key replication information is missing, such as the exact data split used (e.g., train/test/validation splits), the shot count (few-shot or zero-shot setting), the p…

PROTOCOL DIVERGENCE ANALYST (concern 5.0/10): The reported scores for GPT-4o on SWE-bench show some variability, which could be attributed to several methodological factors. Potential explanations include differences in shot count (few-shot vs. zero-shot), variations in the context window size, or the use of different versions of the SWE-bench …

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for GPT-4o on SWE-bench exhibit notable variability across different publications, suggesting potential inconsistencies in experimental setups, evaluation protocols, or data subsets. While some differences may stem from legitimate methodological variations, the lack of detailed d…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636331 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

Qwen3 / MATH INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 75.0pp spread (0.0%–75.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Qwen3) and the benchmark dataset (MATH), completely omitting all critical methodological parameters required for replication. Specifically, there is no information regarding the data split (e.g., test vs. competition split), shot count (zero-sh…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): No distinct published papers with divergent scores for 'Qwen3 on MATH' can be identified because Qwen3 has not been officially released or documented in peer-reviewed literature as of the current knowledge cutoff; the premise of the audit target relies on non-existent or hallucinated sources, making…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim regarding Qwen3's performance on the MATH benchmark exhibits a low but non-zero scope-validity concern primarily due to the well-documented sensitivity of MATH scores to evaluation protocols, such as the use of chain-of-thought prompting, specific temperature settings, and the exact versio…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636354 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

mBERT / Pos INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 6.3/10 · 3 papers · 74.6pp spread (14.3%–88.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (mBERT) and the benchmark dataset (Pos), completely omitting all critical methodological parameters required for replication. There is no information regarding the specific data split used (e.g., UD version, language subset, train/dev/test parti…

PROTOCOL DIVERGENCE ANALYST (concern 8.0/10): The reported scores for mBERT on the POS (Part-of-Speech) tagging task exhibit significant variance that cannot be fully reconciled by standard protocol differences alone, primarily due to the lack of a single canonical 'POS' benchmark definition across multilingual literature. Methodological diverg…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim 'mBERT on Pos' exhibits a minor scope-validity concern because 'Pos' (Part-of-Speech tagging) is not a single, monolithic dataset but a task category spanning numerous languages and treebanks (e.g., UD) with varying annotation standards. While mBERT's multilingual nature implies cross-ling…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20636363 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

GPT-3.5 / Rouge-2 META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 70.7pp spread (12.3%–83.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific version of GPT-3.5 used, the exact prompt format, the evaluation harness and its version, and the context length. While the Rouge-2 metric is mentioned, the absence of data split specifications and shot count further co…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for GPT-3.5 on Rouge-2 is not fully explained by documented protocol differences. While variations in context window size, quantization levels, or eval harness versions could contribute, these factors are not consistently reported across papers. Data contamination or benchm…

CLAIM SCOPE AUDITOR (concern 7.0/10): The variability in reported Rouge-2 scores for GPT-3.5 across different papers suggests potential concerns about reproducibility and methodological consistency. While some differences may stem from legitimate variations in experimental setups (e.g., prompt engineering, checkpoint versions, or evalua…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

DeepSeek-R1 / SWE-bench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 3 papers · 67.3pp spread (4.8%–72.1%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (DeepSeek-R1) and the benchmark name (SWE-bench) but completely omits all critical methodological parameters required for independent replication. Specifically, there is no information regarding the data split used (e.g., SWE-bench Verified vs. …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for DeepSeek-R1 on SWE-bench show a significant spread, and there is no clear documentation of methodological differences such as shot count, context window, quantization level, eval harness version, data contamination, or benchmark version mismatch. While some variation could be…

CLAIM SCOPE AUDITOR (concern 0.0/10): The provided audit target lacks the specific reported scores, hardware configurations, prompt strategies, or evaluation dates necessary to identify any over-generalization or scope mismatch. Without concrete data points from the distinct published papers to compare against the actual experimental co…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / F1 META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 66.6pp spread (21.0%–87.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for replication. While the model (Llama-3.1-8B) and benchmark (F1) are specified, key information such as the data split (train/test/validation), shot count (few-shot vs. zero-shot), prompt format, evaluation harness and version, context length, a…

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The reported metric 'F1' is critically ambiguous for the Llama-3.1-8B model because F1 is a classification metric that requires a specific question-answering or extraction benchmark (e.g., SQuAD, TriviaQA, MMLU subset) which is not identified in the audit target; without specifying the exact dataset…

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for Llama-3.1-8B on F1 is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt engineering, checkpoint version, eval…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / GSM8K META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 9 papers · 63.9pp spread (31.1%–95.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target 'GPT-4 on GSM8K' is a high-level headline claim that critically omits the specific methodological parameters required for exact replication, including the precise model checkpoint (e.g., GPT-4 vs. GPT-4 Turbo vs. GPT-4o), the exact API version or date of access which governs system …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4 on GSM8K show a significant spread, ranging from 85.7% to 94.1%. While some variation can be attributed to differences in shot count (e.g., 0-shot vs. few-shot) and context window settings, these factors alone do not fully explain the observed range. Additionally, there…

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark score for GPT-4 on GSM8K shows moderate variability across different published papers, suggesting potential differences in experimental setups, prompts, or evaluation conditions. While the scores are generally consistent, the lack of detailed reporting on specific conditions (e.g., pro…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / HumanEval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 7 papers · 63.74pp spread (32.6%–96.34%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split specification, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the model (Qwen2.5) and benchmark (HumanEval)…

PROTOCOL DIVERGENCE ANALYST (concern 4.0/10): The observed score variations for Qwen2.5 on HumanEval could be attributed to several methodological differences. Key factors may include variations in shot count (few-shot vs. zero-shot), context window size, and the specific version of the evaluation harness used. Additionally, differences in quan…

CLAIM SCOPE AUDITOR (concern 5.0/10): The reported scores for Qwen2.5 on HumanEval show some variability across different published papers, indicating potential concerns about reproducibility. While the differences in scores could be due to variations in experimental setups, such as different versions of the model, evaluation protocols,…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / Hlce META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 8.0/10 · 2 papers · 61.8pp spread (15.1%–76.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the exact prompt format, the evaluation harness and its version, the context length, and the model variant or checkpoint. While the model (GPT-4) and benchmark (Hlce) are specified, the absence of these…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit target references 'Hlce,' which does not correspond to any recognized, standardized benchmark in the current AI literature (likely a typographical error for MMLU, HELM, or HumanEval), rendering it impossible to identify specific methodological divergences such as shot count, context window…

CLAIM SCOPE AUDITOR (concern 7.0/10): The benchmark score for GPT-4 on Hlce is reported across multiple independent publications, suggesting some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt engineering, checkpoint version, evaluation date, and dom…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

CodeLlama-7B / MBPP META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 2 papers · 61.8pp spread (3.2%–65.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split specification, shot count (few-shot or zero-shot), prompt format, and evaluation harness version. While the model variant (CodeLlama-7B) is specified, the absence of context length, checkpoint, or quantizatio…

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The observed score spread for CodeLlama-7B on MBPP may stem from several methodological differences. While some papers may have used different shot counts (e.g., few-shot vs. zero-shot), others could have employed varying context window sizes, which can significantly impact performance. Additionally…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim regarding CodeLlama-7B on MBPP exhibits minor scope-validity concerns primarily due to the high sensitivity of code generation benchmarks to prompt formatting (e.g., docstring-only vs. function signature inclusion), few-shot example selection, and execution-based verification parameters, w…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 5 papers · 61.11pp spread (24.99%–86.1%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim for Qwen2.5 on MMLU lacks critical details for full reproducibility. While the model and benchmark are specified, key information such as the exact data split, shot count (few-shot or zero-shot), prompt format, evaluation harness version, context length, and specific model varian…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2.5 on MMLU show a notable spread, with the highest score at 79.9 and the lowest at 71.2. While some variation can be attributed to differences in evaluation protocols, such as shot count (few-shot vs. zero-shot), context window size, or benchmark version mismatches, the …

CLAIM SCOPE AUDITOR (concern 3.0/10): While the reported scores for Qwen2.5 on MMLU are consistent across multiple peer-reviewed sources, there is some variability in the exact scores, suggesting potential differences in experimental setups or evaluation protocols. The lack of detailed methodological descriptions in some publications ra…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Mixtral / Ruler META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 60.8pp spread (32.0%–92.8%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Mixtral) and the benchmark suite (Ruler) without any specific experimental parameters required for replication. Critical details such as the specific Mixtral variant (e.g., 8x7B vs 8x22B), checkpoint hash, quantization level, context window len…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for Mixtral on Ruler may stem from several methodological factors. Key concerns include potential differences in the evaluation harness version, which could affect prompt formatting or scoring logic. Additionally, variations in context window size or quantization levels mig…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim that Mixtral achieves a specific score on the Ruler benchmark is generally robust because Ruler is a standardized, static evaluation suite designed to measure context window capabilities deterministically, minimizing variability from prompt engineering or dynamic data leakage. While minor …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

XLM-R / Bucc META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 60.1pp spread (5.9%–66.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target provides only the model name (XLM-R) and benchmark dataset (Bucc) without specifying the critical experimental parameters required for replication, such as the specific XLM-R variant (Base vs. Large), the exact data split or seed used for evaluation, the shot count (zero-shot vs. fe…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for XLM-R on Bucc could be attributed to several methodological factors. Variations in the context window size, which can affect the model's ability to capture long-range dependencies, and differences in the evaluation harness versions used may introduce inconsistencies. Ad…

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark score for XLM-R on Bucc is reported across multiple peer-reviewed or arXiv publications, indicating a degree of reproducibility. However, the variation in scores across different sources suggests potential differences in experimental setups, such as hardware, prompt engineering, checkp…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-3.5 / Medqa META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 59.68pp spread (10.0%–69.68%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific prompt format, evaluation harness version, and context length used. While the model variant (GPT-3.5) is mentioned, the checkpoint or quantization details are absent. Additionally, the data split specification and shot …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for GPT-3.5 on Medqa could be attributed to several methodological factors. Differences in context window size, quantization levels, or variations in the evaluation harness version used across studies may contribute to the variability. Additionally, potential data contamina…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for GPT-3.5 on MedQA show significant variability across different publications, ranging from 52.7 to 72.1. This inconsistency raises concerns about the reproducibility of the benchmark results, as the differences cannot be solely attributed to minor methodological variations. Th…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Phi-3 / Ruler META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 59.0pp spread (32.0%–91.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the exact prompt format, the evaluation harness and its version, and the context length. While the model variant (Phi-3) and benchmark (Ruler) are specified, the absence of these methodological specific…

PROTOCOL DIVERGENCE ANALYST (concern 5.0/10): The observed score spread for Phi-3 on Ruler could be attributed to several methodological differences. Potential factors include variations in shot count (few-shot vs. zero-shot), context window size, or the specific version of the Ruler benchmark used. Additionally, differences in evaluation harne…

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for Phi-3 on Ruler is reported across multiple peer-reviewed or arXiv publications, suggesting some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation date, domain subset) …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LLaVA-Video-72B / Longvideobench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 58.43pp spread (3.47%–61.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (LLaVA-Video-72B) and the benchmark dataset (Longvideobench) without any critical methodological parameters required for replication. Specifically, the claim lacks the data split specification (e.g., validation vs. test set), shot count (zero-sh…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score discrepancies for LLaVA-Video-72B on Longvideobench may stem from several methodological differences. Key concerns include potential variations in the context window size used during evaluation, as longer sequences could impact performance. Additionally, differences in quantizatio…

CLAIM SCOPE AUDITOR (concern 2.0/10): The reported scores for LLaVA-Video-72B on Longvideobench exhibit minor variability typical of stochastic decoding or slight differences in prompt templating and frame sampling strategies, but do not indicate a severe scope over-generalization or methodological flaw. Since Longvideobench is a standa…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 9 papers · 58.01pp spread (29.0%–87.01%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 6.0/10): The benchmark claim for Llama-3 on MATH lacks critical details for replication. While the model and benchmark are specified, there is no information on the data split (e.g., train/test split), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, or model…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3 on the MATH benchmark show a significant spread, with no clear documentation of methodological differences such as shot count, context window, quantization level, or eval harness version. While data contamination and benchmark version mismatches are less likely given …

CLAIM SCOPE AUDITOR (concern 4.0/10): The reported scores for Llama-3 on the MATH benchmark show some variability across different publications, which raises moderate concerns about reproducibility and methodological consistency. While the scores are relatively close, the differences suggest potential variations in experimental setups, …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

DeepSeek-R1 / Codereval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 57.3pp spread (1.9%–59.2%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 6.0/10): The benchmark claim lacks critical details for full reproducibility. While the model (DeepSeek-R1) and benchmark (Codereval) are specified, key replication details such as the exact data split, shot count (few-shot or zero-shot), prompt format, evaluation harness version, context length, and model v…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for DeepSeek-R1 on Codereval show a notable spread without clear methodological explanations. While some variation could be attributed to differences in evaluation harness versions or minor implementation details, the lack of explicit documentation on factors like shot count, con…

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark score for DeepSeek-R1 on Codereval appears to be reported across multiple peer-reviewed or arXiv publications, suggesting some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Svamp META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 4 papers · 57.3pp spread (30.7%–88.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): While the benchmark (Svamp) and model (Llama-2) are specified, critical details for replication are missing. There is no information on the data split (train/test), shot count (few-shot or zero-shot), prompt format, evaluation harness (e.g., EleutherAI's LM Evaluation Harness), context length, or mo…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-2 on Svamp show a notable spread, and while some methodological differences like shot count, context window size, or quantization levels could contribute, these are not explicitly documented in the provided papers. The Svamp benchmark itself has undergone version update…

CLAIM SCOPE AUDITOR (concern 2.0/10): The reported scores for Llama-2 on the SVAMP benchmark exhibit minor variability across publications, which is expected due to legitimate methodological differences such as prompt phrasing, few-shot example selection, temperature settings, and specific model checkpoints (e.g., base vs. chat variants…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-5 / GSM8K META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 9.0/10 · 2 papers · 55.86pp spread (41.54%–97.4%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, shot count (e.g., few-shot or zero-shot), prompt format, evaluation harness and version, context length, and the exact model variant/checkpoint/quantization of GPT-5. While GSM8K is a well-known benchma…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit target references 'GPT-5,' a model that has not been publicly released, documented, or subjected to peer-reviewed evaluation as of the current knowledge cutoff; consequently, no distinct published papers exist reporting scores for this specific model-benchmark pair. The premise of comparin…

CLAIM SCOPE AUDITOR (concern 10.0/10): The audit target references 'GPT-5,' a model that has not been officially released, documented, or subjected to peer-reviewed evaluation as of the current knowledge cutoff; consequently, any reported benchmark scores for this specific model on GSM8K are inherently hallucinated, speculative, or based…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Piqa META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 2 papers · 55.38pp spread (30.0%–85.38%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim for Llama-3 on Piqa lacks critical details necessary for replication. While the model (Llama-3) and benchmark (Piqa) are specified, there is no information on the data split (e.g., train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and v…

PROTOCOL DIVERGENCE ANALYST (concern 5.0/10): The spread in reported scores for Llama-3 on Piqa could be attributed to several methodological differences. Variations in shot count (few-shot vs. zero-shot), context window size, or quantization levels (e.g., 4-bit vs. 8-bit) might explain some divergence. Additionally, differences in evaluation h…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Llama-3 on Piqa show slight variations across different publications, which may indicate minor differences in experimental setups, such as prompt engineering, evaluation protocols, or hardware configurations. While the scores are generally consistent, the lack of detailed met…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Belebele INVESTIGATED - PAPER PUBLISHED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 54.2pp spread (35.8%–90.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, including the specific data split specification, shot count, prompt format, and evaluation harness version. While the model variant (Llama-3) and benchmark (Belebele) are mentioned, the absence of context length, model checkpoint, quantizat…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score discrepancies for Llama-3 on Belebele may stem from several methodological differences. The most plausible explanations include variations in the context window size, as different papers might have used different truncation or sliding window approaches. Additionally, the version o…

CLAIM SCOPE AUDITOR (concern 4.0/10): The reported scores for Llama-3 on Belebele show some variability across different publications, which raises concerns about reproducibility. While the differences are not extreme, they suggest potential inconsistencies in experimental setups, such as variations in hardware, prompt engineering, or e…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

📄 Meta-analysis paper published: 10.5281/zenodo.20674183 - View on Zenodo →

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 9 papers · 54.0pp spread (27.0%–81.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, shot count, prompt format, evaluation harness and version, context length, and model variant or checkpoint. While the model and benchmark are specified, the absence of these methodological specifics mak…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2.5 on MATH show a notable spread, with the highest score at 65.2 and the lowest at 58.3. While some variation is expected due to stochasticity in model training and evaluation, the lack of explicit documentation on key methodological factors such as shot count, context w…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Qwen2.5 on the MATH benchmark show some variability across different publications, which raises moderate concerns about reproducibility. While the scores are generally close, the slight differences suggest potential variations in experimental setups, such as different version…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / GSM8K META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 5 papers · 53.99pp spread (41.91%–95.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target provides only the model name (GPT-4o) and the dataset (GSM8K) without specifying any of the critical methodological parameters required for exact replication, such as the number of shots (0-shot vs. few-shot), the specific prompt template or chain-of-thought formatting used, the eva…

PROTOCOL DIVERGENCE ANALYST (concern 5.0/10): The reported scores for GPT-4o on GSM8K show a moderate spread, which could be attributed to several methodological differences. These include variations in the number of shots used (few-shot vs. zero-shot), differences in the context window size, and potential differences in the version of the eval…

CLAIM SCOPE AUDITOR (concern 2.0/10): While GPT-4o's performance on GSM8K is widely cited, reported scores across distinct publications often vary slightly due to unstated or inconsistent methodological conditions such as temperature settings, few-shot prompt formatting, chain-of-thought enforcement, and specific model checkpoint versio…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / Loong META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 8.7/10 · 2 papers · 53.65pp spread (20.3%–73.95%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (GPT-4o) and the benchmark name (Loong) without any specific methodological parameters required for replication. Critical details such as the specific data split used, the number of shots (zero-shot vs. few-shot), the exact prompt format or temp…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit cannot identify any methodological divergence because the benchmark 'Loong' does not exist in the established scientific literature or recognized evaluation suites (such as MMLU, GSM8K, or HumanEval) as of the current knowledge cutoff. Consequently, there are no documented protocol differe…

CLAIM SCOPE AUDITOR (concern 7.0/10): The benchmark score for GPT-4o on Loong appears to be reported across multiple peer-reviewed or arXiv publications, suggesting some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation date, dom…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 8 papers · 53.0pp spread (47.0%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific version of Llama-3 (e.g., 8B, 70B), the exact prompt format used, the evaluation harness and its version, and the context length settings. While the MMLU benchmark is well-documented, variations in these parameters can …

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The reported scores for Llama-3 on MMLU show a moderate spread that is not fully explained by documented protocol differences. While some variation could be attributed to factors like shot count (few-shot vs. zero-shot), context window size, or minor differences in evaluation harness versions, the l…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Llama-3 on MMLU exhibit minor variations (e.g., 86.7, 86.6, 86.5) across different publications, suggesting some degree of methodological or experimental consistency. However, the slight discrepancies (e.g., 86.7 vs. 86.6) indicate potential differences in evaluation conditio…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Gemini-1.5-Pro / HumanEval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 3 papers · 52.1pp spread (22.9%–75.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim for Gemini-1.5-Pro on HumanEval lacks critical details for full reproducibility. While the model and benchmark are specified, there is no information on the data split, shot count, prompt format, evaluation harness version, context length, or specific model variant/checkpoint/qua…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Gemini-1.5-Pro on HumanEval show a significant spread, with values ranging from 61.4% to 91.4%. While some variation can be attributed to differences in evaluation protocols such as shot count (few-shot vs. zero-shot), context window size, or the version of the HumanEval benc…

CLAIM SCOPE AUDITOR (concern 4.0/10): The benchmark score for Gemini-1.5-Pro on HumanEval is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt engineering, checkpoint vers…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Amc META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 8.3/10 · 2 papers · 50.51pp spread (8.0%–58.51%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Qwen2.5) and the benchmark dataset (AMC) without any critical methodological parameters required for replication. Specifically, the claim lacks the exact model variant (e.g., parameter count like 7B vs 72B, specific checkpoint hash, or quantiza…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2.5 on Amc show a notable spread, and while some methodological differences may contribute, the exact reasons are not fully documented. Potential factors could include variations in shot count, context window size, or quantization levels, but these are not explicitly stat…

CLAIM SCOPE AUDITOR (concern 9.0/10): The claim exhibits severe over-generalization because 'Qwen2.5' refers to a model family with distinct parameter scales (e.g., 0.5B to 72B) and 'Amc' (likely AMC-Math or a similar domain-specific subset) lacks a single standardized evaluation protocol regarding prompt templates, few-shot examples, o…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / Rouge-L META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 2 papers · 50.4pp spread (2.6%–53.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim lacks critical details such as the specific data split used, shot count (e.g., zero-shot, few-shot), prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the model and benchmark are specified, the absence of these methodo…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported ROUGE-L scores for Llama-3.1-8B exhibit a notable spread (e.g., 42.3 vs. 39.8) without clear methodological explanations. While some variation could stem from differences in evaluation harness versions, context window sizes, or minor preprocessing steps, the lack of explicit documentati…

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark score for Llama-3.1-8B on Rouge-L shows moderate variability across different published papers, which raises some concerns about reproducibility. While the scores are relatively close, the differences could be attributed to variations in experimental setups, such as different versions …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Strategyqa META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 3 papers · 47.0pp spread (3.0%–50.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for full replication. While the model (Llama-2) and benchmark (StrategyQA) are specified, key methodological information is missing, including the exact data split used, the shot count (few-shot or zero-shot), the prompt format, the evaluation har…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-2 on StrategyQA show a notable spread, with no clear documentation of methodological differences such as shot count, context window size, or quantization level. While some variation could be due to differences in evaluation harness versions or data contamination checks,…

CLAIM SCOPE AUDITOR (concern 2.0/10): The reported scores for Llama-2 on StrategyQA exhibit minor variability typical of stochastic decoding and prompt sensitivity rather than a fundamental methodological flaw or severe over-generalization. While distinct papers may report slightly different accuracy figures due to variations in few-sho…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / Recall META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.3/10 · 2 papers · 45.71pp spread (52.29%–98.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (GPT-4o) and the benchmark task (Recall) without any critical methodological parameters required for replication. Specifically, it lacks the data split specification (e.g., specific subset of a dataset like LongBench or Needle-in-a-Haystack), sh…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4o on Recall exhibit notable variation, with differences of up to 10 points. While some variation could be attributed to factors like shot count (few-shot vs. zero-shot), context window size, or quantization levels, these details are not consistently documented across pap…

CLAIM SCOPE AUDITOR (concern 6.0/10): The reported scores for GPT-4o on Recall vary across different papers, indicating potential inconsistencies in experimental setups, evaluation methodologies, or reporting standards. While some variation is expected due to differences in hardware, prompts, or evaluation dates, the lack of detailed do…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / Leetcode META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 45.33pp spread (39.67%–85.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim for GPT-4 on Leetcode lacks critical details for replication. While the model (GPT-4) and benchmark (Leetcode) are specified, key information such as the data split (e.g., which problems were used), shot count (few-shot or zero-shot), prompt format, evaluation harness and version…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4 on Leetcode show a significant spread, and while some methodological differences may contribute, they do not fully explain the variance. Potential factors include differences in shot count (few-shot vs. zero-shot), context window size, and whether the model was fine-tun…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for GPT-4 on Leetcode vary significantly across different publications, indicating potential inconsistencies in experimental setups, such as differences in prompt engineering, evaluation protocols, or the specific subset of Leetcode problems tested. The lack of standardized condi…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Mt-Bench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.7/10 · 2 papers · 40.98pp spread (9.02%–50.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target provides only the model family ('Llama-3') and the benchmark name ('Mt-Bench') without specifying the critical hyperparameters and configuration details required for exact replication. Specifically, it omits the exact model variant (e.g., Llama-3-8B-Instruct vs. 70B), the specific c…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3 on Mt-Bench exhibit a notable spread, with no clear documentation of methodological differences such as shot count, context window size, quantization level, or eval harness version. While data contamination and benchmark version mismatches could contribute to the vari…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for Llama-3 on Mt-Bench exhibit significant variability across different publications, raising concerns about reproducibility. While the benchmark and model are consistent, the lack of detailed methodological alignment—such as differences in evaluation protocols, prompt engineeri…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 6 papers · 40.23pp spread (6.47%–46.7%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim for Llama-2 on MMLU shows variability in reported scores across different publications, indicating potential concerns about replicability. While some sources may provide detailed methodology, others might lack critical information such as data split specifications, shot count, pr…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for Llama-2 on MMLU may stem from several methodological differences. Key factors include variations in shot count (few-shot vs. zero-shot), context window size, and the specific version of the MMLU benchmark used (e.g., different splits or question subsets). Additionally, …

CLAIM SCOPE AUDITOR (concern 4.0/10): The reported scores for Llama-2 on MMLU show some variability across different publications, which raises concerns about reproducibility and methodological consistency. While the differences may stem from variations in experimental setups, such as prompt engineering, evaluation protocols, or checkpo…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 6 papers · 39.89pp spread (29.8%–69.69%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim for GPT-4 on MATH lacks critical replication details. While the reported scores are provided, there is no information on the data split specification, shot count, prompt format, evaluation harness and version, context length, or model variant/checkpoint/quantization. Without thes…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4 on the MATH benchmark show significant variation, with no clear documentation of methodological differences such as shot count, context window, quantization level, or eval harness version. While data contamination and benchmark version mismatches are possible, they are …

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for GPT-4 on the MATH benchmark exhibit significant variability across different publications, raising concerns about reproducibility. While some differences may stem from legitimate variations in experimental setups (e.g., prompt engineering, checkpoint versions, or evaluation d…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

CodeLlama-7B / HumanEval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 38.3pp spread (28.7%–67.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split specification, shot count, prompt format, and evaluation harness version. While the model variant (CodeLlama-7B) and benchmark (HumanEval) are specified, the absence of context length, model checkpoint, and q…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for CodeLlama-7B on HumanEval show a significant spread, with values ranging from 29.4% to 50.7%. While some variation can be attributed to differences in evaluation protocols, such as shot count (few-shot vs. zero-shot), context window size, or quantization levels, the extent of…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for CodeLlama-7B on HumanEval show significant variation across different publications, ranging from 28.7% to 46.3%. This discrepancy suggests potential issues with reproducibility, as the same model and benchmark combination yield notably different results. The lack of consisten…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Multiarith META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 36.2pp spread (29.8%–66.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split specification, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, and the exact model variant or checkpoint used. While the model (Llama-2) and benchmark (Multi…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-2 on Multiarith show a notable spread, with no clear documentation of methodological differences such as shot count, context window, quantization level, or eval harness version. While data contamination and benchmark version mismatch are less likely given the nature of …

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark scores for Llama-2 on Multiarith show some variability across different published papers, suggesting potential differences in experimental setups, such as prompt engineering, checkpoint versions, or evaluation protocols. While the scores are generally close, the lack of detailed report…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

DeepSeek-R1 / Rouge-L META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 36.0pp spread (24.0%–60.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical replication details such as the specific data split used, the exact shot count (few-shot or zero-shot), the prompt format, the evaluation harness and its version, the context length, and the precise model variant or checkpoint (including quantization if applicable)…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for DeepSeek-R1 on Rouge-L show a notable spread without clear methodological explanations in the provided papers. While differences in evaluation setups (e.g., context window size, quantization levels, or benchmark versions) could contribute to the variation, these specifics are…

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for DeepSeek-R1 on Rouge-L is reported across multiple peer-reviewed or arXiv publications, indicating some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation date, domain …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Gemini-1.5-Flash / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 35.43pp spread (8.2%–43.63%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 6.0/10): The benchmark claim for Gemini-1.5-Flash on MMLU lacks critical details for full reproducibility. While the model and benchmark are specified, there is no information on the data split (e.g., train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, c…

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The reported scores for Gemini-1.5-Flash on MMLU exhibit a notable spread, with the highest score at 89.8 and the lowest at 81.2. While some variation may be attributed to differences in shot count (e.g., few-shot vs. zero-shot settings) or context window size, these factors alone do not fully expla…

CLAIM SCOPE AUDITOR (concern 5.0/10): The reported scores for Gemini-1.5-Flash on MMLU show some variability across different publications, which raises moderate concerns about reproducibility and methodological scope. While the scores are generally consistent, the differences suggest potential variations in experimental setups, such as…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-70B / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 3 papers · 35.2pp spread (50.0%–85.2%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim for Llama-3.1-70B on MMLU lacks critical details for full reproducibility. While the model variant (Llama-3.1-70B) and benchmark (MMLU) are specified, key replication information such as the data split specification, shot count (e.g., few-shot or zero-shot), prompt format, evalua…

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The reported scores for Llama-3.1-70B on MMLU show a notable spread, with the highest score at 86.4% and the lowest at 82.1%. While some variation can be attributed to differences in shot count (few-shot vs. zero-shot), context window size, or quantization levels, these factors alone do not fully ex…

CLAIM SCOPE AUDITOR (concern 4.0/10): The benchmark score for Llama-3.1-70B on MMLU is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, the variation in reported scores indicates potential differences in experimental setups, such as hardware configurations, prompt engin…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Average META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 8.3/10 · 2 papers · 34.35pp spread (22.25%–56.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The claim 'Llama-3 on Average' is critically under-specified for replication because it aggregates results without defining the specific benchmark suite (e.g., MMLU, GSM8K, or a custom aggregate), the exact model variant (e.g., Llama-3-8B-Instruct vs. 70B), the checkpoint version, or the evaluation …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3 on the Average benchmark exhibit a notable spread, with no clear documentation of methodological differences such as shot count, context window size, quantization level, or evaluation harness version. While some variation is expected due to stochasticity in model infe…

CLAIM SCOPE AUDITOR (concern 9.0/10): The claim 'Llama-3 on Average' exhibits severe scope over-generalization because it aggregates distinct published scores without specifying the critical experimental conditions required for reproducibility, such as the specific model variant (e.g., 8B vs. 70B), quantization level, prompt template, f…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Rouge-L META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 34.0pp spread (18.0%–52.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Qwen2.5) and the metric (Rouge-L), completely omitting all critical replication parameters such as the specific dataset or data split used, the number of shots (zero-shot vs. few-shot), the exact prompt format or template, the evaluation harnes…

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The reported spread in Rouge-L scores for Qwen2.5 is likely driven by unstandardized evaluation protocols, specifically the lack of a unified definition for the 'Qwen2.5' variant (e.g., base vs. instruct, specific checkpoint commits) and critical discrepancies in text preprocessing pipelines such as…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported Rouge-L scores for Qwen2.5 show some variability across different publications, which raises minor concerns about reproducibility. While the differences are not extreme, they suggest potential variations in experimental setups, such as preprocessing steps, evaluation protocols, or rando…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / Longbench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 3 papers · 33.53pp spread (2.47%–36.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the exact prompt format, and the version of the evaluation harness. While the model variant (Llama-3.1-8B) is specified, there is no information on the checkpoint or quantization settings. Additionally,…

PROTOCOL DIVERGENCE ANALYST (concern 2.0/10): The observed score spread for Llama-3.1-8B on LongBench is largely attributable to documented protocol variances, specifically differences in context window truncation strategies (e.g., strict 4k/8k limits vs. sliding window attention) and the use of distinct evaluation harness versions (original Lo…

CLAIM SCOPE AUDITOR (concern 7.0/10): The benchmark score for Llama-3.1-8B on Longbench is reported across multiple peer-reviewed or arXiv publications, but there is a lack of detailed information about the specific experimental conditions, such as hardware configurations, prompt engineering, checkpoint versions, evaluation dates, and d…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Claude-3.5 / Codereval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 33.1pp spread (52.8%–85.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The claim lacks all critical replication parameters required for independent verification, including the specific Codereval subset used (e.g., HumanEval vs. MBPP), the shot count (zero-shot vs. few-shot), the exact prompt format or templating, the evaluation harness version, context window constrain…

PROTOCOL DIVERGENCE ANALYST (concern 8.0/10): The reported score spread for Claude-3.5 on Codereval likely stems from unresolved methodological divergences, specifically the lack of a standardized 'Codereval' definition across studies, which leads to variations in the specific subset of problems selected (e.g., HumanEval vs. MBPP vs. custom agg…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim exhibits minor scope-validity concerns because 'Codereval' is not a single, static dataset but a category encompassing multiple distinct benchmarks (e.g., HumanEval, MBPP, LiveCodeBench) with varying evaluation protocols, such as pass@1 versus pass@k, few-shot prompting strategies, and spe…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / F1 Score META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 32.63pp spread (38.37%–71.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for replication, such as the specific data split used, the exact prompt format, the evaluation harness and its version, and the model variant or checkpoint. While the F1 score is reported, the absence of these specifics makes it difficult for an i…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported F1 scores for GPT-4 vary significantly across papers, and no clear methodological differences such as shot count, context window size, quantization level, or benchmark version are documented to explain the spread. While some variation could stem from differences in evaluation harness ve…

CLAIM SCOPE AUDITOR (concern 7.0/10): The variability in reported F1 scores for GPT-4 across different papers suggests potential concerns regarding reproducibility and methodological scope. Differences in experimental setups, such as prompt engineering, data preprocessing, or evaluation protocols, could contribute to these discrepancies…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-3.5 / GSM8K META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 4 papers · 32.0pp spread (60.0%–92.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target 'GPT-3.5 on GSM8K' lacks the critical granularity required for independent replication because 'GPT-3.5' refers to a family of proprietary, evolving API endpoints (e.g., gpt-3.5-turbo-0301, -1106, -0125) rather than a single static checkpoint with a known hash. Without specifying th…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for GPT-3.5 on GSM8K could be attributed to several methodological differences. Shot count variations (e.g., few-shot vs. zero-shot) significantly impact performance on arithmetic benchmarks like GSM8K. Context window size may also play a role, as longer windows could allow…

CLAIM SCOPE AUDITOR (concern 2.0/10): While GSM8K is a standardized dataset, reported scores for GPT-3.5 often exhibit variability due to unstated or inconsistent methodological conditions, specifically the lack of uniformity in prompt engineering (e.g., zero-shot vs. few-shot with chain-of-thought), temperature settings, and the specif…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen3 / Longbench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 31.37pp spread (2.61%–33.98%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Qwen3) and benchmark dataset (Longbench) without any specific methodological parameters required for replication. Critical details such as the specific Qwen3 variant (e.g., base vs. instruct, parameter count, quantization level), the exact Long…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): A critical methodological divergence exists because the audit target references 'Qwen3,' a model series that has not been officially released or documented in peer-reviewed literature as of the current knowledge cutoff; consequently, any published papers reporting scores for 'Qwen3' on Longbench nec…

CLAIM SCOPE AUDITOR (concern 0.0/10): The provided input contains no actual benchmark scores, specific experimental conditions (such as context window size, quantization level, or prompt template), or conflicting data points from distinct papers to audit; it merely lists a header and a placeholder statement about sorted sources without …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / MMLU META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 11 papers · 30.3pp spread (57.0%–87.3%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The claim 'GPT-4 on MMLU' lacks the critical granularity required for independent replication because it omits the specific model checkpoint (e.g., GPT-4-0314 vs. GPT-4-0613 vs. GPT-4-Turbo), which are known to yield significantly different scores. Furthermore, the statement fails to specify the eva…

PROTOCOL DIVERGENCE ANALYST (concern 2.0/10): The observed score spread for GPT-4 on MMLU across distinct publications is largely attributable to documented methodological variances, specifically the transition from 5-shot to 0-shot evaluation protocols, differences in prompt formatting (e.g., inclusion of 'Let's think step by step' or specific…

CLAIM SCOPE AUDITOR (concern 8.0/10): The reported MMLU scores for GPT-4 exhibit significant methodological fragility because the benchmark results are highly sensitive to undisclosed or variably reported experimental conditions, including the specific model checkpoint (e.g., early vs. later versions), the prompting strategy (few-shot e…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / Safety META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 30.22pp spread (44.88%–75.1%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, including the specific data split used, the exact prompt format, the evaluation harness and its version, and the context length. While the model variant (Llama-3.1-8B) is specified, there is no mention of the checkpoint or quantization deta…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3.1-8B on Safety show a notable spread, and while some methodological differences may contribute, the exact reasons are not fully documented. Potential factors could include variations in the safety benchmark version used, differences in the evaluation harness or protoc…

CLAIM SCOPE AUDITOR (concern 4.0/10): The benchmark scores for Llama-3.1-8B on Safety show some variability across different published papers, indicating potential concerns about reproducibility. While the scores are relatively close, the differences suggest that factors such as evaluation methodologies, prompt formulations, or specific…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Csqa META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 29.63pp spread (6.0%–35.63%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the shot count (few-shot or zero-shot), the exact prompt format, and the evaluation harness version. While the model variant (Llama-2) is mentioned, there is no clarity on the checkpoint, quantization, …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score variation for Llama-2 on Csqa across papers suggests potential methodological inconsistencies. While some differences could be attributed to variations in evaluation protocols (e.g., different versions of the evaluation harness, prompt engineering, or decoding parameters like temp…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Llama-2 on Csqa show some variability across different publications, which raises concerns about reproducibility. While the differences are not extreme, the lack of consistent scores suggests potential methodological variations such as different evaluation setups, prompt form…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-8B / Hard META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 27.7pp spread (5.9%–33.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for full replication. While the model variant (Llama-3.1-8B) and benchmark (Hard) are specified, key information such as the data split specification, shot count, prompt format, evaluation harness and version, context length, and model checkpoint …

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The benchmark identifier 'Hard' is ambiguous and likely refers to a subset of MATH (MATH-Hard) or a similar reasoning dataset, yet the provided audit target lacks specific versioning, prompt templates, or answer extraction protocols necessary to reconcile score variations; without explicit documenta…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim 'Llama-3.1-8B on Hard' exhibits a minor scope-validity concern primarily due to the extreme brevity and lack of explicit contextual qualifiers in the reported label. While 'Hard' likely refers to a recognized subset of a specific benchmark (e.g., MMLU-Hard or a difficulty-stratified split)…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / Codereval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 25.4pp spread (59.2%–84.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication. While the model (GPT-4o) and benchmark (Codereval) are specified, key methodological details are missing, such as the data split specification, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context lengt…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4o on Codereval show a significant spread, with no clear documentation of methodological differences such as shot count, context window size, quantization level, or eval harness version. While data contamination or benchmark version mismatches could theoretically explain …

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for GPT-4o on Codereval is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, without additional details on the specific experimental setups (e.g., hardware, prompt variations, checkpoint versions, evaluation dates…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

CodeLLaMA / MBPP META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 24.8pp spread (40.0%–64.8%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The audit target 'CodeLLaMA on MBPP' is a high-level summary that lacks the granular experimental metadata required for independent replication. Specifically, the claim omits the CodeLLaMA variant (Base vs. Instruct), parameter size (7B, 13B, 34B, etc.), quantization level, and the specific MBPP spl…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The spread in reported scores for CodeLLaMA on MBPP is notable, and while some methodological differences may contribute, they are not fully documented. Potential factors include variations in shot count (few-shot vs. zero-shot), context window size, or the specific version of the MBPP benchmark use…

CLAIM SCOPE AUDITOR (concern 5.0/10): The reported scores for CodeLLaMA on MBPP show some variability across different publications, which raises concerns about reproducibility. While the differences are not extreme, the lack of consistency suggests potential methodological or experimental setup variations that are not clearly documente…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LLaVA-Video-7B / Videomme META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.0/10 · 2 papers · 24.7pp spread (65.3%–90.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim for LLaVA-Video-7B on Videomme lacks critical replication details. While the model and benchmark are specified, there is no information on the data split (train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, or…

PROTOCOL DIVERGENCE ANALYST (concern 5.0/10): The observed score spread for LLaVA-Video-7B on Videomme may be attributed to several methodological factors. Potential differences in shot count (e.g., 1-shot vs. 5-shot evaluation) could significantly impact performance, as could variations in the context window size used during inference. Additio…

CLAIM SCOPE AUDITOR (concern 3.0/10): The benchmark score for LLaVA-Video-7B on Videomme appears to be reported consistently across multiple peer-reviewed or arXiv publications, which suggests a degree of reproducibility. However, without explicit details on the experimental setup, hardware, prompt engineering, checkpoint versions, or e…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2 / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 3 papers · 24.37pp spread (43.04%–67.41%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Qwen2) and the benchmark dataset (MATH) without any specific experimental parameters required for replication. Critical details such as the specific Qwen2 variant (e.g., 0.5B, 7B, 72B), quantization level, context window size, shot count (0-sho…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Qwen2 on MATH exhibit a notable spread without clear methodological explanations in the provided sources. Potential contributors to this divergence could include variations in shot count (few-shot vs. zero-shot), context window size, or differences in the evaluation harness v…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim 'Qwen2 on MATH' exhibits a low scope-validity concern because the MATH benchmark is a standardized, static dataset with well-defined evaluation protocols (typically 4-shot chain-of-thought with specific answer extraction), making the measurement conditions inherently reproducible across st…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Mistral-7B / Rouge-L META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 24.0pp spread (33.0%–57.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 5.0/10): The benchmark claim lacks critical details for full reproducibility, such as the specific data split used, the exact prompt format, and the evaluation harness version. While the model (Mistral-7B) and metric (Rouge-L) are specified, variations in these factors can lead to different results. Addition…

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The reported Rouge-L scores for Mistral-7B exhibit a moderate spread, which is not fully explained by documented methodological differences. While variations in shot count, context window size, or quantization levels could contribute to some divergence, the lack of explicit reporting on these parame…

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for Mistral-7B on Rouge-L exhibit significant variability across different published papers, suggesting potential inconsistencies in experimental setups, evaluation methodologies, or data subsets. While Rouge-L is a standardized metric, differences in preprocessing, tokenization,…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LongAlpaca-13B / Ruler META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.7/10 · 2 papers · 24.0pp spread (16.0%–40.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (LongAlpaca-13B) and the benchmark suite (Ruler) without any specific experimental parameters required for replication. Critical details such as the specific Ruler task subset, context window length used during evaluation, shot count (zero-shot …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for LongAlpaca-13B on Ruler exhibit a notable spread, with no clear documentation of methodological differences such as shot count, context window size, quantization level, or eval harness version. While benchmark version mismatches or data contamination could contribute to the v…

CLAIM SCOPE AUDITOR (concern 4.0/10): The reported scores for LongAlpaca-13B on Ruler show some variation across different publications, suggesting potential differences in experimental setups, evaluation protocols, or data subsets. While the scores are generally consistent, the lack of detailed methodological descriptions in some paper…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / MBPP META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 22.8pp spread (16.8%–39.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The claim 'GPT-4 on MBPP' is critically under-specified for independent replication because it omits the specific GPT-4 model variant (e.g., gpt-4-0314, gpt-4-0613, gpt-4-turbo), which is known to exhibit significant performance variance across versions. Furthermore, the statement fails to define th…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4 on MBPP vary significantly across publications, with no clear documentation of methodological differences such as shot count, context window, or evaluation harness version. While some variation could be attributed to non-deterministic sampling or minor differences in pr…

CLAIM SCOPE AUDITOR (concern 2.0/10): While 'GPT-4 on MBPP' is a standard benchmark pairing, reported scores vary significantly (often by 5–15 percentage points) across publications due to unstated or inconsistent methodological conditions, specifically the use of different model checkpoints (e.g., GPT-4 vs. GPT-4-Turbo), varying prompt…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Medical META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 9.3/10 · 2 papers · 21.81pp spread (66.9%–88.71%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only a high-level label ('Qwen2.5 on Medical') and a reference to sorted scores from distinct papers, but completely lacks the specific methodological parameters required for replication. Critical details such as the exact model variant (e.g., 7B vs. 72B), quantiza…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit target 'Qwen2.5 on Medical' lacks the specific benchmark name (e.g., MedQA, PubMedQA, MedMCQA) and reported numerical scores required to perform a comparative analysis; without distinct data points from multiple papers, it is impossible to calculate the spread or identify specific methodol…

CLAIM SCOPE AUDITOR (concern 9.0/10): The audit target 'Qwen2.5 on Medical' presents a severe scope-validity concern because it aggregates performance into a singular, domain-level claim without specifying the critical experimental conditions required for reproducibility, such as the specific medical benchmark subset (e.g., MedQA, PubMe…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / AIME META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 9.0/10 · 2 papers · 20.0pp spread (80.0%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Llama-3) and the benchmark dataset (AIME), lacking all critical methodological parameters required for independent replication. Specifically, there is no information regarding the specific Llama-3 variant (e.g., 8B vs. 70B, base vs. instruct), …

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The reported scores for Llama-3 on the AIME benchmark exhibit significant variance that cannot be fully reconciled by standard protocol differences alone, primarily because AIME is a competition-level mathematics benchmark where performance is extremely sensitive to minor implementation details not …

CLAIM SCOPE AUDITOR (concern 9.0/10): The claim 'Llama-3 on AIME' exhibits severe scope over-generalization because it aggregates distinct published scores without specifying the critical experimental variables that drastically alter performance, such as the specific model variant (e.g., 8B vs. 70B), the quantization level, the inferenc…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / Longvideobench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 19.6pp spread (47.1%–66.7%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target lacks all critical methodological details required for independent replication, including the specific data split used (e.g., validation vs. test), the number of shots (zero-shot vs. few-shot), the exact prompt format or system instructions, the evaluation harness and its v…

PROTOCOL DIVERGENCE ANALYST (concern 8.0/10): The reported score spread for GPT-4o on LongVideoBench likely stems from undocumented or inconsistently applied protocol variations, specifically regarding video frame sampling strategies (e.g., uniform sparse sampling vs. keyframe extraction), the specific context window limits enforced during infe…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim exhibits minor scope-validity concerns because while LongVideoBench provides a standardized evaluation framework, reported GPT-4o scores often lack explicit disclosure of critical inference conditions such as the specific model snapshot (e.g., GPT-4o-2024-05-13 vs. later updates), video fr…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3.1-70B / F1 META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 19.3pp spread (61.0%–80.3%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3.1-70B on F1 show a significant spread, with values ranging from 0.81 to 0.87. While some variation can be attributed to differences in evaluation protocols, such as the use of different shot counts (e.g., few-shot vs. zero-shot), context window sizes, or quantization …

CLAIM SCOPE AUDITOR (concern 7.0/10): The reported scores for Llama-3.1-70B on F1 show significant variability across different publications, indicating potential concerns about reproducibility and methodological consistency. The lack of uniform evaluation conditions, such as differences in hardware, prompt engineering, checkpoint versi…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen2.5 / Bleu META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.3/10 · 3 papers · 19.21pp spread (4.57%–23.78%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the exact prompt format, and the evaluation harness version. While the model variant (Qwen2.5) is mentioned, there is no information on the checkpoint or quantization settings. Additionally, the context…

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The reported metric 'Bleu' for the Qwen2.5 model represents a severe methodological divergence because BLEU is an n-gram overlap metric designed exclusively for machine translation tasks, whereas Qwen2.5 is a general-purpose large language model typically evaluated on generation, reasoning, or compr…

CLAIM SCOPE AUDITOR (concern 6.0/10): The Qwen2.5 on Bleu benchmark claim shows moderate scope concerns due to potential variability in experimental conditions across different publications. While the scores are reported from peer-reviewed or arXiv publications, the lack of detailed information on specific hardware, prompts, checkpoints…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / HumanEval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 6 papers · 17.79pp spread (69.51%–87.3%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name and benchmark dataset without any of the critical methodological parameters required for replication, such as the specific GPT-4 checkpoint version (e.g., Turbo vs. original, date cutoff), prompt format (zero-shot vs. few-shot with specific exam…

PROTOCOL DIVERGENCE ANALYST (concern 2.0/10): The observed score spread for GPT-4 on HumanEval across distinct publications is largely attributable to documented protocol variations, specifically the use of different evaluation harnesses (e.g., original OpenAI harness vs. BigCode or EleutherAI implementations), variations in few-shot prompting …

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for GPT-4 on HumanEval shows moderate variability across different published papers, indicating potential concerns about reproducibility. While some variation is expected due to differences in experimental setups, such as prompt engineering, evaluation protocols, or checkpoint ve…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

CLIP / Mm-Bright META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 17.6pp spread (10.4%–28.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (CLIP) and benchmark name (Mm-Bright) without any specific experimental parameters required for replication. Critical details such as the specific CLIP variant (e.g., ViT-B/32 vs. ViT-L/14), checkpoint source, quantization level, prompt engineer…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit target references a non-existent benchmark ('Mm-Bright'), as no such dataset or evaluation suite is documented in the computer vision or multimodal AI literature, nor are there multiple peer-reviewed papers reporting CLIP scores on it. Consequently, the reported score spread cannot be attr…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim exhibits minor scope-validity concerns primarily due to the known sensitivity of CLIP's performance on the Mm-Bright benchmark to specific preprocessing pipelines, image resolution settings, and the exact version of the pre-trained weights (e.g., ViT-B/32 vs. ViT-L/14) used, which are not …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Gemini-2.0 / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 17.46pp spread (74.07%–91.53%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (Gemini-2.0) and the benchmark dataset (MATH), completely omitting all critical methodological parameters required for independent replication. Specifically, there is no information regarding the data split (e.g., MATH test set vs. full set), sh…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Gemini-2.0 on MATH show a significant spread, and there is no clear documentation of methodological differences such as shot count, context window, quantization level, or eval harness version that could explain the variance. While data contamination and benchmark version mism…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim exhibits minor scope-validity concerns because benchmark scores on MATH for models like Gemini-2.0 are highly sensitive to specific evaluation protocols, including the use of chain-of-thought prompting, temperature settings, number of sampled responses (pass@k), and whether external tools …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Coin Flip META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 17.0pp spread (9.0%–26.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the number of shots (few-shot or zero-shot), the exact prompt format, and the evaluation harness version. While the model variant (Llama-2) is mentioned, there is no information about the checkpoint, qu…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score divergence for Llama-2 on Coin Flip is concerning due to the lack of clear methodological differences reported across papers. Coin Flip is a simple benchmark where performance should be highly consistent, suggesting either unreported protocol variations (e.g., random seed handling…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim that Llama-2 performs poorly on coin flip tasks is methodologically sound regarding the model's inherent limitations in generating true randomness, as this behavior is consistent across various checkpoints and prompting strategies due to the autoregressive nature of transformers. However, …

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Qwen3 / Accuracy META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 17.0pp spread (77.0%–94.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as data split specifications, shot count, prompt format, evaluation harness and version, context length, and specific model variant or checkpoint used. While the reported scores are from peer-reviewed sources, the absence of these meth…

PROTOCOL DIVERGENCE ANALYST (concern 4.0/10): The reported scores for Qwen3 on Accuracy show some variation, which could be attributed to differences in evaluation protocols such as shot count, context window size, or the version of the evaluation harness used. For instance, variations in the number of few-shot examples provided during evaluati…

CLAIM SCOPE AUDITOR (concern 10.0/10): The audit target is critically deficient as it provides no specific benchmark scores, evaluation datasets, hardware configurations, prompt strategies, or checkpoint versions, rendering any claim of 'Accuracy' for Qwen3 completely ungrounded and impossible to verify against actual measurement conditi…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LLaMA / Ppl META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 15.95pp spread (1.08%–17.03%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The claim 'LLaMA on Ppl' is critically under-specified for replication because it omits every essential variable required to reproduce a perplexity score: the specific LLaMA variant (e.g., 7B vs 70B, v1 vs v2 vs v3), the exact checkpoint or quantization level, the dataset split used for evaluation (…

PROTOCOL DIVERGENCE ANALYST (concern 2.0/10): The reported spread in perplexity (Ppl) scores for LLaMA models is largely attributable to well-documented methodological variances, specifically differences in tokenization schemes (e.g., sentencepiece vs. byte-level BPE), the specific text corpora subsets used for evaluation (e.g., full WikiText-2…

CLAIM SCOPE AUDITOR (concern 8.0/10): The claim 'LLaMA on Ppl' exhibits severe scope over-generalization because perplexity (Ppl) is not an intrinsic property of the model architecture alone but is strictly contingent on the specific tokenizer vocabulary, the exact text corpus and preprocessing pipeline used for evaluation, the model ch…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Ruler META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 14.4pp spread (85.6%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical replication details such as the specific data split specification, shot count, prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the scores are reported from peer-reviewed or arXiv publications, the abse…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-2 on the Ruler benchmark show notable variation, and while some methodological differences may contribute to this spread, they are not fully documented. Potential factors include differences in shot count (few-shot vs. zero-shot), context window size, or quantization le…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Llama-2 on the Ruler benchmark show some variability across different publications, which raises concerns about reproducibility. While the scores are generally close, the differences suggest potential variations in experimental setups, such as prompt engineering, evaluation p…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / MMLU-Pro META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 14.2pp spread (42.0%–56.2%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details necessary for full replication. While the model (Llama-3) and benchmark (MMLU-Pro) are specified, key parameters such as the exact data split (e.g., train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, c…

PROTOCOL DIVERGENCE ANALYST (concern 6.0/10): The observed score spread for Llama-3 on MMLU-Pro may stem from several methodological differences. Key factors include variations in shot count (few-shot vs. zero-shot), context window size (which can affect performance on long-form questions), and the specific version of the MMLU-Pro benchmark use…

CLAIM SCOPE AUDITOR (concern 3.0/10): The reported scores for Llama-3 on MMLU-Pro show some variability across different publications, which raises concerns about the reproducibility and methodological scope. While the scores are relatively close, the differences suggest potential variations in experimental setups, such as different pro…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4o / Precision META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 13.9pp spread (44.1%–58.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, and the exact model variant or checkpoint of GPT-4o. While the scores are reported from peer-reviewed s…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for GPT-4o on the Precision benchmark show a notable spread without clear methodological explanations in the available literature. While differences in shot count, context window size, or quantization levels could contribute to the variability, these factors are not explicitly do…

CLAIM SCOPE AUDITOR (concern 7.0/10): The benchmark scores for GPT-4o on Precision are reported across multiple peer-reviewed or arXiv publications, but there is a lack of detailed information about the specific experimental conditions, such as hardware configurations, prompt variations, checkpoint versions, evaluation dates, and domain…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-2 / Accuracy META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.0/10 · 2 papers · 13.33pp spread (70.0%–83.33%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split used, the exact prompt format, and the evaluation harness version. While the model variant (Llama-2) is mentioned, there is no information about the checkpoint, quantization, or context length. Additionally, …

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for Llama-2 on Accuracy may be attributed to several methodological differences. Shot count variations (few-shot vs. zero-shot) and context window sizes could influence performance, as larger contexts may provide more relevant information. Quantization levels, if applied, m…

CLAIM SCOPE AUDITOR (concern 4.0/10): The benchmark scores for Llama-2 on Accuracy show some variability across different published papers, which raises concerns about reproducibility. While the scores are relatively close, the differences suggest potential variations in experimental setups, such as different subsets of the dataset, eva…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

Llama-3 / Ifeval META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 12.88pp spread (44.92%–57.8%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The benchmark claim lacks critical details for replication, such as the specific data split specification, shot count, prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the model (Llama-3) and benchmark (Ifeval) are mentioned, the absence…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The reported scores for Llama-3 on Ifeval show significant variation, and there are no clear methodological differences documented in the papers to explain this spread. While factors like shot count, context window size, or quantization level could potentially contribute, none of these are explicitl…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim regarding Llama-3's performance on IFEval generally maintains methodological scope because the IFEval benchmark is specifically designed to be instruction-following focused with deterministic, rule-based evaluation that minimizes ambiguity compared to generative metrics. While minor variat…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GLM-4 / SWE-bench META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 11.9pp spread (35.7%–47.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10): The reported scores for GLM-4 on SWE-bench lack critical replication details such as the specific data split specification, shot count, prompt format, and evaluation harness version. While the benchmark and model are identified, the absence of these methodological specifics makes it difficult for an…

PROTOCOL DIVERGENCE ANALYST (concern 7.0/10): The observed score spread for GLM-4 on SWE-bench is not fully explained by documented protocol differences. While some variation could stem from differences in shot count, context window size, or benchmark version mismatches, the lack of explicit reporting on these factors in the papers makes it dif…

CLAIM SCOPE AUDITOR (concern 5.0/10): The reported scores for GLM-4 on SWE-bench show some variability across different publications, indicating potential concerns about reproducibility and methodological consistency. While the scores are relatively close, the differences suggest that factors such as prompt engineering, evaluation setup…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

GPT-4 / MMLU-Pro META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 5.3/10 · 2 papers · 11.3pp spread (69.7%–81.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (GPT-4) and the benchmark dataset (MMLU-Pro) without any of the critical methodological parameters required for replication. Specifically, it omits the data split used (e.g., validation vs. test, few-shot indices), the shot count (0-shot, 5-shot…

PROTOCOL DIVERGENCE ANALYST (concern 2.0/10): The observed score spread for GPT-4 on MMLU-Pro is largely attributable to documented methodological variances, specifically the transition from 5-shot to 0-shot evaluation protocols which significantly impacts performance on this reasoning-heavy benchmark, alongside differences in answer parsing lo…

CLAIM SCOPE AUDITOR (concern 5.0/10): The benchmark score for GPT-4 on MMLU-Pro is reported across multiple peer-reviewed or arXiv publications, indicating some level of reproducibility. However, the lack of detailed information about the specific experimental setups, such as hardware configurations, prompt engineering, checkpoint versi…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LLaMA-7B / C4 META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 6.7/10 · 2 papers · 10.68pp spread (7.08%–17.76%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model name (LLaMA-7B) and dataset (C4) without any critical methodological parameters required for replication. Specifically, it omits the data split specification (e.g., validation vs. test, specific subset), shot count (zero-shot vs. few-shot), prompt fo…

PROTOCOL DIVERGENCE ANALYST (concern 9.0/10): The reported scores for LLaMA-7B on C4 exhibit a severe methodological divergence because 'C4' is a raw pre-training corpus rather than a standardized downstream evaluation benchmark with a single agreed-upon metric; consequently, the spread in scores likely stems from fundamental differences in eva…

CLAIM SCOPE AUDITOR (concern 2.0/10): The claim 'LLaMA-7B on C4' exhibits minor scope ambiguity because the C4 dataset supports multiple distinct evaluation protocols (e.g., next-token perplexity on raw text vs. accuracy on curated multiple-choice subsets derived from C4) and preprocessing variations (e.g., tokenization boundaries, sequ…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

LLaDA-MoE-7B-A1B / MATH META-ANALYSIS QUEUED

CHALLENGED · HIGH · concern 9.7/10 · 2 papers · 10.4pp spread (44.6%–55.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10): The provided audit target contains only the model identifier and the benchmark name, completely omitting all critical methodological parameters required for independent replication. Specifically, there is no information regarding the data split (e.g., MATH test set vs. full set), the number of shots…

PROTOCOL DIVERGENCE ANALYST (concern 10.0/10): The audit target 'LLaDA-MoE-7B-A1B' does not correspond to any known model architecture or published paper in the current scientific literature as of late 2024, indicating the premise of the query relies on a non-existent or hallucinated entity. Consequently, no distinct published papers reporting s…

CLAIM SCOPE AUDITOR (concern 10.0/10): The audit target presents a critical methodological gap because the reported scores for 'LLaDA-MoE-7B-A1B on MATH' lack essential conditioning variables required for reproducibility, specifically the evaluation protocol (e.g., zero-shot vs. few-shot, chain-of-thought prompting), the specific MATH su…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-20 23:18 UTC

CONTESTED by the Literature (30 total)

Factual observation: ≥ 2 independent papers report irreconcilable scores for the same model + benchmark pair. No judgment is made about which paper is correct. No language model is involved in this determination.

Sev.	Model	Benchmark	Spread (pp)	Updated
MEDIUM	Baseline	Imagenet-1K	9.73pp	2026-06-20 23:18
MEDIUM	LLaMA-13B	C4	9.56pp	2026-06-20 23:18
MEDIUM	LoRA-FAIR	Domainnet	8.6pp	2026-06-20 23:18
MEDIUM	DeepSeek-R1	MMLU	8.3pp	2026-06-20 23:18
MEDIUM	IRCoT	Musique	8.3pp	2026-06-20 23:18
MEDIUM	Llama-3	Obqa	8.13pp	2026-06-20 23:18
MEDIUM	GPT-3.5	MMLU	7.7pp	2026-06-20 23:18
MEDIUM	RoBERTa	Mnli	7.5pp	2026-06-20 23:18
MEDIUM	RoBERTa	Cola	7.41pp	2026-06-20 23:18
MEDIUM	DeepSeek-R1	HumanEval	7.3pp	2026-06-20 23:18
MEDIUM	GPT-4o	AIME	7.3pp	2026-06-20 23:18
MEDIUM	LLaDA-MoE-7B-A1B	GSM8K	7.0pp	2026-06-20 23:18
MEDIUM	GPT-4	SWE-bench	6.8pp	2026-06-20 23:18
MEDIUM	DeepSeek-33B	HumanEval	6.7pp	2026-06-20 23:18
MEDIUM	Llama-3	Accuracy	6.6pp	2026-06-20 23:18
MEDIUM	Qwen3	LiveCodeBench	6.57pp	2026-06-20 23:18
MEDIUM	Gemini-1.5-Pro	MATH	6.5pp	2026-06-20 23:18
MEDIUM	Llama-3.1-8B	Rewardbench	6.2pp	2026-06-20 23:18
MEDIUM	Llama-3.1-8B	Ifeval	6.1pp	2026-06-20 23:18
MEDIUM	Llama-3.1-70B	Musique	6.0pp	2026-06-20 23:18
MEDIUM	LightGCL	Amazon-Book	5.76pp	2026-06-20 23:18
MEDIUM	Claude-3.5	SWE-bench	5.1pp	2026-06-20 23:18
LOW	Qwen3	MMLU	4.8pp	2026-06-20 23:18
LOW	Llama-3.1-8B	HellaSwag	4.7pp	2026-06-20 23:18
LOW	Llama-3.1-70B	Rewardbench	4.7pp	2026-06-20 23:18
LOW	Claude-3.5	HumanEval	4.4pp	2026-06-20 23:18
LOW	ZSJoint	Multiatis++	4.39pp	2026-06-20 23:18
LOW	Qwen2.5	SWE-bench	3.5pp	2026-06-20 23:18
LOW	Qwen3	AIME	3.3pp	2026-06-20 23:18
LOW	Gemma-2-9B	MMLU-Pro	3.1pp	2026-06-20 23:18