Skip to content
Longterm Wiki

All Source Checks

Automated source checking of wiki data against original sources. Each record is checked against one or more external sources to confirm accuracy.

View internal dashboard with coverage & action queue →

Verified Correct

65

97% of checked

Has Issues

0

0% of checked

Can't Verify

2

3% of checked

Not Yet Checked

0

of 67 total

Contradicted

0

None found

Outdated

0

All current

Accuracy Rate

100%

confirmed / (confirmed + wrong + outdated)

Needs Recheck

0

All up to date

67 results
Benchmark Resultconfirmed

sid_Ac7c55KtVw / IFEval: 91.2

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / GSM8K: 98.4

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / MMMU: 76.5

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / BrowseComp: 84

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / HumanEval: 95.4

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / MMLU: 92.1

Claude Opus 4.6·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / MGSM: 92.5

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / SimpleQA: 36

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / LiveCodeBench: 70.3

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / Humanity's Last Exam: 43.2

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / HumanEval: 92

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / GSM8K: 95

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / MMLU-Pro: 89.5

Claude Opus 4.5·4/24/2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / MMMU: 69.1

Claude 3.7 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / LiveCodeBench: 65.4

Claude 3.7 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / GSM8K: 96.4

Claude 3.7 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / MMLU-Pro: 78.4

Claude 3.7 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / HumanEval: 94

Claude 3.7 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ISfAiImMYg / SWE-bench Verified: 49

Claude 3.5 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_ISfAiImMYg / GSM8K: 96.4

Claude 3.5 Sonnet·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HellaSwag: 84

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / GSM8K: 40.3

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HumanEval: 30.5

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / MMLU: 60.1

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / HumanEval: 89

Llama·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MMLU: 87.3

Llama·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MATH: 73.8

Llama·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / Chatbot Arena Elo: 1402

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / LiveCodeBench: 79.4

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / HumanEval: 86.5

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / MMLU-Pro: 79.9

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / GSM8K: 89.3

Grok·4/24/2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / Aider Polyglot: 9.8

GPT-4.1 nano·4/24/2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / MMLU: 80.1

GPT-4.1 nano·4/24/2026
Benchmark Resultunverifiable

sid_nywmt9QdsA / MathVista: 73.1

GPT-4.1 mini·4/24/2026
Benchmark Resultconfirmed

sid_nywmt9QdsA / MMLU: 80.1

GPT-4.1 mini·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / TruthfulQA: 47

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / GSM8K: 57.1

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / DROP: 61.4

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / HellaSwag: 85.5

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / WinoGrande: 81.6

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HellaSwag: 95

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / GSM8K: 92

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MATH: 76.6

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MGSM: 90.5

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HumanEval: 90.2

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MMLU: 88.7

GPT·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MMLU: 92.4

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / HumanEval: 89.7

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MATH: 78.3

Gemini·4/24/2026
Showing 150 of 67
PrevPage 1 of 2Next

Data from source_check_verdicts table. Click a row to view detailed evidence.