Skip to content
Longterm Wiki

All Source Checks

Automated source checking of wiki data against original sources. Each record is checked against one or more external sources to confirm accuracy.

View internal dashboard with coverage & action queue →

Verified Correct

50

96% of checked

Has Issues

0

0% of checked

Can't Verify

2

4% of checked

Not Yet Checked

0

of 52 total

Contradicted

0

None found

Outdated

0

All current

Accuracy Rate

100%

confirmed / (confirmed + wrong + outdated)

Needs Recheck

0

All up to date

45 results
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / GSM8K: 40.3

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HumanEval: 30.5

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HellaSwag: 84

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / MMLU: 60.1

Mistral·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MATH: 73.8

Llama·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / HumanEval: 89

Llama·4/24/2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MMLU: 87.3

Llama·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / Chatbot Arena Elo: 1402

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / LiveCodeBench: 79.4

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / GSM8K: 89.3

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / MMLU-Pro: 79.9

Grok·4/24/2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / HumanEval: 86.5

Grok·4/24/2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / Aider Polyglot: 9.8

GPT-4.1 nano·4/24/2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / MMLU: 80.1

GPT-4.1 nano·4/24/2026
Benchmark Resultconfirmed

sid_nywmt9QdsA / MMLU: 80.1

GPT-4.1 mini·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / GSM8K: 57.1

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / DROP: 61.4

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / WinoGrande: 81.6

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / TruthfulQA: 47

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / HellaSwag: 85.5

GPT-3.5 Turbo·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HellaSwag: 95

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / GSM8K: 92

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MATH: 76.6

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MGSM: 90.5

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HumanEval: 90.2

GPT·4/24/2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MMLU: 88.7

GPT·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MATH: 78.3

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MMLU: 92.4

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / HumanEval: 89.7

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MMLU-Pro: 90.99

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / SWE-bench Verified: 80.6

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / Humanity's Last Exam: 44.4

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / ARC-AGI-2: 77.1

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / GPQA Diamond: 94.3

Gemini·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / BBH: 87.5

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / GPQA Diamond: 59.1

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / DROP: 91.6

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / MMLU-Pro: 75.9

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / HumanEval: 65.2

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / GSM8K: 89.3

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / MATH: 61.6

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_svlbcrT5oQ / MMLU: 88.5

DeepSeek Models·4/24/2026
Benchmark Resultconfirmed

sid_dHgSM46fMw / SWE-bench Verified: 74.5

Claude Opus 4.1·4/24/2026
Benchmark Resultconfirmed

sid_dHgSM46fMw / GPQA Diamond: 80.9

Claude Opus 4.1·4/24/2026
Benchmark Resultconfirmed

sid_y87VxEBBIA / SWE-bench Verified: 73.3

Claude Haiku 4.5·4/24/2026

Data from source_check_verdicts table. Click a row to view detailed evidence.