Fact System Strategy

Status

The fact system is now the KB (Knowledge Base) system. After an early iteration in data/facts/ with ~139 facts across 18 entities, the system was restructured into packages/kb/ with a much broader scope: 362 entity files in packages/kb/data/things/ and 100+ property definitions in packages/kb/data/properties.yaml.

The early scaling attempt in February 2026 (batch-enriching ~30 pages with <F> tags via crux enrich fact-refs) was reverted after discovering that 81% of tags were wrapping form. That lesson directly shaped KB's design: self-closing <FBF> is the standard form, and wrapping form is discouraged.

This page documents what we learned, what similar systems do, and the operating principles behind KB.

Current infrastructure (KB):

362 entity YAML files in packages/kb/data/things/ (e.g., anthropic.yaml has ~60 facts with clean temporal structure)
100+ property definitions in packages/kb/data/properties.yaml with display config (format, units, divisors)
<FBF entity="x" property="y" /> component renders inline values with hover tooltips (self-closing is the standard form)
<FBFactValue> component for programmatic/dashboard use
<Calc> component derives values from multiple KB facts
Build-time system computes fact timeseries, usage tracking, source resolution
Fact dashboard (E898, removed — superseded by FactBase)

Current usage: <FBF> tags appear across ~19 pages, still concentrated on Anthropic pages (valuation, IPO, investors, stakeholders, headcount progression). The Anthropic entity alone has ~60 facts with temporal structure covering valuation rounds, revenue growth, and headcount over time.

Key finding (still true): The top fact (Anthropic valuation) appears on 9+ pages — proof the system works when used correctly. The three-criteria test (volatile, entity-attributable, worth tracking) is now the operating principle for what enters KB.

Lessons from Similar Systems

Before defining our strategy, it's worth understanding how other knowledge systems solve the "single source of truth for volatile data" problem.

Wikipedia / Wikidata

Wikipedia's approach is a hierarchical override: articles can hardcode values, pull from Wikidata via templates, or use infobox parameters that default to Wikidata. The key insight is that Wikidata doesn't try to tag every number in article prose — it stores structured data separately, and templates pull specific fields into specific locations (infoboxes, comparison tables).

SPARQL queries enable cross-entity consistency checks ("show me all AI companies with valuation > $10B"). This is analogous to our crux query facts — structured data living outside prose, queryable for dashboards.

Lesson for us: KB YAML facts are our Wikidata. They feed structured locations (InfoBox, DataInfoBox, comparison tables) and inline prose via self-closing <FBF> tags — the equivalent of Wikidata template calls. KB now implements this pattern with 362 entity files.

Obsidian Dataview

Obsidian's Dataview plugin demonstrates dual metadata: YAML frontmatter for structured fields, plus inline Key:: Value annotations in prose. Dataview queries pull values from both sources into dynamic tables and lists.

The critical pattern is that inline annotations are write-once, read-many — you write Revenue:: $9.7B once in a page, and queries on other pages display it. The annotation IS the single source of truth.

Lesson for us: KB implements exactly this pattern. Self-closing <FBF entity="x" property="y" /> tags are inline queries pulling from structured KB data. The wrapping form <FBF>...$9.7B...</FBF> is the anti-pattern: it creates TWO sources of truth (KB data and display text). This lesson directly led to KB standardizing on self-closing form.

Data Journalism (OWID, FiveThirtyEight)

Data journalism organizations keep data lifecycle completely separate from content. Our World in Data stores datasets in git repos with automated update pipelines. Charts pull from datasets; prose references charts. A journalist never hardcodes "238 million" in article text — the chart renders the current value.

Lesson for us: KB implements this workflow. For volatile numbers (company valuations, funding totals), the update path is: edit KB YAML → pages automatically reflect the change via self-closing <FBF> tags and structured components (tables, InfoBox) that read from KB data.

Temporal Knowledge Graphs (DBpedia-TKG)

Academic temporal knowledge graphs attach explicit timestamps to every fact: (Anthropic, valuation, 2026-02, $61.5B). This enables historical queries and staleness detection.

Lesson for us: KB's asOf field on facts implements this pattern. Adding expectedUpdateFrequency to properties in packages/kb/data/properties.yaml would enable automatic staleness detection ("this valuation is 6 months old but should update quarterly").

Relationship to Statements

Facts and Statements are complementary, not competing systems:

Facts (KB YAML) are the volatile numeric data layer — rendered via \<FBF> and \<Calc> components at build time, optimized for "update one KB entry, correct 9 pages."
Statements (PostgreSQL) are the broader structured assertion layer — covering all assertion types (numeric, text, entity-valued, attributed) with property taxonomy, citations, and temporal validity.

Statements link to Facts via source_fact_key (e.g., "anthropic.valuation"), providing traceability without requiring migration of the KB build-time pipeline. The property taxonomy in KB property definitions is shared — it seeds the properties table that Statements use.

Long-term convergence: Facts may eventually become a subset of Statements (numeric statements with canonical status), but the KB YAML → \<FBF> → build pipeline works well and doesn't need immediate migration.

The Statements System Architecture page (E1006, now removed) previously provided the full Statements system reference.

What Facts Are Actually Good For

The fact system provides value in two scenarios:

Primary value: volatile numbers that appear on multiple pages

When Anthropic raises a new round, updating one YAML entry should update the company page, the IPO analysis, the investor returns page, the stakeholder page, and competitor comparison tables. This is the "update once, correct everywhere" proposition.

Evidence it works: The Anthropic valuation fact appears on 9 pages. When it updates, one YAML edit propagates everywhere. This is the system working as designed.

Even single-page volatile facts have value, because the update workflow becomes "edit YAML, not grep prose." When a funding amount changes, finding and updating the right YAML field is faster and more reliable than searching across MDX files for hardcoded numbers.

Secondary value: computed derived values

<Calc> expressions that combine facts (revenue multiples, growth rates, funding ratios) auto-update when either input changes. This is genuinely useful and only possible with structured fact data.

Problems with the Current Approach

1. Most `<FBF>` tags should use self-closing form

The wrapping form duplicates the display text in the MDX:

{/* Wrapping form — the number is hardcoded in display text */}
xAI raised <FBF entity="xai" property="total-funding">\$26 billion</FBF> in funding

{/* If KB updates to $30B, display still says "$26 billion" */}

This is worse than plain text: it adds JSX complexity without providing the core benefit of single-source-of-truth updates. Prefer the self-closing form: <FBF entity="xai" property="total-funding" />.

2. The enrichment pipeline tags numbers indiscriminately

crux enrich fact-refs uses an LLM to match inline numbers to facts, but it has no judgment about whether tagging adds value. It wraps benchmark scores (will never change), historical funding amounts (fixed in time), and random contextual numbers.

3. Entity and property naming is now consistent

<FBF entity="ssi" property="headcount"> uses entity slugs and human-readable property names, aligning with KB conventions. EntityLink continues to use numeric IDs like E338 for entity pages.

4. The extraction pipeline creates too many low-value facts

crux facts extract proposes facts for any entity-attributable number, including historical dates, one-time events, and fixed measurements. This dilutes the fact database with entries that never need updating.

5. Limited display format control

The <FBF> component uses KB property display rules for formatting, which handles most cases. However, there is limited ability to specify format variants in context — in prose you might want "$380 billion" while in a table "$380B" is better. Property-level display config in packages/kb/data/properties.yaml provides the default format, but per-usage overrides remain limited.

Revised Strategy

Core principle: facts are for volatile, structured data

A number should be a fact only when it satisfies all three criteria:

Volatile: Will change with the next data update (quarterly earnings, annual headcount, latest valuation)
Entity-attributable: Tied to a specific entity in our data layer
Worth tracking: Either appears on 2+ pages, or changes frequently enough that YAML-based update workflow saves effort vs. grep-and-replace

Examples of good facts: current valuation, latest revenue, current headcount, total funding raised, user counts.

Examples of bad facts: founding dates, historical funding round amounts, benchmark scores from specific papers, one-time event figures.

Current scope: KB has grown to 362 entity files in packages/kb/data/things/, but the three-criteria test remains the gatekeeper for what belongs. Not every entity file needs volatile facts — many contain stable reference data. The volatile, high-value facts (valuations, revenue, headcount) remain the system's primary strength.

Self-closing `<FBF>` is the only acceptable inline form

Self-closing <FBF entity="x" property="y" /> must be the default and strongly-preferred form:

{/* Correct — value comes from KB, updates automatically */}
xAI is valued at <FBF entity="xai" property="valuation" /> as of its Series E round.

{/* Wrong — value is hardcoded, defeats the purpose */}
xAI is valued at <FBF entity="xai" property="valuation">\$230 billion</FBF> as of its Series E round.

Wrapping form should only be used when the prose requires a format that cannot be expressed via the component's formatting. Even then, consider whether the number should just be plain text instead.

Entity and property are both required

The <FBF> component requires both entity (slug) and property (human-readable name):

<FBF entity="anthropic" property="valuation" />
<FBF entity="anthropic" property="revenue" asOf="2025-12" />

Display format control

The <FBF> component uses KB property display rules for formatting. For historical values, use the asOf prop:

{/* Latest value */}
<FBF entity="anthropic" property="valuation" />

{/* Specific historical value */}
<FBF entity="anthropic" property="valuation" asOf="2025-11" />

Format rules are defined per-property in packages/kb/data/properties.yaml:

valuation:
  display:
    divisor: 1000000000
    prefix: "$"
    suffix: "B"
    longSuffix: " billion"    # Used by format="long"

Scale deliberately, not automatically

Don't batch-enrich the entire wiki. Instead:

Identify the numbers that actually change frequently (company valuations, revenue, headcount, total funding)
Create KB facts for those, with self-closing <FBF> tags
Use <Calc> for derived values that reference those facts (revenue multiples, growth rates)
Leave everything else as plain text

The extraction pipeline (crux facts extract) is useful for proposing candidates, but a human should decide which ones become facts.

Staleness Detection

The problem

Facts without freshness expectations silently go stale. The Anthropic valuation fact says "as of Feb 2026" but nothing warns you when it's 6 months old. For a fast-moving company, a 6-month-old valuation is misleading.

The solution: `expectedUpdateFrequency` on properties

Add an update frequency field to each property in packages/kb/data/properties.yaml:

valuation:
  label: "Valuation"
  expectedUpdateFrequency: 90  # days — expect updates quarterly
  # ...

headcount:
  label: "Headcount"
  expectedUpdateFrequency: 180  # days — expect updates biannually

This enables:

Dashboard warnings: The fact dashboard shows stale facts (asOf older than expected)
Auto-update targeting: The auto-update pipeline prioritizes stale facts for news scanning
CLI alerts: crux facts audit flags facts past their expected freshness window
Build-time notices: Non-blocking warnings during build-data.mjs for stale facts (informational, not CI-blocking)

Staleness tiers

Frequency	Cadence	Examples
30 days	Monthly	Revenue (for companies reporting monthly ARR)
90 days	Quarterly	Valuation, funding, headcount
180 days	Biannually	Safety researcher counts, team sizes
365 days	Annually	Market share, net worth estimates

Auto-Update Integration

Phased approach

The auto-update system (crux auto-update) already fetches news and routes it to wiki pages. Facts should integrate in phases:

Phase A: Detection only. When the auto-update pipeline detects a potential fact change (e.g., "Anthropic raises new round at $X valuation"), it flags it in the update plan but does not write to YAML. A human reviews and updates.

Phase B: Proposed YAML diffs. The pipeline generates a proposed diff to the fact YAML file, included in the auto-update PR for review. The reviewer approves or adjusts the value before merging.

Phase C: Automated writes behind PR review. The pipeline writes directly to fact YAML and creates a PR. If confidence is high (e.g., valuation from a reliable source with exact number), auto-merge is possible. Low-confidence changes require human review.

Each phase is gated on the previous phase working reliably. Start with detection, which is low-risk and immediately useful for prioritizing manual updates.

Integration points

data/auto-update/sources.yaml: Sources that produce structured financial data (SEC filings, CrunchBase, PitchBook) should be tagged with the measures they can update
crux auto-update plan: Should show which facts are stale and which sources might have updates
Fact update PRs should include the source URL, confidence level, and old vs. new value in the PR description

Implementation Priorities

Done: KB system restructure

Migrated from data/facts/ (139 facts, 18 entities) to packages/kb/data/things/ (362 entity files)
Migrated from data/fact-measures.yaml (25 measures) to packages/kb/data/properties.yaml (95 properties with display config)
Renamed <F> component to <FBF> with entity and property props as required attributes
Added <FBFactValue> component for programmatic/dashboard use
Self-closing <FBF entity="x" property="y" /> is the standard form
Built fact dashboard at /internal/facts (E898, since removed)

Done: Prune and migrate tags

Wrapping <F> tags converted to self-closing <FBF> form
Historical/fixed numbers removed from fact system (replaced with plain text)
Entity and property naming standardized on slugs and human-readable names

Remaining: Staleness infrastructure

Add expectedUpdateFrequency to properties in packages/kb/data/properties.yaml
Surface staleness warnings in the fact dashboard
Add CLI command to list overdue facts

Remaining: Auto-update integration

Phase A: Detection — flag potential fact updates in auto-update plan
Phase B: Proposed diffs — generate KB YAML changes in auto-update PRs
Gate Phase C (automated writes) on successful Phase A+B operation

Remaining: Enrichment pipeline repurposing

Repurpose as a migration tool: scan MDX for hardcoded numbers that match existing KB fact values, propose self-closing <FBF> replacements
Tighten extraction to only propose facts meeting the three criteria

Remaining: Display format improvements

Per-usage format overrides (short vs long display in different contexts)
Currently property-level display config handles most cases, but context-sensitive formatting is limited

Resolved Open Questions

Q1: Should KB YAML files exist for entities with no volatile facts?

Answer: In KB, entity files serve a broader purpose. The original answer was "prune files with only fixed facts," but KB's 362 entity files include both volatile tracked numbers and stable reference data. Entity files with only stable data still contribute to the structured knowledge layer (enabling queries, dashboards, and cross-entity comparisons). The three-criteria test applies to which facts get <FBF> tags in prose, not to which entity files exist.

Q2: Should the enrichment pipeline be removed or repurposed?

Answer: Repurpose (still TODO). The wrapping behavior was killed. The pipeline should be rebuilt as a migration tool: scan MDX for hardcoded numbers that match existing KB fact values, and propose replacements with self-closing <FBF entity="x" property="y" /> tags. This is the inverse of what it originally did — instead of adding redundant wrappers, it would remove hardcoded values.

Q3: How should facts interact with the auto-update system?

Answer: Phased — detection first, then automated writes behind PR review. See the Auto-Update Integration section above. Don't jump to automated YAML writes until detection is proven reliable.

Q4: What's the right display format for self-closing `<FBF>` tags?

Answer (implemented): The <FBF> component uses KB property display rules from packages/kb/data/properties.yaml for formatting. The asOf prop handles historical values. This eliminates the main reason people used wrapping form (to control display text). Remaining gap: per-usage format overrides for different contexts (prose vs table).