Technical Deep-Dive

Eureka Device

Autonomous scientific claim extraction from ML papers: how it actually works, what the numbers are, and where it fails.

~4,200 words 15 min read January 2026

§1 What It Actually Does

Eureka is a pipeline that takes ML papers (PDFs from arXiv) and extracts structured claims with evidence pointers, builds a graph of relationships between claims, and generates testable hypotheses from patterns in that graph.

The concrete output is:

Claims: Structured statements with confidence scores, linked to evidence
Evidence: Pointers to exact locations (page, table, figure, section)
Regimes: Experimental conditions under which claims hold
Edges: Relationships between claims (supports, contradicts, extends)
Hypotheses: Generated from graph patterns (contradictions, gaps, synthesis opportunities)

Key insight: The system's value isn't in any single extraction — it's in the cross-paper graph. A claim from Paper A only becomes interesting when it contradicts Paper B or enables hypothesis C.

§2 Measured Performance

These numbers are from processing 197 papers across 8 topic-specific databases. All measurements on a single machine (M2 Max, 64GB RAM) using Claude 3.5 Sonnet for extraction.

47s

Avg extraction time

Two-pass: 8s triage + 39s deep

~12K

Tokens per paper

Input: ~10K, Output: ~2K

$0.18

Cost per paper

Sonnet at $3/$15 per 1M

1.37

Claims per paper

271 claims from 197 papers

Extraction Quality

Evaluated against 50 manually annotated papers (human baseline). Inter-annotator agreement (2 annotators) was 0.73 Cohen's kappa on claim boundaries.

Metric	Value	Notes
Claim Precision	0.81	% of extracted claims that are valid
Claim Recall	0.64	% of human-identified claims found
Evidence Accuracy	0.89	% of evidence pointers that resolve correctly
Regime Extraction	0.72	F1 on experimental condition fields
Edge Precision	0.68	% of generated edges that humans agree with
Triage Accuracy	0.91	Fast pass correctly filters irrelevant papers

The recall problem: 0.64 recall means we miss ~36% of claims. This is acceptable for hypothesis generation (we need coverage, not completeness) but would be fatal for systematic review. The main failure mode is claims buried in dense methodology sections.

§3 How Extraction Works

The Two-Pass Architecture

PDF Input │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ PASS 1: TRIAGE (Haiku, ~1.5K tokens, 8s) │ │ ───────────────────────────────────────── │ │ Prompt: "Is this paper about [topic]? Does it contain │ │ empirical claims with quantitative results?" │ │ │ │ Output: { relevant: bool, confidence: float, reason: str } │ │ │ │ If not relevant OR confidence < 0.7 → SKIP │ └─────────────────────────────────────────────────────────────┘ │ (60% pass) ▼ ┌─────────────────────────────────────────────────────────────┐ │ PASS 2: DEEP EXTRACTION (Sonnet, ~12K tokens, 39s) │ │ ────────────────────────────────────────────────── │ │ Input: Full paper text (chunked if >100K tokens) │ │ Output: Structured JSON with claims, evidence, regimes │ │ │ │ Key: Schema is in the prompt, model fills it │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ POST-PROCESSING │ │ ─────────────── │ │ 1. Validate evidence pointers (page/table exists) │ │ 2. Dedupe claims by semantic similarity (>0.92 = merge) │ │ 3. Link to ontology terms (methods, datasets, metrics) │ │ 4. Compute confidence calibration adjustment │ └─────────────────────────────────────────────────────────────┘

Evidence Grounding: Why It Matters

Every claim must have an evidence pointer that resolves to a real location. This is enforced in post-processing:

func validateEvidence(claim Claim, paper Paper) error {
    for _, ev := range claim.Evidence {
        switch ev.Type {
        case "table":
            if !paper.HasTable(ev.TableID) {
                return fmt.Errorf("table %s not found", ev.TableID)
            }
        case "figure":
            if !paper.HasFigure(ev.FigureID) {
                return fmt.Errorf("figure %s not found", ev.FigureID)
            }
        case "text":
            if ev.Page < 1 || ev.Page > paper.PageCount {
                return fmt.Errorf("page %d out of range", ev.Page)
            }
            // Fuzzy match the quote in the page text
            if !fuzzyMatch(paper.PageText(ev.Page), ev.Quote, 0.8) {
                return fmt.Errorf("quote not found on page %d", ev.Page)
            }
        }
    }
    return nil
}

11% of claims fail evidence validation and are either fixed (LLM retry with error message) or discarded. This is the main quality gate.

§4 Regime Compatibility

The most important engineering decision in Eureka is regime-gated edges. Two claims can only be compared if their experimental regimes are compatible.

"Flash Attention is 2x faster" (tested on A100, 7B model, seq_len=2048) doesn't contradict "Flash Attention has no speedup" (tested on T4, 125M model, seq_len=512). These are claims in different regimes.

Edge Gating Rules

Edge Type	Min Compatibility	Fallback
CONTRADICTS	0.70	→ DIFFERENT_REGIME
SUPPORTS	0.50	→ POSSIBLY_SUPPORTS
EXTENDS	0.30	→ RELATED
RELATED	0.00	Always allowed

Result: Of 847 initial CONTRADICTS edges detected by semantic similarity, only 312 (37%) survived regime gating. The rest were demoted to DIFFERENT_REGIME — not contradictions, just different experimental contexts.

§5 Hypothesis Generation

Hypotheses are generated from four graph patterns. Each pattern has a template and scoring criteria.

Pattern 1: Contradiction Resolution

When two claims contradict within compatible regimes, generate a hypothesis explaining the discrepancy.

// Example from actual data
Claim A (Paper: FlashAttention-2):
  "FlashAttention achieves 2.4x speedup on A100 for seq_len >= 1024"
  Regime: A100-80GB, 7B params, seq_len=2048

Claim B (Paper: Attention Benchmark Study):
  "FlashAttention shows <1.1x speedup on A100 for decoder-only models"
  Regime: A100-40GB, 7B params, seq_len=2048

Regime compatibility: 0.85 (same gen hardware, same size, same task)
Edge type: CONTRADICTS (survives gating)

Generated hypothesis:
  "The FlashAttention speedup discrepancy may be explained by A100-80GB vs
   A100-40GB memory bandwidth differences (2.0 TB/s vs 1.6 TB/s). Hypothesis:
   FlashAttention speedup scales with memory bandwidth."

  Testable: Yes (run same benchmark on both SKUs)
  Estimated effort: Low (no training, just inference benchmarks)
  Priority score: 0.82

Anti-Hype Scoring

Every hypothesis is scored to prefer practical, testable ideas over hype. The formula penalizes high cost and fragility:

Score(H) = (N^α × E^β × U^γ × L^δ) / (C^κ × F^λ)

Where:
  N = Novelty        // 0-1, based on cluster distance from existing work
  E = Evidence       // 0-1, median confidence of supporting claims
  U = Utility        // 0-1, favors efficiency gains over marginal accuracy
  L = Leverage       // 0-1, how many other hypotheses this would unlock
  C = Cost           // 1-10, estimated compute to test (1=inference, 10=pretrain)
  F = Fragility      // 1-5, number of assumptions that must hold

Exponents (tuned on human rankings):
  α=0.8, β=1.2, γ=1.0, δ=0.6, κ=1.5, λ=1.3

Key insight: The cost penalty (κ=1.5) is aggressive. A hypothesis requiring pretraining (C=10) needs 31x more evidence/novelty/utility to match one testable with inference (C=1).

§6 Data Model

Full SQLite schema. All IDs are typed strings to prevent cross-table confusion.

-- Papers table
CREATE TABLE papers (
    id TEXT PRIMARY KEY,       -- "arxiv:2205.14135"
    arxiv_id TEXT,
    title TEXT NOT NULL,
    authors TEXT,               -- JSON array
    abstract TEXT,
    pdf_path TEXT,
    extracted_text TEXT,
    page_count INTEGER,
    triage_relevant BOOLEAN,
    triage_confidence REAL,
    extraction_status TEXT,   -- pending|extracted|failed
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

-- Claims table
CREATE TABLE claims (
    id TEXT PRIMARY KEY,       -- "claim:flash-speedup-2205"
    paper_id TEXT REFERENCES papers(id),
    statement TEXT NOT NULL,
    claim_type TEXT,           -- performance|efficiency|correctness|...
    confidence REAL,
    confidence_calibrated REAL,
    evidence_json TEXT,       -- JSON array of evidence objects
    regime_id TEXT REFERENCES regimes(id),
    embedding BLOB,            -- 1536-dim float32 (text-embedding-3-small)
    created_at TIMESTAMP
);

-- Edges table
CREATE TABLE edges (
    id TEXT PRIMARY KEY,
    from_claim_id TEXT REFERENCES claims(id),
    to_claim_id TEXT REFERENCES claims(id),
    relation_type TEXT,       -- supports|contradicts|extends|...
    relation_original TEXT,  -- before regime gating
    regime_compatibility REAL,
    confidence REAL,
    rationale TEXT,
    created_at TIMESTAMP
);

-- Indexes for common queries
CREATE INDEX idx_claims_paper ON claims(paper_id);
CREATE INDEX idx_edges_type ON edges(relation_type);

§7 Failure Modes

Where Eureka breaks down, and what we're doing about it.

1. Methodology Section Blindness

Problem: Claims buried in dense methodology text are missed. The model extracts well from abstract, intro, results — poorly from methods.

Impact: ~18% of missed claims are methodology-specific (hyperparameters, training details, ablation conditions).

Mitigation: Separate methodology-focused extraction pass with different prompt. Not yet implemented.

2. Table Parsing Failures

Problem: PDF table extraction is lossy. Complex tables (merged cells, rotated headers) often parse incorrectly.

Impact: ~23% of evidence pointer failures are table-related.

Mitigation: Using table detection + OCR fallback (Tesseract). Considering switch to vision model for tables.

3. Confidence Calibration

Problem: Raw LLM confidence scores are poorly calibrated. Model says 0.9 confidence for claims that are actually 0.7 reliable.

Impact: Scoring formula weights evidence confidence heavily (β=1.2), so miscalibration directly hurts hypothesis ranking.

Mitigation: Platt scaling on held-out human-annotated set. calibrated_confidence = σ(A × raw_confidence + B) where A=1.3, B=-0.4 from our data.

§8 Implementation Notes

Why SQLite, Not a Graph Database?

Portability: Single file, no server, works offline
Query patterns: Most queries are "get claims by paper" or "get edges by claim" — simple JOINs, not graph traversals
Scale: At 271 claims, Neo4j overhead isn't worth it. SQLite handles 100K+ claims fine
Embeddings: sqlite-vec extension for vector similarity (0.3ms for top-10 nearest neighbors in 10K vectors)

Why Typed String IDs?

All IDs are strings with type prefixes: paper:arxiv:2205.14135, claim:flash-speedup-001. This prevents accidentally using a ClaimID where a PaperID is expected — caught at query time, not runtime.

Sources & Related Work

ORKG (Open Research Knowledge Graph) — Structured extraction of research contributions. Jaradeh et al., 2019.
SciREX — Scientific information extraction dataset. Jain et al., 2020.
SCITE — Citation context classification. Nicholson et al., 2021.
LLM Calibration — On calibration of modern neural networks. Guo et al., 2017.