Autonomous scientific claim extraction from ML papers: how it actually works, what the numbers are, and where it fails.
Eureka is a pipeline that takes ML papers (PDFs from arXiv) and extracts structured claims with evidence pointers, builds a graph of relationships between claims, and generates testable hypotheses from patterns in that graph.
The concrete output is:
Key insight: The system's value isn't in any single extraction — it's in the cross-paper graph. A claim from Paper A only becomes interesting when it contradicts Paper B or enables hypothesis C.
These numbers are from processing 197 papers across 8 topic-specific databases. All measurements on a single machine (M2 Max, 64GB RAM) using Claude 3.5 Sonnet for extraction.
Evaluated against 50 manually annotated papers (human baseline). Inter-annotator agreement (2 annotators) was 0.73 Cohen's kappa on claim boundaries.
| Metric | Value | Notes |
|---|---|---|
| Claim Precision | 0.81 | % of extracted claims that are valid |
| Claim Recall | 0.64 | % of human-identified claims found |
| Evidence Accuracy | 0.89 | % of evidence pointers that resolve correctly |
| Regime Extraction | 0.72 | F1 on experimental condition fields |
| Edge Precision | 0.68 | % of generated edges that humans agree with |
| Triage Accuracy | 0.91 | Fast pass correctly filters irrelevant papers |
The recall problem: 0.64 recall means we miss ~36% of claims. This is acceptable for hypothesis generation (we need coverage, not completeness) but would be fatal for systematic review. The main failure mode is claims buried in dense methodology sections.
Every claim must have an evidence pointer that resolves to a real location. This is enforced in post-processing:
func validateEvidence(claim Claim, paper Paper) error {
for _, ev := range claim.Evidence {
switch ev.Type {
case "table":
if !paper.HasTable(ev.TableID) {
return fmt.Errorf("table %s not found", ev.TableID)
}
case "figure":
if !paper.HasFigure(ev.FigureID) {
return fmt.Errorf("figure %s not found", ev.FigureID)
}
case "text":
if ev.Page < 1 || ev.Page > paper.PageCount {
return fmt.Errorf("page %d out of range", ev.Page)
}
// Fuzzy match the quote in the page text
if !fuzzyMatch(paper.PageText(ev.Page), ev.Quote, 0.8) {
return fmt.Errorf("quote not found on page %d", ev.Page)
}
}
}
return nil
}
11% of claims fail evidence validation and are either fixed (LLM retry with error message) or discarded. This is the main quality gate.
The most important engineering decision in Eureka is regime-gated edges. Two claims can only be compared if their experimental regimes are compatible.
"Flash Attention is 2x faster" (tested on A100, 7B model, seq_len=2048) doesn't contradict "Flash Attention has no speedup" (tested on T4, 125M model, seq_len=512). These are claims in different regimes.
| Edge Type | Min Compatibility | Fallback |
|---|---|---|
| CONTRADICTS | 0.70 | → DIFFERENT_REGIME |
| SUPPORTS | 0.50 | → POSSIBLY_SUPPORTS |
| EXTENDS | 0.30 | → RELATED |
| RELATED | 0.00 | Always allowed |
Result: Of 847 initial CONTRADICTS edges detected by semantic similarity, only 312 (37%) survived regime gating. The rest were demoted to DIFFERENT_REGIME — not contradictions, just different experimental contexts.
Hypotheses are generated from four graph patterns. Each pattern has a template and scoring criteria.
When two claims contradict within compatible regimes, generate a hypothesis explaining the discrepancy.
// Example from actual data
Claim A (Paper: FlashAttention-2):
"FlashAttention achieves 2.4x speedup on A100 for seq_len >= 1024"
Regime: A100-80GB, 7B params, seq_len=2048
Claim B (Paper: Attention Benchmark Study):
"FlashAttention shows <1.1x speedup on A100 for decoder-only models"
Regime: A100-40GB, 7B params, seq_len=2048
Regime compatibility: 0.85 (same gen hardware, same size, same task)
Edge type: CONTRADICTS (survives gating)
Generated hypothesis:
"The FlashAttention speedup discrepancy may be explained by A100-80GB vs
A100-40GB memory bandwidth differences (2.0 TB/s vs 1.6 TB/s). Hypothesis:
FlashAttention speedup scales with memory bandwidth."
Testable: Yes (run same benchmark on both SKUs)
Estimated effort: Low (no training, just inference benchmarks)
Priority score: 0.82
Every hypothesis is scored to prefer practical, testable ideas over hype. The formula penalizes high cost and fragility:
Score(H) = (N^α × E^β × U^γ × L^δ) / (C^κ × F^λ)
Where:
N = Novelty // 0-1, based on cluster distance from existing work
E = Evidence // 0-1, median confidence of supporting claims
U = Utility // 0-1, favors efficiency gains over marginal accuracy
L = Leverage // 0-1, how many other hypotheses this would unlock
C = Cost // 1-10, estimated compute to test (1=inference, 10=pretrain)
F = Fragility // 1-5, number of assumptions that must hold
Exponents (tuned on human rankings):
α=0.8, β=1.2, γ=1.0, δ=0.6, κ=1.5, λ=1.3
Key insight: The cost penalty (κ=1.5) is aggressive. A hypothesis requiring pretraining (C=10) needs 31x more evidence/novelty/utility to match one testable with inference (C=1).
Full SQLite schema. All IDs are typed strings to prevent cross-table confusion.
-- Papers table
CREATE TABLE papers (
id TEXT PRIMARY KEY, -- "arxiv:2205.14135"
arxiv_id TEXT,
title TEXT NOT NULL,
authors TEXT, -- JSON array
abstract TEXT,
pdf_path TEXT,
extracted_text TEXT,
page_count INTEGER,
triage_relevant BOOLEAN,
triage_confidence REAL,
extraction_status TEXT, -- pending|extracted|failed
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Claims table
CREATE TABLE claims (
id TEXT PRIMARY KEY, -- "claim:flash-speedup-2205"
paper_id TEXT REFERENCES papers(id),
statement TEXT NOT NULL,
claim_type TEXT, -- performance|efficiency|correctness|...
confidence REAL,
confidence_calibrated REAL,
evidence_json TEXT, -- JSON array of evidence objects
regime_id TEXT REFERENCES regimes(id),
embedding BLOB, -- 1536-dim float32 (text-embedding-3-small)
created_at TIMESTAMP
);
-- Edges table
CREATE TABLE edges (
id TEXT PRIMARY KEY,
from_claim_id TEXT REFERENCES claims(id),
to_claim_id TEXT REFERENCES claims(id),
relation_type TEXT, -- supports|contradicts|extends|...
relation_original TEXT, -- before regime gating
regime_compatibility REAL,
confidence REAL,
rationale TEXT,
created_at TIMESTAMP
);
-- Indexes for common queries
CREATE INDEX idx_claims_paper ON claims(paper_id);
CREATE INDEX idx_edges_type ON edges(relation_type);
Where Eureka breaks down, and what we're doing about it.
Problem: Claims buried in dense methodology text are missed. The model extracts well from abstract, intro, results — poorly from methods.
Impact: ~18% of missed claims are methodology-specific (hyperparameters, training details, ablation conditions).
Mitigation: Separate methodology-focused extraction pass with different prompt. Not yet implemented.
Problem: PDF table extraction is lossy. Complex tables (merged cells, rotated headers) often parse incorrectly.
Impact: ~23% of evidence pointer failures are table-related.
Mitigation: Using table detection + OCR fallback (Tesseract). Considering switch to vision model for tables.
Problem: Raw LLM confidence scores are poorly calibrated. Model says 0.9 confidence for claims that are actually 0.7 reliable.
Impact: Scoring formula weights evidence confidence heavily (β=1.2), so miscalibration directly hurts hypothesis ranking.
Mitigation: Platt scaling on held-out human-annotated set. calibrated_confidence = σ(A × raw_confidence + B) where A=1.3, B=-0.4 from our data.
All IDs are strings with type prefixes: paper:arxiv:2205.14135,
claim:flash-speedup-001. This prevents accidentally using a
ClaimID where a PaperID is expected — caught at query time, not runtime.