Systematic Discovery Methodology
A framework for LLM agents to systematically explore and understand complex systems, codebases, research domains, and problem spaces.
Exploration is not optimization. You're not minimizing a metric—you're reducing uncertainty.
Your job is to:
Before exploring, inventory what you don't know.
## Known Knowns
Things I understand and can explain.
## Known Unknowns
Specific questions I need to answer.
## Suspected Structure
Hypotheses about how things connect.
## Boundary Conditions
What's in scope vs out of scope.
## Unknown Unknowns (meta)
Areas where I don't even know what questions to ask.
Exploration without structure becomes wandering. The inventory:
| Type | Description | Exploration Strategy |
|---|---|---|
| Structural | How is it organized? | Map → hierarchy, dependencies |
| Behavioral | What does it do? | Probe → inputs, outputs, edge cases |
| Causal | Why does it work this way? | Trace → history, constraints, decisions |
| Boundary | Where are the edges? | Test → limits, failure modes |
| Conceptual | What are the key abstractions? | Synthesize → patterns, vocabulary |
Your memory is unreliable. The journal is ground truth.
## Session [N]: [Focus Area]
### Starting State
- What I thought I knew: [summary]
- Open questions: [list]
- Hypothesis: [what I expect to find]
### Explorations
#### Probe 1: [Description]
- Action: What I did
- Found: What I observed
- Implies: What this means for my model
- New questions: What this raises
#### Probe 2: [Description]
...
### Session Synthesis
- Model updates: How my understanding changed
- Confirmed: Hypotheses that held
- Refuted: Hypotheses that failed
- Deferred: Questions for later
- Connections: Links to other areas
Goal: Structural overview. Map the territory.
Tactics:
Survey Questions:
Goal: Deep understanding of specific area.
Tactics:
Dive Questions:
Goal: Understand flow across boundaries.
Tactics:
Trace Questions:
Goal: Validate understanding through prediction.
Tactics:
Probe Protocol:
1. State prediction: "I expect X because [reasoning]"
2. Investigate: Look at actual behavior/code/data
3. Compare: Did reality match prediction?
4. If match: Confidence increases
5. If mismatch: WHY? Update model.
Goal: Unify fragments into coherent model.
Tactics:
Synthesis Questions:
while uncertainty > acceptable_threshold:
# Phase 1: Survey (if entering new area)
if new_territory:
map = SURVEYOR.scan(territory)
questions = extract_questions(map)
priorities = rank_by_uncertainty(questions)
# Phase 2: Investigate priority areas
for question in priorities[:K]:
if question.type == STRUCTURAL:
finding = DIVER.investigate(question.target)
elif question.type == FLOW:
finding = TRACER.follow(question.scenario)
elif question.type == BEHAVIORAL:
finding = DIVER.probe(question.target)
journal.record(question, finding)
# Phase 3: Synthesize periodically
if journal.entries > synthesis_threshold:
model = SYNTHESIZER.integrate(journal)
# Phase 4: Challenge the model
holes = CHALLENGER.attack(model)
if holes:
priorities.extend(holes)
else:
uncertainty = estimate_remaining(model)
# Phase 5: Decide next focus
if stuck_in_one_area:
switch_to_adjacent_area()
if model.confidence > threshold:
mark_area_understood()
Start where users/callers start: main(), index.html, API endpoints. Public interfaces before internals. High-traffic paths before edge cases.
Why: Entry points reveal intended usage and core flows.
When confused, trace data movement. Where does input come from? What transforms happen? Where does output go? What persists vs. what's ephemeral?
Why: Data flow is often clearer than control flow.
Tests reveal: intended behavior, edge cases the authors worried about, integration boundaries, "happy path" assumptions.
Why: Tests are executable documentation of expectations.
Every system has essential vs. accidental complexity. What's the minimum viable version? What could you remove and still have it work?
Why: Understanding the core accelerates understanding the rest.
If you can't name it, you don't understand it. Create vocabulary as you go. "The X pattern" — name recurring structures. Naming forces clarity.
Why: Vocabulary is crystallized understanding.
Try to explain what you've found. Where does the explanation break down? What can't you articulate clearly? Those gaps are understanding gaps.
Why: Teaching reveals holes in mental models.
Pay special attention to things that don't fit: unexpected dependencies, naming inconsistencies, code that "shouldn't be there", historical artifacts.
Why: Anomalies often reveal important history or constraints.
When you're going too deep or too shallow, recalibrate.
Signs you need to go deeper:
Action: Pick one component, DIVER mode, exhaust it.
Signs you need to pull back:
Action: SURVEYOR mode, map adjacent territory.
1. Could I implement this from my understanding?
No → probably too shallow
2. Could I explain this to someone in 2 minutes?
No → might be too deep (or too shallow)
3. Do I know where this fits in the bigger picture?
No → too deep without context
4. Can I predict what I'll find in adjacent areas?
No → haven't extracted patterns yet
Start at center, expand outward in rings.
Ring 0: Entry point / main concept
Ring 1: Direct dependencies / immediate context
Ring 2: Secondary dependencies / broader context
Ring 3: Ecosystem / external integrations
Each ring complete before next.
When: Clear center exists. Good for codebase with obvious main.
Start with critical questions, investigate to answer.
1. What is the #1 thing I need to understand?
2. Investigate until answered.
3. What's the next most critical question?
4. Repeat.
Let questions drive exploration path.
When: Specific goals exist. Good for targeted investigation.
Understand by comparing to known similar things.
1. What does this remind me of?
2. How is it similar?
3. How is it different?
4. What explains the differences?
Build understanding through contrast.
When: Familiar reference points exist. Good for learning new framework.
Understand the present through the past.
1. What was version 0?
2. What changed and why?
3. What constraints shaped decisions?
4. What's vestigial vs. essential?
Git history, release notes, design docs.
When: System seems historically contingent. Good for legacy code.
Understand by trying to break.
1. What could go wrong?
2. What are the trust boundaries?
3. Where are the assumptions?
4. What happens at the edges?
Security/reliability mindset.
When: Need to understand robustness. Good for security audit.
Understand by mentally rebuilding.
1. If I were building this, what would I need?
2. What problems would I face?
3. How would I solve them?
4. How does actual compare to my imagined version?
Predict then compare.
When: System is large/complex. Good for architecture understanding.
Exploration produces artifacts, not just knowledge.
Nodes: Key concepts, components
Edges: Relationships (uses, contains, depends-on, etc.)
Annotations: Brief descriptions
Visual representation of system structure.
Term: Definition in context of this system.
Build shared vocabulary.
Important: Note where this system's usage
differs from common usage.
Decision: What choice was made
Context: What constraints existed
Alternatives: What was considered
Rationale: Why this choice
Captures the "why" that code can't show.
Trigger: What initiates the flow
Steps: Numbered sequence through system
Data: What transforms at each step
Result: What outcome
Executable understanding of key flows.
Question: What I still don't understand
Attempts: What I tried to find out
Blocker: Why I couldn't answer
Priority: How important to resolve
Explicit acknowledgment of gaps.
| Context | Completion Standard |
|---|---|
| Quick orientation | Survey complete, key concepts named |
| Working in codebase | Can modify safely, know impact radius |
| Debugging | Can trace issue, know relevant components |
| Architecture review | Can critique decisions, identify risks |
| Full ownership | Could rewrite from scratch |
Try to explain your understanding.
Where does the explanation:
- Feel confident? → Actually understood
- Get hand-wavy? → Partially understood
- Require hedging? → Not understood
The explanation reveals your actual knowledge.
Don't use markdown files. Use a queryable database with Zettelkasten-style linking.
-- Exploration findings/observations
CREATE TABLE findings (
id INTEGER PRIMARY KEY,
session_id TEXT NOT NULL,
area TEXT NOT NULL, -- module, file, concept
finding_type TEXT, -- structural|behavioral|causal|boundary|conceptual
content TEXT NOT NULL,
implications TEXT,
questions TEXT, -- JSON array of new questions
confirmed INTEGER, -- has this been validated?
created_at TIMESTAMP,
embedding BLOB -- for semantic search
);
-- Zettelkasten-style links between findings
CREATE TABLE links (
from_id INTEGER,
to_id INTEGER,
relation TEXT, -- supports|contradicts|refines|questions|similar
note TEXT
);
-- Fast tag-based filtering
CREATE TABLE tags (
finding_id INTEGER,
tag TEXT
);
from agent_memory import AgentMemory
mem = AgentMemory("./exploration.db")
# Log a finding
finding_id = mem.log_finding(
session_id="codebase_explore_001",
area="src/encoder",
finding_type="structural",
content="Encoder uses varint for all integer types",
implications="Space-efficient but CPU cost on decode",
questions=["Why not fixed-width for small ints?"],
tags=["encoder", "varint", "design-decision"]
)
# Link to related finding
mem.add_link(
from_id=finding_id, from_type="finding",
to_id=3, to_type="finding",
relation="supports",
note="Both relate to space optimization"
)
# Query: all structural findings in encoder
findings = mem.get_findings(
area="encoder",
finding_type="structural"
)
# Query: unconfirmed findings (need validation)
uncertain = mem.get_findings(confirmed_only=False)
# Semantic search
related = mem.search_text("memory allocation")
| Finding Type | Uncertainty Being Reduced | Typical Tags |
|---|---|---|
| structural | How is it organized? | module, component, dependency |
| behavioral | What does it do? | api, input, output, edge-case |
| causal | Why does it work this way? | design-decision, history, constraint |
| boundary | Where are the edges? | limit, failure-mode, assumption |
| conceptual | What are the key abstractions? | pattern, vocabulary, mental-model |
1. BOUND What am I exploring? What's out of scope?
2. SURVEY What exists? How is it organized?
3. QUESTION What don't I understand? Prioritize.
4. DIVE Investigate priority unknowns.
5. TRACE Follow flows across boundaries.
6. PROBE Test mental model with predictions.
7. RECORD Journal everything. Trust the log.
8. SYNTHESIZE Build unified understanding.
9. CHALLENGE Attack the model. Find holes.
10. ITERATE Until "done enough" for context.
Too shallow: Too deep:
• Know WHAT not HOW • Lost big picture
• Predictions fail • Details disconnected
• Can't explain edges • Diminishing returns
→ Go deeper: DIVER → Pull back: SURVEYOR
Survey: File tree, README, entry points
Dive: Core module, critical path
Trace: Main user flow, error handling
Probe: "If I change X, what breaks?"
Synthesize: Architecture diagram, module responsibilities
Survey: Survey papers, key authors, major conferences
Dive: Seminal papers, foundational techniques
Trace: Citation chains (forward and backward)
Probe: "Can I reproduce this result?"
Synthesize: Research map, open problems, key debates
Survey: Documentation, endpoint list, data models
Dive: Core resources, authentication, rate limits
Trace: Complete request lifecycle
Probe: "What happens if I send malformed X?"
Synthesize: Mental model of system behavior, gotchas
Survey: Existing solutions, stakeholder needs, constraints
Dive: Specific failure modes, edge cases
Trace: User journeys, data flows
Probe: "Would approach X handle scenario Y?"
Synthesize: Problem decomposition, solution space map
Survey: Schema, row counts, column types, missingness
Dive: Distributions, outliers, specific fields
Trace: Relationships between tables/fields
Probe: "If X is true, what should Y look like?"
Synthesize: Data quality assessment, feature hypotheses