Agentic Debugging Protocol | Generalized Methodology

Section 0

The Meta-Pattern

Every hard optimization problem has the same structure:

ACTUAL_PERFORMANCE → [UNKNOWN GAP] → THEORETICAL_LIMIT

Your job is to:

Bound

Establish the theoretical limit

Measure

Determine actual performance

Decompose

Break the gap into components

Attack

Systematically close each component

This works for GPU kernels, compiler optimization, database queries, network throughput, build times, algorithm complexity — anything with a measurable metric and physical/mathematical constraints.

Principle 1

Theoretical Floors First

Before optimizing anything, calculate what's physically possible.

The Floor Formula

Floor = max(Resource_1_floor, Resource_2_floor, ..., Resource_N_floor)

Where each Resource_floor = Required_units / Units_per_time

Domain Examples

Network Throughput

data_floor = total_bytes / bandwidth
latency_floor = round_trips * RTT
floor = max(data_floor, latency_floor)

Database Query

io_floor = pages_to_read / pages_per_second
cpu_floor = rows_to_process / rows_per_second
floor = max(io_floor, cpu_floor)

Build System

compile_floor = total_units / (cores * rate)
link_floor = total_link_work / link_rate
floor = max(compile_floor, link_floor)

Algorithm Complexity

floor = theoretical_complexity(input_size)
# e.g., O(n log n) for comparison sort

Why This Matters

If your floor is 100ms and you're at 150ms, you have 50ms of optimization headroom. If your floor is 100ms and you're at 10s, you have catastrophic structural problems.

The ratio actual/floor tells you which situation you're in:

Ratio	Situation	Action
1.0 - 1.5x	Near optimal	Micro-optimizations only
1.5 - 3x	Scheduling/pipelining problems	Mid-level optimizations
3x - 10x	Architectural problems	Need structural changes
>10x	Fundamental algorithmic problems	Rethink the approach

Principle 2

Structured Worklog

Your memory is unreliable. The worklog is not.

Schema

## Attempt [N]: [Descriptive Name]

Hypothesis:
What you believe is causing the gap, stated falsifiably.

Change:
Concrete, minimal modification to test hypothesis.

Baseline: [metric before]
Result:   [metric after]
Delta:    [+X% / -X%]

Verdict: KEEP | REVERT | PARTIAL

Explanation:
Why did this work/fail? What does it tell you about the system?

Unlocks:
What new optimizations does this enable?

Blocks:
What does this prevent or make harder?

Worklog Discipline

Write BEFORE implementing — Forces clear hypothesis formation
Write AFTER measuring — Captures actual results, not assumptions
Never edit old entries — Append corrections, don't rewrite history
Tag dependencies — "Requires Attempt 5" / "Incompatible with Attempt 3"

Why This Works

Optimization is not linear. You will:

Try things that fail
Find combinations that interact unexpectedly
Revisit abandoned approaches after other changes

The worklog lets you:

Avoid repeating failed experiments
Understand interaction effects
Reconstruct the reasoning for any decision

Principle 3

Agent Architecture

Role Decomposition

🧠

Planner

reasoning-heavy, slow, expensive

Input: Current state, metrics, worklog
Output: Prioritized hypothesis list
When: Start, milestones, when stuck

⚡

Executor

code-generation, fast, cheap

Input: Single hypothesis
Output: Modified code/config
Key: Always work on ISOLATED COPY

✓

Validator

systematic, thorough

Input: Original + modified versions
Output: Correctness, performance
When: After every attempt

💡

Ideator

creative, external knowledge

Input: Problem, failed approaches
Output: Alternative techniques
When: 3+ failed attempts

Coordination Protocol

while not (at_floor or out_of_time):

    # Phase 1: Plan
    hypotheses = PLANNER.analyze(
        current_state, metrics, worklog
    )

    # Phase 2: Execute in parallel
    results = []
    for h in hypotheses[:K]:  # K = parallelism budget
        copy = isolate(current_state)
        modified = EXECUTOR.implement(copy, h)
        valid, perf = VALIDATOR.check(original, modified)
        results.append((h, modified, valid, perf))

    # Phase 3: Select
    valid_results = [r for r in results if r.valid]

    if not valid_results:
        # All failed - escalate to ideator
        new_directions = IDEATOR.search(
            problem=metrics.bottleneck,
            failures=[r.hypothesis for r in results]
        )
        worklog.add_failure_batch(results, new_directions)
        continue

    best = select_best(valid_results)
    current_state = best.modified
    worklog.add_success(best)

Parallelism Strategy

Isolated Copies: Never let parallel attempts contaminate each other.

main_branch/
    ├── attempt_7_tiling/       # Executor A working here
    ├── attempt_8_fusion/       # Executor B working here
    └── attempt_9_unrolling/    # Executor C working here

Combine Wins: After validation, combine successful independent changes.

Conflict Resolution: If two wins are incompatible, try:

Sequential application (A then B, or B then A)
Hybrid approach
Choose higher impact one

Principle 4

Bottleneck Hierarchy

Fix problems in the right order. Don't micro-optimize a broken architecture.

Universal Hierarchy

CORRECTNESS

Is it even doing the right thing?

ALGORITHM

Is the asymptotic complexity correct? O(n²) vs O(n log n) dominates everything else.

DATA STRUCTURE

Are operations efficient for access patterns? Array vs hash vs tree changes constant factors 10-100x.

MEMORY HIERARCHY

Are you cache-friendly? Avoiding unnecessary copies? Cache miss vs hit: 100x difference.

PARALLELISM

Are you using available compute? Single-threaded vs parallel: Nx difference.

INSTRUCTION LEVEL

Are individual operations efficient? Usually 2-5x opportunity here.

How to Identify Current Level

Profile first. Look for:

Algorithm problems: Metrics scale worse than expected with input size
Data structure problems: Operations that should be O(1) are O(n)
Memory problems: Low cache hit rate, high memory bandwidth
Parallelism problems: Low CPU/GPU utilization, high contention
Instruction problems: High CPI, specific unit saturation

Don't guess. Measure.

Principle 5

The Wall Protocol

When you're stuck, don't thrash. Systematically expand your search.

Stuck Detection

You are stuck when:

3+ attempts with no improvement
Profiler shows clear bottleneck but no obvious fix
Within 2x of floor but can't close gap

Wall Protocol Steps

STEP BACK

Re-examine assumptions. Is your floor calculation correct?

WIDEN

Search literature, reference implementations, vendor docs.

SIMPLIFY

Create minimal reproduction case. Remove complexity.

ESCALATE

Bring in different model/agent. Ask human expert.

PIVOT

Consider alternative approaches entirely.

Ideator Query Templates

# Bottleneck-focused:
"{bottleneck_type} optimization techniques {domain}"
"reducing {resource} usage in {algorithm/system}"

# Reference-seeking:
"{similar_system} implementation {vendor/project}"
"state of the art {problem_type} {year}"

# Alternative approaches:
"alternative to {current_approach} for {goal}"
"{goal} without {constraint_you_assumed}"

Principle 6

Optimization Patterns Library

Batching

Symptom: High per-item overhead dominates

Solution: Amortize overhead across multiple items

Before: for each item: setup() + work() + teardown() After: setup() + for each item: work() + teardown()

When: Setup/teardown cost > work cost

Caching / Memoization

Symptom: Repeated computation of same values

Solution: Store and reuse

Before: result = expensive(x) // called 1000x After: result = cache.get_or_compute(x, expensive)

When: Computation cost > storage cost, repetition exists

Precomputation

Symptom: Runtime computation of values known at build/init time

Solution: Compute once, store forever

Before: runtime: result = compute(static_params) After: build: TABLE = [compute(p) for p in all_params] runtime: result = TABLE[params_index]

When: Parameter space is bounded and enumerable

Lazy Evaluation

Symptom: Computing values that might not be needed

Solution: Defer until actually required

Before: all_results = [expensive(x) for x in items] return all_results[0] if condition else None After: if condition: return expensive(items[0]) return None

When: Not all computed values are used

Fusion

Symptom: Multiple passes over same data

Solution: Combine into single pass

Before: temp = [f(x) for x in data] result = [g(t) for t in temp] After: result = [g(f(x)) for x in data]

When: Intermediate results only used immediately, memory bandwidth limited

Trading Resources

Symptom: One resource saturated, others idle

Solution: Transform to use idle resource

Compute ↔ Memory: Recompute vs cache Time ↔ Space: Online vs batch Accuracy ↔ Speed: Approximate vs exact Latency ↔ Throughput: Streaming vs batching

When: Profiler shows imbalanced utilization

Avoiding Work

Symptom: Doing unnecessary computation

Solution: Don't do it

• Filter before transform (process only what's needed) • Early exit (stop when answer is known) • Incremental update (reuse previous result) • Approximate (good enough faster)

When: Always check this first. The fastest code is code that doesn't run.

Principle 7

Documentation for Continuity

The goal is not just to solve the problem, but to make the solution understandable and reproducible.

Causal Chain Documentation

After achieving a result, explain HOW you got there:

## Optimization Journey: [X] → [Y] ([Z]x improvement)

### Stage 1: [First Major Win]
- What: Description of change
- Why it helped: Causal mechanism
- Impact: [metric_before] → [metric_after]
- Enabled: What this unlocked

### Stage 2: [Second Major Win]
...

### Interactions
- [Optimization A] + [Optimization B] = synergy because [reason]
- [Optimization C] blocks [Optimization D] because [reason]

### Dead Ends
- Tried [X], failed because [Y]
- [Approach Z] looked promising but [limitation]

### Remaining Headroom
- Current: [metric]
- Floor: [theoretical]
- Gap explained by: [reasons]
- Would need [X] to close further

Why Document

Future you — Will forget why you did things
Other agents — Need context to continue
Verification — Reviewers need to understand choices
Learning — Patterns extracted become reusable

Principle 8

Memory System

Don't use markdown files. Use a queryable database with Zettelkasten-style linking.

Architecture: SQLite + Embeddings + Links

┌─────────────────────────────────────────────────────────────┐ │ AGENT MEMORY SYSTEM │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ ATTEMPTS │ │ FINDINGS │ │ LINKS │ │ │ │ (worklog) │◄─►│ (exploration)│◄─►│ (zettelkasten│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ └──────────┬───────┴──────────┬──────┘ │ │ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ TAGS │ │ EMBEDDINGS │ │ │ │ (fast filter│ │ (semantic │ │ │ └─────────────┘ │ search) │ │ │ └─────────────┘ │ │ │ │ Storage: SQLite (portable, queryable, auditable) │ │ Search: sqlite-vec for vector similarity │ │ Links: Zettelkasten-style bidirectional relations │ └─────────────────────────────────────────────────────────────┘

Schema

-- Core worklog for debugging attempts
CREATE TABLE attempts (
    id INTEGER PRIMARY KEY,
    session_id TEXT NOT NULL,
    hypothesis TEXT NOT NULL,
    change_description TEXT,
    baseline_metric REAL,
    result_metric REAL,
    delta_pct REAL,
    verdict TEXT,  -- keep|revert|partial|pending
    explanation TEXT,
    unlocks TEXT,
    blocks TEXT,
    created_at TIMESTAMP,
    embedding BLOB   -- for semantic search
);

-- Zettelkasten-style links between memories
CREATE TABLE links (
    from_id INTEGER,
    from_type TEXT,  -- 'attempt' or 'finding'
    to_id INTEGER,
    to_type TEXT,
    relation TEXT,   -- enables|blocks|similar|contradicts
    note TEXT
);

-- Fast tag-based filtering
CREATE TABLE tags (
    memory_id INTEGER,
    memory_type TEXT,
    tag TEXT
);

Usage

from agent_memory import AgentMemory

mem = AgentMemory("./session.db")

# Log an attempt
attempt_id = mem.log_attempt(
    session_id="cuda_opt_001",
    hypothesis="Tiling with 32x32 blocks improves L2 hit rate",
    change="Added TILE_SIZE=32 loop",
    baseline=142.5,
    result=89.2,
    verdict="keep",
    explanation="40% reduction in L2 misses",
    tags=["cache", "tiling", "gpu"]
)

# Link to previous attempt
mem.add_link(
    from_id=attempt_id, from_type="attempt",
    to_id=5, to_type="attempt",
    relation="enables",
    note="Tiling enables further fusion"
)

# Query: all wins with 'cache' tag
wins = mem.get_attempts(verdict="keep", tag="cache")

# Query: what does attempt 7 enable?
enabled = mem.get_linked(7, "attempt", relation="enables")

# Export session as markdown (for humans)
report = mem.export_session_markdown("cuda_opt_001")

Why This Works

Feature	Benefit
SQLite	Portable (single file), queryable, auditable, 5x cheaper than vector DBs
Embeddings	Semantic search: "find attempts similar to cache optimization"
Links	Knowledge graph: trace enables/blocks relationships
Tags	Fast filtering without full-text search
Sessions	Isolate different optimization projects

Link Relations

enables → This discovery enables that approach
blocks → This prevents that from working
similar → These are conceptually related
contradicts → These findings conflict
supports → This evidence supports that hypothesis
refines → This is a more detailed version

※ Quick Reference Card

The Optimization Loop

1. BOUND     What's theoretically possible?
             Floor = max(resource_floors)

2. MEASURE   What's actually happening?
             Profile before guessing.

3. GAP       Why is actual > theoretical?
             Decompose into components.

4. PLAN      What hypothesis to test?
             Use PLANNER agent. Prioritize.

5. EXECUTE   Implement on isolated copy.
             Never corrupt main branch.

6. VALIDATE  Correctness first. Then perf.
             Fast + wrong = worthless.

7. LOG       Record everything. Trust worklog.
             Your memory lies.

8. ITERATE   Until at floor or stuck.
             If stuck → Wall Protocol.

Agent Roles

PLANNER   Strategy, hypotheses (slow, smart)
EXECUTOR  Implementation (fast, parallel)
VALIDATOR Correctness + metrics (thorough)
IDEATOR   Literature, alternatives (when stuck)

Bottleneck Hierarchy

L0: Correctness    → Is it right?
L1: Algorithm      → Is complexity optimal?
L2: Data structure → Are operations efficient?
L3: Memory         → Is cache behavior good?
L4: Parallelism    → Is utilization high?
L5: Instructions   → Are ops efficient?

FIX IN ORDER. DON'T SKIP LEVELS.

Wall Protocol

3+ failures → STOP thrashing

1. STEP BACK   Check assumptions
2. WIDEN       Search literature
3. SIMPLIFY    Minimal repro case
4. ESCALATE    Fresh perspective
5. PIVOT       Alternative approach

Optimization Patterns

Batching    Amortize overhead
Caching     Store & reuse
Precompute  Build-time vs runtime
Lazy        Defer until needed
Fusion      Single pass
Trade       Balance resources
Avoid       Don't do the work

Appendix

Domain Instantiation

To apply this framework to a new domain:

Identify the metric — What are you optimizing? (time, memory, cost, ...)
Enumerate resources — What are the physical/mathematical constraints?
Define floors — For each resource, what's the theoretical minimum?
Build profiling — How do you measure what's happening?
Catalog patterns — What are known optimization techniques in this domain?
Set up isolation — How do you safely experiment without breaking things?
Establish validation — How do you verify correctness?

Once you have these, the meta-pattern applies directly.

Example: GPU Kernel Optimization

Metric:     Execution time (μs)
Resources:  Memory bandwidth, compute, occupancy
Floors:     BW floor = bytes / peak_bw
            Compute floor = FLOPs / peak_throughput
Profiling:  Nsight Compute, nvprof
Patterns:   Coalescing, tiling, shared mem, warp shuffle
Isolation:  Separate kernel files, benchmark harness
Validation: Diff against reference implementation

Example: Database Query Optimization

Metric:     Query latency (ms)
Resources:  Disk I/O, CPU, memory
Floors:     I/O floor = pages * page_read_time
            CPU floor = rows * row_process_time
Profiling:  EXPLAIN ANALYZE, pg_stat_statements
Patterns:   Indexing, denormalization, query rewrite
Isolation:  Test database, query sandbox
Validation: Result set comparison, row counts

Example: Build System Optimization

Metric:     Build time (seconds)
Resources:  CPU cores, disk I/O, network (for deps)
Floors:     Critical path through dependency graph
            Parallel floor = total_work / cores
Profiling:  Build traces, timing logs
Patterns:   Caching, incremental builds, parallelization
Isolation:  Clean build environment, Docker
Validation: Artifact hashes, test suite

Contents

The Meta-Pattern

Theoretical Floors First

The Floor Formula

Domain Examples

Why This Matters

Structured Worklog

Schema

Worklog Discipline

Why This Works

Agent Architecture

Role Decomposition

Coordination Protocol

Parallelism Strategy

Bottleneck Hierarchy

Universal Hierarchy

How to Identify Current Level

The Wall Protocol

Stuck Detection

Wall Protocol Steps

Ideator Query Templates

Optimization Patterns Library

Documentation for Continuity

Causal Chain Documentation

Why Document

Memory System

Architecture: SQLite + Embeddings + Links

Schema

Usage

Why This Works

Link Relations

※ Quick Reference Card

The Optimization Loop

Agent Roles

Bottleneck Hierarchy

Wall Protocol

Optimization Patterns

Domain Instantiation