Agentic Debugging Protocol

Generalized Methodology

A systematic framework for LLM agents to tackle hard optimization and debugging problems across any domain.

🔧 Framework v2.0 🎯 For AI Agents 📐 8 Principles

Contents

§0 The Meta-Pattern §1 Theoretical Floors First §2 Structured Worklog §3 Agent Architecture §4 Bottleneck Hierarchy §5 The Wall Protocol §6 Optimization Patterns §7 Documentation for Continuity §8 Memory System Quick Reference Card A Domain Instantiation

The Meta-Pattern

Every hard optimization problem has the same structure:

ACTUAL_PERFORMANCE → [UNKNOWN GAP] → THEORETICAL_LIMIT

Your job is to:

1
Bound
Establish the theoretical limit
2
Measure
Determine actual performance
3
Decompose
Break the gap into components
4
Attack
Systematically close each component

This works for GPU kernels, compiler optimization, database queries, network throughput, build times, algorithm complexity — anything with a measurable metric and physical/mathematical constraints.

Theoretical Floors First

Before optimizing anything, calculate what's physically possible.

The Floor Formula

Floor = max(Resource_1_floor, Resource_2_floor, ..., Resource_N_floor)
Where each Resource_floor = Required_units / Units_per_time

Domain Examples

Network Throughput
data_floor = total_bytes / bandwidth
latency_floor = round_trips * RTT
floor = max(data_floor, latency_floor)
Database Query
io_floor = pages_to_read / pages_per_second
cpu_floor = rows_to_process / rows_per_second
floor = max(io_floor, cpu_floor)
Build System
compile_floor = total_units / (cores * rate)
link_floor = total_link_work / link_rate
floor = max(compile_floor, link_floor)
Algorithm Complexity
floor = theoretical_complexity(input_size)
# e.g., O(n log n) for comparison sort

Why This Matters

If your floor is 100ms and you're at 150ms, you have 50ms of optimization headroom. If your floor is 100ms and you're at 10s, you have catastrophic structural problems.

The ratio actual/floor tells you which situation you're in:

Ratio Situation Action
1.0 - 1.5x Near optimal Micro-optimizations only
1.5 - 3x Scheduling/pipelining problems Mid-level optimizations
3x - 10x Architectural problems Need structural changes
>10x Fundamental algorithmic problems Rethink the approach

Structured Worklog

Your memory is unreliable. The worklog is not.

Schema

## Attempt [N]: [Descriptive Name]

Hypothesis:
What you believe is causing the gap, stated falsifiably.

Change:
Concrete, minimal modification to test hypothesis.

Baseline: [metric before]
Result:   [metric after]
Delta:    [+X% / -X%]

Verdict: KEEP | REVERT | PARTIAL

Explanation:
Why did this work/fail? What does it tell you about the system?

Unlocks:
What new optimizations does this enable?

Blocks:
What does this prevent or make harder?

Worklog Discipline

Why This Works

Optimization is not linear. You will:

The worklog lets you:

Agent Architecture

Role Decomposition

🧠
Planner
reasoning-heavy, slow, expensive

Input: Current state, metrics, worklog
Output: Prioritized hypothesis list
When: Start, milestones, when stuck

Executor
code-generation, fast, cheap

Input: Single hypothesis
Output: Modified code/config
Key: Always work on ISOLATED COPY

Validator
systematic, thorough

Input: Original + modified versions
Output: Correctness, performance
When: After every attempt

💡
Ideator
creative, external knowledge

Input: Problem, failed approaches
Output: Alternative techniques
When: 3+ failed attempts

Coordination Protocol

while not (at_floor or out_of_time):

    # Phase 1: Plan
    hypotheses = PLANNER.analyze(
        current_state, metrics, worklog
    )

    # Phase 2: Execute in parallel
    results = []
    for h in hypotheses[:K]:  # K = parallelism budget
        copy = isolate(current_state)
        modified = EXECUTOR.implement(copy, h)
        valid, perf = VALIDATOR.check(original, modified)
        results.append((h, modified, valid, perf))

    # Phase 3: Select
    valid_results = [r for r in results if r.valid]

    if not valid_results:
        # All failed - escalate to ideator
        new_directions = IDEATOR.search(
            problem=metrics.bottleneck,
            failures=[r.hypothesis for r in results]
        )
        worklog.add_failure_batch(results, new_directions)
        continue

    best = select_best(valid_results)
    current_state = best.modified
    worklog.add_success(best)

Parallelism Strategy

Isolated Copies: Never let parallel attempts contaminate each other.

main_branch/
    ├── attempt_7_tiling/       # Executor A working here
    ├── attempt_8_fusion/       # Executor B working here
    └── attempt_9_unrolling/    # Executor C working here

Combine Wins: After validation, combine successful independent changes.

Conflict Resolution: If two wins are incompatible, try:

  1. Sequential application (A then B, or B then A)
  2. Hybrid approach
  3. Choose higher impact one

Bottleneck Hierarchy

Fix problems in the right order. Don't micro-optimize a broken architecture.

Universal Hierarchy

L0
CORRECTNESS
Is it even doing the right thing?
L1
ALGORITHM
Is the asymptotic complexity correct? O(n²) vs O(n log n) dominates everything else.
L2
DATA STRUCTURE
Are operations efficient for access patterns? Array vs hash vs tree changes constant factors 10-100x.
L3
MEMORY HIERARCHY
Are you cache-friendly? Avoiding unnecessary copies? Cache miss vs hit: 100x difference.
L4
PARALLELISM
Are you using available compute? Single-threaded vs parallel: Nx difference.
L5
INSTRUCTION LEVEL
Are individual operations efficient? Usually 2-5x opportunity here.

How to Identify Current Level

Profile first. Look for:

Don't guess. Measure.

The Wall Protocol

When you're stuck, don't thrash. Systematically expand your search.

Stuck Detection

You are stuck when:

Wall Protocol Steps

1
STEP BACK
Re-examine assumptions. Is your floor calculation correct?
2
WIDEN
Search literature, reference implementations, vendor docs.
3
SIMPLIFY
Create minimal reproduction case. Remove complexity.
4
ESCALATE
Bring in different model/agent. Ask human expert.
5
PIVOT
Consider alternative approaches entirely.

Ideator Query Templates

# Bottleneck-focused:
"{bottleneck_type} optimization techniques {domain}"
"reducing {resource} usage in {algorithm/system}"

# Reference-seeking:
"{similar_system} implementation {vendor/project}"
"state of the art {problem_type} {year}"

# Alternative approaches:
"alternative to {current_approach} for {goal}"
"{goal} without {constraint_you_assumed}"

Optimization Patterns Library

Batching
Symptom: High per-item overhead dominates
Solution: Amortize overhead across multiple items
Before: for each item: setup() + work() + teardown() After: setup() + for each item: work() + teardown()
When: Setup/teardown cost > work cost
Caching / Memoization
Symptom: Repeated computation of same values
Solution: Store and reuse
Before: result = expensive(x) // called 1000x After: result = cache.get_or_compute(x, expensive)
When: Computation cost > storage cost, repetition exists
Precomputation
Symptom: Runtime computation of values known at build/init time
Solution: Compute once, store forever
Before: runtime: result = compute(static_params) After: build: TABLE = [compute(p) for p in all_params] runtime: result = TABLE[params_index]
When: Parameter space is bounded and enumerable
Lazy Evaluation
Symptom: Computing values that might not be needed
Solution: Defer until actually required
Before: all_results = [expensive(x) for x in items] return all_results[0] if condition else None After: if condition: return expensive(items[0]) return None
When: Not all computed values are used
Fusion
Symptom: Multiple passes over same data
Solution: Combine into single pass
Before: temp = [f(x) for x in data] result = [g(t) for t in temp] After: result = [g(f(x)) for x in data]
When: Intermediate results only used immediately, memory bandwidth limited
Trading Resources
Symptom: One resource saturated, others idle
Solution: Transform to use idle resource
Compute ↔ Memory: Recompute vs cache Time ↔ Space: Online vs batch Accuracy ↔ Speed: Approximate vs exact Latency ↔ Throughput: Streaming vs batching
When: Profiler shows imbalanced utilization
Avoiding Work
Symptom: Doing unnecessary computation
Solution: Don't do it
• Filter before transform (process only what's needed) • Early exit (stop when answer is known) • Incremental update (reuse previous result) • Approximate (good enough faster)
When: Always check this first. The fastest code is code that doesn't run.

Documentation for Continuity

The goal is not just to solve the problem, but to make the solution understandable and reproducible.

Causal Chain Documentation

After achieving a result, explain HOW you got there:

## Optimization Journey: [X] → [Y] ([Z]x improvement)

### Stage 1: [First Major Win]
- What: Description of change
- Why it helped: Causal mechanism
- Impact: [metric_before] → [metric_after]
- Enabled: What this unlocked

### Stage 2: [Second Major Win]
...

### Interactions
- [Optimization A] + [Optimization B] = synergy because [reason]
- [Optimization C] blocks [Optimization D] because [reason]

### Dead Ends
- Tried [X], failed because [Y]
- [Approach Z] looked promising but [limitation]

### Remaining Headroom
- Current: [metric]
- Floor: [theoretical]
- Gap explained by: [reasons]
- Would need [X] to close further

Why Document

Memory System

Don't use markdown files. Use a queryable database with Zettelkasten-style linking.

Architecture: SQLite + Embeddings + Links

┌─────────────────────────────────────────────────────────────┐ │ AGENT MEMORY SYSTEM │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ ATTEMPTS │ │ FINDINGS │ │ LINKS │ │ │ │ (worklog) │◄─►│ (exploration)│◄─►│ (zettelkasten│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ └──────────┬───────┴──────────┬──────┘ │ │ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ TAGS │ │ EMBEDDINGS │ │ │ │ (fast filter│ │ (semantic │ │ │ └─────────────┘ │ search) │ │ │ └─────────────┘ │ │ │ │ Storage: SQLite (portable, queryable, auditable) │ │ Search: sqlite-vec for vector similarity │ │ Links: Zettelkasten-style bidirectional relations │ └─────────────────────────────────────────────────────────────┘

Schema

-- Core worklog for debugging attempts
CREATE TABLE attempts (
    id INTEGER PRIMARY KEY,
    session_id TEXT NOT NULL,
    hypothesis TEXT NOT NULL,
    change_description TEXT,
    baseline_metric REAL,
    result_metric REAL,
    delta_pct REAL,
    verdict TEXT,  -- keep|revert|partial|pending
    explanation TEXT,
    unlocks TEXT,
    blocks TEXT,
    created_at TIMESTAMP,
    embedding BLOB   -- for semantic search
);

-- Zettelkasten-style links between memories
CREATE TABLE links (
    from_id INTEGER,
    from_type TEXT,  -- 'attempt' or 'finding'
    to_id INTEGER,
    to_type TEXT,
    relation TEXT,   -- enables|blocks|similar|contradicts
    note TEXT
);

-- Fast tag-based filtering
CREATE TABLE tags (
    memory_id INTEGER,
    memory_type TEXT,
    tag TEXT
);

Usage

from agent_memory import AgentMemory

mem = AgentMemory("./session.db")

# Log an attempt
attempt_id = mem.log_attempt(
    session_id="cuda_opt_001",
    hypothesis="Tiling with 32x32 blocks improves L2 hit rate",
    change="Added TILE_SIZE=32 loop",
    baseline=142.5,
    result=89.2,
    verdict="keep",
    explanation="40% reduction in L2 misses",
    tags=["cache", "tiling", "gpu"]
)

# Link to previous attempt
mem.add_link(
    from_id=attempt_id, from_type="attempt",
    to_id=5, to_type="attempt",
    relation="enables",
    note="Tiling enables further fusion"
)

# Query: all wins with 'cache' tag
wins = mem.get_attempts(verdict="keep", tag="cache")

# Query: what does attempt 7 enable?
enabled = mem.get_linked(7, "attempt", relation="enables")

# Export session as markdown (for humans)
report = mem.export_session_markdown("cuda_opt_001")

Why This Works

Feature Benefit
SQLite Portable (single file), queryable, auditable, 5x cheaper than vector DBs
Embeddings Semantic search: "find attempts similar to cache optimization"
Links Knowledge graph: trace enables/blocks relationships
Tags Fast filtering without full-text search
Sessions Isolate different optimization projects

Link Relations

※ Quick Reference Card

The Optimization Loop

1. BOUND     What's theoretically possible?
             Floor = max(resource_floors)

2. MEASURE   What's actually happening?
             Profile before guessing.

3. GAP       Why is actual > theoretical?
             Decompose into components.

4. PLAN      What hypothesis to test?
             Use PLANNER agent. Prioritize.

5. EXECUTE   Implement on isolated copy.
             Never corrupt main branch.

6. VALIDATE  Correctness first. Then perf.
             Fast + wrong = worthless.

7. LOG       Record everything. Trust worklog.
             Your memory lies.

8. ITERATE   Until at floor or stuck.
             If stuck → Wall Protocol.

Agent Roles

PLANNER   Strategy, hypotheses (slow, smart)
EXECUTOR  Implementation (fast, parallel)
VALIDATOR Correctness + metrics (thorough)
IDEATOR   Literature, alternatives (when stuck)

Bottleneck Hierarchy

L0: Correctness    → Is it right?
L1: Algorithm      → Is complexity optimal?
L2: Data structure → Are operations efficient?
L3: Memory         → Is cache behavior good?
L4: Parallelism    → Is utilization high?
L5: Instructions   → Are ops efficient?

FIX IN ORDER. DON'T SKIP LEVELS.

Wall Protocol

3+ failures → STOP thrashing

1. STEP BACK   Check assumptions
2. WIDEN       Search literature
3. SIMPLIFY    Minimal repro case
4. ESCALATE    Fresh perspective
5. PIVOT       Alternative approach

Optimization Patterns

Batching    Amortize overhead
Caching     Store & reuse
Precompute  Build-time vs runtime
Lazy        Defer until needed
Fusion      Single pass
Trade       Balance resources
Avoid       Don't do the work

Domain Instantiation

To apply this framework to a new domain:

  1. Identify the metric — What are you optimizing? (time, memory, cost, ...)
  2. Enumerate resources — What are the physical/mathematical constraints?
  3. Define floors — For each resource, what's the theoretical minimum?
  4. Build profiling — How do you measure what's happening?
  5. Catalog patterns — What are known optimization techniques in this domain?
  6. Set up isolation — How do you safely experiment without breaking things?
  7. Establish validation — How do you verify correctness?

Once you have these, the meta-pattern applies directly.

Example: GPU Kernel Optimization
Metric:     Execution time (μs)
Resources:  Memory bandwidth, compute, occupancy
Floors:     BW floor = bytes / peak_bw
            Compute floor = FLOPs / peak_throughput
Profiling:  Nsight Compute, nvprof
Patterns:   Coalescing, tiling, shared mem, warp shuffle
Isolation:  Separate kernel files, benchmark harness
Validation: Diff against reference implementation
Example: Database Query Optimization
Metric:     Query latency (ms)
Resources:  Disk I/O, CPU, memory
Floors:     I/O floor = pages * page_read_time
            CPU floor = rows * row_process_time
Profiling:  EXPLAIN ANALYZE, pg_stat_statements
Patterns:   Indexing, denormalization, query rewrite
Isolation:  Test database, query sandbox
Validation: Result set comparison, row counts
Example: Build System Optimization
Metric:     Build time (seconds)
Resources:  CPU cores, disk I/O, network (for deps)
Floors:     Critical path through dependency graph
            Parallel floor = total_work / cores
Profiling:  Build traces, timing logs
Patterns:   Caching, incremental builds, parallelization
Isolation:  Clean build environment, Docker
Validation: Artifact hashes, test suite