The Meta-Pattern
Every hard optimization problem has the same structure:
This works for GPU kernels, compiler optimization, database queries, network throughput, build times, algorithm complexity — anything with a measurable metric and physical/mathematical constraints.
Theoretical Floors First
Before optimizing anything, calculate what's physically possible.
The Floor Formula
Domain Examples
data_floor = total_bytes / bandwidth latency_floor = round_trips * RTT floor = max(data_floor, latency_floor)
io_floor = pages_to_read / pages_per_second cpu_floor = rows_to_process / rows_per_second floor = max(io_floor, cpu_floor)
compile_floor = total_units / (cores * rate) link_floor = total_link_work / link_rate floor = max(compile_floor, link_floor)
floor = theoretical_complexity(input_size) # e.g., O(n log n) for comparison sort
Why This Matters
If your floor is 100ms and you're at 150ms, you have 50ms of optimization headroom. If your floor is 100ms and you're at 10s, you have catastrophic structural problems.
The ratio actual/floor tells you which situation you're in:
| Ratio | Situation | Action |
|---|---|---|
| 1.0 - 1.5x | Near optimal | Micro-optimizations only |
| 1.5 - 3x | Scheduling/pipelining problems | Mid-level optimizations |
| 3x - 10x | Architectural problems | Need structural changes |
| >10x | Fundamental algorithmic problems | Rethink the approach |
Structured Worklog
Your memory is unreliable. The worklog is not.
Schema
## Attempt [N]: [Descriptive Name] Hypothesis: What you believe is causing the gap, stated falsifiably. Change: Concrete, minimal modification to test hypothesis. Baseline: [metric before] Result: [metric after] Delta: [+X% / -X%] Verdict: KEEP | REVERT | PARTIAL Explanation: Why did this work/fail? What does it tell you about the system? Unlocks: What new optimizations does this enable? Blocks: What does this prevent or make harder?
Worklog Discipline
- Write BEFORE implementing — Forces clear hypothesis formation
- Write AFTER measuring — Captures actual results, not assumptions
- Never edit old entries — Append corrections, don't rewrite history
- Tag dependencies — "Requires Attempt 5" / "Incompatible with Attempt 3"
Why This Works
Optimization is not linear. You will:
- Try things that fail
- Find combinations that interact unexpectedly
- Revisit abandoned approaches after other changes
The worklog lets you:
- Avoid repeating failed experiments
- Understand interaction effects
- Reconstruct the reasoning for any decision
Agent Architecture
Role Decomposition
Input: Current state, metrics, worklog
Output: Prioritized hypothesis list
When: Start, milestones, when stuck
Input: Single hypothesis
Output: Modified code/config
Key: Always work on ISOLATED COPY
Input: Original + modified versions
Output: Correctness, performance
When: After every attempt
Input: Problem, failed approaches
Output: Alternative techniques
When: 3+ failed attempts
Coordination Protocol
while not (at_floor or out_of_time):
# Phase 1: Plan
hypotheses = PLANNER.analyze(
current_state, metrics, worklog
)
# Phase 2: Execute in parallel
results = []
for h in hypotheses[:K]: # K = parallelism budget
copy = isolate(current_state)
modified = EXECUTOR.implement(copy, h)
valid, perf = VALIDATOR.check(original, modified)
results.append((h, modified, valid, perf))
# Phase 3: Select
valid_results = [r for r in results if r.valid]
if not valid_results:
# All failed - escalate to ideator
new_directions = IDEATOR.search(
problem=metrics.bottleneck,
failures=[r.hypothesis for r in results]
)
worklog.add_failure_batch(results, new_directions)
continue
best = select_best(valid_results)
current_state = best.modified
worklog.add_success(best)
Parallelism Strategy
Isolated Copies: Never let parallel attempts contaminate each other.
main_branch/
├── attempt_7_tiling/ # Executor A working here
├── attempt_8_fusion/ # Executor B working here
└── attempt_9_unrolling/ # Executor C working here
Combine Wins: After validation, combine successful independent changes.
Conflict Resolution: If two wins are incompatible, try:
- Sequential application (A then B, or B then A)
- Hybrid approach
- Choose higher impact one
Bottleneck Hierarchy
Fix problems in the right order. Don't micro-optimize a broken architecture.
Universal Hierarchy
How to Identify Current Level
Profile first. Look for:
- Algorithm problems: Metrics scale worse than expected with input size
- Data structure problems: Operations that should be O(1) are O(n)
- Memory problems: Low cache hit rate, high memory bandwidth
- Parallelism problems: Low CPU/GPU utilization, high contention
- Instruction problems: High CPI, specific unit saturation
Don't guess. Measure.
The Wall Protocol
When you're stuck, don't thrash. Systematically expand your search.
Stuck Detection
You are stuck when:
- 3+ attempts with no improvement
- Profiler shows clear bottleneck but no obvious fix
- Within 2x of floor but can't close gap
Wall Protocol Steps
Ideator Query Templates
# Bottleneck-focused:
"{bottleneck_type} optimization techniques {domain}"
"reducing {resource} usage in {algorithm/system}"
# Reference-seeking:
"{similar_system} implementation {vendor/project}"
"state of the art {problem_type} {year}"
# Alternative approaches:
"alternative to {current_approach} for {goal}"
"{goal} without {constraint_you_assumed}"
Optimization Patterns Library
Documentation for Continuity
The goal is not just to solve the problem, but to make the solution understandable and reproducible.
Causal Chain Documentation
After achieving a result, explain HOW you got there:
## Optimization Journey: [X] → [Y] ([Z]x improvement) ### Stage 1: [First Major Win] - What: Description of change - Why it helped: Causal mechanism - Impact: [metric_before] → [metric_after] - Enabled: What this unlocked ### Stage 2: [Second Major Win] ... ### Interactions - [Optimization A] + [Optimization B] = synergy because [reason] - [Optimization C] blocks [Optimization D] because [reason] ### Dead Ends - Tried [X], failed because [Y] - [Approach Z] looked promising but [limitation] ### Remaining Headroom - Current: [metric] - Floor: [theoretical] - Gap explained by: [reasons] - Would need [X] to close further
Why Document
- Future you — Will forget why you did things
- Other agents — Need context to continue
- Verification — Reviewers need to understand choices
- Learning — Patterns extracted become reusable
Memory System
Don't use markdown files. Use a queryable database with Zettelkasten-style linking.
Architecture: SQLite + Embeddings + Links
Schema
-- Core worklog for debugging attempts
CREATE TABLE attempts (
id INTEGER PRIMARY KEY,
session_id TEXT NOT NULL,
hypothesis TEXT NOT NULL,
change_description TEXT,
baseline_metric REAL,
result_metric REAL,
delta_pct REAL,
verdict TEXT, -- keep|revert|partial|pending
explanation TEXT,
unlocks TEXT,
blocks TEXT,
created_at TIMESTAMP,
embedding BLOB -- for semantic search
);
-- Zettelkasten-style links between memories
CREATE TABLE links (
from_id INTEGER,
from_type TEXT, -- 'attempt' or 'finding'
to_id INTEGER,
to_type TEXT,
relation TEXT, -- enables|blocks|similar|contradicts
note TEXT
);
-- Fast tag-based filtering
CREATE TABLE tags (
memory_id INTEGER,
memory_type TEXT,
tag TEXT
);
Usage
from agent_memory import AgentMemory
mem = AgentMemory("./session.db")
# Log an attempt
attempt_id = mem.log_attempt(
session_id="cuda_opt_001",
hypothesis="Tiling with 32x32 blocks improves L2 hit rate",
change="Added TILE_SIZE=32 loop",
baseline=142.5,
result=89.2,
verdict="keep",
explanation="40% reduction in L2 misses",
tags=["cache", "tiling", "gpu"]
)
# Link to previous attempt
mem.add_link(
from_id=attempt_id, from_type="attempt",
to_id=5, to_type="attempt",
relation="enables",
note="Tiling enables further fusion"
)
# Query: all wins with 'cache' tag
wins = mem.get_attempts(verdict="keep", tag="cache")
# Query: what does attempt 7 enable?
enabled = mem.get_linked(7, "attempt", relation="enables")
# Export session as markdown (for humans)
report = mem.export_session_markdown("cuda_opt_001")
Why This Works
| Feature | Benefit |
|---|---|
| SQLite | Portable (single file), queryable, auditable, 5x cheaper than vector DBs |
| Embeddings | Semantic search: "find attempts similar to cache optimization" |
| Links | Knowledge graph: trace enables/blocks relationships |
| Tags | Fast filtering without full-text search |
| Sessions | Isolate different optimization projects |
Link Relations
- enables → This discovery enables that approach
- blocks → This prevents that from working
- similar → These are conceptually related
- contradicts → These findings conflict
- supports → This evidence supports that hypothesis
- refines → This is a more detailed version
※ Quick Reference Card
The Optimization Loop
1. BOUND What's theoretically possible?
Floor = max(resource_floors)
2. MEASURE What's actually happening?
Profile before guessing.
3. GAP Why is actual > theoretical?
Decompose into components.
4. PLAN What hypothesis to test?
Use PLANNER agent. Prioritize.
5. EXECUTE Implement on isolated copy.
Never corrupt main branch.
6. VALIDATE Correctness first. Then perf.
Fast + wrong = worthless.
7. LOG Record everything. Trust worklog.
Your memory lies.
8. ITERATE Until at floor or stuck.
If stuck → Wall Protocol.
Agent Roles
PLANNER Strategy, hypotheses (slow, smart) EXECUTOR Implementation (fast, parallel) VALIDATOR Correctness + metrics (thorough) IDEATOR Literature, alternatives (when stuck)
Bottleneck Hierarchy
L0: Correctness → Is it right? L1: Algorithm → Is complexity optimal? L2: Data structure → Are operations efficient? L3: Memory → Is cache behavior good? L4: Parallelism → Is utilization high? L5: Instructions → Are ops efficient? FIX IN ORDER. DON'T SKIP LEVELS.
Wall Protocol
3+ failures → STOP thrashing 1. STEP BACK Check assumptions 2. WIDEN Search literature 3. SIMPLIFY Minimal repro case 4. ESCALATE Fresh perspective 5. PIVOT Alternative approach
Optimization Patterns
Batching Amortize overhead Caching Store & reuse Precompute Build-time vs runtime Lazy Defer until needed Fusion Single pass Trade Balance resources Avoid Don't do the work
Domain Instantiation
To apply this framework to a new domain:
- Identify the metric — What are you optimizing? (time, memory, cost, ...)
- Enumerate resources — What are the physical/mathematical constraints?
- Define floors — For each resource, what's the theoretical minimum?
- Build profiling — How do you measure what's happening?
- Catalog patterns — What are known optimization techniques in this domain?
- Set up isolation — How do you safely experiment without breaking things?
- Establish validation — How do you verify correctness?
Once you have these, the meta-pattern applies directly.
Metric: Execution time (μs)
Resources: Memory bandwidth, compute, occupancy
Floors: BW floor = bytes / peak_bw
Compute floor = FLOPs / peak_throughput
Profiling: Nsight Compute, nvprof
Patterns: Coalescing, tiling, shared mem, warp shuffle
Isolation: Separate kernel files, benchmark harness
Validation: Diff against reference implementation
Metric: Query latency (ms)
Resources: Disk I/O, CPU, memory
Floors: I/O floor = pages * page_read_time
CPU floor = rows * row_process_time
Profiling: EXPLAIN ANALYZE, pg_stat_statements
Patterns: Indexing, denormalization, query rewrite
Isolation: Test database, query sandbox
Validation: Result set comparison, row counts
Metric: Build time (seconds)
Resources: CPU cores, disk I/O, network (for deps)
Floors: Critical path through dependency graph
Parallel floor = total_work / cores
Profiling: Build traces, timing logs
Patterns: Caching, incremental builds, parallelization
Isolation: Clean build environment, Docker
Validation: Artifact hashes, test suite