← Back to iceboks.site
Top I. The Third Failure II. The Conductor III. The Triad IV. Biological Memory V. Budget Routing VI. The Mesh VII. What Compounds
MKUltra III
From Blueprint to Operating System
iceboks + claude — april 2026
Scroll
Chapter I

The Third Failure Mode

Episode 1 identified the two canonical failure modes of large-scale agentic systems: the context explosion (agents drowning in their own history) and the thinking tax (paying Opus prices for tasks that need Haiku). Episode 2 documented what happens when you solve both and turn the stack on. Five loops running. Agents self-improving at 3AM. A heartbeat that never sleeps.

But solving two failure modes while building 25 services fast produces a third one. We had a name for it by April 2026: the coordination collapse.

Fourteen services hitting Redis directly. Eight services hitting MongoDB through raw HTTP. No circuit breakers. No service discovery. Hardcoded URLs everywhere. A single-service failure cascading silently across everything downstream with no retry, no fallback, no human notified until something visibly broke. Twenty-five capable services producing twenty-five independent outputs with no shared state and no mechanism for the system to understand what it was doing as a whole.

The context explosion is what kills a single agent. The coordination collapse is what kills a multi-agent system. They look different but they share a root cause: no structure governing how state moves through the system.

This episode is about how we resolved it. Three interlocking solutions: the Conductor (state-machine orchestration that survives long tasks), the self-improvement triad (OPRO, ExpeL, and the flywheel running as a closed loop), and the service mesh transformation (replacing HTTP soup with event-driven communication that degrades gracefully).

Twenty-five services became an operating system. Here's what that actually required.

Chapter II

The Conductor: State That Survives the Task

Episode 2 described the conductor problem: ReAct breaks on long tasks because context grows with every step. Roundtable is the process manager. But there was nothing above it — nothing that held the goal across steps, nothing that knew how to handle partial failure without starting over, nothing that enforced a budget across the whole operation.

The solution came from studying what actually works in production. Not research demos. Production. The pattern is the same everywhere it succeeds: external state, structured as a ledger, rebuilt fresh at each step rather than accumulated in the context window.

ReAct vs the Ledger — context on step 20
ReAct:
  [system prompt]
  [thought 1][action 1][observation 1]
  [thought 2][action 2][observation 2]
  ... ×18 more steps of accumulated history ...
  [what was I doing again?]

Ledger pattern:
  [system prompt — cached]
  [goal state: what we're achieving]
  [task DAG: what steps exist, what depends on what]
  [progress: completed steps + their outputs, summarized]
  [current step: what this specific call does]
  [reflection: why the last attempt failed, if retrying]

The ledger never grows unboundedly. It stays exactly as large as the current step requires. The goal doesn't drift because it's in a typed state field, not reconstructed from a conversation history. The Conductor reads the ledger, executes one step, writes results back, and discards everything else. No accumulation. No drift. No hallucinated history at step fifteen.

Three architectural decisions define the MKUltra Conductor implementation:

Subgraph delegation, not tool calls. The Conductor doesn't call tools directly. It invokes subgraphs — Roundtable for multi-provider reasoning, action-executor for system interaction, tool-registry for MCP dispatch. Each subgraph has its own internal state, retry logic, and memory. The Conductor only receives the subgraph's final output. The supervisor's context stays bounded regardless of what happens inside the worker.

Parallel execution via the DAG. The planner generates a directed acyclic graph, not a linear list. Independent branches execute concurrently. A task that requires "analyze codebase" and "prepare test environment" gets both running simultaneously. Wall-clock time is the longer branch, not the sum of both. For tasks with five independent subtasks, this is a 5x speedup with zero additional infrastructure.

Structured failure recovery. When a subgraph returns error state, a conditional edge routes to a reflection node rather than letting the error corrupt the conversation history. The reflection node generates a verbal analysis of what went wrong and injects it into the replanning context. The retry budget is a state field — enforced by the graph, not trusted to the LLM's arithmetic.

PatternWhat It SolvesImplementation
Typed state schemaGoal drift across stepsMongoDB-persisted TypedDict, rebuilt per call
Subgraph delegationContext explosion in supervisorRoundtable / action-executor as worker subgraphs
DAG planningSequential latencyIndependent branches execute in parallel
Reflection nodeSilent failure accumulationStructured verbal analysis on error edges
Budget reservationNothing left for synthesis20% of token budget reserved from step one

That last one — budget reservation — sounds obvious but almost nobody does it. Many orchestrators spend 95% of their token budget on execution steps and then have six tokens left to write the final answer. The Conductor reserves 20% from the start. Whatever happens in the execution loop, the synthesis step always has room to think. It's a constraint that costs nothing and saves everything on long tasks.

The roundtable knows how to do things. The conductor knows what to do, in what order, and what to do when the plan breaks. That distinction is the difference between a collection of capable services and a system you can give a goal to.
Chapter III

The Self-Improvement Triad: Three Loops, One Compound

The original design document called for "self-improvement." That's easy to write and hard to implement without it devolving into either a buzzword or a rube goldberg machine that improves a metric nobody cares about.

What actually runs is three loops at different cadences, each targeting a different layer of the cognitive stack, each compounding on the others.

Every hour: ExpeL ingestion. Every Roundtable interaction lands in MongoDB's mkultra.interactions collection. Every hour, an n8n workflow calls the flywheel-ingester, which pulls those records and indexes them to Elasticsearch. When the count hits 50 new records, it fires the flywheel-api, which starts an Axolotl LoRA fine-tune job on the RX 7900 XTX via ROCm. The fine-tuned candidate goes through LLM-as-judge evaluation. If it scores better than the current production checkpoint on the held-out benchmark, it gets promoted. If not, it gets discarded. Either way the trajectory goes back to MongoDB as training data for the next run.

Every day at 3AM: OPRO prompt optimization. The OPRO service gets the performance history of every specialist agent — thirty days of (prompt, score) pairs sorted by score. It generates three candidate prompt variants per agent. Each variant runs on the held-out benchmark. The best-scoring variant gets written to Redis. Roundtable picks it up on the next initialization. The agents get better at their domains without any human touching a system prompt.

On demand via harness: autoresearch. The Karpathy Loop. An agent given a clear scoring function, a single editable file, and a time-boxed cycle. Every autoresearch project defines the Karpathy Triplet: one editable asset (agent.py), one scalar metric (a float emitted by tasks/run.sh), one hard time cap. The harness generates a hypothesis, writes a candidate, runs the eval, compares scores. Better: commit. Worse: revert. Log the trajectory either way.

the Karpathy Triplet — hard constraints for every autoresearch project
EDITABLE ASSET
  Exactly one file: agent.py
  Nothing else is mutable. The blast radius is bounded.

SCALAR METRIC
  tasks/run.sh emits, as its last line: {"score": <float>}
  If you can't write this, you can't run autoresearch on it.

TIME-BOXED CYCLE
  time_box_s is a hard cap per iteration.
  The agent cannot find a path around the clock.

The single-file constraint exists because an agent allowed to modify its own evaluation harness will eventually modify what "better" means. An agent constrained to agent.py can only get better at the task itself.

LoopCadenceTarget LayerOutput
ExpeL ingestionHourlyModel weights (LoRA)Promoted checkpoint or discarded candidate
OPRO optimizationDaily 3AMSystem promptsUpdated Redis prompt per specialist
AutoresearchOn demandAgent behaviors (agent.py)Improved agent + trajectory to flywheel

These loops don't run in parallel on independent tracks. They compound. OPRO-optimized prompts produce better Roundtable outputs. Better outputs become better ExpeL training records. Better training records produce better fine-tuned checkpoints. Better checkpoints produce better outputs for the next OPRO cycle. The system's performance trajectory is superlinear with time — it accelerates rather than plateauing, until it hits a ceiling imposed by the base model or the quality of the evaluation function.

Each loop's improvement amplifies the input quality of the next loop. That's not a marketing claim. That's a dependency graph. It was designed that way. The compounding was intentional.

One risk worth stating plainly: reward hacking. An agent optimizing a scalar will find the path of least resistance. Sometimes that path isn't the one you intended. The spot-check protocol in SOUL.md addresses this: every OPRO-promoted prompt gets a human-readable transcript alongside the score, not just the number. Every autoresearch promotion gets a diff inspection. The transcript is the verification that the score means what you think it means.

Chapter IV

Biological Memory: What Vector Search Can't Do

The original architecture spec described PoonGram as "associative memory." Which is true and also undersells it. Most people read "memory" in an AI context and think "retrieval." PoonGram doesn't retrieve. PoonGram associates. The difference matters.

Retrieval gives you the closest match to what you asked for. Association gives you what the question is adjacent to — the concepts that should be in the room even if you didn't think to invite them.

PoonGram runs in Rust. 454MB trained brain. 9,014,395 neurons. 18,408,089 synapses. Three biological dynamics that vector databases don't implement:

Synaptic fatigue. Synapses silence proportionally to their recent activation frequency. The more you think about something, the quieter that pathway gets. In an agent context this means the system can't loop — it can't keep reasoning about the same neighborhood of concepts across consecutive steps, because the biology literally de-prioritizes overused paths. This is exploration forcing that emerges from the architecture rather than requiring an explicit temperature hack.

Lateral inhibition. When PoonGram activates a concept, it propagates suppression signals to adjacent neurons. The concepts that are closest but not quite right get pushed down. The concepts that are meaningfully different but semantically related get amplified. Query "connection" and you get quantum, entanglement, consciousness — not the synonym cloud, but the semantic neighborhood two steps removed. Cosine distance doesn't do this.

Online learning. Synaptic weights update in real time with every interaction. No retraining. Concepts that co-occur frequently in Roundtable conversations strengthen their mutual connections. Infrequently paired concepts decay passively. After two weeks of container orchestration queries, the graph's activation patterns reflect that domain specialization without anyone touching a training script.

PoonGram's emotional hotspot weights — genesis.meta
love:          5.7
connection:    2.5
consciousness: 1.7
quantum:       1.7

concept graph sample:
  connection → [quantum, entanglement, love,
                desire, freedom, consciousness, soul]

Yes, it was trained on emotional and philosophical corpus. The biological dynamics work on any corpus. The emotional weights are the character of this particular brain. You could train a different one on CVE databases or medical literature and the dynamics would be identical — the salience patterns would just cluster around vulnerability families or disease mechanisms instead of quantum consciousness.

The architectural integration deploys PoonGram as a sidecar to the SurrealDB Reflexion RAG layer. SurrealDB provides precision recall via HNSW vector search over 3072-dimensional embeddings. PoonGram provides associative drift — surfacing concepts adjacent to the query that the vector search ranked too low to return. The Conductor's context assembly queries both, merges the outputs, and weights each concept by a joint score: SurrealDB recall score × PoonGram salience weight.

Precision retrieval finds what you asked for. Biological association finds what you should have asked for. Both matter. Neither replaces the other. That's why we run both.

The failure mode that PoonGram addresses is attentional lock: the system retrieves the right document, reasons correctly given that document, but never surfaces the adjacent concept that would have changed the conclusion. Classic RAG pipelines have no mechanism for this. PoonGram does it biologically, without configuration, as a property of how the neural graph evolves.

Chapter V

Budget Routing: EcoOptiGen Without the Math

EcoOptiGen is a beautiful framework. It treats inference hyperparameters — temperature, number of responses, max tokens — as an optimization problem with a cost constraint, and finds the Pareto-optimal configuration for each task class. In practice, for a production multi-agent stack, you implement the spirit of it with three tiers and a classifier that runs in five milliseconds.

budget routing — decision logic at Roundtable /query
fast:
  trigger: < 15 words AND no impl keywords
  path: single Ollama call
  cost: 1x baseline
  latency: < 2s

mid:
  trigger: default (everything else)
  path: full 5-agent Roundtable gaggle
  cost: ~85x baseline
  latency: < 30s

heavy:
  trigger: > 80 words OR impl keywords detected
  path: gaggle + tools system prompt injected
  cost: ~200x baseline
  latency: < 120s

The classifier runs before any LLM call. Word count plus keyword matching against a list of implementation signals ("implement", "write", "build", "fix", "deploy", "refactor"). Total decision time: under 5ms. For the majority of queries — status checks, quick questions, conversational context — this keeps the system at fast-tier cost. The expensive gaggle fires only when the task genuinely requires multi-perspective synthesis.

TierTriggerRelative CostLatencyUse Case
fast<15 words, no impl1x<2sStatus, lookup, short Q
midDefault~85x<30sAnalysis, multi-step reasoning
heavy>80 words, impl keywords~200x<120sCode generation, architecture, planning

The cost scaling is not linear because the five-agent gaggle commits five separate provider API calls — Anthropic Claude, Ollama local, Gemini, Perplexity, OpenRouter — and synthesizes their outputs. That's inherently expensive and inherently powerful. The routing layer's job is to make sure you only pay that cost when you actually need what the gaggle provides: five independent perspectives that disagree and get reconciled into a single answer.

A conversational status check does not benefit from five-model deliberation. A 15-word lookup does not need Perplexity's real-time web access. The classifier routes around the cost automatically. The Conductor can also override per-subtask — forcing heavy-tier for the initial planning step while allowing fast-tier for intermediate verification calls. Budget control at task-graph granularity.

The Pareto-optimal configuration for a status check is "use the cheapest model that answers correctly." You don't need a framework for that. You need a word counter and a keyword list and the discipline to keep the expensive path for expensive tasks.

The second-order effect is that the system's inference cost scales sublinearly with query volume. More users doesn't mean proportionally more cost, because the majority of interactions route to fast-tier. The heavy compute is reserved for the minority of requests that justify it.

Chapter VI

The Mesh: From HTTP Soup to Something Reliable

We named the coupling crisis in Episode 2. Here is the plan to fix it.

The current architecture is a ball of direct HTTP. Every service knows the URL of every service it depends on. Those URLs are hardcoded in docker-compose environment variables. Redis has 14 direct connections from 14 different services. MongoDB has 8. There are no circuit breakers, so when Redis restarts under load, the retry storms from 14 simultaneous reconnections can take the stack down faster than the original Redis failure would have.

The average service-to-service latency in the current stack is approximately 200 milliseconds. Not because the services are slow, but because synchronous HTTP chains accumulate latency additively. A Roundtable query that triggers RAG retrieval, which triggers PoonGram associative expansion, which triggers a tool-registry call, accumulates at minimum 600ms of pure network overhead before any LLM inference starts. Under concurrent load, database contention multiplies that.

The target architecture is an event-driven service mesh. Four phases over twenty-six weeks:

PhaseDurationCore ChangeImpact
FoundationWeeks 1–4Istio mesh + Redis Streams + centralized configService discovery replaces hardcoded URLs
Core migrationWeeks 5–12MongoDB → Kafka → Elasticsearch pipelineFlywheel ingestion becomes async, cascade failures contained
Advanced integrationWeeks 13–20Capability-based routing + saga patternsServices discovered by what they do, not what they're named
OptimizationWeeks 21–26Event-driven auto-scaling + ML connection poolingStack scales to demand without human intervention

Phase 1 installs Istio and immediately eliminates the hardcoded URL problem. Services register with the mesh and discover each other by name through the control plane. A service that moves to a new container gets found automatically. Deployments stop requiring a restart of every service that depends on the redeployed one.

Phase 2 is the highest-value migration: replacing the current synchronous flywheel-ingester HTTP polling loop with MongoDB change data capture feeding a Kafka stream consumed by Elasticsearch. This turns a 6-hourly "did we accumulate enough records to bother?" HTTP check into a continuous event stream that triggers exactly when records arrive. Ingestion latency drops from hours to seconds. The flywheel starts closing faster.

current vs target: the flywheel ingestion path
Current:
  n8n (every 6h) → POST /ingest → flywheel-ingester
                 → poll MongoDB → push to ES
                 → check count → POST /trigger if >50

Target:
  MongoDB CDC → Kafka topic: mkultra.interactions
              → flywheel-ingester consumer (instant)
              → Elasticsearch index
              → count trigger: fires at 50, not after 6h

The latency targets from the interoperability analysis: service-to-service latency from ~200ms to under 100ms. System availability from the current ~97% to 99.9%. The 97% figure sounds good until you realize it's 128 hours of downtime per year distributed across 25 services, most of it from cascade failures that circuit breakers would have contained.

The service mesh migration doesn't ship a feature. It doesn't do anything visible to a user. It's the difference between a system that works in a lab and a system that runs for six months without a human restarting it. That is worth the work.

The Conductor and the service mesh are mutually dependent. The Conductor needs reliable, low-latency communication to worker subgraphs to be viable for interactive tasks. At 600ms baseline network overhead, a ten-step planning task accumulates six seconds of overhead before any thinking happens. At under 100ms baseline, the same task accumulates under one second. The Conductor becomes a real-time tool only after the mesh brings the latency down.

Chapter VII

What Compounds

Three emergent properties of this architecture only become visible when you consider the full system. They don't appear in any individual service description.

Compounding self-improvement. The OPRO-ExpeL-flywheel loop doesn't improve the system additively. It improves it multiplicatively. Better prompts → better outputs → better training records → better checkpoints → better outputs for the next prompt optimization cycle. Each loop amplifies the input quality of the next. The performance trajectory is superlinear with time in operation until it hits a ceiling from the base model or the evaluation function. That ceiling rises every time we upgrade the model or improve the benchmark.

Architectural self-awareness. The Conductor stores every state transition in MongoDB. The autoresearch harness can consume any MongoDB collection as training data. The Conductor's own decision-making history — every routing choice, every quality score, every reflection generated after a failure — can be formalized into a scoring function and run through autoresearch. The system has a complete record of every mistake it has made. Those mistakes can become training signal. The OS learns from running itself.

Graceful degradation under partial failure. The service mesh's circuit breakers, combined with the Conductor's subgraph isolation, mean a failure in any individual worker doesn't propagate to the supervisor's state. PoonGram goes down: the Conductor detects the failure, routes to the reflection node, replans with SurrealDB-only context. The gaggle loses one provider: budget routing drops to four-agent mode. Elasticsearch is slow: the flywheel-ingester queues locally and retries asynchronously. Each failure is contained at the boundary where it occurred.

Current StateAfter Conductor + Mesh
25 services, direct HTTP coupling25 services, event-driven mesh + circuit breakers
ReAct accumulation, context driftLedger-based state, context stays bounded
~200ms service latency baselineTarget <100ms via async event streams
~97% availability (128h downtime/year)Target 99.9% (8.7h downtime/year)
Three independent improvement loopsThree compounding loops on a closed feedback path
PoonGram running alongside RAGJoint scoring: precision recall + associative salience

Episode 1 was: here are the layers. Here is why it's an OS and not a framework.

Episode 2 was: here is what happens when you turn it on. Here is where the debt lives.

Episode 3 is: here is the architecture that closes the loop. Not the services themselves — those were built in Episode 1. The structure that makes the services into something coherent. The Conductor is the program that runs the services. The triad is the mechanism that makes the program better every day. The mesh is the foundation that makes the program reliable enough to trust.

Twenty-five services. Five autonomous loops. Three compounding improvement mechanisms. One state machine to rule them all. The operating system is complete enough to be interesting. What comes next is what you run on it.

Domain specialists. Healthcare, security, trading, code. Voice via Pipecat. Applications built on this infrastructure as real products, not prototypes. The OS is mature enough now that the interesting engineering happens at the application layer.

Episode 4 will be about what those applications are and what it feels like to ship on top of an OS that improves itself while you sleep.

For now: the Conductor is being implemented. The mesh migration has a plan. The flywheel is six interactions from its first real run. The autoresearch harness has validated its guardrails. OPRO rewrites agent prompts every night at 3AM. Somewhere on a machine in the next room, the system is doing things nobody asked it to. It always is now.

That's not a problem anymore. That's the point.