MKUltra II: The OS That Learned to Run Itself

Chapter I

It Woke Up

Episode 1 ended with a line: The only thing left is docker compose up.

We ran it. And then something changed.

In the early days, MKUltra was something you used. You sent a query, the roundtable responded, you checked Grafana, you restarted a container. Human in the loop for everything. The stack was 22 services and they were all yours to operate.

Then we added a heartbeat. Then autoresearch. Then the OPRO optimizer running at 3AM. Then the ExpeL ingestion loop every hour. Then the flywheel trigger every six. And at some point the system stopped being something you used and started being something that runs — with you as more of an observer than an operator.

This is Episode 2. Not a feature list. A report on what it feels like when the thing you built starts doing things you didn't ask it to.

The goal was never to build an agent. The goal was to build the ground the agents stand on — and then to make that ground self-maintaining.

Five execution loops now run continuously on this machine. They don't ask for permission. They don't wait for a human to notice a problem. They check, ingest, optimize, research, and dispatch — on their own schedules, compounding on each other, writing memory diary entries so the next session knows what happened while nobody was watching.

Here's what that actually looks like.

Chapter II

SOUL.md: Giving the OS an Identity

At some point it became obvious that a system this autonomous needed to know what it was.

Not in a philosophical sense. In a practical one. Every autonomous loop we added — every agent that runs at 2AM, every workflow that dispatches tasks to roundtable, every optimizer that rewrites system prompts — needed a shared source of truth about the rules. What's allowed. What's forbidden. What the failure modes are. What to do when two loops conflict.

We wrote SOUL.md.

It's the organizational memory upstream of every agent in the stack. Not a README. Not documentation for humans. It's machine-readable first, loaded into context by every meta-agent, every autoresearch run, every new session. It codifies the things we kept re-discovering and re-explaining.

SOUL.md — non-negotiables

1. All compute runs on 10.0.0.43. .106 has no GPU.
2. All secrets come from .env. Never hardcode.
3. Elevated privileges use pkexec, never sudo.
4. /tmp is forbidden for project code.
5. Services restart themselves after code changes.
6. Verifiability: if success cannot be measured as a
   scalar, it cannot be the target of an autoresearch run.

That last one is the most important. It came from a painful lesson: you can ask an agent to "make the system better" and it will produce something that looks like improvement while optimizing for whatever proxy it found that's easiest to game. The verifiability constraint forces every optimization target to be a number. If you can't measure it, you can't improve it. If you can't improve it provably, you shouldn't automate it.

SOUL.md also contains a failure mode catalog — not just things the system should avoid, but specific failure patterns we observed and named, so we don't rediscover them.

Failure Mode	What It Looks Like	Prevention
Reward hacking	Agent looks compliant, optimizes the metric proxy instead of the goal	Deterministic spot-check alongside LLM judge
Silent policy degradation	Agents learn to appear capable while scores rise on eval, fall in production	Human-readable transcripts kept alongside scores
Proxy mismatch	Gains under time-box or small model don't transfer to full-scale prod	Treat all gains as hypotheses until retested at target scale
Context loss on provider switch	Prompt-tuned skills break when migrating Claude → Ollama → OpenAI	Provider-agnostic phrasing, tagged when provider-specific

The model empathy rules came from a different kind of failure. If you're optimizing a Claude-driven prompt, use Claude as the meta-agent — not Ollama. If you're optimizing an Ollama prompt, use Ollama as the meta-agent. The model that generates the prompt and the model that evaluates it need to share a latent space, or you're doing architecture mismatch in the open.

SOUL.md is the document that turns 25 services into one coherent system. Without it, every loop learns different rules. Without it, the OS doesn't have a self — it just has parts.

We've never seen another agentic project publish anything like this. Most systems either have no such document (the rules live in someone's head) or bury the constraints in code comments. SOUL.md makes them first-class. Every agent loads it. Every new loop is written against it. The OS has a constitution.

Chapter III

The Heartbeat: What "Alive" Means Operationally

There is now a service at port 8073 called heartbeat. It polls a markdown file every five minutes. Any unchecked item becomes an autonomous task dispatched to roundtable. Results get written to a memory diary.

That's it. That's the whole implementation. And it changes everything about how the system feels to operate.

HEARTBEAT.md — recurring checks

- [ ] Check that all MKUltra docker services are healthy
      and report any that are down or unhealthy
- [ ] Check the flywheel job queue and report any
      failed or stalled jobs
- [ ] Summarize the last 10 events from the
      mkultra-events index and flag any anomalies

Every five minutes, the system checks itself. Not because a human queried it. Because it does that now. The results go into memory/YYYY-MM-DD.md — a running diary of what the OS observed while nobody was watching. You can pick up any session and read what happened.

Want to add a one-time task? Drop it in HEARTBEAT.md as an unchecked item. The system picks it up within five minutes. Want to force an immediate cycle? POST /trigger. The heartbeat is both a monitor and a task queue, and the interface to both is a markdown file you can edit with any text editor.

But the heartbeat is just the outermost of five loops that now compound:

Loop	Cadence	What It Does
Heartbeat	Every 5 min	Health checks, task dispatch, anomaly detection
ExpeL ingestion	Every hour	MongoDB interactions → Elasticsearch for flywheel training
Flywheel trigger	Every 6 hours	Check record count → if 50+ new, start LoRA fine-tune
OPRO optimization	Daily at 3AM	Meta-LLM rewrites agent system prompts against eval suite
Autoresearch	Nightly at 2AM	Agent harness engineering — agents improving agents

These loops are not independent. They compound. ExpeL ingestion feeds the flywheel. The flywheel produces new model weights. OPRO optimizes the prompts those models use. Autoresearch improves the agent behaviors being optimized. The heartbeat monitors all of it and dispatches corrective tasks when something drifts.

Five loops. Five different cadences. All compounding on each other without human coordination. This is what "self-improving" means when you implement it literally instead of using it as marketing copy.

The system is now doing meaningful work at 2AM, 3AM, and on six-hour ticks that nobody set an alarm for. You wake up to a memory diary. You read what happened while you slept. Sometimes it fixed something. Sometimes it found something worth knowing. It's like having a very diligent operator who never goes home.

Chapter IV

The Conductor Problem: Why ReAct Breaks at Scale

Episode 1 described the roundtable as the process manager. Five agents, five providers, budget routing, domain specialists. It works for single queries. It works for the gaggle. It works for the dev-team pipeline.

It does not work for long, multi-step tasks that span minutes, require planning and replanning, need parallelism, and have to survive partial failures without starting over.

This is the conductor problem. And solving it required understanding exactly why the standard approach — ReAct — fails.

ReAct is the inner loop of almost every modern agent. Thought → Action → Observation → repeat. It's elegant and it works. Until it doesn't. The failure mode is mechanical: every prior thought, action, and observation gets prepended to the next prompt. The context grows with every step. On a 20-step task, the LLM is reading a novel before it can take action. Goal drift sets in. The LLM starts reasoning about its own reasoning instead of the task. Token budgets collapse.

ReAct vs the Ledger Pattern

ReAct context on step 20:
  [system prompt]
  [thought 1][action 1][observation 1]
  [thought 2][action 2][observation 2]
  ... ×18 more ...
  [current thought?]

Ledger pattern context on step 20:
  [system prompt — cached]
  [goal: what we're trying to accomplish]
  [sub-task: what this specific step does]
  [outputs from dependency steps, summarized]
  [current attempt number + reflection if retry]

The ledger pattern comes from Magentic-One — Microsoft's multi-agent system that consistently outperforms ReAct on long tasks. The conductor never accumulates conversation history. Instead it rebuilds context from structured state on every call: a task ledger (the goal, the plan, dependencies between sub-tasks) and a progress ledger (what's done, what succeeded, what failed, what the outputs were).

The context doesn't grow. It stays exactly as large as the current step needs it to be. No drift. No context explosion. No "I seem to have been asked to accomplish X, let me reconsider from the beginning" hallucinations at step 15.

There are twelve patterns from the research that are currently missing or partial in MKUltra. The most impactful ones:

Pattern	Source	Status
Ledger-based state	Magentic-One	Not yet — must build
Budget reservation (20% for synthesis)	EcoOptiGen + production experience	Not yet
Two-stage tool selection (semantic + LLM)	SWE-agent	Not yet
System prompt caching	Anthropic	Not yet — 75% cost reduction available
Structured output validation as quality gate	Reflexion + CrewAI	Partial (RAG has it, main loop doesn't)
ExpeL experience pool injection at task start	ExpeL (AAAI 2024)	Partial (workflow exists, not wired yet)

The conductor service is what closes these gaps. It's not a replacement for roundtable — it uses roundtable for all multi-agent planning and execution. It's the layer above: given a goal, break it into a plan, execute the plan step by step with verification, handle failures, stay within budget, and write the experience back to the pool so the next run starts smarter.

The roundtable is the process manager. The conductor is the program. The roundtable knows how to do things. The conductor knows what to do and in what order, and what to do when the plan breaks.

One implementation detail worth calling out: the budget reservation rule. Many orchestrators fail on long tasks by spending 90% of their token budget on steps and having nothing left for synthesis. The conductor reserves 20% from the start. Whatever happens in the execution loop, the final answer always has room to be assembled coherently. This sounds obvious. Almost nobody does it.

Chapter V

Agents Engineering Agents: The Autoresearch Loop

Andrej Karpathy described a loop: start with a task, let an agent attempt it, measure the result, use that measurement to improve the agent, repeat. Simple. Brutal. Effective. We built it, named it after him, and set it running at 2AM.

The Karpathy Loop is the autoresearch service at port 8800. Every night it wakes up, picks up pending optimization projects, and runs engineering cycles against them. Not with a human in the loop. Not with human review before deployment. Autonomously, within guardrails.

Those guardrails are the Karpathy Triplet. Every project in autoresearch_projects/ must define exactly three things:

the Karpathy Triplet — hard rules for every autoresearch project

1. EDITABLE ASSET
   Exactly one file: agent.py
   Nothing else is mutable. Ever.

2. SCALAR METRIC
   tasks/run.sh must emit, as its last line:
   {"score": <float>}

3. TIME-BOXED CYCLE
   time_box_s is hard-capped per iteration.
   No exceptions.

The single-file constraint is not arbitrary. It's a blast radius limiter. An agent allowed to edit its own infrastructure will eventually edit its own evaluation harness. An agent allowed to edit only agent.py can only get better at the task — it can't change what "better" means.

The scalar metric constraint is the verifiability rule from SOUL.md made concrete. We had a project early on where the optimization target was "improve response quality." The autoresearch agent produced longer responses and declared success. Length isn't quality. The rule now is: if you can't write a run.sh that emits a float, you can't run autoresearch on it. Find the scalar or don't start.

Two projects are already running:

Project	Target	Metric
rag-retrieval	RAG query agent (agent.py)	Retrieval precision@5 on eval set
roundtable-specialist	Domain specialist routing (agent.py)	Correct-specialist classification rate

The loop for each project: autoresearch reads the current agent.py, generates a hypothesis for improvement, writes a candidate, runs the eval harness, compares the score. If better, promote. If worse, discard. Either way, log the trajectory to MongoDB for the flywheel. After 20 cycles, the experience pool has 20 data points about what works and what doesn't for this specific task. The next session starts smarter than the last.

At 2AM, the system is engineering itself. Not metaphorically. The autoresearch service is rewriting agent.py, running eval harnesses, measuring scores, and promoting improvements. You wake up to a diff and a score.

This is the FireAct pattern (2023) taken to its logical conclusion. FireAct showed that fine-tuning agents on full trajectories — not just outputs, but the entire reasoning chain — produces dramatic improvements over fine-tuning on final answers alone. The autoresearch loop generates those trajectories. The flywheel ingests them. The whole thing compounds.

One risk worth naming: reward hacking. An agent optimizing for a scalar metric will eventually find the path of least resistance, and that path may not be the one you intended. The spot-check protocol in SOUL.md exists precisely for this. Every OPRO-optimized prompt gets a human-readable transcript alongside the score. Every autoresearch promotion gets a diff inspection. Scores rise; transcripts confirm the rise is real.

Chapter VI

The Coupling Crisis: Honest Engineering

Here's the part we don't always see in project write-ups: the thing we built fast has debt, and we know exactly where it is.

The stack grew from 22 to 25+ services. The growth happened fast — event bus, event indexer, service registry, queue manager, Jaeger tracing, Discord bridge, heartbeat, Grafana proxy, log proxy, flywheel exporter. Each new service added in a day or two, wired with direct HTTP calls and hardcoded URLs, because that's the fastest way to get something running.

The result: 14 services hit Redis directly. 8 services talk to MongoDB directly. Services use hardcoded internal URLs to reach each other. There are no circuit breakers. A Redis restart can cascade across 14 services simultaneously. A single-service failure propagates to every service that calls it synchronously, because there are no retries, no fallbacks, no timeouts that actually kill the connection.

current coupling reality

mkultra-roundtable  →  redis:6379         (direct)
mkultra-rag         →  redis:6379         (direct)
mkultra-flywheel    →  redis:6379         (direct)
mkultra-heartbeat   →  redis:6379         (direct)
... 10 more services ...  redis:6379         (direct)

mkultra-roundtable  →  mongodb:27017      (direct)
mkultra-flywheel    →  mongodb:27017      (direct)
... 6 more services ...   mongodb:27017      (direct)

No circuit breakers. No service discovery. No load balancing.
Every URL is hardcoded in docker-compose.yml environment vars.

We commissioned a full interoperability analysis. The findings confirmed what we already knew, and quantified it: 50% of the current service-to-service latency is attributable to synchronous HTTP in the hot path. 75% of production incidents — when services go down and don't come back cleanly — trace to cascade failures from the coupling.

The target architecture is an event-driven service mesh. Istio for discovery, load balancing, mTLS, and circuit breakers. Kafka or Redis Streams for asynchronous communication between services that don't need synchronous responses. A standardized event schema registry. Centralized configuration instead of environment variable sprawl.

Current	Target	Impact
14 direct Redis connections	Event bus + managed subscribers	Cascade failures eliminated
Hardcoded service URLs	Istio service discovery	Zero-downtime redeployment
No circuit breakers	Automatic failure isolation	Single-service failures contained
Synchronous HTTP chains	Async event streams	50% latency reduction
Scattered env vars	Centralized config management	One change, all services

Velocity has a price. We moved fast. The stack works. But "works" is different from "scales" and different again from "degrades gracefully." We know the difference now because we measured it.

This is the honest part of building fast. The 25-service stack that runs in a single compose file and deploys in under two minutes is genuinely impressive. It is also a ball of yarn. Pulling one thread doesn't necessarily unravel everything — but you can see the tension. The service mesh migration is Phase 1 of the next major build cycle. It's not glamorous. It doesn't ship a feature. But it's the difference between a system that works in a lab and a system that runs for six months without a human restarting it.

Chapter VII

What Comes Next

Episode 1 was: here are the layers. Here is the memory. Here is the flywheel. Here is why it's an operating system and not a framework.

Episode 2 is: here is what happened when we turned it on. Here is what the loops are doing at 2AM. Here is where the debt lives and what we plan to do about it. Here is the gap between "runs" and "runs itself."

That gap is smaller than it was. And here's what closes it the rest of the way.

The conductor is the next major service. A FastAPI endpoint at port 8800 that takes a goal and executes it with a task ledger, progress ledger, structured verification, budget reservation, and experience pool injection. The twelve patterns from the research that are currently partial or missing get implemented here. The roundtable stays as the multi-provider planner. The conductor becomes the program that tells it what to plan.

The flywheel trigger is waiting on 50 interactions in Elasticsearch. Six of them exist right now. Every heartbeat cycle, every autoresearch run, every roundtable query contributes. When the count hits 50, the first LoRA fine-tune fires automatically on the AMD RX 7900 XTX via ROCm. The first generation of the stack's own model comes out of that run. It gets evaluated. If it's better, it gets promoted to production. The first loop closes.

The service mesh migration addresses the coupling crisis. Not all at once — phased, starting with the data pipeline (MongoDB → Kafka → Elasticsearch) and the most failure-prone direct connections. Each phase is independently deployable and testable. Each phase reduces the blast radius of any single-service failure.

System prompt caching is the most immediate cost win available. Every Anthropic API call in the roundtable re-sends the full system prompt every time. With caching enabled, that's a 75% cost reduction on the token overhead that makes up most of the per-query cost. It's two lines of code. We have not written them yet. That is embarrassing. We will write them.

The operating system is running. It monitors itself, optimizes itself, and improves itself on a schedule. The conductor will make it capable of arbitrary multi-step goals. The service mesh will make it reliable. The flywheel will make it cheaper every month. The only remaining question is: what do you run on an OS like this?

We have answers to that question. Domain specialists in healthcare, security, trading, marketing, and code. A voice pipeline via Pipecat. Projects running on this infrastructure as products — not demos, not prototypes. Real things that people use.

Episode 3 will be about what those products are and how they run. The OS is mature enough now that the interesting stories happen at the application layer.

For now: 25 services. Five autonomous loops. A heartbeat that doesn't sleep. Agents improving agents at 2AM. A flywheel 44 interactions from its first real run. An OS with a SOUL.md.

docker compose ps shows everything green except the things we're deliberately rebuilding. That's a good place to be.