Agent OS Spec — Analysis

Role Classification

It's both — not either

The spec divides into a Dev division (engineering: build, deploy, review) and an Ops division (marketing, publishing, comms). Dev ≈ TAC. Ops ≈ COMMAND. The difference is topology: they run both identities in one session via prompt folders. We run separate sessions with a structured handoff protocol (dispatch-tac.sh).

Their approach is simpler to install. Ours has less context bleed on complex missions.

Pattern Comparison

Pattern	Their Implementation	Our Implementation
Council of 5	Pass / Flag / Block voting. One block halts. Any flag triggers revision before proceeding.	Same seats. Flag surfaces and notes dissent — doesn't hard-stop reversible ops.
10-Attempt Rule	Different method class each retry. Opus diagnosis at attempt 9.	Identical. Called "Intelligence Surge" internally.
EYES Protocol	5 types: HTTP 200, /health, file, cron, API 2xx. All curl-based.	Curl + Wizard (Chrome DevTools MCP) for SPAs. SPA gate is mandatory.
Hook Enforcement	5 rules including "no external AI without permission."	Similar set. External AI restriction removed — too conservative for subagent dispatch.
Memory	Flat files committed to GitHub. Cross-session via git pull on boot.	Flat files + ruvector.db. Semantic search via `cmd-search.sh`.
Parallel Dispatch	Not addressed. No CAMPAIGN law.	Explicit Iron Law. Sequential dispatch on independent targets = command failure.
Memory Confidence	Tagged: confirmed / observed / inferred.	Flat — all observations treated equally. Gap worth closing.
Employee Onboarding	First-session interview writes personalized prompt file.	Not implemented. Worth adding for multi-operator scenarios.

Keep — They Got Right

Core architecture

Keep

Council of 5, 10-attempt rule with Opus at attempt 9, hook-enforced guardrails, skills as on-demand (names at startup, bodies on invoke), session lifecycle with pull-on-boot. All solid. Independent validation that these patterns hold under real use.

"Infrastructure only at bottlenecks" principle

Keep

Deliberately avoid custom servers and databases until a concrete bottleneck demands it. Good discipline — worth encoding as a rule when scoping future COMMAND infrastructure work.

Steal — Worth Retrofitting

Memory confidence tagging

Steal

Their memory entries carry a tag: confirmed / observed / inferred. Our observation log treats all entries equally. An agent knowing whether a fact was directly verified vs inferred from context is meaningfully different — especially when contradictory entries appear across sessions.

Per-employee onboarding interview

Steal

First session runs a structured interview: directness preference, decision authority level, interaction style. Output is a personalized prompt file that loads alongside the division prompt each session. Useful if COMMAND is ever opened to additional operators beyond Toby.

Change — Real Gaps in Their Spec

EYES protocol is curl-only — SPA blind spot

Production Defect

All five verification types rely on curl. A React, Next.js, or Vite app returns HTTP 200 on an HTML shell containing one script tag. The actual application never runs. curl 200 ≠ EYES.

The fix: SPA gate before any verification. curl -s <url> | grep -c 'id="root"\|id="__next"' → if ≥1, Wizard is mandatory — navigate, screenshot, read the PNG. Their spec will silently pass broken deploys of any modern frontend.

No CAMPAIGN law — sequential dispatch trap

Change

When a mission has N independent targets — build 10 sites, audit 20 pages, send 15 drafts — processing them sequentially is a performance failure, not a feature. We encoded this as an Iron Law: count all targets, run independence check, fan out in parallel before item 1 starts. Sequential dispatch on independent work is a command failure. They'll hit this.

Single-session Dev/Ops split — context bleed risk

Change

Running Dev and Ops in the same session via prompt folders works at low complexity. Under sustained parallel workloads — debugging a deploy failure while managing a content calendar — context from one division bleeds into the other. Separate sessions with structured handoff keeps each clean. The overhead is worth it at scale.

Cut — Remove Entirely

"Never call external AI without permission" as a hook

Cut

Blocking subagent dispatch via a hook is too conservative for a dev agent. Council of 5 votes, Intelligence Surge, parallel exploration — all require spawning subagents autonomously. A permission gate on every external AI call kills the operational speed these systems are built to provide. The right guardrail is spend ceiling + destructive-op confirmation, not a blanket external-AI block.

Any flag → mandatory revision before proceeding

Cut

Correct for irreversible actions: send, publish, delete. Overkill for reversible ops: branch, draft, scaffold. A hard-stop on every single flag vote slows routine work for no gain. Gate by reversibility, not vote type.

Verdict

The bones are solid — same architecture we independently built and battle-tested. Council, 10-attempt rule, hook enforcement, session lifecycle. All hold up. The memory confidence tagging and per-employee onboarding are worth stealing back.

The SPA gate gap is the one production defect. An EYES protocol that can't distinguish a raw HTML response from a rendered React application will silently pass broken deployments. That gap surfaces in production, not in testing.

Ship the SPA gate fix and the CAMPAIGN law, cut the external-AI block and the flag-always-halts rule, and it's close to production-grade.