tangle-network · drewstone · Jul 1, 2026 · Jul 1, 2026
diff --git a/examples/README.md b/examples/README.md
@@ -61,6 +61,7 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha
 | 9 | [`ui-audit/`](./ui-audit/) | You want the smallest end-to-end `runLoop` over a real client (Playwright + stub judge), persisting findings. |
 | 9a | `agent-eval/examples/eval-fixtures-quickstart` | You want Vercel-style eval folders (`PROMPT.md` + `EVAL.ts`) to run through runtime's `runLoop`; pair `loadEvalFixtureScenarios()` with `loopCampaignDispatch()`. |
 | 9b | [`coding-benchmark/`](./coding-benchmark/) | You want a scientifically-rigorous coding benchmark across harnesses: `runProfileMatrix` over harness × baseline-profile × scenario, a one-line tool knob (websearch / webfetch / MCP), a held-out-test-execution anti-cheat (the agent is graded on hidden tests it never saw, so it can't hardcode), a secondary quality judge, and paired-bootstrap + Wilson + BH stats (offline by default; `--live` for real harness boxes). |
+| 9c | [`webcode-matrix/`](./webcode-matrix/) | You want the **real WebCode benchmark** (Exa's 33-task dataset, graded by its own hidden tests) across a harness×model matrix, rendered as a publishable leaderboard — ranked board + profile×axis matrix + SVG/HTML charts + CIs + pairwise significance, all via the domain-agnostic `leaderboard()` engine. WebCode's discriminator is web *retrieval* (post-cutoff APIs); for SWE-style coding see `9b`. |
 
 ## Tier 3 — the production runtime, deeper
 
@@ -84,9 +85,11 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha
 |---|---|---|
 | 16 | [`strategy-evolution/`](./strategy-evolution/) | You want the full policy-search + holdout gate: author candidates from losses, promote a champion only if a paired-bootstrap CI says it isn't luck (needs `TANGLE_API_KEY`). |
 | 17 | [`improve/`](./improve/) | You want the one supported RSI verb: `improve(profile, findings, opts)` — optimize one profile surface, ship only if it clears the held-out gate. Offline. |
+| 17b | [`self-improving-coder/`](./self-improving-coder/) | You want the flywheel on a **contamination-proof coding task**: an agent authors strategies from its train losses, graded by real pytest, promoted only if a paired-bootstrap CI clears on a fresh holdout. The bundled task is deliberately saturated (the gate honestly returns no-promotion); swap in a harder env or SWE-bench to see a real lift. `CALIBRATE=1` is a $0 no-creds check. |
 | 18 | [`self-improving-loop/`](./self-improving-loop/) | You want the unrolled internals of #17: v0 → judge → analyst → mutation → v1 → gate, with the "which substrate owns each phase" map. Offline. |
 | 19 | [`intelligence-recommend/`](./intelligence-recommend/) | You want the intelligence loop offline: trace → findings → `improve()` → gated candidate. |
 | 20 | [`intelligence-drop-in/`](./intelligence-drop-in/) | You want to wrap any agent with `withTangleIntelligence` and ship one trace per call (best-effort; off = passthrough). |
+| 20b | [`intelligence-webcode/`](./intelligence-webcode/) | You want the full Intelligence SDK (billing boundary + effort tiers, per-tool cost waterfall, OTLP export) on **every cell of a real benchmark** — the WebCode harness×model matrix, instrumented. Needs a sandbox key. |
 | 21 | [`agents-of-all-shapes/`](./agents-of-all-shapes/) | You want proof that any framework's traces converge on one OTel contract → one `InsightReport` (the CI-tested example). |
 | 22 | [`product-eval/`](./product-eval/) | You want user-sim product evals: a persona over a multi-round conversation via `runPersonaConversation`, then score the transcript (`maxTurns` is a ceiling, not a target). Needs `TANGLE_API_KEY`; offline via a `backendFor` override. |
 | 23 | [`agentic-data-creation/`](./agentic-data-creation/) | You want the **Autodata inner loop**: an agent manufactures HARD training examples from a doc and keeps only the ones that DISCRIMINATE a strong solver from a weak one. Composes the fold (`runLoop`+refine driver), N× sampling (`runLoop`+fanout driver), `llmJudge`, `CostLedger`, and `Corpus`; the one new piece is `discriminativeAcceptRule`. Shows the calibration (plain gap ≈ 0.02 vs agentic ≈ 0.31). Offline. |

diff --git a/examples/researcher-loop/README.md b/examples/researcher-loop/README.md
@@ -1,9 +1,11 @@
 # researcher-loop
 
 `researcherProfile()` (from `@tangle-network/agent-knowledge/profiles`) +
-`runLoop()` + an inline fanout `Driver` — the primary, smallest example of the
-`runLoop` kernel. Two parallel researcher attempts answer the same question;
-the validator scores citation density + namespace scoping + per-item
+`runLoop()` + an inline fanout `Driver` — the `runLoop` kernel driving a **domain
+research profile**. (For the minimal, dependency-free `runLoop` example to read
+first, see [`driver-loop`](../driver-loop); this one adds the agent-knowledge
+research profile on top.) Two parallel researcher attempts answer the same
+question; the validator scores citation density + namespace scoping + per-item
 provenance; the kernel picks the highest-scoring valid winner.
 
 A **round** is one `plan → run workers → decide` cycle. This driver is
@@ -47,12 +49,17 @@ flowchart TD
 ## Run
 
 ```bash
+# 1. install the optional peer this example needs (it is NOT a dependency of the runtime):
+pnpm add -D @tangle-network/agent-knowledge
+# 2. run it:
 pnpm tsx examples/researcher-loop/researcher-loop.ts
 ```
 
-The `@tangle-network/agent-knowledge` peer dep ships in `node_modules`
-already; the example imports `researcherProfile` from
-`@tangle-network/agent-knowledge/profiles`.
+`@tangle-network/agent-knowledge` is an **optional peer** — the runtime never
+imports it (domain packages enter by injection, not dependency), so it is not in
+`node_modules` by default and this example is excluded from the repo's CI
+typecheck (`tsconfig.examples.json`). Install it as above; the example imports
+`researcherProfile` from `@tangle-network/agent-knowledge/profiles`.
 
 ## What it shows
 

diff --git a/examples/self-improving-coder/README.md b/examples/self-improving-coder/README.md
@@ -0,0 +1,35 @@
+# self-improving-coder
+
+The self-improvement flywheel, composed cleanly, on a **contamination-proof** coding task. An agent authors candidate strategies from its training-set losses, then a **held-out gate** ships a change only if it beats the current agent on fresh tasks the search never touched — so registering an agent for self-improvement can never make it worse.
+
+Nothing here is hand-rolled: the agent is an `AgentProfile` worker, the task is an `AgenticSurface`, and the gated flywheel is `runStrategyEvolution` + `promotionGate` (a seeded paired-bootstrap CI on a disjoint holdout, read exactly once).
+
+## Run
+
+```bash
+# $0, no creds — proves the task is solvable AND the grader discriminates before spending anything.
+CALIBRATE=1  pnpm tsx examples/self-improving-coder/self-improving-coder.ts
+
+# the real flywheel (needs a router key + python3/pytest on the host to run the deployable check).
+TANGLE_API_KEY=sk-...  pnpm tsx examples/self-improving-coder/self-improving-coder.ts
+```
+
+Env knobs: `WORKER_MODEL` (default `deepseek-v4-flash`), `AUTHOR_MODEL` (default `gemini-2.5-pro`), `TRAIN_N`, `ROUTER_BASE`.
+
+## What you'll see — and why "No promotion" is the honest, correct result
+
+**The bundled task is deliberately simple** — a few wire-protocol functions fully pinned by their tests. A capable model aces it (every strategy scores 1.0), so the gate **correctly returns no promotion**: you cannot demonstrate improvement where there is no headroom, and this harness refuses to fake one (`calibrate-before-measure`, enforced). That null is the point — the gate is honest.
+
+**To see a real promotion, give it a task with a correctable middle band** (some attempts pass, some fail — the only regime where improvement is measurable):
+- swap `environment`/`tasks` for the algorithmically-hard generated env in [`../ablation-suite/hard-coding-env.ts`](../ablation-suite/hard-coding-env.ts), or
+- swap in the SWE-bench `Environment` (`bench/src/benchmarks/swe-bench.ts`) — everything else is identical. *(SWE-bench is contamination-**suspect**: its bugs are public GitHub fixes a model may have memorized — report that, never claim clean.)*
+
+## Why contamination-proof
+
+Each task is a small wire-protocol library whose constants (version, separators, checksum modulus, opcode) are **derived from the seed** and specified **only** by the test file — so a frontier model cannot have memorized the fix; the exact contract is generated per task. Graded by **real pytest** (a deployable check), never an LLM judge.
+
+## Related
+
+- [`improve`](../improve) — the one-call `improve(profile, findings)` facade over this loop.
+- [`self-improving-loop`](../self-improving-loop) — the same gate on a prompt surface, offline.
+- [`strategy-evolution`](../strategy-evolution) — the multi-generation `runStrategyEvolution` in isolation.