diff --git a/examples/README.md b/examples/README.md index c81a1087..e8fdd854 100644 --- a/examples/README.md +++ b/examples/README.md @@ -61,6 +61,7 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha | 9 | [`ui-audit/`](./ui-audit/) | You want the smallest end-to-end `runLoop` over a real client (Playwright + stub judge), persisting findings. | | 9a | `agent-eval/examples/eval-fixtures-quickstart` | You want Vercel-style eval folders (`PROMPT.md` + `EVAL.ts`) to run through runtime's `runLoop`; pair `loadEvalFixtureScenarios()` with `loopCampaignDispatch()`. | | 9b | [`coding-benchmark/`](./coding-benchmark/) | You want a scientifically-rigorous coding benchmark across harnesses: `runProfileMatrix` over harness × baseline-profile × scenario, a one-line tool knob (websearch / webfetch / MCP), a held-out-test-execution anti-cheat (the agent is graded on hidden tests it never saw, so it can't hardcode), a secondary quality judge, and paired-bootstrap + Wilson + BH stats (offline by default; `--live` for real harness boxes). | +| 9c | [`webcode-matrix/`](./webcode-matrix/) | You want the **real WebCode benchmark** (Exa's 33-task dataset, graded by its own hidden tests) across a harness×model matrix, rendered as a publishable leaderboard — ranked board + profile×axis matrix + SVG/HTML charts + CIs + pairwise significance, all via the domain-agnostic `leaderboard()` engine. WebCode's discriminator is web *retrieval* (post-cutoff APIs); for SWE-style coding see `9b`. | ## Tier 3 — the production runtime, deeper @@ -84,9 +85,11 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha |---|---|---| | 16 | [`strategy-evolution/`](./strategy-evolution/) | You want the full policy-search + holdout gate: author candidates from losses, promote a champion only if a paired-bootstrap CI says it isn't luck (needs `TANGLE_API_KEY`). | | 17 | [`improve/`](./improve/) | You want the one supported RSI verb: `improve(profile, findings, opts)` — optimize one profile surface, ship only if it clears the held-out gate. Offline. | +| 17b | [`self-improving-coder/`](./self-improving-coder/) | You want the flywheel on a **contamination-proof coding task**: an agent authors strategies from its train losses, graded by real pytest, promoted only if a paired-bootstrap CI clears on a fresh holdout. The bundled task is deliberately saturated (the gate honestly returns no-promotion); swap in a harder env or SWE-bench to see a real lift. `CALIBRATE=1` is a $0 no-creds check. | | 18 | [`self-improving-loop/`](./self-improving-loop/) | You want the unrolled internals of #17: v0 → judge → analyst → mutation → v1 → gate, with the "which substrate owns each phase" map. Offline. | | 19 | [`intelligence-recommend/`](./intelligence-recommend/) | You want the intelligence loop offline: trace → findings → `improve()` → gated candidate. | | 20 | [`intelligence-drop-in/`](./intelligence-drop-in/) | You want to wrap any agent with `withTangleIntelligence` and ship one trace per call (best-effort; off = passthrough). | +| 20b | [`intelligence-webcode/`](./intelligence-webcode/) | You want the full Intelligence SDK (billing boundary + effort tiers, per-tool cost waterfall, OTLP export) on **every cell of a real benchmark** — the WebCode harness×model matrix, instrumented. Needs a sandbox key. | | 21 | [`agents-of-all-shapes/`](./agents-of-all-shapes/) | You want proof that any framework's traces converge on one OTel contract → one `InsightReport` (the CI-tested example). | | 22 | [`product-eval/`](./product-eval/) | You want user-sim product evals: a persona over a multi-round conversation via `runPersonaConversation`, then score the transcript (`maxTurns` is a ceiling, not a target). Needs `TANGLE_API_KEY`; offline via a `backendFor` override. | | 23 | [`agentic-data-creation/`](./agentic-data-creation/) | You want the **Autodata inner loop**: an agent manufactures HARD training examples from a doc and keeps only the ones that DISCRIMINATE a strong solver from a weak one. Composes the fold (`runLoop`+refine driver), N× sampling (`runLoop`+fanout driver), `llmJudge`, `CostLedger`, and `Corpus`; the one new piece is `discriminativeAcceptRule`. Shows the calibration (plain gap ≈ 0.02 vs agentic ≈ 0.31). Offline. | diff --git a/examples/researcher-loop/README.md b/examples/researcher-loop/README.md index 96193a07..acfe9283 100644 --- a/examples/researcher-loop/README.md +++ b/examples/researcher-loop/README.md @@ -1,9 +1,11 @@ # researcher-loop `researcherProfile()` (from `@tangle-network/agent-knowledge/profiles`) + -`runLoop()` + an inline fanout `Driver` — the primary, smallest example of the -`runLoop` kernel. Two parallel researcher attempts answer the same question; -the validator scores citation density + namespace scoping + per-item +`runLoop()` + an inline fanout `Driver` — the `runLoop` kernel driving a **domain +research profile**. (For the minimal, dependency-free `runLoop` example to read +first, see [`driver-loop`](../driver-loop); this one adds the agent-knowledge +research profile on top.) Two parallel researcher attempts answer the same +question; the validator scores citation density + namespace scoping + per-item provenance; the kernel picks the highest-scoring valid winner. A **round** is one `plan → run workers → decide` cycle. This driver is @@ -47,12 +49,17 @@ flowchart TD ## Run ```bash +# 1. install the optional peer this example needs (it is NOT a dependency of the runtime): +pnpm add -D @tangle-network/agent-knowledge +# 2. run it: pnpm tsx examples/researcher-loop/researcher-loop.ts ``` -The `@tangle-network/agent-knowledge` peer dep ships in `node_modules` -already; the example imports `researcherProfile` from -`@tangle-network/agent-knowledge/profiles`. +`@tangle-network/agent-knowledge` is an **optional peer** — the runtime never +imports it (domain packages enter by injection, not dependency), so it is not in +`node_modules` by default and this example is excluded from the repo's CI +typecheck (`tsconfig.examples.json`). Install it as above; the example imports +`researcherProfile` from `@tangle-network/agent-knowledge/profiles`. ## What it shows diff --git a/examples/self-improving-coder/README.md b/examples/self-improving-coder/README.md new file mode 100644 index 00000000..363eac41 --- /dev/null +++ b/examples/self-improving-coder/README.md @@ -0,0 +1,35 @@ +# self-improving-coder + +The self-improvement flywheel, composed cleanly, on a **contamination-proof** coding task. An agent authors candidate strategies from its training-set losses, then a **held-out gate** ships a change only if it beats the current agent on fresh tasks the search never touched — so registering an agent for self-improvement can never make it worse. + +Nothing here is hand-rolled: the agent is an `AgentProfile` worker, the task is an `AgenticSurface`, and the gated flywheel is `runStrategyEvolution` + `promotionGate` (a seeded paired-bootstrap CI on a disjoint holdout, read exactly once). + +## Run + +```bash +# $0, no creds — proves the task is solvable AND the grader discriminates before spending anything. +CALIBRATE=1 pnpm tsx examples/self-improving-coder/self-improving-coder.ts + +# the real flywheel (needs a router key + python3/pytest on the host to run the deployable check). +TANGLE_API_KEY=sk-... pnpm tsx examples/self-improving-coder/self-improving-coder.ts +``` + +Env knobs: `WORKER_MODEL` (default `deepseek-v4-flash`), `AUTHOR_MODEL` (default `gemini-2.5-pro`), `TRAIN_N`, `ROUTER_BASE`. + +## What you'll see — and why "No promotion" is the honest, correct result + +**The bundled task is deliberately simple** — a few wire-protocol functions fully pinned by their tests. A capable model aces it (every strategy scores 1.0), so the gate **correctly returns no promotion**: you cannot demonstrate improvement where there is no headroom, and this harness refuses to fake one (`calibrate-before-measure`, enforced). That null is the point — the gate is honest. + +**To see a real promotion, give it a task with a correctable middle band** (some attempts pass, some fail — the only regime where improvement is measurable): +- swap `environment`/`tasks` for the algorithmically-hard generated env in [`../ablation-suite/hard-coding-env.ts`](../ablation-suite/hard-coding-env.ts), or +- swap in the SWE-bench `Environment` (`bench/src/benchmarks/swe-bench.ts`) — everything else is identical. *(SWE-bench is contamination-**suspect**: its bugs are public GitHub fixes a model may have memorized — report that, never claim clean.)* + +## Why contamination-proof + +Each task is a small wire-protocol library whose constants (version, separators, checksum modulus, opcode) are **derived from the seed** and specified **only** by the test file — so a frontier model cannot have memorized the fix; the exact contract is generated per task. Graded by **real pytest** (a deployable check), never an LLM judge. + +## Related + +- [`improve`](../improve) — the one-call `improve(profile, findings)` facade over this loop. +- [`self-improving-loop`](../self-improving-loop) — the same gate on a prompt surface, offline. +- [`strategy-evolution`](../strategy-evolution) — the multi-generation `runStrategyEvolution` in isolation.