Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha
| 9 | [`ui-audit/`](./ui-audit/) | You want the smallest end-to-end `runLoop` over a real client (Playwright + stub judge), persisting findings. |
| 9a | `agent-eval/examples/eval-fixtures-quickstart` | You want Vercel-style eval folders (`PROMPT.md` + `EVAL.ts`) to run through runtime's `runLoop`; pair `loadEvalFixtureScenarios()` with `loopCampaignDispatch()`. |
| 9b | [`coding-benchmark/`](./coding-benchmark/) | You want a scientifically-rigorous coding benchmark across harnesses: `runProfileMatrix` over harness × baseline-profile × scenario, a one-line tool knob (websearch / webfetch / MCP), a held-out-test-execution anti-cheat (the agent is graded on hidden tests it never saw, so it can't hardcode), a secondary quality judge, and paired-bootstrap + Wilson + BH stats (offline by default; `--live` for real harness boxes). |
| 9c | [`webcode-matrix/`](./webcode-matrix/) | You want the **real WebCode benchmark** (Exa's 33-task dataset, graded by its own hidden tests) across a harness×model matrix, rendered as a publishable leaderboard — ranked board + profile×axis matrix + SVG/HTML charts + CIs + pairwise significance, all via the domain-agnostic `leaderboard()` engine. WebCode's discriminator is web *retrieval* (post-cutoff APIs); for SWE-style coding see `9b`. |

## Tier 3 — the production runtime, deeper

Expand All @@ -84,9 +85,11 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha
|---|---|---|
| 16 | [`strategy-evolution/`](./strategy-evolution/) | You want the full policy-search + holdout gate: author candidates from losses, promote a champion only if a paired-bootstrap CI says it isn't luck (needs `TANGLE_API_KEY`). |
| 17 | [`improve/`](./improve/) | You want the one supported RSI verb: `improve(profile, findings, opts)` — optimize one profile surface, ship only if it clears the held-out gate. Offline. |
| 17b | [`self-improving-coder/`](./self-improving-coder/) | You want the flywheel on a **contamination-proof coding task**: an agent authors strategies from its train losses, graded by real pytest, promoted only if a paired-bootstrap CI clears on a fresh holdout. The bundled task is deliberately saturated (the gate honestly returns no-promotion); swap in a harder env or SWE-bench to see a real lift. `CALIBRATE=1` is a $0 no-creds check. |
| 18 | [`self-improving-loop/`](./self-improving-loop/) | You want the unrolled internals of #17: v0 → judge → analyst → mutation → v1 → gate, with the "which substrate owns each phase" map. Offline. |
| 19 | [`intelligence-recommend/`](./intelligence-recommend/) | You want the intelligence loop offline: trace → findings → `improve()` → gated candidate. |
| 20 | [`intelligence-drop-in/`](./intelligence-drop-in/) | You want to wrap any agent with `withTangleIntelligence` and ship one trace per call (best-effort; off = passthrough). |
| 20b | [`intelligence-webcode/`](./intelligence-webcode/) | You want the full Intelligence SDK (billing boundary + effort tiers, per-tool cost waterfall, OTLP export) on **every cell of a real benchmark** — the WebCode harness×model matrix, instrumented. Needs a sandbox key. |
| 21 | [`agents-of-all-shapes/`](./agents-of-all-shapes/) | You want proof that any framework's traces converge on one OTel contract → one `InsightReport` (the CI-tested example). |
| 22 | [`product-eval/`](./product-eval/) | You want user-sim product evals: a persona over a multi-round conversation via `runPersonaConversation`, then score the transcript (`maxTurns` is a ceiling, not a target). Needs `TANGLE_API_KEY`; offline via a `backendFor` override. |
| 23 | [`agentic-data-creation/`](./agentic-data-creation/) | You want the **Autodata inner loop**: an agent manufactures HARD training examples from a doc and keeps only the ones that DISCRIMINATE a strong solver from a weak one. Composes the fold (`runLoop`+refine driver), N× sampling (`runLoop`+fanout driver), `llmJudge`, `CostLedger`, and `Corpus`; the one new piece is `discriminativeAcceptRule`. Shows the calibration (plain gap ≈ 0.02 vs agentic ≈ 0.31). Offline. |
Expand Down
19 changes: 13 additions & 6 deletions examples/researcher-loop/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# researcher-loop

`researcherProfile()` (from `@tangle-network/agent-knowledge/profiles`) +
`runLoop()` + an inline fanout `Driver` — the primary, smallest example of the
`runLoop` kernel. Two parallel researcher attempts answer the same question;
the validator scores citation density + namespace scoping + per-item
`runLoop()` + an inline fanout `Driver` — the `runLoop` kernel driving a **domain
research profile**. (For the minimal, dependency-free `runLoop` example to read
first, see [`driver-loop`](../driver-loop); this one adds the agent-knowledge
research profile on top.) Two parallel researcher attempts answer the same
question; the validator scores citation density + namespace scoping + per-item
provenance; the kernel picks the highest-scoring valid winner.

A **round** is one `plan → run workers → decide` cycle. This driver is
Expand Down Expand Up @@ -47,12 +49,17 @@ flowchart TD
## Run

```bash
# 1. install the optional peer this example needs (it is NOT a dependency of the runtime):
pnpm add -D @tangle-network/agent-knowledge
# 2. run it:
pnpm tsx examples/researcher-loop/researcher-loop.ts
```

The `@tangle-network/agent-knowledge` peer dep ships in `node_modules`
already; the example imports `researcherProfile` from
`@tangle-network/agent-knowledge/profiles`.
`@tangle-network/agent-knowledge` is an **optional peer** — the runtime never
imports it (domain packages enter by injection, not dependency), so it is not in
`node_modules` by default and this example is excluded from the repo's CI
typecheck (`tsconfig.examples.json`). Install it as above; the example imports
`researcherProfile` from `@tangle-network/agent-knowledge/profiles`.

## What it shows

Expand Down
35 changes: 35 additions & 0 deletions examples/self-improving-coder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# self-improving-coder

The self-improvement flywheel, composed cleanly, on a **contamination-proof** coding task. An agent authors candidate strategies from its training-set losses, then a **held-out gate** ships a change only if it beats the current agent on fresh tasks the search never touched — so registering an agent for self-improvement can never make it worse.

Nothing here is hand-rolled: the agent is an `AgentProfile` worker, the task is an `AgenticSurface`, and the gated flywheel is `runStrategyEvolution` + `promotionGate` (a seeded paired-bootstrap CI on a disjoint holdout, read exactly once).

## Run

```bash
# $0, no creds — proves the task is solvable AND the grader discriminates before spending anything.
CALIBRATE=1 pnpm tsx examples/self-improving-coder/self-improving-coder.ts

# the real flywheel (needs a router key + python3/pytest on the host to run the deployable check).
TANGLE_API_KEY=sk-... pnpm tsx examples/self-improving-coder/self-improving-coder.ts
```

Env knobs: `WORKER_MODEL` (default `deepseek-v4-flash`), `AUTHOR_MODEL` (default `gemini-2.5-pro`), `TRAIN_N`, `ROUTER_BASE`.

## What you'll see — and why "No promotion" is the honest, correct result

**The bundled task is deliberately simple** — a few wire-protocol functions fully pinned by their tests. A capable model aces it (every strategy scores 1.0), so the gate **correctly returns no promotion**: you cannot demonstrate improvement where there is no headroom, and this harness refuses to fake one (`calibrate-before-measure`, enforced). That null is the point — the gate is honest.

**To see a real promotion, give it a task with a correctable middle band** (some attempts pass, some fail — the only regime where improvement is measurable):
- swap `environment`/`tasks` for the algorithmically-hard generated env in [`../ablation-suite/hard-coding-env.ts`](../ablation-suite/hard-coding-env.ts), or
- swap in the SWE-bench `Environment` (`bench/src/benchmarks/swe-bench.ts`) — everything else is identical. *(SWE-bench is contamination-**suspect**: its bugs are public GitHub fixes a model may have memorized — report that, never claim clean.)*

## Why contamination-proof

Each task is a small wire-protocol library whose constants (version, separators, checksum modulus, opcode) are **derived from the seed** and specified **only** by the test file — so a frontier model cannot have memorized the fix; the exact contract is generated per task. Graded by **real pytest** (a deployable check), never an LLM judge.

## Related

- [`improve`](../improve) — the one-call `improve(profile, findings)` facade over this loop.
- [`self-improving-loop`](../self-improving-loop) — the same gate on a prompt surface, offline.
- [`strategy-evolution`](../strategy-evolution) — the multi-generation `runStrategyEvolution` in isolation.
Loading