tangle-network · drewstone · Jul 1, 2026 · Jul 1, 2026
diff --git a/examples/agentic-data-creation/README.md b/examples/agentic-data-creation/README.md
@@ -1,12 +1,12 @@
 # agentic-data-creation
 
-**An agent manufactures its own hard training data.** This is the INNER loop of Autodata /
-Agentic Self-Instruct (Meta FAIR, arXiv 2606.25996): instead of hand-writing examples, an agent
+**An agent manufactures its own hard training data.** This is the INNER loop of **agentic
+self-instruct** (the self-instruct pattern, Wang et al. 2022, taken agentic): instead of hand-writing examples, an agent
 *writes* candidate {context, question, reference, rubric} examples from a grounding doc and keeps
 only the ones that are **hard for a weak solver but doable for a strong one**. The hard ones are
 exactly the examples worth training on.
 
-> This example builds only the paper's **data-creation** half (the inner loop). The RL-training
+> This example builds only the **data-creation** half (the inner loop). The RL-training
 > outer half needs a trainer this repo does not have, so it is out of scope here.
 
 Runs fully offline (scripted solvers + a mocked judge, no credentials):
@@ -47,7 +47,7 @@ flowchart TD
 ## The one new piece — `discriminativeAcceptRule`
 
 Everything else is composed from primitives this repo already ships. The genuinely new piece is the
-paper's reward, written as a small, Validator-shaped accept/reject:
+accept rule, written as a small, Validator-shaped accept/reject:
 
 ```ts
 discriminativeAcceptRule({ strongScore, weakScore, minStrong = 0.65, maxWeak = 0.5, minGap = 0.2 })
@@ -86,7 +86,7 @@ this before trusting it (the `calibrate-before-measure` discipline): it measures
 challenger's **first (un-refined) draft** — plain generation — and on the **loop-accepted** example,
 and shows the accept rule separates them. **Offline the solvers are scripted, so this proves the
 wiring + that the rule discriminates by construction — it is NOT an empirical reproduction of the
-paper's Table 1.** Reproducing Table 1 for real (the loop actually producing harder data) needs the
+illustrative target.** Reproducing that separation for real (the loop actually producing harder data) needs the
 live run below, with real two-tier solver models:
 
 ```
@@ -101,7 +101,7 @@ examples would be uninformative, and the loop would be optimizing noise.
 
 `offline-fixtures.ts` is the credentialless stand-in (the same pattern `examples/driver-loop` and
 `examples/self-improving-loop` use): deterministic scripted challenger/solvers and a **mocked judge
-transport** bound into a *real* `llmJudge` `JudgeConfig`, tuned to reproduce Table 1. The judge,
+transport** bound into a *real* `llmJudge` `JudgeConfig`, tuned to the illustrative target separation. The judge,
 sampler, fold, cost ledger, and corpus are all the real primitives — only the LLM responses are
 scripted. To run live, swap the mock transport for `createChatClient({ transport: 'router', apiKey })`
 (glm-5.2) and the scripted workers for real sandbox/cli-bridge clients; the loop is unchanged.

diff --git a/examples/agentic-data-creation/agentic-data-creation.ts b/examples/agentic-data-creation/agentic-data-creation.ts
@@ -1,7 +1,7 @@
 /**
- * agentic-data-creation — the INNER loop of Autodata / Agentic Self-Instruct (Meta FAIR,
- * arXiv 2606.25996): an agent MANUFACTURES hard training examples from a grounding doc, and keeps
- * only the ones that DISCRIMINATE a strong solver from a weak one.
+ * agentic-data-creation — the INNER loop of AGENTIC SELF-INSTRUCT (the self-instruct pattern,
+ * Wang et al. 2022, taken agentic): an agent MANUFACTURES hard training examples from a grounding doc,
+ * and keeps only the ones that DISCRIMINATE a strong solver from a weak one.
  *
  * This file is the SUBJECT. The whole method is four roles + one accept rule, composed from
  * primitives this package already ships — nothing here re-implements a judge, a sampler, a cost
@@ -78,7 +78,7 @@ export interface AcceptDecision {
 // THE ONE NEW PIECE — the paper's discriminative reward, as a small Validator-shaped rule.
 // ═══════════════════════════════════════════════════════════════════════════════════════════
 //
-// Autodata keeps an example ONLY IF it separates a strong solver from a weak one: the strong
+// The accept rule keeps an example ONLY IF it separates a strong solver from a weak one: the strong
 // solver should mostly get it (>= minStrong), the weak solver should mostly miss it (< maxWeak),
 // and the margin between them (the "gap") must clear minGap. That is the whole objective — make
 // examples too hard for the weak solver — so the rule is the LITERAL accept criterion, never
@@ -340,7 +340,7 @@ export interface DataCreationResult {
 }
 
 /**
- * Run the Autodata inner loop: manufacture `target` discriminating examples from `doc`, refining
+ * Run the self-instruct inner loop: manufacture `target` discriminating examples from `doc`, refining
  * each via the challenger fold until it is accepted (or its retry budget runs out). Returns the
  * accepted set, the per-example gap for the accepted (agentic) AND the first-draft (plain) examples
  * for calibration, the corpus they accreted into, and the cost ledger.

diff --git a/examples/agentic-data-creation/offline-fixtures.ts b/examples/agentic-data-creation/offline-fixtures.ts
@@ -13,8 +13,9 @@
  *     [0,1] rubric score from (strength, difficulty) with a small per-sample jitter so the N× mean
  *     is a genuine average. LIVE mode (glm-5.2) instead reads the real answer text against the rubric.
  *
- * The scores are tuned to reproduce the paper's Table 1 separation: an EASY (plain) example barely
- * separates the two solvers (gap ≈ 0.02); a HARD (agentic) one separates them widely (gap ≈ 0.31).
+ * The scores are tuned to an ILLUSTRATIVE target separation: an EASY (plain) example barely separates
+ * the two solvers (gap ≈ 0.02); a HARD (agentic) one separates them widely (gap ≈ 0.31). These numbers are
+ * by construction here — a live run produces the real ones.
  */
 
 import { createChatClient, llmJudge } from '@tangle-network/agent-eval'
@@ -157,7 +158,7 @@ export function solverClient(strength: 'weak' | 'strong'): SandboxClient {
 // `llmJudge` builds the system+user messages, makes ONE chat() call, and parses the model's
 // `{ dimensions, notes }` JSON into a canonical [0,1] `JudgeScore` (real composite math). Offline,
 // the transport returns a scripted score from the answer's grade marker; live, a real model scores
-// the prose. Tuned so EASY → gap ≈ 0.02, HARD → gap ≈ 0.31 (the paper's Table 1).
+// the prose. Tuned so EASY → gap ≈ 0.02, HARD → gap ≈ 0.31 (illustrative targets, by construction).
 export function buildRubricJudge(): JudgeConfig<SolverArtifact> {
   const chat = createChatClient({
     transport: 'mock',

diff --git a/examples/agentic-data-creation/run.ts b/examples/agentic-data-creation/run.ts
@@ -5,7 +5,7 @@
  * training examples from one grounding doc, and prints:
  *   1. each accepted example with its weak/strong solver scores and the gap,
  *   2. the CALIBRATION — does the gap metric actually separate? A plainly-generated (first-draft)
- *      example should show a SMALL gap; an agentic-loop-accepted one a LARGE gap (the paper's Table 1),
+ *      example should show a SMALL gap; an agentic-loop-accepted one a LARGE gap (the illustrative target),
  *   3. the cost ledger, split by role (challenger vs each solver) — composed, never hand-counted.
  *
  * Run:  pnpm tsx examples/agentic-data-creation/run.ts
@@ -55,9 +55,9 @@ async function main(): Promise<void> {
   const plain = mean(result.plainGaps)
   const agentic = mean(result.agenticGaps)
   console.log('\n— Calibration: does the gap metric discriminate? —')
-  console.log(`  plain   (first-draft examples)  mean gap = ${plain.toFixed(2)}   (paper ≈ 0.02)`)
+  console.log(`  plain   (first-draft examples)  mean gap = ${plain.toFixed(2)}   (target ≈ 0.02)`)
   console.log(
-    `  agentic (loop-accepted examples) mean gap = ${agentic.toFixed(2)}   (paper ≈ 0.31)`,
+    `  agentic (loop-accepted examples) mean gap = ${agentic.toFixed(2)}   (target ≈ 0.31)`,
   )
   const separates = Number.isFinite(plain) && Number.isFinite(agentic) && agentic - plain >= 0.15
   console.log(

diff --git a/examples/agents-of-all-shapes/README.md b/examples/agents-of-all-shapes/README.md
@@ -20,7 +20,7 @@ No sandbox. No deploy. No server. The analysis runs **in-process**.
 
 ```bash
 # Verified QA path — in-process, no key, no infra:
-npx tsx examples/agents-of-all-shapes/run.ts
+pnpm tsx examples/agents-of-all-shapes/run.ts
 
 # CI verification (what proves it):
 pnpm test -- tests/agents-of-all-shapes.test.ts

diff --git a/examples/agents-of-all-shapes/shared/intelligence.ts b/examples/agents-of-all-shapes/shared/intelligence.ts
@@ -17,6 +17,7 @@
 
 import { analyzeRuns, fromOtelSpans, type InsightReport } from '@tangle-network/agent-eval/contract'
 import type { TraceSpanEvent } from '@tangle-network/agent-eval/hosted'
+import { createOtelExporter } from '@tangle-network/agent-runtime'
 
 export type { InsightReport, TraceSpanEvent }
 
@@ -101,51 +102,33 @@ export interface ShipOptions {
   serviceName?: string
 }
 
-/** Optional hosted path: POST the same OTel spans to Tangle Intelligence's
- *  OTLP/HTTP ingest. Identical analysis runs server-side. */
+/** Optional hosted path: POST the same OTel spans to Tangle Intelligence's OTLP/HTTP ingest via the
+ *  runtime's OWN exporter — `createOtelExporter` builds the resourceSpans envelope, appends `/v1/traces`,
+ *  and batches the POST. No hand-rolled wire format; the same primitive the runtime uses in production. */
 export async function shipToTangleOtlp(spans: TraceSpanEvent[], opts: ShipOptions): Promise<void> {
-  const res = await fetch(`${opts.endpoint}/v1/traces`, {
-    method: 'POST',
-    headers: {
-      'content-type': 'application/json',
-      authorization: `Bearer ${opts.apiKey}`,
-    },
-    body: JSON.stringify({
-      resourceSpans: [
-        {
-          resource: {
-            attributes: [
-              {
-                key: 'service.name',
-                value: { stringValue: opts.serviceName ?? 'agents-of-all-shapes' },
-              },
-            ],
-          },
-          scopeSpans: [
-            {
-              scope: { name: 'agents-of-all-shapes' },
-              spans: spans.map((s) => ({
-                traceId: s.traceId,
-                spanId: s.spanId,
-                name: s.name,
-                startTimeUnixNano: String(s.startTimeUnixNano),
-                endTimeUnixNano: String(s.endTimeUnixNano),
-                attributes: Object.entries(s.attributes).map(([key, value]) => ({
-                  key,
-                  value:
-                    typeof value === 'number'
-                      ? { doubleValue: value }
-                      : { stringValue: String(value) },
-                })),
-                status: s.status,
-              })),
-            },
-          ],
-        },
-      ],
-    }),
+  const exporter = createOtelExporter({
+    endpoint: opts.endpoint,
+    headers: { authorization: `Bearer ${opts.apiKey}` },
+    serviceName: opts.serviceName ?? 'agents-of-all-shapes',
   })
-  if (!res.ok) {
-    throw new Error(`intelligence ingest failed: ${res.status} ${await res.text()}`)
+  if (!exporter) throw new Error('shipToTangleOtlp: no OTLP endpoint configured')
+  // OTLP status.code is numeric (UNSET=0, OK=1, ERROR=2); TraceSpanEvent carries the string enum.
+  const statusCode = { UNSET: 0, OK: 1, ERROR: 2 } as const
+  for (const s of spans) {
+    exporter.exportSpan({
+      traceId: s.traceId,
+      spanId: s.spanId,
+      name: s.name,
+      startTimeUnixNano: String(s.startTimeUnixNano),
+      endTimeUnixNano: String(s.endTimeUnixNano),
+      attributes: Object.entries(s.attributes).map(([key, value]) => ({
+        key,
+        value: typeof value === 'number' ? { doubleValue: value } : { stringValue: String(value) },
+      })),
+      ...(s.status
+        ? { status: { code: statusCode[s.status.code], message: s.status.message } }
+        : {}),
+    })
   }
+  await exporter.flush()
 }
diff --git a/examples/intelligence-drop-in/README.md b/examples/intelligence-drop-in/README.md
@@ -1,30 +1,38 @@
 # intelligence-drop-in
 
-The Observe + Mode-0 slice of the Tangle Intelligence SDK: wrap an existing
-agent, ship one trace per call, and pay only inference at the OFF tier. The
-wrapper is best-effort — a live agent never fails because Intelligence is down.
+The Observe + Mode-0 slice of the Tangle Intelligence SDK: wrap an existing agent, ship one trace per
+call, and pay **only inference** (the base model stream) at the OFF tier. **Why it matters:** you get
+per-call observability + billing with a one-line wrapper, and you can prove — not just assert — that
+turning intelligence *off* charges nothing extra. The wrapper is best-effort: a live agent never fails
+because Intelligence is down.
+
+> **Mode 0** = the OFF tier: telemetry stays on, but intelligence spend (analysts, corpus, extra spawns)
+> is clamped to 0. **Inference spend** = the base model stream you'd pay anyway; **intelligence spend** =
+> what the SDK's extra reasoning adds on top.
 
 ## Run
 
 ```bash
+# $0, no creds — stands up a throwaway local OTLP collector so the trace is visible without a key.
 pnpm tsx examples/intelligence-drop-in/intelligence-drop-in.ts
 ```
 
-The example stands up a throwaway local OTLP collector, so it runs with no
-credentials.
+It prints three proofs and **asserts** the last two (throws if they don't hold):
+1. wrap any `(input) => Promise<output>` in one line and it ships a trace;
+2. point it at a dead endpoint — the agent still answers (export failure swallowed);
+3. at `effort: 'off'`, read the exported span BACK off the collector and confirm `intelligence_usd = 0`.
 
 ## What it shows
 
-- `withTangleIntelligence(agent, { project, apiKey, endpoint })` — wrap any
-  `(input) => Promise<output>` agent; the shape is preserved and one trace span
-  is exported per call.
-- `createIntelligenceClient(...).traceRun(meta, fn)` — the explicit-trace API:
-  `trace.recordOutput` / `trace.recordOutcome` inside the body.
-- **Best-effort export** — pointed at a dead endpoint, the agent still returns
-  its answer; the export failure is swallowed.
-- **Mode 0 / OFF** (`effort: 'off'`) — pure passthrough, zero intelligence
-  spawns. The exported trace carries `{ inferenceUsd, intelligenceUsd }` and
-  `intelligenceUsd` is clamped to `0` — the mechanism that proves an OFF
-  customer paid inference-only.
-- `client.doctor()` — network-free readiness: Observe is always reachable;
-  Recommend and Gated-PR report the inputs they still need.
+- `withTangleIntelligence(agent, { project, apiKey, endpoint })` — wrap any agent; the call shape is
+  preserved and one trace span is exported per call, fire-and-forget.
+- `createIntelligenceClient(...).traceRun(meta, fn)` — the explicit-trace API (`trace.recordOutput` /
+  `trace.recordOutcome`), used here so we can `flush()` and read the span back.
+- **The OFF proof, by execution** — at OFF there is no intelligence spawn, so the exported span's
+  `intelligence_usd` is `0` by construction. The example digs it out of the OTLP payload and asserts it.
+
+## Going live
+
+Drop the local collector: set `TANGLE_API_KEY` and point `endpoint` at your real OTLP/HTTP collector
+(or omit `endpoint` to use `OTEL_EXPORTER_OTLP_ENDPOINT`). Raise `effort` from `off` to `standard`/`max`
+to enable the intelligence tiers — the same wrapper, one field changed.