Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions examples/agentic-data-creation/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# agentic-data-creation

**An agent manufactures its own hard training data.** This is the INNER loop of Autodata /
Agentic Self-Instruct (Meta FAIR, arXiv 2606.25996): instead of hand-writing examples, an agent
**An agent manufactures its own hard training data.** This is the INNER loop of **agentic
self-instruct** (the self-instruct pattern, Wang et al. 2022, taken agentic): instead of hand-writing examples, an agent
*writes* candidate {context, question, reference, rubric} examples from a grounding doc and keeps
only the ones that are **hard for a weak solver but doable for a strong one**. The hard ones are
exactly the examples worth training on.

> This example builds only the paper's **data-creation** half (the inner loop). The RL-training
> This example builds only the **data-creation** half (the inner loop). The RL-training
> outer half needs a trainer this repo does not have, so it is out of scope here.

Runs fully offline (scripted solvers + a mocked judge, no credentials):
Expand Down Expand Up @@ -47,7 +47,7 @@ flowchart TD
## The one new piece — `discriminativeAcceptRule`

Everything else is composed from primitives this repo already ships. The genuinely new piece is the
paper's reward, written as a small, Validator-shaped accept/reject:
accept rule, written as a small, Validator-shaped accept/reject:

```ts
discriminativeAcceptRule({ strongScore, weakScore, minStrong = 0.65, maxWeak = 0.5, minGap = 0.2 })
Expand Down Expand Up @@ -86,7 +86,7 @@ this before trusting it (the `calibrate-before-measure` discipline): it measures
challenger's **first (un-refined) draft** — plain generation — and on the **loop-accepted** example,
and shows the accept rule separates them. **Offline the solvers are scripted, so this proves the
wiring + that the rule discriminates by construction — it is NOT an empirical reproduction of the
paper's Table 1.** Reproducing Table 1 for real (the loop actually producing harder data) needs the
illustrative target.** Reproducing that separation for real (the loop actually producing harder data) needs the
live run below, with real two-tier solver models:

```
Expand All @@ -101,7 +101,7 @@ examples would be uninformative, and the loop would be optimizing noise.

`offline-fixtures.ts` is the credentialless stand-in (the same pattern `examples/driver-loop` and
`examples/self-improving-loop` use): deterministic scripted challenger/solvers and a **mocked judge
transport** bound into a *real* `llmJudge` `JudgeConfig`, tuned to reproduce Table 1. The judge,
transport** bound into a *real* `llmJudge` `JudgeConfig`, tuned to the illustrative target separation. The judge,
sampler, fold, cost ledger, and corpus are all the real primitives — only the LLM responses are
scripted. To run live, swap the mock transport for `createChatClient({ transport: 'router', apiKey })`
(glm-5.2) and the scripted workers for real sandbox/cli-bridge clients; the loop is unchanged.
Expand Down
10 changes: 5 additions & 5 deletions examples/agentic-data-creation/agentic-data-creation.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/**
* agentic-data-creation — the INNER loop of Autodata / Agentic Self-Instruct (Meta FAIR,
* arXiv 2606.25996): an agent MANUFACTURES hard training examples from a grounding doc, and keeps
* only the ones that DISCRIMINATE a strong solver from a weak one.
* agentic-data-creation — the INNER loop of AGENTIC SELF-INSTRUCT (the self-instruct pattern,
* Wang et al. 2022, taken agentic): an agent MANUFACTURES hard training examples from a grounding doc,
* and keeps only the ones that DISCRIMINATE a strong solver from a weak one.
*
* This file is the SUBJECT. The whole method is four roles + one accept rule, composed from
* primitives this package already ships — nothing here re-implements a judge, a sampler, a cost
Expand Down Expand Up @@ -78,7 +78,7 @@ export interface AcceptDecision {
// THE ONE NEW PIECE — the paper's discriminative reward, as a small Validator-shaped rule.
// ═══════════════════════════════════════════════════════════════════════════════════════════
//
// Autodata keeps an example ONLY IF it separates a strong solver from a weak one: the strong
// The accept rule keeps an example ONLY IF it separates a strong solver from a weak one: the strong
// solver should mostly get it (>= minStrong), the weak solver should mostly miss it (< maxWeak),
// and the margin between them (the "gap") must clear minGap. That is the whole objective — make
// examples too hard for the weak solver — so the rule is the LITERAL accept criterion, never
Expand Down Expand Up @@ -340,7 +340,7 @@ export interface DataCreationResult {
}

/**
* Run the Autodata inner loop: manufacture `target` discriminating examples from `doc`, refining
* Run the self-instruct inner loop: manufacture `target` discriminating examples from `doc`, refining
* each via the challenger fold until it is accepted (or its retry budget runs out). Returns the
* accepted set, the per-example gap for the accepted (agentic) AND the first-draft (plain) examples
* for calibration, the corpus they accreted into, and the cost ledger.
Expand Down
7 changes: 4 additions & 3 deletions examples/agentic-data-creation/offline-fixtures.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@
* [0,1] rubric score from (strength, difficulty) with a small per-sample jitter so the N× mean
* is a genuine average. LIVE mode (glm-5.2) instead reads the real answer text against the rubric.
*
* The scores are tuned to reproduce the paper's Table 1 separation: an EASY (plain) example barely
* separates the two solvers (gap ≈ 0.02); a HARD (agentic) one separates them widely (gap ≈ 0.31).
* The scores are tuned to an ILLUSTRATIVE target separation: an EASY (plain) example barely separates
* the two solvers (gap ≈ 0.02); a HARD (agentic) one separates them widely (gap ≈ 0.31). These numbers are
* by construction here — a live run produces the real ones.
*/

import { createChatClient, llmJudge } from '@tangle-network/agent-eval'
Expand Down Expand Up @@ -157,7 +158,7 @@ export function solverClient(strength: 'weak' | 'strong'): SandboxClient {
// `llmJudge` builds the system+user messages, makes ONE chat() call, and parses the model's
// `{ dimensions, notes }` JSON into a canonical [0,1] `JudgeScore` (real composite math). Offline,
// the transport returns a scripted score from the answer's grade marker; live, a real model scores
// the prose. Tuned so EASY → gap ≈ 0.02, HARD → gap ≈ 0.31 (the paper's Table 1).
// the prose. Tuned so EASY → gap ≈ 0.02, HARD → gap ≈ 0.31 (illustrative targets, by construction).
export function buildRubricJudge(): JudgeConfig<SolverArtifact> {
const chat = createChatClient({
transport: 'mock',
Expand Down
6 changes: 3 additions & 3 deletions examples/agentic-data-creation/run.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* training examples from one grounding doc, and prints:
* 1. each accepted example with its weak/strong solver scores and the gap,
* 2. the CALIBRATION — does the gap metric actually separate? A plainly-generated (first-draft)
* example should show a SMALL gap; an agentic-loop-accepted one a LARGE gap (the paper's Table 1),
* example should show a SMALL gap; an agentic-loop-accepted one a LARGE gap (the illustrative target),
* 3. the cost ledger, split by role (challenger vs each solver) — composed, never hand-counted.
*
* Run: pnpm tsx examples/agentic-data-creation/run.ts
Expand Down Expand Up @@ -55,9 +55,9 @@ async function main(): Promise<void> {
const plain = mean(result.plainGaps)
const agentic = mean(result.agenticGaps)
console.log('\n— Calibration: does the gap metric discriminate? —')
console.log(` plain (first-draft examples) mean gap = ${plain.toFixed(2)} (paper ≈ 0.02)`)
console.log(` plain (first-draft examples) mean gap = ${plain.toFixed(2)} (target ≈ 0.02)`)
console.log(
` agentic (loop-accepted examples) mean gap = ${agentic.toFixed(2)} (paper ≈ 0.31)`,
` agentic (loop-accepted examples) mean gap = ${agentic.toFixed(2)} (target ≈ 0.31)`,
)
const separates = Number.isFinite(plain) && Number.isFinite(agentic) && agentic - plain >= 0.15
console.log(
Expand Down
2 changes: 1 addition & 1 deletion examples/agents-of-all-shapes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ No sandbox. No deploy. No server. The analysis runs **in-process**.

```bash
# Verified QA path — in-process, no key, no infra:
npx tsx examples/agents-of-all-shapes/run.ts
pnpm tsx examples/agents-of-all-shapes/run.ts

# CI verification (what proves it):
pnpm test -- tests/agents-of-all-shapes.test.ts
Expand Down
71 changes: 27 additions & 44 deletions examples/agents-of-all-shapes/shared/intelligence.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

import { analyzeRuns, fromOtelSpans, type InsightReport } from '@tangle-network/agent-eval/contract'
import type { TraceSpanEvent } from '@tangle-network/agent-eval/hosted'
import { createOtelExporter } from '@tangle-network/agent-runtime'

export type { InsightReport, TraceSpanEvent }

Expand Down Expand Up @@ -101,51 +102,33 @@ export interface ShipOptions {
serviceName?: string
}

/** Optional hosted path: POST the same OTel spans to Tangle Intelligence's
* OTLP/HTTP ingest. Identical analysis runs server-side. */
/** Optional hosted path: POST the same OTel spans to Tangle Intelligence's OTLP/HTTP ingest via the
* runtime's OWN exporter — `createOtelExporter` builds the resourceSpans envelope, appends `/v1/traces`,
* and batches the POST. No hand-rolled wire format; the same primitive the runtime uses in production. */
export async function shipToTangleOtlp(spans: TraceSpanEvent[], opts: ShipOptions): Promise<void> {
const res = await fetch(`${opts.endpoint}/v1/traces`, {
method: 'POST',
headers: {
'content-type': 'application/json',
authorization: `Bearer ${opts.apiKey}`,
},
body: JSON.stringify({
resourceSpans: [
{
resource: {
attributes: [
{
key: 'service.name',
value: { stringValue: opts.serviceName ?? 'agents-of-all-shapes' },
},
],
},
scopeSpans: [
{
scope: { name: 'agents-of-all-shapes' },
spans: spans.map((s) => ({
traceId: s.traceId,
spanId: s.spanId,
name: s.name,
startTimeUnixNano: String(s.startTimeUnixNano),
endTimeUnixNano: String(s.endTimeUnixNano),
attributes: Object.entries(s.attributes).map(([key, value]) => ({
key,
value:
typeof value === 'number'
? { doubleValue: value }
: { stringValue: String(value) },
})),
status: s.status,
})),
},
],
},
],
}),
const exporter = createOtelExporter({
endpoint: opts.endpoint,
headers: { authorization: `Bearer ${opts.apiKey}` },
serviceName: opts.serviceName ?? 'agents-of-all-shapes',
})
if (!res.ok) {
throw new Error(`intelligence ingest failed: ${res.status} ${await res.text()}`)
if (!exporter) throw new Error('shipToTangleOtlp: no OTLP endpoint configured')
// OTLP status.code is numeric (UNSET=0, OK=1, ERROR=2); TraceSpanEvent carries the string enum.
const statusCode = { UNSET: 0, OK: 1, ERROR: 2 } as const
for (const s of spans) {
exporter.exportSpan({
traceId: s.traceId,
spanId: s.spanId,
name: s.name,
startTimeUnixNano: String(s.startTimeUnixNano),
endTimeUnixNano: String(s.endTimeUnixNano),
attributes: Object.entries(s.attributes).map(([key, value]) => ({
key,
value: typeof value === 'number' ? { doubleValue: value } : { stringValue: String(value) },
})),
...(s.status
? { status: { code: statusCode[s.status.code], message: s.status.message } }
: {}),
})
}
await exporter.flush()
}
44 changes: 26 additions & 18 deletions examples/intelligence-drop-in/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,38 @@
# intelligence-drop-in

The Observe + Mode-0 slice of the Tangle Intelligence SDK: wrap an existing
agent, ship one trace per call, and pay only inference at the OFF tier. The
wrapper is best-effort — a live agent never fails because Intelligence is down.
The Observe + Mode-0 slice of the Tangle Intelligence SDK: wrap an existing agent, ship one trace per
call, and pay **only inference** (the base model stream) at the OFF tier. **Why it matters:** you get
per-call observability + billing with a one-line wrapper, and you can prove — not just assert — that
turning intelligence *off* charges nothing extra. The wrapper is best-effort: a live agent never fails
because Intelligence is down.

> **Mode 0** = the OFF tier: telemetry stays on, but intelligence spend (analysts, corpus, extra spawns)
> is clamped to 0. **Inference spend** = the base model stream you'd pay anyway; **intelligence spend** =
> what the SDK's extra reasoning adds on top.

## Run

```bash
# $0, no creds — stands up a throwaway local OTLP collector so the trace is visible without a key.
pnpm tsx examples/intelligence-drop-in/intelligence-drop-in.ts
```

The example stands up a throwaway local OTLP collector, so it runs with no
credentials.
It prints three proofs and **asserts** the last two (throws if they don't hold):
1. wrap any `(input) => Promise<output>` in one line and it ships a trace;
2. point it at a dead endpoint — the agent still answers (export failure swallowed);
3. at `effort: 'off'`, read the exported span BACK off the collector and confirm `intelligence_usd = 0`.

## What it shows

- `withTangleIntelligence(agent, { project, apiKey, endpoint })` — wrap any
`(input) => Promise<output>` agent; the shape is preserved and one trace span
is exported per call.
- `createIntelligenceClient(...).traceRun(meta, fn)` — the explicit-trace API:
`trace.recordOutput` / `trace.recordOutcome` inside the body.
- **Best-effort export** — pointed at a dead endpoint, the agent still returns
its answer; the export failure is swallowed.
- **Mode 0 / OFF** (`effort: 'off'`) — pure passthrough, zero intelligence
spawns. The exported trace carries `{ inferenceUsd, intelligenceUsd }` and
`intelligenceUsd` is clamped to `0` — the mechanism that proves an OFF
customer paid inference-only.
- `client.doctor()` — network-free readiness: Observe is always reachable;
Recommend and Gated-PR report the inputs they still need.
- `withTangleIntelligence(agent, { project, apiKey, endpoint })` — wrap any agent; the call shape is
preserved and one trace span is exported per call, fire-and-forget.
- `createIntelligenceClient(...).traceRun(meta, fn)` — the explicit-trace API (`trace.recordOutput` /
`trace.recordOutcome`), used here so we can `flush()` and read the span back.
- **The OFF proof, by execution** — at OFF there is no intelligence spawn, so the exported span's
`intelligence_usd` is `0` by construction. The example digs it out of the OTLP payload and asserts it.

## Going live

Drop the local collector: set `TANGLE_API_KEY` and point `endpoint` at your real OTLP/HTTP collector
(or omit `endpoint` to use `OTEL_EXPORTER_OTLP_ENDPOINT`). Raise `effort` from `off` to `standard`/`max`
to enable the intelligence tiers — the same wrapper, one field changed.
Loading
Loading