Run the real WebCode benchmark across a matrix of harnesses × models, scored by Exa's own graders, rendered as a publishable leaderboard.
WebCode is 33 coding tasks across 9 languages, each targeting a library API released after the model's training cutoff, so the agent must web-search to find the current signatures. Each task ships a prompt, the file to produce, and a pytest grader (test_patch).
Exa open-sources the dataset only: "No agent harness included — bring your own (e.g. mini-swe-agent)." So this example is the harness: it sweeps the cartesian of harness (what drives the agent — claude-code / codex / opencode / gemini) × model, runs each (harness×model, task) cell in its own sandbox with web search on, writes the agent's solution, and runs Exa's exact test_patch — pass ⟺ pytest exits 0. No LLM judge; no invented tasks.
The real 33-task dataset (MIT) is fetched, not committed (it carries secret-shaped test fixtures) — run data/fetch.sh first, or set WEBCODE_DATASET. See data/SOURCE.md for provenance.
Each task ships a Dockerfile pinning its toolchain (Swift 6.1, Go 1.23, …) plus python/pytest to run the grader. How the sandbox provides that toolchain:
environment: 'universal'(default here) — one multi-language Nix stack (python+pytest + Go/Py/TS/Java/C++), the same default thecommit0/clbenchgates use. Zero per-task work; covers the common languages.- Per-task image — exotic toolchains universal lacks (Swift/Elixir/Kotlin) ship their own base in
task.baseImage(parsed from the task'sFROM); pass it assandboxOverrides.image. Already plumbed; no new code. - Pre-built per-task image — for byte-exact parity with Exa's Dockerfile, pre-build each into a registry image and reference it by tag.
A missing toolchain surfaces as a failing test, never a fake pass.
examples/webcode-matrix/data/fetch.sh # one-time: download the 33-task dataset
TANGLE_API_KEY=… # ONE key — sandbox + model router + router-backed web_search
SANDBOX_API_KEY=… # the sandbox service (omit if your TANGLE_API_KEY also provisions sandboxes)
LIMIT=3 # optional: first N tasks for a cheap smoke (omit for all 33)
tsx examples/webcode-matrix/webcode-matrix.tsSearch is router-backed — the agent's web_search goes through the Tangle router on TANGLE_API_KEY (provider picked by TANGLE_SEARCH_DEFAULT_PROVIDER, default exa); no separate Exa key. you/perplexity/tavily/parallel/brave work the same way.
Writes report.md + report.svg + report.html to RUN_DIR — a ranked leaderboard, the full profile×task score matrix, and embeddable charts. That rendering is the general leaderboard engine (@tangle-network/agent-runtime/loops), which turns any runProfileMatrix result into the same report for any benchmark in any domain.
webcode-dataset.ts—loadWebCodeTasks()loads the real 33 tasks (prompt +test_patch+ solution file + dockerfile).webcode-matrix.ts— the harness×model grid, the dispatch (run → write solution → run Exa's grader), and the leaderboard render.../intelligence-webcode— the same benchmark with the full Tangle Intelligence SDK (cost, per-tool waterfall, OTLP) on every cell.
WebCode isolates one thing: can an agent retrieve and apply knowledge it was never trained on? The matrix answers which harness+model does it best — the harness controls how the agent searches and iterates, the model controls how well it reasons over what it finds, and Exa's hidden tests keep everyone honest.