diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
new file mode 100644
index 00000000..f59209d8
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
@@ -0,0 +1,222 @@
+---
+title: "AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost"
+description: "GoModel vs LiteLLM, Portkey, and Bifrost - a reproducible AWS benchmark of four open-source AI gateways across latency, throughput, memory, CPU, and Docker image size. A fast, lightweight LiteLLM alternative in Go."
+coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png"
+coverImageWidth: 2400
+coverImageHeight: 1260
+pubDate: 2026-06-26
+author: "Jakub A. Wasek"
+tags:
+  - benchmarking
+  - ai-gateway
+  - litellm
+  - portkey
+  - bifrost
+  - gomodel
+---
+
+![GoModel vs LiteLLM, Portkey and Bifrost - latency is overrated, look at the bill](./cover.png)
+
+The point of this benchmark is not to prove that LiteLLM sucks. The point is to
+measure GoModel honestly against the gateways people actually compare it to:
+**LiteLLM, Portkey, and Bifrost**.
+
+That said - yes, LiteLLM sucks, and that is exactly why GoModel exists. (If you're
+not sure what I mean, I'd recommend giving the software a try yourself - or doing
+your own research)
+
+In October 2025 I tried to build my startup on top of LiteLLM. I quickly found
+out that the software is fundamentally designed badly. A proxy-like server, on
+the hot path of every request, written in Python? On top of that came a long
+tail of operational issues. So I did my research and started writing GoModel: a
+production-grade and enterprise-grade AI gateway / AI control plane, in Go.
+
+The later <a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/" rel="nofollow noopener noreferrer">supply-chain security incident</a> around LiteLLM only confirmed my view.
+Go and its standard-library-heavy dependency trees are structurally far less
+exposed to that class of attack than a sprawling Python dependency graph.
+
+With the motivation out of the way, let's talk about what's actually worth
+measuring in an AI gateway benchmark - the metrics that make a comparison
+meaningful.
+
+When I [launched GoModel on Hacker News](https://news.ycombinator.com/item?id=47861333)
+I told the thread I'd publish a real, reproducible benchmark. Here it comes.
+
+## What to measure to choose the best AI gateway
+
+Here is the full list of metrics that matter:
+
+- `p99` / `p95` / `p50` latency (proxy overhead)
+- RAM consumption
+- CPU consumption (and throughput per core)
+- Cold-start time
+- Docker image size
+- Vendor-agnostic
+- Open-source
+
+A couple of these deserve a closer look.
+
+### Latency
+
+Latency matters less than you'd assume. Be precise about what we are measuring:
+**proxy overhead latency** - the time the gateway itself adds, on top of the
+upstream call.
+
+The trap is treating latency as the ultimate criterion. In any real workload the
+dominant latency comes from inference. The gateway's overhead is a small fraction
+of the total you're already living with. A gateway that is "2x faster" at adding
+`5 ms` is not meaningfully faster once a model takes `2000 ms` to respond.
+
+So I care far more about the *tail* (p99) than the median - a gateway that is
+usually fast but occasionally stalls is worse than one that is boringly
+consistent.
+
+### Resource consumption - CPU, RAM, image size, cold start
+
+These are the metrics that actually move the needle, because they map directly to:
+
+1. The monthly cost of your infrastructure.
+2. Whether you can run the gateway serverless (AWS Lambda, GCP Functions) or on
+   edge devices at all.
+
+A `372 MB` image (`1.2 GB` unpacked) that idles at gigabytes of RAM and takes
+`25 s` to cold-start is a different operational animal than a `16 MB` image that
+peaks at `37 MB` of RAM and is serving traffic `0.56 s` after launch.
+
+## The benchmark
+
+Every gateway talked to the **same instant mock backend**, so the numbers reflect
+gateway overhead, not model latency or network jitter. Each ran one at a time, in
+Docker, on an **AWS `c7i.large`** (2 vCPU, 4 GiB) running the latest **Amazon Linux
+2023** AMI - the whole thing is Terraform'd, runs on one command, and tears itself
+down afterwards.
+
+I actually ran this twice. The **first cut used the free-tier `t2.micro`**
+(1 vCPU, 1 GiB) - cheap, self-destructing, trivial to reproduce. But I realized
+that was *unfair to the competitors*: a 1 GiB box can't hold the memory-heavy
+gateways (LiteLLM idles near a gigabyte), so they spill into **swap** and get
+penalized for the host being too small rather than for their own overhead. So I
+switched to the roomier, non-burstable **`c7i.large`** - nothing swaps there, and a
+fixed-performance instance also removes the CPU-credit drift that muddies the tail
+on burstable boxes. **The relative results barely moved between the two runs** -
+GoModel still won on tail latency, throughput, memory, and image size. Giving the
+heavy gateways enough RAM to not thrash makes the comparison *more* honest, not
+less.
+
+I tested four gateways across six workloads - chat completions, the Responses API,
+and Anthropic messages, each streaming and non-streaming - driven at `8,000`
+requests per workload, concurrency `10`, across **two trials with randomized
+gateway order**. Latency is the **median across trials**, and I report each p99
+with its min-max across trials so a single noisy window can't drive the story.
+
+A few methodology details worth calling out:
+
+- **Throughput is measured, not inferred.** The latency runs report
+  completed-req/s at a fixed concurrency, which is just latency restated. Real
+  capacity comes from a separate **concurrency sweep** that drives each gateway to
+  saturation and records sustained req/s.
+- **I warm up every dialect before measuring it.** LiteLLM lazily imports its
+  per-dialect translation modules on first use, so a naive chat-only warmup left
+  the Responses and Messages paths cold and inflated their tails. I neutralized
+  that to be fair - but note what it tells you: a server that pays an import tax
+  the first time it sees a request type is, again, not designed for the hot path.
+- **Fair resilience config.** Every gateway runs with retries disabled. I also
+  disabled GoModel's circuit breaker for the test - under the saturation sweep a
+  few transient errors would otherwise trip it and it would (correctly, in
+  production) start rejecting requests, which would unfairly zero out its *own*
+  throughput. No other gateway here has a breaker, so off is the apples-to-apples
+  setting.
+- **LiteLLM at its recommended worker count.** A LiteLLM worker is effectively
+  single-threaded, and its own production guidance is one worker per CPU core - so I
+  run it with `num_workers` = the box's vCPU count (`2` here), the same multi-core
+  access the Go gateways get for free. (Pin it to one worker and it under-uses the
+  box; give it more and, as the table shows, its memory balloons. There's no setting
+  that makes it both fast *and* light.)
+- **Streaming uses terminal-marker or idle-gap detection**, so a gateway that
+  streams content without ever sending a terminal event (Bifrost, over a
+  non-native backend) is measured to last byte instead of hanging the harness.
+
+## The comparison
+
+Representative latency is chat completions, non-streaming. All resource figures
+are measured under load on the same box.
+
+| Metric | GoModel | Bifrost | Portkey | LiteLLM |
+|---|--:|--:|--:|--:|
+| Runtime | Go | Go | Node.js | Python |
+| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` |
+| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` |
+| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` |
+| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` |
+| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` |
+| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` |
+| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` |
+| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` |
+| Vendor-agnostic | Yes | Partial † | Yes | Yes |
+| Open-source | Yes ‡ | Partial ‡ | Partial ‡ | Yes |
+
+Same numbers, at a glance:
+
+![Latency tail p99: GoModel 6.9 ms, Bifrost 18.3 ms, Portkey 30.5 ms, LiteLLM 39.3 ms](./charts/june-2026-latency-p99.svg)
+
+![Sustained throughput: GoModel 4,900 req/s, Bifrost 3,100, Portkey 950, LiteLLM 324](./charts/june-2026-throughput.svg)
+
+![Peak memory under load: GoModel 37 MB, Bifrost 143 MB, Portkey 112 MB, LiteLLM 2.3 GB](./charts/june-2026-memory.svg)
+
+![Docker image, compressed: GoModel 16 MB, Bifrost 77 MB, Portkey 59 MB, LiteLLM 372 MB](./charts/june-2026-image.svg)
+
+A few honest notes, because I'd rather you trust the rest of the table:
+
+- **On a non-burstable host the medians are real, and GoModel leads on both ends.**
+  It posts the lowest `p50` (`1.8 ms`) *and* the tightest `p99` (`6.9 ms`).
+  Bifrost is a close second on the median (`2.5 ms`) - but its tail is ~`2.7x`
+  heavier (`18 ms`) and it carries ~`4x` the memory under load.
+- **GoModel cold-starts in `0.56 s` versus LiteLLM's ~`25 s`.** That is the
+  difference between viable on a serverless platform and not.
+- **Portkey** does not serve the Anthropic `/v1/messages` dialect in this
+  single-provider setup, hence `4/6` (it supports Anthropic with a fuller
+  virtual-key config; this is a setup limitation, not a hard capability gap).
+- **LiteLLM** ships a `372 MB` compressed image (`1.16 GB` on disk), and at its
+  recommended config (one worker per core) it uses **~`2.3 GB` of RAM** - two ~1 GB
+  worker processes - and ~`25 s` to cold-start. Running it *properly* for multi-core throughput makes the footprint
+  worse, not better. That is the cost of Python on the hot path.
+- **Bifrost is not a neutral project (†).** It is built by
+  [Maxim AI](https://www.getmaxim.ai/bifrost), an LLM evaluation & observability
+  platform, and ships a first-party plugin that forwards your gateway traffic to
+  Maxim's platform. It routes to many *model* providers, but the gateway itself is
+  a channel into one vendor's ecosystem - not the independent, vendor-neutral tool
+  the "1000+ models" headline implies.
+- **"Open-source" deserves an asterisk (‡).** Portkey keeps its observability
+  storage, dashboard, multi-team RBAC, and at-scale semantic caching in a closed
+  managed tier; Bifrost's core gateway is Apache-2.0 but its Enterprise edition
+  layers on closed/managed features. GoModel is open-source today, with some
+  enterprise-grade features planned to stay private. LiteLLM is the most open of
+  the four - its proxy core is MIT - but even it gates its enterprise features
+  (SSO, audit logs, fine-grained access control) behind a separate *proprietary*
+  commercial license that ships source-available in the `enterprise/` folder, not
+  as free OSS.
+
+## Summary
+
+GoModel is the best gateway in this comparison: the lowest median *and* the
+tightest latency tail, the highest sustained throughput, the best throughput per
+CPU (~`52` req/s per %), the smallest compressed image (≈`23x` smaller than
+LiteLLM) and memory, the fastest cold start - with full workload coverage.
+
+I've tried to be as objective as I can, and the whole thing is built to be
+**self-verifiable**: the harness provisions the AWS instance, runs every gateway
+against the same backend, prints the table, and destroys the infrastructure.
+**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)** -
+clone the repo, point it at your AWS account, and run `./run.sh`. It builds the
+images, provisions the box, runs all four gateways, prints the tables, and tears
+the infrastructure back down on its own.
+
+One caveat: it runs on **paid** AWS infrastructure, not the free tier. A
+`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or
+two, so budget **under `$1`** per run to be safe - and if you pass `KEEP=1` or a
+teardown ever fails, you keep paying until you destroy the box, so double-check
+it's gone.
+
+If you have objections to this benchmark, reach out on the GoModel Discord (link
+in the GoModel README on GitHub). And I'd genuinely like to see more impartial
+gateway comparisons out there - bring your own numbers.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md
new file mode 100644
index 00000000..7ea41cdf
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md
@@ -0,0 +1,351 @@
+---
+title: "Benchmarking AI Gateways: GoModel vs LiteLLM vs Portkey vs Bifrost"
+description: "A reproducible AI gateway benchmark comparing GoModel, LiteLLM, Portkey, and Bifrost on latency, throughput, memory, CPU, cold start, and image size."
+coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png"
+coverImageWidth: 2400
+coverImageHeight: 1260
+pubDate: 2026-06-26
+author: "Jakub A. Wasek"
+keywords:
+  - AI gateway benchmark
+  - AI control plane
+  - OpenAI-compatible API
+  - LiteLLM alternative
+  - GoModel
+  - LiteLLM
+  - Portkey
+  - Bifrost
+tags:
+  - benchmarking
+  - ai-gateway
+  - ai-control-plane
+  - litellm
+  - portkey
+  - bifrost
+  - gomodel
+---
+
+![GoModel vs LiteLLM, Portkey and Bifrost benchmark - four gateways, one backend, GoModel wins](./cover.png)
+
+In October 2025 I tried to build my startup on top of LiteLLM.
+
+At first it looked like the obvious choice. It supported many providers, it had
+an OpenAI-compatible API, and it was already used by a lot of people. I did not
+want to write an AI gateway. I wanted to build the product behind it.
+
+Then I started running it on the hot path.
+
+My opinion changed there.
+
+A gateway is not a dashboard or integration glue you call once in a while. It
+sits on every request, every retry, every stream, every tool call, every
+fallback, every timeout.
+
+A heavy gateway charges rent forever.
+
+Most AI gateway comparisons miss that part. They talk about provider count,
+dashboards, tracing, and "support for 1000+ models". Those things matter, but
+they are not free. Before the gateway calls OpenAI, Anthropic, Gemini, vLLM, or
+anything else, it has already spent your CPU, memory, cold-start time, and
+operational budget.
+
+I am not comparing full product maturity here. I am comparing how these gateways
+behave on the hot path.
+
+So I started writing [GoModel](https://github.com/ENTERPILOT/GoModel): a small
+open-source AI gateway and AI control plane in Go, with an OpenAI-compatible API
+and explicit provider adapters.
+
+When I <a href="https://news.ycombinator.com/item?id=47861333" rel="nofollow noopener noreferrer">launched GoModel on Hacker News</a>,
+I promised a real, reproducible benchmark. This article is that follow-up.
+
+The benchmark question is simple:
+
+**How lean is each AI gateway when it sits on the request path?**
+
+That question runs through the whole benchmark: GoModel vs LiteLLM vs Portkey vs
+Bifrost, measured by latency, throughput, memory, CPU, cold start, and image
+size rather than landing pages or feature matrices.
+
+## The runtime footprint matters
+
+Latency gets the easiest arguments. It rarely tells the whole story.
+
+Most real LLM calls are dominated by inference time. If a model takes `2000 ms`
+to answer, the difference between `5 ms` and `15 ms` of proxy overhead is not
+the main story.
+
+The main story is the deployment envelope:
+
+- How much RAM does the gateway need under load?
+- How much CPU does it burn per request?
+- How many requests can it serve per core?
+- How fast does it cold-start?
+- How large is the Docker image?
+- Can you run it as a sidecar, on a small VM, in serverless, or near local
+  models?
+- Is the core gateway actually open-source?
+
+Those numbers decide whether the gateway can run where you want it to run.
+
+A `372 MB` compressed image (`1.2 GB` unpacked) that idles around gigabytes of
+RAM and takes `25 s` to cold-start is a different operational thing than a
+`16 MB` image that peaks at `37 MB` of RAM and is serving traffic `0.56 s` after
+launch.
+
+So I care about the runtime footprint.
+
+## What this benchmark does not prove
+
+This benchmark does **not** prove that one gateway is best for every company.
+
+I am not measuring:
+
+- bug counts or overall correctness
+- semantic cache quality
+- tracing UI quality
+- guardrail quality
+- admin dashboards
+- long-term provider maintenance
+- every possible provider-specific feature
+- total provider count
+
+Those things matter. Some of them matter a lot.
+
+LiteLLM in particular has more integrated providers and more gateway features
+than GoModel today. If your first requirement is maximum provider coverage right
+now, LiteLLM has a real advantage. This benchmark does not erase that. It
+measures the runtime footprint of putting each gateway on the request path. In
+practice, many smaller or newer providers already expose an OpenAI-compatible
+API, so provider count is not always the same as practical routing coverage.
+
+The benchmark measures one narrower thing: **runtime and deployment overhead on
+the request path**.
+
+That still matters, because the gateway is on the hot path. If you run high
+request volume, local models, serverless workloads, edge workloads, or many small
+model calls, the overhead stops being theoretical.
+
+## AI gateway benchmark setup
+
+I tested four AI gateways people actually compare:
+
+- GoModel
+- LiteLLM
+- Portkey
+- Bifrost
+
+Every gateway talked to the **same instant mock backend**, on purpose. I did not
+want to benchmark OpenAI, Anthropic, AWS networking, or random internet jitter.
+I wanted to isolate the gateway itself.
+
+Each gateway ran one at a time, in Docker, on an **AWS `c7i.large`** with
+2 vCPU and 4 GiB RAM, running the latest **Amazon Linux 2023** AMI. The whole
+thing is Terraform'd, runs with one command, and tears itself down afterwards.
+
+I first ran this on a free-tier `t2.micro`. That was cheap and easy to
+reproduce, but unfair to the heavier gateways. A 1 GiB machine cannot hold a
+gateway that wants gigabytes of memory, so it starts swapping. At that point you
+are benchmarking the host being too small.
+
+So I moved to `c7i.large`: still small, but non-burstable and large enough that
+nothing swaps. It also makes the LiteLLM setup more honest. LiteLLM recommends
+one worker per vCPU, and this machine has 2 vCPUs, so LiteLLM gets 2
+workers. That gives it the multi-core access it is supposed to have instead of
+pinning it to a single worker on a tiny box.
+
+The test covered six workloads:
+
+- chat completions, non-streaming
+- chat completions, streaming
+- Responses API, non-streaming
+- Responses API, streaming
+- Anthropic messages, non-streaming
+- Anthropic messages, streaming
+
+Each workload used `8,000` requests at concurrency `10`, across **two trials
+with randomized gateway order**. Latency is the **median across trials**, and I
+report p99 with its min-max range so one noisy window cannot tell the whole
+story.
+
+I would not call this a statistically exhaustive study. It is a reproducible
+engineering benchmark, and the harness is public so people can rerun it, change
+the machine, or add their own workloads.
+
+A few details matter if you want to reproduce or criticize the numbers:
+
+- **Throughput is measured, not inferred.** The latency runs report
+  completed-req/s at fixed concurrency, but real capacity comes from a separate
+  concurrency sweep that drives each gateway to saturation.
+- **Every dialect is warmed up before measurement.** LiteLLM lazily imports some
+  per-dialect translation code on first use. A chat-only warmup made its
+  Responses and Messages paths look worse than they should. I warmed up all
+  dialects to avoid that.
+- **Retries are disabled for all gateways.** I also disabled GoModel's circuit
+  breaker for this benchmark. In production, rejecting traffic after upstream
+  trouble is the right behavior. In a saturation benchmark, it would make the
+  throughput number unfairly low.
+- **LiteLLM runs with its recommended worker count.** A LiteLLM worker is
+  effectively single-threaded, and its production guidance is one worker per
+  vCPU. On this box that means `2` workers.
+- **Streaming uses terminal-marker or idle-gap detection.** If a gateway streams
+  content but never sends a terminal event, the harness measures to last byte
+  instead of hanging forever.
+
+## GoModel vs LiteLLM vs Portkey vs Bifrost
+
+Representative latency is chat completions, non-streaming. All resource figures
+are measured under load on the same box.
+
+| Metric | GoModel | Bifrost | Portkey | LiteLLM |
+|---|--:|--:|--:|--:|
+| Runtime | Go | Go | Node.js | Python |
+| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` |
+| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` |
+| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` |
+| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` |
+| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` |
+| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` |
+| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` |
+| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` |
+| Vendor-neutral core | Yes | Partial † | Yes | Yes |
+| Core source available | Yes ‡ | Partial ‡ | Partial ‡ | Yes |
+
+Same numbers, at a glance:
+
+![Latency tail p99: GoModel 6.9 ms, Bifrost 18.3 ms, Portkey 30.5 ms, LiteLLM 39.3 ms](./charts/june-2026-latency-p99.svg)
+
+![Sustained throughput: GoModel 4,900 req/s, Bifrost 3,100, Portkey 950, LiteLLM 324](./charts/june-2026-throughput.svg)
+
+![Peak memory under load: GoModel 37 MB, Bifrost 143 MB, Portkey 112 MB, LiteLLM 2.3 GB](./charts/june-2026-memory.svg)
+
+![Docker image, compressed: GoModel 16 MB, Bifrost 77 MB, Portkey 59 MB, LiteLLM 372 MB](./charts/june-2026-image.svg)
+
+## What stood out
+
+GoModel had the lowest median latency and the tightest tail: `1.8 ms` p50 and
+`6.9 ms` p99.
+
+Bifrost was close on median latency at `2.5 ms`, which is a good result. The
+gap opened at the tail and in memory: `18.3 ms` p99 and `143 MB` peak RAM under
+load.
+
+Portkey was heavier than I expected for this narrow proxy benchmark. It served
+`950 req/s` sustained and used `112 MB` peak RAM under load. In this setup it did
+not serve the Anthropic `/v1/messages` dialect, so it gets `4/6` workload
+coverage. Treat that as a setup limitation, not a claim that Portkey cannot
+support Anthropic in a fuller virtual-key configuration.
+
+LiteLLM was the outlier. At its recommended worker count, it used about
+`2.3 GB` of RAM, cold-started in `25.5 s`, and sustained `324 req/s`.
+
+Not because Python is morally bad. The language matters only when it changes the
+deployment envelope. Here it does: memory floor, image size, cold-start time,
+dependency graph, and throughput per core.
+
+The later <a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/" rel="nofollow noopener noreferrer">supply-chain incident around LiteLLM</a>
+also made me more confident in GoModel's design direction. A small Go binary
+with a standard-library-heavy dependency tree is structurally less exposed to
+that class of problem than a large Python dependency graph.
+
+## What AI gateway benchmarks do not capture
+
+Forwarding JSON is not the hard part.
+
+The hard part is provider drift.
+
+OpenAI, Anthropic, Gemini, AWS Bedrock, Azure OpenAI, Groq, xAI, Cerebras, vLLM,
+and local servers all disagree in small ways. Then they change those ways. Tool
+calling changes. Streaming changes. Reasoning parameters change. Image inputs
+change. Error formats change. Rate-limit semantics change.
+
+An AI gateway or AI control plane has to absorb that without becoming magic.
+
+GoModel's bet is not "support every model name on the internet".
+
+The bet is:
+
+- support the providers people actually deploy
+- keep provider adapters explicit
+- accept OpenAI-compatible requests generously
+- translate only what needs translation
+- pass through what should stay provider-specific
+- return conservative OpenAI-compatible responses
+
+For the same reason, GoModel starts as a small OpenAI-compatible gateway, not as
+a dashboard with a proxy attached.
+
+## Why this matters for local models and vLLM
+
+If all your traffic goes to a cloud model that takes several seconds to answer,
+gateway overhead can look academic.
+
+Local models change the math.
+
+If you are routing through an AI gateway to vLLM, Ollama, LM Studio, llama.cpp,
+or small specialized models on your own network, the model call can be much
+faster. Then gateway overhead, cold starts, memory, and sidecar size matter more.
+
+One reason I want GoModel to stay small: a gateway should be cheap enough to put
+near the workload.
+
+## Notes on neutrality and open source
+
+Bifrost is built by Maxim AI, an LLM
+evaluation and observability platform. It routes to many model providers, but
+the gateway also sits close to Maxim's eval and observability ecosystem. If you
+want to choose your own eval platform, or stay independent from any eval
+platform, ask whether Bifrost is the right match for you. Good software can
+still have incentives attached. "Vendor-neutral" needs an asterisk here.
+
+"Open-source" also needs care.
+
+Portkey keeps observability storage, dashboard, multi-team RBAC, and at-scale
+semantic caching in a closed managed tier. Bifrost's core gateway is Apache-2.0,
+but its Enterprise edition adds closed or managed features. LiteLLM's proxy core
+is MIT, but enterprise features like SSO, audit logs, and fine-grained access
+control sit behind a proprietary commercial license.
+
+GoModel is open-source today. Some enterprise-grade AI control plane features may
+stay private. The core gateway is intended to remain useful without those private
+features.
+
+## Reproduce it yourself
+
+The benchmark is built to be self-verifiable. It provisions the AWS instance,
+runs every gateway against the same backend, prints the tables, and destroys the
+infrastructure.
+
+**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)**:
+
+```bash
+./run.sh
+```
+
+One caveat: it runs on **paid** AWS infrastructure, not the free tier. A
+`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or
+two, so budget **under `$1`** per run to be safe.
+
+If you pass `KEEP=1` or teardown fails, you keep paying until you destroy the
+box, so double-check the teardown.
+
+## Conclusion
+
+I did not start GoModel because I wanted another AI gateway in the world.
+
+I started it because the gateway I wanted to use became part of the problem. It
+sat on the hot path, but did not feel like hot-path software: too heavy, too
+slow to start, too expensive to keep around, too large for the job.
+
+This benchmark is the result of turning that frustration into numbers.
+
+The numbers say GoModel is small in the places I care about: `16 MB` image,
+`37 MB` peak RAM, `0.56 s` cold start, `1.8 ms` p50, `6.9 ms` p99, and
+`4900 req/s` sustained throughput on a small AWS box.
+
+LiteLLM still has more providers and more features today. Portkey and Bifrost
+have their own strengths. But if the gateway is going to sit between your users
+and every model call, I think it should first be cheap, predictable, and boring
+to run.
+
+GoModel is my attempt to build that kind of gateway.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg
new file mode 100644
index 00000000..51f6aa12
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg
@@ -0,0 +1,19 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 660 242" width="660" height="242" role="img" aria-label="Docker image (compressed) (pull size, lower is better). GoModel 16 MB, Bifrost 77 MB, Portkey 59 MB, LiteLLM 372 MB." font-family="system-ui,-apple-system,Segoe UI,Roboto,sans-serif">
+  <rect x="0.5" y="0.5" width="659" height="241" rx="14" fill="#ffffff" stroke="#e2e8f0"/>
+  <text x="24" y="32" font-size="16" font-weight="700" fill="#0f172a">Docker image (compressed)</text>
+  <text x="636" y="32" text-anchor="end" font-size="12" fill="#64748b">pull size · lower is better</text>
+  <g>
+    <text x="80" y="71" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="700" fill="#0f172a">GoModel</text>
+    <rect x="92" y="56" width="21.1" height="30" rx="5" fill="#1e7ab5"/>
+    <text x="122.1" y="71" dominant-baseline="middle" font-size="13" font-weight="700" fill="#1e7ab5">16 MB</text>
+    <text x="80" y="117" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Bifrost</text>
+    <rect x="92" y="102" width="101.4" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="202.4" y="117" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">77 MB</text>
+    <text x="80" y="163" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Portkey</text>
+    <rect x="92" y="148" width="77.7" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="178.7" y="163" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">59 MB</text>
+    <text x="80" y="209" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">LiteLLM</text>
+    <rect x="92" y="194" width="490.0" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="591.0" y="209" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">372 MB</text>
+  </g>
+</svg>
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg
new file mode 100644
index 00000000..cac41ab0
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg
@@ -0,0 +1,19 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 660 242" width="660" height="242" role="img" aria-label="Latency tail (p99, chat) (ms, lower is better). GoModel 6.9 ms, Bifrost 18.3 ms, Portkey 30.5 ms, LiteLLM 39.3 ms." font-family="system-ui,-apple-system,Segoe UI,Roboto,sans-serif">
+  <rect x="0.5" y="0.5" width="659" height="241" rx="14" fill="#ffffff" stroke="#e2e8f0"/>
+  <text x="24" y="32" font-size="16" font-weight="700" fill="#0f172a">Latency tail (p99, chat)</text>
+  <text x="636" y="32" text-anchor="end" font-size="12" fill="#64748b">ms · lower is better</text>
+  <g>
+    <text x="80" y="71" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="700" fill="#0f172a">GoModel</text>
+    <rect x="92" y="56" width="86.0" height="30" rx="5" fill="#1e7ab5"/>
+    <text x="187.0" y="71" dominant-baseline="middle" font-size="13" font-weight="700" fill="#1e7ab5">6.9 ms</text>
+    <text x="80" y="117" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Bifrost</text>
+    <rect x="92" y="102" width="228.2" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="329.2" y="117" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">18.3 ms</text>
+    <text x="80" y="163" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Portkey</text>
+    <rect x="92" y="148" width="380.3" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="481.3" y="163" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">30.5 ms</text>
+    <text x="80" y="209" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">LiteLLM</text>
+    <rect x="92" y="194" width="490.0" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="591.0" y="209" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">39.3 ms</text>
+  </g>
+</svg>
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg
new file mode 100644
index 00000000..f6dd3ce2
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg
@@ -0,0 +1,19 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 660 242" width="660" height="242" role="img" aria-label="Peak memory under load (RAM, lower is better). GoModel 37 MB, Bifrost 143 MB, Portkey 112 MB, LiteLLM 2.3 GB." font-family="system-ui,-apple-system,Segoe UI,Roboto,sans-serif">
+  <rect x="0.5" y="0.5" width="659" height="241" rx="14" fill="#ffffff" stroke="#e2e8f0"/>
+  <text x="24" y="32" font-size="16" font-weight="700" fill="#0f172a">Peak memory under load</text>
+  <text x="636" y="32" text-anchor="end" font-size="12" fill="#64748b">RAM · lower is better</text>
+  <g>
+    <text x="80" y="71" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="700" fill="#0f172a">GoModel</text>
+    <rect x="92" y="56" width="8.0" height="30" rx="5" fill="#1e7ab5"/>
+    <text x="109.0" y="71" dominant-baseline="middle" font-size="13" font-weight="700" fill="#1e7ab5">37 MB</text>
+    <text x="80" y="117" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Bifrost</text>
+    <rect x="92" y="102" width="30.8" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="131.8" y="117" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">143 MB</text>
+    <text x="80" y="163" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Portkey</text>
+    <rect x="92" y="148" width="24.2" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="125.2" y="163" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">112 MB</text>
+    <text x="80" y="209" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">LiteLLM</text>
+    <rect x="92" y="194" width="490.0" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="591.0" y="209" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">2.3 GB</text>
+  </g>
+</svg>
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg
new file mode 100644
index 00000000..4ea70ef6
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg
@@ -0,0 +1,19 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 660 242" width="660" height="242" role="img" aria-label="Sustained throughput (req/s, higher is better). GoModel 4,900, Bifrost 3,100, Portkey 950, LiteLLM 324." font-family="system-ui,-apple-system,Segoe UI,Roboto,sans-serif">
+  <rect x="0.5" y="0.5" width="659" height="241" rx="14" fill="#ffffff" stroke="#e2e8f0"/>
+  <text x="24" y="32" font-size="16" font-weight="700" fill="#0f172a">Sustained throughput</text>
+  <text x="636" y="32" text-anchor="end" font-size="12" fill="#64748b">req/s · higher is better</text>
+  <g>
+    <text x="80" y="71" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="700" fill="#0f172a">GoModel</text>
+    <rect x="92" y="56" width="490.0" height="30" rx="5" fill="#1e7ab5"/>
+    <text x="591.0" y="71" dominant-baseline="middle" font-size="13" font-weight="700" fill="#1e7ab5">4,900</text>
+    <text x="80" y="117" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Bifrost</text>
+    <rect x="92" y="102" width="310.0" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="411.0" y="117" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">3,100</text>
+    <text x="80" y="163" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">Portkey</text>
+    <rect x="92" y="148" width="95.0" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="196.0" y="163" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">950</text>
+    <text x="80" y="209" text-anchor="end" dominant-baseline="middle" font-size="14" font-weight="500" fill="#64748b">LiteLLM</text>
+    <rect x="92" y="194" width="32.4" height="30" rx="5" fill="#cbd5e1"/>
+    <text x="133.4" y="209" dominant-baseline="middle" font-size="13" font-weight="600" fill="#334155">324</text>
+  </g>
+</svg>
diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover-b.png b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png
new file mode 100644
index 00000000..9b2e833c
Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png differ
diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover.png b/docs/2026-06-25_aws_gateway_benchmark/cover.png
new file mode 100644
index 00000000..0da1dbcf
Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover.png differ
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
new file mode 100644
index 00000000..179b4868
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
@@ -0,0 +1,3 @@
+output/
+__pycache__/
+*.pyc
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/README.md b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md
new file mode 100644
index 00000000..8f489897
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md
@@ -0,0 +1,152 @@
+# GoModel quality (QA) suite
+
+A curated corpus of ~50 complex requests that exercises every client-facing
+dialect and modality of the gateway against **real providers**
+(OpenAI / Anthropic / Gemini), then **registers** and **rates** each one.
+
+It answers a different question than the latency benchmark next door
+(`docs/2026-06-25_aws_gateway_benchmark/`): not *how fast/cheap* the gateway is,
+but *does it correctly accept, translate, and normalize real-world requests* —
+the Postel's-law contract.
+
+For every case the suite records:
+
+- the **request as sent** (after model-role and variable resolution);
+- the **response** received (status, headers, body, or assembled SSE text);
+- **how the gateway recorded/normalized it** — pulled from the audit log:
+  the inbound request body it captured, the normalized response body it
+  returned, the resolved provider/model, and token usage;
+
+and rates it `PASS` / `FAIL` / `ERROR` / `SKIP`, plus a 0–100 **quality score**
+for soft modality checks (did the vision model name the colour, did STT recover
+the spoken words).
+
+## What it covers
+
+| Dialect / endpoint | Providers | Modalities exercised |
+|---|---|---|
+| `/v1/chat/completions` | OpenAI, Anthropic, Gemini | text, multi-turn, streaming, vision, tools, reasoning, structured output, field preservation |
+| `/v1/responses` | OpenAI, Anthropic, Gemini | text, multimodal input, streaming, tools, structured output, reasoning, conversation linkage |
+| `/v1/messages` (+ `/count_tokens`) | Anthropic | native shape, system prompt, streaming SSE, vision blocks, tool_use, extended thinking, default `max_tokens` injection |
+| `/v1/conversations` | OpenAI | create → get → use-in-Responses → update → delete (stateful) |
+| `/v1/audio/speech`, `/v1/audio/transcriptions` | OpenAI | TTS, and a TTS→STT round-trip that recovers the spoken words |
+| `/v1/embeddings` | OpenAI | single + batch |
+| error normalization | OpenAI, Anthropic | unknown model, unsupported `input_audio`, malformed JSON |
+
+## How "field preservation" is verified (and its honest limit)
+
+GoModel's audit log captures the **inbound** client request body and the
+**normalized** response body it returns — *not* the upstream provider-translated
+request. So the suite verifies translation two ways:
+
+1. **Behaviorally** — e.g. the reasoning case sends `max_tokens` to a model that
+   rejects it upstream; a `200` proves the gateway mapped it to
+   `max_completion_tokens` and dropped `temperature`. The audio-rejection case
+   proves an unsupported modality fails cleanly (4xx) rather than crashing.
+2. **From the audit record** — extra/unknown request fields (`x_qa_marker`,
+   `metadata`) are asserted present in the captured inbound body, and
+   provider-specific response extras (`system_fingerprint`, `service_tier`,
+   `stop_reason`, `usage`) are asserted preserved in the normalized response.
+
+Audit cross-checks are **soft** by default: if audit bodies are off or the entry
+hasn't flushed, those checks are skipped with a note, never a false failure.
+
+## Prerequisites
+
+Run the gateway with audit logging **and bodies** enabled so the preservation
+checks have data:
+
+```bash
+LOGGING_ENABLED=true \
+LOGGING_LOG_BODIES=true \
+LOGGING_LOG_HEADERS=true \
+LOGGING_LOG_AUDIO_BODIES=true \
+LOGGING_FLUSH_INTERVAL=2 \
+./gomodel        # or: go run ./cmd/gomodel
+```
+
+Provider keys come from the gateway's environment (`OPENAI_API_KEY`,
+`ANTHROPIC_API_KEY`, `GEMINI_API_KEY`). The harness authenticates to the gateway
+with `GOMODEL_MASTER_KEY` (read from the env or the repo `.env`).
+
+> This calls real providers and spends real money — modest (a few cents) for one
+> full run, since payloads are tiny and `max_tokens` is capped on every case.
+
+## Run it
+
+```bash
+cd docs/2026-06-25_aws_gateway_benchmark/qa
+python3 run_qa.py                      # full corpus against http://localhost:8080
+python3 run_qa.py --only chat          # filter by id/group/provider substring
+python3 run_qa.py --only openai
+python3 run_qa.py --no-audit           # skip audit cross-checks (faster, fewer assertions)
+python3 run_qa.py --list               # list matching cases, don't run
+python3 run_qa.py --gateway http://host:8080
+```
+
+Stdlib only — no `pip install`. Exit code is non-zero if any case failed or
+errored. Results land in `output/<run_id>/`:
+
+- `results.json` — full per-case record (request sent, response, audit view, every assertion)
+- `report.md` — readable table + a drill-down of failed/errored cases
+
+## Adapt to your account
+
+The spec never hardcodes a model id. Cases reference logical roles
+(`@openai.chat`, `@anthropic.thinking`, `@gemini.vision`); edit `models.json` to
+map them to models your keys can reach. A role with no mapping makes its cases
+`SKIP`, never fail. Image inputs (`@image.red` / `@imageb64.red`) are generated
+solid-colour PNGs — no binary assets in the repo.
+
+## Layout
+
+```
+run_qa.py        orchestrator + assertion evaluation + CLI
+models.json      logical model roles -> concrete model ids (edit this)
+spec/            declarative cases, one JSON file per endpoint group
+qalib/           stdlib helpers: config, paths, assertions, client, report
+output/          run artifacts (gitignored)
+```
+
+## Case schema (quick reference)
+
+```jsonc
+{
+  "id": "chat.openai.multiturn",          // unique
+  "title": "...", "provider": "openai",
+  "modality": ["text"],                    // labels for reporting
+  "request": {
+    "method": "POST",                      // default POST
+    "path": "/v1/chat/completions",        // may contain ${captured_var}
+    "headers": {"X-QA-Marker": "keep"},
+    "stream": false,
+    "body": { "model": "@openai.chat", "...": "..." },
+    "raw_body": "…",                       // send verbatim (malformed-JSON tests)
+    "produce": "tts_then_stt",             // composite: TTS then transcribe its output
+    "tts": {...}, "stt": {...}             // inputs for produce=tts_then_stt
+  },
+  "capture": { "conversation_id": "$.id" },// save response values for later ${vars}
+  "expect": {
+    "status": 200,                         // int or list
+    "headers":  [ {"name": "X-Request-Id", "present": true} ],
+    "body":     [ {"field": "content_type", "contains": "audio/"},
+                  {"field": "bytes", "gte": 2000},
+                  {"field": "text",  "not_empty": true} ],
+    "response": [ {"path": "$.choices[0].message.content", "not_empty": true} ],
+    "stream":   { "min_events": 2, "terminal": "[DONE]",
+                  "event_types": ["message_start"], "text": [{"not_empty": true}] },
+    "audit":    [ {"path": "$.provider", "equals": "openai"},
+                  {"path": "$.data.request_body.x_qa_marker", "equals": "keep"} ],
+    "quality":  [ {"target": "response:$.output[0].content[0].text",
+                   "contains_any": ["paris"]} ]   // soft; feeds the score
+  }
+}
+```
+
+**Operators** (one per assertion): `present` · `absent` · `equals` ·
+`not_equals` · `not_empty` · `contains` · `not_contains` · `contains_any` ·
+`contains_all` · `regex` · `gt` · `gte` · `lt` · `lte` · `type` · `length_gte` ·
+`one_of`. Add `"hard": false` to make a failure a soft signal instead of failing
+the case (audit and quality checks are soft by default).
+
+**Quality targets:** `stream` · `body.text` · `response:$.path` · `audit:$.path`.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/models.json b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json
new file mode 100644
index 00000000..98dfdc3e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json
@@ -0,0 +1,20 @@
+{
+  "_comment": "Logical model roles used by the spec (@openai.chat, @anthropic.thinking, ...). Edit these to match the models your account/keys can reach. Image roles (@image.red/blue/green) are generated by the harness and need no entry.",
+  "openai": {
+    "chat": "gpt-4.1-mini",
+    "vision": "gpt-4.1-mini",
+    "reasoning": "gpt-5-mini",
+    "tts": "gpt-4o-mini-tts",
+    "stt": "gpt-4o-mini-transcribe",
+    "embed": "text-embedding-3-small"
+  },
+  "anthropic": {
+    "chat": "claude-sonnet-4-6",
+    "vision": "claude-sonnet-4-6",
+    "thinking": "claude-opus-4-8"
+  },
+  "gemini": {
+    "chat": "gemini-2.5-flash",
+    "vision": "gemini-2.5-flash"
+  }
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
new file mode 100644
index 00000000..ebb6a9b8
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
@@ -0,0 +1,9 @@
+"""qalib — small helpers for the GoModel quality (QA) harness.
+
+Stdlib-only. Split into focused modules so each stays readable:
+  config      — gateway URL, master key, model/image role resolution, spec loading
+  paths       — JSON-path mini-language + deterministic image fixtures
+  assertions  — declarative assertion operators
+  client      — HTTP send (JSON / multipart / SSE) + audit-log lookup
+  report      — console table + results.json + report.md
+"""
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
new file mode 100644
index 00000000..93a8d78b
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
@@ -0,0 +1,93 @@
+"""Declarative assertion operators.
+
+Each assertion object names exactly one operator plus optional metadata:
+
+    {"path": "$.usage.total_tokens", "gt": 0}
+    {"path": "$.choices[0].message.content", "not_empty": true}
+    {"path": "$.system_fingerprint", "present": true, "hard": false}
+
+`hard` (default true) decides whether a failure fails the case or is recorded
+as a soft/quality signal. The caller locates the value (from a response body,
+header, stream, or audit entry) and passes (found, value) here.
+"""
+import re
+
+from .paths import json_type
+
+# Operators that are meaningful even when the value is absent.
+_ABSENCE_OPS = {"present", "absent"}
+
+
+def _as_number(v):
+    try:
+        return float(v)
+    except (TypeError, ValueError):
+        return None
+
+
+def apply_operator(assertion, found, value):
+    """Evaluate one assertion. Returns (ok: bool, reason: str)."""
+    for op in assertion:
+        if op in ("path", "field", "name", "hard", "note", "target"):
+            continue
+        expected = assertion[op]
+
+        if op == "present":
+            ok = found is expected if isinstance(expected, bool) else found
+            return ok, f"present={found}"
+        if op == "absent":
+            return (not found), f"present={found}"
+
+        # All remaining operators require the value to exist.
+        if not found and op not in _ABSENCE_OPS:
+            return False, "value not found"
+
+        if op == "equals":
+            return value == expected, f"{value!r} == {expected!r}"
+        if op == "not_equals":
+            return value != expected, f"{value!r} != {expected!r}"
+        if op == "not_empty":
+            empty = value is None or value == "" or value == [] or value == {}
+            return (not empty), f"non-empty (got {_short(value)})"
+        if op == "contains":
+            return str(expected).lower() in str(value).lower(), f"contains {expected!r}"
+        if op == "not_contains":
+            return str(expected).lower() not in str(value).lower(), f"not contains {expected!r}"
+        if op == "contains_any":
+            hay = str(value).lower()
+            hit = next((w for w in expected if str(w).lower() in hay), None)
+            return hit is not None, f"any{expected} -> {hit!r}"
+        if op == "contains_all":
+            hay = str(value).lower()
+            miss = [w for w in expected if str(w).lower() not in hay]
+            return not miss, f"all present (missing {miss})"
+        if op == "regex":
+            return re.search(expected, str(value)) is not None, f"~ /{expected}/"
+        if op in ("gt", "gte", "lt", "lte"):
+            n, e = _as_number(value), _as_number(expected)
+            if n is None or e is None:
+                return False, f"non-numeric {value!r}"
+            ok = {"gt": n > e, "gte": n >= e, "lt": n < e, "lte": n <= e}[op]
+            return ok, f"{n} {op} {e}"
+        if op == "type":
+            return json_type(value) == expected, f"type {json_type(value)} == {expected}"
+        if op == "length_gte":
+            try:
+                return len(value) >= expected, f"len {len(value)} >= {expected}"
+            except TypeError:
+                return False, f"no length: {value!r}"
+        if op == "one_of":
+            return value in expected, f"{value!r} in {expected}"
+
+        return False, f"unknown operator {op!r}"
+
+    return False, "empty assertion"
+
+
+def is_hard(assertion):
+    return assertion.get("hard", True)
+
+
+def _short(value, n=60):
+    s = repr(value)
+    return s if len(s) <= n else s[: n - 1] + "…"
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
new file mode 100644
index 00000000..18830604
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
@@ -0,0 +1,206 @@
+"""HTTP client for the QA harness: JSON, multipart, and SSE, plus audit lookup.
+
+Stdlib only (urllib). Every gateway call carries a unique X-Request-Id and a
+run-scoped X-GoModel-User-Path so the matching audit entry can be found, which
+is how the harness inspects what the gateway *recorded* it received and
+returned (request/response bodies, provider, resolved model, usage).
+"""
+import json
+import time
+import urllib.error
+import urllib.request
+import uuid
+
+
+class Result:
+    """Captured outcome of one HTTP exchange."""
+
+    def __init__(self):
+        self.status = 0
+        self.headers = {}
+        self.request_id = ""
+        self.json = None          # parsed JSON body (if any)
+        self.text = None          # text body (non-JSON)
+        self.raw = b""            # raw body bytes (binary, e.g. TTS audio)
+        self.bytes = 0            # raw body length
+        self.content_type = ""
+        self.events = []          # parsed SSE event objects
+        self.stream_text = ""     # assembled assistant text from a stream
+        self.terminal = None      # terminal SSE marker seen ("[DONE]", "message_stop", ...)
+        self.error = None         # transport-level exception text
+
+
+class Client:
+    def __init__(self, base_url, api_key, user_path, timeout=120):
+        self.base = base_url.rstrip("/")
+        self.api_key = api_key
+        self.user_path = user_path
+        self.timeout = timeout
+
+    def _common_headers(self, request_id, extra):
+        h = {
+            "Authorization": f"Bearer {self.api_key}",
+            "X-Request-ID": request_id,
+            "X-GoModel-User-Path": self.user_path,
+        }
+        if extra:
+            h.update(extra)
+        return h
+
+    # ── JSON / raw request, optionally streaming ────────────────────────────
+    def send(self, method, path, body=None, headers=None, stream=False, raw_body=None):
+        rid = "qa-" + uuid.uuid4().hex[:24]
+        res = Result()
+        res.request_id = rid
+        url = self.base + path
+        hdrs = self._common_headers(rid, headers)
+
+        data = None
+        if raw_body is not None:
+            data = raw_body.encode("utf-8")
+            hdrs.setdefault("Content-Type", "application/json")
+        elif body is not None:
+            data = json.dumps(body).encode("utf-8")
+            hdrs["Content-Type"] = "application/json"
+
+        req = urllib.request.Request(url, data=data, method=method, headers=hdrs)
+        try:
+            resp = urllib.request.urlopen(req, timeout=self.timeout)
+            self._capture(res, resp, stream)
+        except urllib.error.HTTPError as e:
+            res.status = e.code
+            self._capture(res, e, stream=False)
+        except Exception as e:  # noqa: BLE001 — surface any transport failure as ERROR
+            res.error = f"{type(e).__name__}: {e}"
+        return res
+
+    # ── multipart/form-data (audio transcriptions) ──────────────────────────
+    def send_multipart(self, path, fields, file_field, filename, file_bytes,
+                       file_content_type, headers=None):
+        rid = "qa-" + uuid.uuid4().hex[:24]
+        res = Result()
+        res.request_id = rid
+        boundary = "----qa" + uuid.uuid4().hex
+        parts = []
+        for k, v in (fields or {}).items():
+            parts.append(f"--{boundary}\r\n".encode())
+            parts.append(f'Content-Disposition: form-data; name="{k}"\r\n\r\n'.encode())
+            parts.append(f"{v}\r\n".encode())
+        parts.append(f"--{boundary}\r\n".encode())
+        parts.append(
+            f'Content-Disposition: form-data; name="{file_field}"; filename="{filename}"\r\n'.encode())
+        parts.append(f"Content-Type: {file_content_type}\r\n\r\n".encode())
+        parts.append(file_bytes)
+        parts.append(f"\r\n--{boundary}--\r\n".encode())
+        data = b"".join(parts)
+
+        hdrs = self._common_headers(rid, headers)
+        hdrs["Content-Type"] = f"multipart/form-data; boundary={boundary}"
+        req = urllib.request.Request(self.base + path, data=data, method="POST", headers=hdrs)
+        try:
+            resp = urllib.request.urlopen(req, timeout=self.timeout)
+            self._capture(res, resp, stream=False)
+        except urllib.error.HTTPError as e:
+            res.status = e.code
+            self._capture(res, e, stream=False)
+        except Exception as e:  # noqa: BLE001
+            res.error = f"{type(e).__name__}: {e}"
+        return res
+
+    # ── response capture ────────────────────────────────────────────────────
+    def _capture(self, res, resp, stream):
+        res.status = getattr(resp, "status", res.status) or res.status
+        try:
+            res.headers = {k.lower(): v for k, v in resp.headers.items()}
+        except Exception:  # noqa: BLE001
+            res.headers = {}
+        res.request_id = res.headers.get("x-request-id", res.request_id)
+        res.content_type = res.headers.get("content-type", "")
+
+        if stream and "text/event-stream" in res.content_type:
+            self._read_sse(res, resp)
+            return
+
+        raw = resp.read()
+        res.raw = raw
+        res.bytes = len(raw)
+        if "application/json" in res.content_type:
+            try:
+                res.json = json.loads(raw.decode("utf-8"))
+            except Exception:  # noqa: BLE001
+                res.text = raw.decode("utf-8", "replace")
+        elif res.content_type.startswith("text/"):
+            res.text = raw.decode("utf-8", "replace")
+        # binary (audio) bodies: only size + content-type are kept.
+
+    def _read_sse(self, res, resp):
+        for rawline in resp:
+            line = rawline.decode("utf-8", "replace").rstrip("\n").rstrip("\r")
+            if not line or line.startswith(":"):
+                continue
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:"):].strip()
+            if payload == "[DONE]":
+                res.terminal = "[DONE]"
+                continue
+            try:
+                ev = json.loads(payload)
+            except Exception:  # noqa: BLE001
+                continue
+            res.events.append(ev)
+            self._accumulate(res, ev)
+
+    @staticmethod
+    def _accumulate(res, ev):
+        """Assemble assistant text across the three streaming dialects and note
+        terminal markers."""
+        etype = ev.get("type")
+        if etype in ("response.completed", "message_stop", "response.output_text.done"):
+            res.terminal = etype
+        # chat.completions: choices[].delta.content
+        for ch in ev.get("choices", []) or []:
+            delta = ch.get("delta") or {}
+            if isinstance(delta.get("content"), str):
+                res.stream_text += delta["content"]
+        # responses: output_text deltas
+        if etype == "response.output_text.delta" and isinstance(ev.get("delta"), str):
+            res.stream_text += ev["delta"]
+        # anthropic messages: content_block_delta.text
+        if etype == "content_block_delta":
+            d = ev.get("delta") or {}
+            if isinstance(d.get("text"), str):
+                res.stream_text += d["text"]
+
+    # ── audit lookup ────────────────────────────────────────────────────────
+    def fetch_audit(self, request_id, attempts=6, delay=1.5):
+        """Find the audit entry for a request_id (retrying for flush lag) and
+        return the full detail entry, or None."""
+        for i in range(attempts):
+            entry_id = self._find_entry_id(request_id)
+            if entry_id:
+                detail = self._get_json(f"/admin/audit/detail?log_id={entry_id}")
+                if detail:
+                    return detail
+            if i < attempts - 1:
+                time.sleep(delay)
+        return None
+
+    def _find_entry_id(self, request_id):
+        listing = self._get_json(f"/admin/audit/log?search={request_id}&limit=20")
+        if not listing:
+            return None
+        for entry in listing.get("entries", []):
+            if entry.get("request_id") == request_id:
+                return entry.get("id")
+        return None
+
+    def _get_json(self, path):
+        req = urllib.request.Request(
+            self.base + path, method="GET",
+            headers={"Authorization": f"Bearer {self.api_key}"})
+        try:
+            resp = urllib.request.urlopen(req, timeout=self.timeout)
+            return json.loads(resp.read().decode("utf-8"))
+        except Exception:  # noqa: BLE001
+            return None
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
new file mode 100644
index 00000000..a3072b4e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
@@ -0,0 +1,114 @@
+"""Config loading: master key, model/image roles, spec files.
+
+The spec never hardcodes a concrete model id. Cases reference logical roles
+("@openai.chat", "@anthropic.thinking", "@image.red") that resolve through
+`models.json`, so a user adapts the whole corpus to their account by editing
+one file.
+"""
+import glob
+import json
+import os
+
+from .paths import png_base64, png_data_url
+
+_COLORS = {"red": (220, 30, 30), "blue": (30, 60, 220), "green": (30, 180, 70)}
+
+# @image.<name>   -> data: URL (chat/responses image_url form)
+# @imageb64.<name> -> raw base64 (native Anthropic image source.data)
+IMAGES = {name: png_data_url(rgb) for name, rgb in _COLORS.items()}
+IMAGES_B64 = {name: png_base64(rgb) for name, rgb in _COLORS.items()}
+
+
+def load_master_key(repo_root):
+    """Master/admin key: env first, then the repo .env (never printed)."""
+    key = os.environ.get("GOMODEL_API_KEY") or os.environ.get("GOMODEL_MASTER_KEY")
+    if key:
+        return key.strip()
+    env_path = os.path.join(repo_root, ".env")
+    if os.path.exists(env_path):
+        with open(env_path, encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line.startswith("GOMODEL_MASTER_KEY="):
+                    return line.split("=", 1)[1].strip().strip('"').strip("'")
+    return ""
+
+
+def load_models(path):
+    with open(path, encoding="utf-8") as f:
+        return json.load(f)
+
+
+def load_specs(spec_dir, only=None):
+    """Load and concatenate every spec/*.json (sorted by filename, then array
+    order). `only` filters by substring against id / group / provider."""
+    cases = []
+    for path in sorted(glob.glob(os.path.join(spec_dir, "*.json"))):
+        with open(path, encoding="utf-8") as f:
+            data = json.load(f)
+        for case in data:
+            case.setdefault("group", os.path.splitext(os.path.basename(path))[0])
+            cases.append(case)
+    if only:
+        needle = only.lower()
+        cases = [c for c in cases
+                 if needle in c.get("id", "").lower()
+                 or needle in c.get("group", "").lower()
+                 or needle in c.get("provider", "").lower()]
+    return cases
+
+
+def resolve_roles(obj, models):
+    """Recursively replace @provider.role and @image.name tokens with concrete
+    values. Returns (resolved_obj, unresolved_roles)."""
+    unresolved = []
+
+    def walk(node):
+        if isinstance(node, str):
+            if node.startswith("@imageb64."):
+                name = node[len("@imageb64."):]
+                if name in IMAGES_B64:
+                    return IMAGES_B64[name]
+                unresolved.append(node)
+                return node
+            if node.startswith("@image."):
+                name = node[len("@image."):]
+                if name in IMAGES:
+                    return IMAGES[name]
+                unresolved.append(node)
+                return node
+            if node.startswith("@"):
+                parts = node[1:].split(".")
+                cur = models
+                for p in parts:
+                    if isinstance(cur, dict) and p in cur:
+                        cur = cur[p]
+                    else:
+                        unresolved.append(node)
+                        return node
+                return cur
+            return node
+        if isinstance(node, list):
+            return [walk(x) for x in node]
+        if isinstance(node, dict):
+            return {k: walk(v) for k, v in node.items()}
+        return node
+
+    return walk(obj), unresolved
+
+
+def interpolate_vars(obj, variables):
+    """Replace ${var} occurrences inside any string using captured runtime vars."""
+    def walk(node):
+        if isinstance(node, str):
+            out = node
+            for name, val in variables.items():
+                out = out.replace("${" + name + "}", str(val))
+            return out
+        if isinstance(node, list):
+            return [walk(x) for x in node]
+        if isinstance(node, dict):
+            return {k: walk(v) for k, v in node.items()}
+        return node
+
+    return walk(obj)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
new file mode 100644
index 00000000..c82fbbda
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
@@ -0,0 +1,90 @@
+"""JSON-path mini-language and deterministic image fixtures.
+
+The path language is intentionally tiny — enough to address normalized AI
+responses and audit entries without a dependency:
+
+    $                      the root object
+    $.a.b                  nested object keys
+    $.choices[0].message   array index
+    $.data.request_body.x  arbitrary nested key (audit bodies)
+
+`get_path` returns (found, value) so callers can distinguish "missing" from
+"present but null/empty".
+"""
+import base64
+import re
+import struct
+import zlib
+
+_TOKEN = re.compile(r"([^.\[\]]+)|\[(\d+)\]")
+
+
+def get_path(obj, path):
+    """Resolve a `$.a.b[0]` path. Returns (found: bool, value)."""
+    if path in ("$", "", None):
+        return True, obj
+    if path.startswith("$."):
+        path = path[2:]
+    elif path.startswith("$"):
+        path = path[1:]
+    cur = obj
+    for key, idx in _TOKEN.findall(path):
+        if idx != "":
+            if not isinstance(cur, list):
+                return False, None
+            i = int(idx)
+            if i >= len(cur):
+                return False, None
+            cur = cur[i]
+        else:
+            if not isinstance(cur, dict) or key not in cur:
+                return False, None
+            cur = cur[key]
+    return True, cur
+
+
+def json_type(value):
+    """JSON type name for a Python value (for the `type` assertion)."""
+    if value is None:
+        return "null"
+    if isinstance(value, bool):
+        return "boolean"
+    if isinstance(value, (int, float)):
+        return "number"
+    if isinstance(value, str):
+        return "string"
+    if isinstance(value, list):
+        return "array"
+    if isinstance(value, dict):
+        return "object"
+    return "unknown"
+
+
+# ── deterministic image fixtures ────────────────────────────────────────────
+# A solid-colour PNG is the simplest reproducible vision input: every provider
+# can name a colour, so `quality: contains_any [red]` is a stable smoke check
+# that needs no network fetch and no binary asset checked into the repo.
+
+def _solid_png(rgb, size=48):
+    raw = bytearray()
+    row = bytes(rgb) * size
+    for _ in range(size):
+        raw.append(0)            # PNG filter type 0 (none) per scanline
+        raw.extend(row)
+
+    def chunk(typ, data):
+        body = typ + data
+        return struct.pack(">I", len(data)) + body + struct.pack(">I", zlib.crc32(body) & 0xFFFFFFFF)
+
+    sig = b"\x89PNG\r\n\x1a\n"
+    ihdr = struct.pack(">IIBBBBB", size, size, 8, 2, 0, 0, 0)  # 8-bit RGB
+    idat = zlib.compress(bytes(raw), 9)
+    return sig + chunk(b"IHDR", ihdr) + chunk(b"IDAT", idat) + chunk(b"IEND", b"")
+
+
+def png_base64(rgb, size=48):
+    return base64.b64encode(_solid_png(rgb, size)).decode("ascii")
+
+
+def png_data_url(rgb, size=48):
+    return "data:image/png;base64," + png_base64(rgb, size)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
new file mode 100644
index 00000000..d551e84e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
@@ -0,0 +1,117 @@
+"""Reporting: console table, results.json, and a Markdown report.
+
+The report "registers" each case — the request as sent, the response and how
+the gateway recorded/normalized it (from the audit entry), and every assertion
+with its observed value — and "rates" it PASS / FAIL / ERROR / SKIP plus a
+0–100 quality score for soft modality checks.
+"""
+import json
+import os
+
+STATUS_GLYPH = {"PASS": "PASS", "FAIL": "FAIL", "ERROR": "ERR ", "SKIP": "skip"}
+
+
+def quality_score(case_result):
+    soft = [c for c in case_result["checks"] if not c["hard"]]
+    if not soft:
+        return None
+    return round(100 * sum(1 for c in soft if c["ok"]) / len(soft))
+
+
+def print_console(results, meta):
+    print("\n" + "=" * 92)
+    print("GOMODEL QUALITY (QA) SUITE")
+    print("=" * 92)
+    print(f"gateway={meta['gateway']}  cases={len(results)}  "
+          f"audit_bodies={'on' if meta['audit_bodies'] else 'OFF'}")
+    print("-" * 92)
+    hdr = f"{'status':6} {'id':46} {'prov':9} {'http':>4} {'qual':>5}  detail"
+    print(hdr)
+    print("-" * 92)
+    for r in results:
+        q = quality_score(r)
+        qs = f"{q:>4}%" if q is not None else "   - "
+        detail = r["detail"]
+        if len(detail) > 24:
+            detail = detail[:23] + "…"
+        print(f"{STATUS_GLYPH.get(r['status'], r['status']):6} "
+              f"{r['id'][:46]:46} {(r.get('provider') or ''):9} "
+              f"{(r['http'] or ''):>4} {qs:>5}  {detail}")
+
+    counts = _counts(results)
+    print("-" * 92)
+    print(f"PASS {counts['PASS']}   FAIL {counts['FAIL']}   "
+          f"ERROR {counts['ERROR']}   SKIP {counts['SKIP']}   "
+          f"(total {len(results)})")
+    _print_breakdown("by endpoint", results, "group")
+    _print_breakdown("by provider", results, "provider")
+    print("=" * 92)
+
+
+def _counts(results):
+    c = {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0}
+    for r in results:
+        c[r["status"]] = c.get(r["status"], 0) + 1
+    return c
+
+
+def _print_breakdown(label, results, key):
+    groups = {}
+    for r in results:
+        g = r.get(key) or "?"
+        groups.setdefault(g, {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0})
+        groups[g][r["status"]] += 1
+    line = "  ".join(
+        f"{g}:{v['PASS']}/{v['PASS'] + v['FAIL'] + v['ERROR'] + v['SKIP']}"
+        for g, v in sorted(groups.items()))
+    print(f"{label:12}: {line}")
+
+
+def write_results(out_dir, results, meta):
+    os.makedirs(out_dir, exist_ok=True)
+    with open(os.path.join(out_dir, "results.json"), "w", encoding="utf-8") as f:
+        json.dump({"meta": meta, "counts": _counts(results), "cases": results},
+                  f, indent=2)
+    _write_markdown(out_dir, results, meta)
+    return out_dir
+
+
+def _write_markdown(out_dir, results, meta):
+    c = _counts(results)
+    L = ["# GoModel Quality (QA) Report\n",
+         f"`gateway={meta['gateway']}  cases={len(results)}  "
+         f"audit_bodies={'on' if meta['audit_bodies'] else 'off'}`\n",
+         f"**PASS {c['PASS']} · FAIL {c['FAIL']} · ERROR {c['ERROR']} · SKIP {c['SKIP']}**\n",
+         "| status | id | endpoint | provider | modality | http | quality | detail |",
+         "|---|---|---|---|--:|--:|--:|---|"]
+    for r in results:
+        q = quality_score(r)
+        qs = f"{q}%" if q is not None else ""
+        mod = r.get("modality")
+        if isinstance(mod, str):
+            mod = [mod]
+        elif not isinstance(mod, list):
+            mod = []
+        modality = ",".join(str(m) for m in mod)
+        L.append(f"| {r['status']} | `{r['id']}` | {r.get('group','')} | "
+                 f"{r.get('provider','')} | {modality} | {r['http'] or ''} | {qs} | "
+                 f"{_md(r['detail'])} |")
+    L.append("")
+    L.append("## Failed & errored cases\n")
+    bad = [r for r in results if r["status"] in ("FAIL", "ERROR")]
+    if not bad:
+        L.append("_None._\n")
+    for r in bad:
+        L.append(f"### `{r['id']}` — {r['status']}\n")
+        L.append(f"- {_md(r.get('title',''))}")
+        L.append(f"- http `{r['http']}`  ·  {_md(r['detail'])}")
+        for chk in r["checks"]:
+            if not chk["ok"] and chk["hard"]:
+                L.append(f"  - FAIL `{chk['where']}` — {_md(chk['reason'])}")
+        L.append("")
+    with open(os.path.join(out_dir, "report.md"), "w", encoding="utf-8") as f:
+        f.write("\n".join(L))
+
+
+def _md(s):
+    return str(s).replace("|", "\\|").replace("\n", " ")
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
new file mode 100644
index 00000000..804973f9
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
@@ -0,0 +1,347 @@
+#!/usr/bin/env python3
+"""GoModel quality (QA) harness — declarative spec runner.
+
+Sends a curated corpus of complex requests through a running GoModel gateway to
+real providers (OpenAI / Anthropic / Gemini) across every dialect and modality,
+then registers and rates each case:
+
+  - registers the request as sent, the response, and how the gateway *recorded*
+    and normalized it (pulled from the audit log: inbound body, normalized body,
+    provider, resolved model, usage);
+  - rates each case PASS / FAIL / ERROR / SKIP, plus a 0–100 quality score for
+    soft modality checks (did the vision model name the colour, did STT recover
+    the spoken words, …).
+
+Usage:
+  python run_qa.py                         # full corpus against localhost:8080
+  python run_qa.py --only chat             # filter by id/group/provider substring
+  python run_qa.py --only openai --no-audit
+  python run_qa.py --list                  # list cases without running
+  python run_qa.py --gateway http://host:8080 --models models.json
+
+Requires the gateway running with audit logging + bodies for the preservation
+checks:  LOGGING_ENABLED=true LOGGING_LOG_BODIES=true LOGGING_LOG_HEADERS=true
+LOGGING_LOG_AUDIO_BODIES=true  (see README).  Stdlib only.
+"""
+import argparse
+import os
+import sys
+import time
+import uuid
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, HERE)
+
+from qalib import config, report          # noqa: E402
+from qalib.assertions import apply_operator, is_hard  # noqa: E402
+from qalib.client import Client           # noqa: E402
+from qalib.paths import get_path          # noqa: E402
+
+def _find_repo_root(start):
+    """Walk up to the repo root (the dir holding .git), for the .env lookup."""
+    d = start
+    while d != os.path.dirname(d):
+        if os.path.exists(os.path.join(d, ".git")):
+            return d
+        d = os.path.dirname(d)
+    return start
+
+
+REPO_ROOT = _find_repo_root(HERE)
+
+
+def locate(target, res, audit):
+    """Resolve a quality/assertion target selector to (found, value)."""
+    if target == "stream":
+        return bool(res.stream_text), res.stream_text
+    if target == "body.text":
+        return res.text is not None, res.text
+    if target.startswith("response:"):
+        return get_path(res.json, target[len("response:"):])
+    if target.startswith("audit:"):
+        if audit is None:
+            return False, None
+        return get_path(audit, target[len("audit:"):])
+    return False, None
+
+
+def evaluate(case, res, audit, audit_attempted, variables=None):
+    """Return (status, checks, detail). checks: [{where, ok, hard, reason}]."""
+    checks = []
+    expect = case.get("expect", {})
+    if variables:
+        # Resolve ${var} references (e.g. a captured ${conversation_id}) in
+        # assertion operands, the same way request paths/bodies are interpolated.
+        expect = config.interpolate_vars(expect, variables)
+
+    if res.error:
+        return "ERROR", checks, res.error
+
+    # ── status ──────────────────────────────────────────────────────────────
+    want = expect.get("status", 200)
+    want = want if isinstance(want, list) else [want]
+    checks.append({"where": "status", "ok": res.status in want, "hard": True,
+                   "reason": f"{res.status} in {want}"})
+
+    # ── headers ───────────────────────────────────────────────────────────────
+    for a in expect.get("headers", []):
+        name = a["name"].lower()
+        found = name in res.headers
+        ok, reason = apply_operator(a, found, res.headers.get(name))
+        checks.append({"where": f"header:{a['name']}", "ok": ok,
+                       "hard": is_hard(a), "reason": reason})
+
+    # ── body (synthetic fields for any body, incl. binary) ────────────────────
+    body_fields = {"content_type": res.content_type, "bytes": res.bytes,
+                   "text": res.text}
+    for a in expect.get("body", []):
+        field = a["field"]
+        val = body_fields.get(field)
+        ok, reason = apply_operator(a, val is not None, val)
+        checks.append({"where": f"body:{field}", "ok": ok,
+                       "hard": is_hard(a), "reason": reason})
+
+    # ── response JSON ─────────────────────────────────────────────────────────
+    for a in expect.get("response", []):
+        found, val = get_path(res.json, a["path"]) if res.json is not None else (False, None)
+        ok, reason = apply_operator(a, found, val)
+        checks.append({"where": f"response:{a['path']}", "ok": ok,
+                       "hard": is_hard(a), "reason": reason})
+
+    # ── streaming ─────────────────────────────────────────────────────────────
+    st = expect.get("stream")
+    if st:
+        if "min_events" in st:
+            n = len(res.events)
+            checks.append({"where": "stream:events", "ok": n >= st["min_events"],
+                           "hard": True, "reason": f"{n} events >= {st['min_events']}"})
+        if "terminal" in st:
+            checks.append({"where": "stream:terminal", "ok": res.terminal == st["terminal"],
+                           "hard": True, "reason": f"{res.terminal!r} == {st['terminal']!r}"})
+        for et in st.get("event_types", []):
+            present = any(e.get("type") == et for e in res.events)
+            checks.append({"where": f"stream:type:{et}", "ok": present,
+                           "hard": True, "reason": f"event {et} present={present}"})
+        for a in st.get("text", []):
+            ok, reason = apply_operator(a, bool(res.stream_text), res.stream_text)
+            checks.append({"where": "stream:text", "ok": ok,
+                           "hard": is_hard(a), "reason": reason})
+
+    # ── audit (gateway's own record of what it received / returned) ───────────
+    for a in expect.get("audit", []):
+        path = a["path"]
+        if not audit_attempted:
+            continue
+        if audit is None:
+            checks.append({"where": f"audit:{path}", "ok": True, "hard": False,
+                           "reason": "audit entry not found (skipped)"})
+            continue
+        found, val = get_path(audit, path)
+        # If body capture is off, demote data.* checks to soft skips.
+        if not found and path.startswith("$.data."):
+            data = audit.get("data") or {}
+            if "request_body" not in data and "response_body" not in data:
+                checks.append({"where": f"audit:{path}", "ok": True, "hard": False,
+                               "reason": "audit bodies off (enable LOGGING_LOG_BODIES)"})
+                continue
+        ok, reason = apply_operator(a, found, val)
+        checks.append({"where": f"audit:{path}", "ok": ok,
+                       "hard": is_hard(a), "reason": reason})
+
+    # ── quality (always soft; feeds the score) ────────────────────────────────
+    for a in expect.get("quality", []):
+        found, val = locate(a.get("target", "stream"), res, audit)
+        a = dict(a)
+        a["hard"] = False
+        ok, reason = apply_operator(a, found, val)
+        checks.append({"where": f"quality:{a.get('target','stream')}", "ok": ok,
+                       "hard": False, "reason": reason})
+
+    hard_fail = [c for c in checks if c["hard"] and not c["ok"]]
+    status = "FAIL" if hard_fail else "PASS"
+    if hard_fail:
+        detail = f"{hard_fail[0]['where']}: {hard_fail[0]['reason']}"
+    else:
+        ok_n = sum(1 for c in checks if c["ok"])
+        detail = f"{ok_n}/{len(checks)} ok"
+    return status, checks, detail
+
+
+def run_case(case, client, models, variables, do_audit):
+    """Build, send, capture vars, fetch audit for one case. Returns (res, audit,
+    audit_attempted, skip_reason)."""
+    resolved, unresolved = config.resolve_roles(case.get("request", {}), models)
+    if unresolved:
+        return None, None, False, f"unresolved role(s): {', '.join(sorted(set(unresolved)))}"
+    req = config.interpolate_vars(resolved, variables)
+
+    produce = req.get("produce")
+    if produce == "tts_then_stt":
+        res = _produce_tts_then_stt(req, client)
+    else:
+        res = client.send(req.get("method", "POST"), req["path"], body=req.get("body"),
+                          headers=req.get("headers"), stream=req.get("stream", False),
+                          raw_body=req.get("raw_body"))
+
+    # capture runtime vars from the response body
+    for name, path in (case.get("capture") or {}).items():
+        if res.json is not None:
+            found, val = get_path(res.json, path)
+            if found:
+                variables[name] = val
+
+    audit_attempted = bool(do_audit and case.get("expect", {}).get("audit"))
+    audit = client.fetch_audit(res.request_id) if audit_attempted else None
+    return res, audit, audit_attempted, None
+
+
+def _produce_tts_then_stt(req, client):
+    tts = req["tts"]
+    fmt = tts.get("response_format", "mp3")
+    r1 = client.send("POST", "/v1/audio/speech", body=tts)
+    if r1.status != 200 or not r1.raw:
+        r1.error = f"tts produce failed (status {r1.status}, {r1.bytes} bytes)"
+        return r1
+    stt = req["stt"]
+    mime = r1.content_type or "audio/mpeg"
+    res = client.send_multipart("/v1/audio/transcriptions", stt, "file",
+                                f"qa.{fmt}", r1.raw, mime)
+    res.produced_from = {"tts_status": r1.status, "tts_bytes": r1.bytes,
+                         "tts_content_type": r1.content_type}
+    return res
+
+
+def _trim(obj, limit=4000):
+    """Trim long strings (base64 audio, etc.) so the artifact stays readable."""
+    if isinstance(obj, str):
+        return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit} chars)"
+    if isinstance(obj, list):
+        return [_trim(x, limit) for x in obj]
+    if isinstance(obj, dict):
+        return {k: _trim(v, limit) for k, v in obj.items()}
+    return obj
+
+
+def artifact(case, res, audit):
+    """The registered record: what was sent, what came back, how the gateway
+    recorded/normalized it."""
+    if res is None:
+        return {"request": case.get("request"), "response": None, "audit": None}
+    resp = {"status": res.status, "content_type": res.content_type,
+            "bytes": res.bytes, "request_id": res.request_id}
+    if res.json is not None:
+        resp["json"] = _trim(res.json)
+    if res.text is not None:
+        resp["text"] = _trim(res.text)
+    if res.events:
+        resp["stream_events"] = len(res.events)
+        resp["stream_text"] = _trim(res.stream_text)
+        resp["terminal"] = res.terminal
+    if getattr(res, "produced_from", None):
+        resp["produced_from"] = res.produced_from
+    audit_view = None
+    if audit:
+        data = audit.get("data") or {}
+        audit_view = {
+            "provider": audit.get("provider"),
+            "resolved_model": audit.get("resolved_model"),
+            "requested_model": audit.get("requested_model"),
+            "status_code": audit.get("status_code"),
+            "duration_ns": audit.get("duration_ns"),
+            "usage": audit.get("usage"),
+            "request_body": _trim(data.get("request_body")),
+            "response_body": _trim(data.get("response_body")),
+        }
+    return {"request": _trim(case.get("request")), "response": resp, "audit": audit_view}
+
+
+def main():
+    ap = argparse.ArgumentParser(description="GoModel quality (QA) harness")
+    ap.add_argument("--gateway", default=os.environ.get("GATEWAY", "http://localhost:8080"))
+    ap.add_argument("--models", default=os.path.join(HERE, "models.json"))
+    ap.add_argument("--spec-dir", default=os.path.join(HERE, "spec"))
+    ap.add_argument("--out", default=os.path.join(HERE, "output"))
+    ap.add_argument("--only", default=None, help="filter by id/group/provider substring")
+    ap.add_argument("--no-audit", action="store_true", help="skip audit-log cross-checks")
+    ap.add_argument("--list", action="store_true", help="list matching cases and exit")
+    ap.add_argument("--timeout", type=int, default=120)
+    args = ap.parse_args()
+
+    models = config.load_models(args.models)
+    cases = config.load_specs(args.spec_dir, args.only)
+    if not cases:
+        print("no cases matched", file=sys.stderr)
+        return 2
+    if args.list:
+        for c in cases:
+            print(f"{c['id']:48} {c.get('group',''):14} {c.get('provider','')}")
+        print(f"\n{len(cases)} cases")
+        return 0
+
+    key = config.load_master_key(REPO_ROOT)
+    if not key:
+        print("no GOMODEL_MASTER_KEY found (env or repo .env)", file=sys.stderr)
+        return 2
+
+    run_id = uuid.uuid4().hex[:12]
+    user_path = f"/qa/{run_id}"
+    client = Client(args.gateway, key, user_path, timeout=args.timeout)
+
+    health = client.send("GET", "/health")
+    if health.error or health.status >= 500:
+        print(f"gateway not reachable at {args.gateway}: "
+              f"{health.error or health.status}", file=sys.stderr)
+        return 2
+
+    print(f"running {len(cases)} cases against {args.gateway}  (user_path {user_path})")
+    results = []
+    variables = {}
+    audit_bodies_seen = False
+    for case in cases:
+        t0 = time.time()
+        try:
+            res, audit, attempted, skip = run_case(case, client, models, variables,
+                                                    do_audit=not args.no_audit)
+
+            if skip:
+                results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
+                print(f"skip {case['id']}: {skip}")
+                continue
+
+            if audit and (audit.get("data") or {}).get("request_body") is not None:
+                audit_bodies_seen = True
+
+            status, checks, detail = evaluate(case, res, audit, attempted, variables)
+            rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
+            results.append(rec)
+            print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}")
+        except Exception as e:  # noqa: BLE001 — never let one case abort the run
+            err = f"{type(e).__name__}: {e}"
+            results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
+            print(f"ERR  {case['id']}: {err}")
+            continue
+
+    meta = {"gateway": args.gateway, "run_id": run_id, "user_path": user_path,
+            "audit_bodies": audit_bodies_seen, "models": models}
+    report.print_console(results, meta)
+    out_dir = os.path.join(args.out, run_id)
+    report.write_results(out_dir, results, meta)
+    print(f"\nwrote {os.path.join(out_dir, 'results.json')}\n"
+          f"wrote {os.path.join(out_dir, 'report.md')}")
+
+    failed = sum(1 for r in results if r["status"] in ("FAIL", "ERROR"))
+    return 1 if failed else 0
+
+
+def _record(case, status, checks, detail, res, audit, elapsed):
+    return {
+        "id": case["id"], "title": case.get("title", ""), "group": case.get("group"),
+        "provider": case.get("provider"), "modality": case.get("modality"),
+        "status": status, "http": (res.status if res else None),
+        "detail": detail, "elapsed_ms": round(elapsed * 1000),
+        "checks": checks, "artifact": artifact(case, res, audit),
+    }
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
new file mode 100644
index 00000000..a2a2fbfb
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
@@ -0,0 +1,75 @@
+[
+  {
+    "id": "audio.openai.tts_mp3",
+    "title": "TTS: synthesize speech (mp3)",
+    "provider": "openai",
+    "modality": ["audio"],
+    "request": {
+      "path": "/v1/audio/speech",
+      "body": {"model": "@openai.tts", "voice": "alloy", "input": "The quick brown fox jumps over the lazy dog.", "response_format": "mp3"}
+    },
+    "expect": {
+      "status": 200,
+      "body": [
+        {"field": "content_type", "contains": "audio/"},
+        {"field": "bytes", "gte": 2000}
+      ]
+    },
+    "notes": "Text-to-speech returns binary audio with an audio/* content type."
+  },
+  {
+    "id": "audio.openai.tts_wav",
+    "title": "TTS: response_format wav changes content type",
+    "provider": "openai",
+    "modality": ["audio"],
+    "request": {
+      "path": "/v1/audio/speech",
+      "body": {"model": "@openai.tts", "voice": "alloy", "input": "Hello world.", "response_format": "wav"}
+    },
+    "expect": {
+      "status": 200,
+      "body": [
+        {"field": "content_type", "contains": "audio/wav"},
+        {"field": "bytes", "gte": 2000}
+      ]
+    },
+    "notes": "response_format must drive the response MIME type (audio/wav)."
+  },
+  {
+    "id": "audio.openai.tts_stt_json",
+    "title": "STT: round-trip TTS -> transcription (json) recovers the words",
+    "provider": "openai",
+    "modality": ["audio"],
+    "request": {
+      "produce": "tts_then_stt",
+      "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Benchmark gateways measure latency and cost.", "response_format": "mp3"},
+      "stt": {"model": "@openai.stt", "response_format": "json"}
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.text", "not_empty": true}],
+      "quality": [{"target": "response:$.text", "contains_any": ["benchmark", "gateway", "latency", "cost"]}]
+    },
+    "notes": "Self-contained modality round-trip: synthesize known text, transcribe it, assert the words come back. No external audio fixture."
+  },
+  {
+    "id": "audio.openai.tts_stt_text",
+    "title": "STT: response_format text returns plain text",
+    "provider": "openai",
+    "modality": ["audio"],
+    "request": {
+      "produce": "tts_then_stt",
+      "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Speech to text in plain format.", "response_format": "mp3"},
+      "stt": {"model": "@openai.stt", "response_format": "text"}
+    },
+    "expect": {
+      "status": 200,
+      "body": [
+        {"field": "content_type", "contains": "text/"},
+        {"field": "text", "not_empty": true}
+      ],
+      "quality": [{"target": "body.text", "contains_any": ["speech", "text", "plain", "format"]}]
+    },
+    "notes": "Transcription response_format=text returns text/plain, not JSON."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
new file mode 100644
index 00000000..f2ea9eab
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
@@ -0,0 +1,448 @@
+[
+  {
+    "id": "chat.openai.multiturn",
+    "title": "OpenAI chat: multi-turn system+user+assistant+user",
+    "provider": "openai",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [
+          {"role": "system", "content": "You are a terse assistant. Answer in one short sentence."},
+          {"role": "user", "content": "Name the largest planet in the solar system."},
+          {"role": "assistant", "content": "Jupiter."},
+          {"role": "user", "content": "And the smallest?"}
+        ],
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "headers": [{"name": "X-Request-Id", "present": true}],
+      "response": [
+        {"path": "$.object", "equals": "chat.completion"},
+        {"path": "$.choices[0].message.role", "equals": "assistant"},
+        {"path": "$.choices[0].message.content", "not_empty": true},
+        {"path": "$.usage.total_tokens", "gt": 0}
+      ],
+      "audit": [
+        {"path": "$.provider", "equals": "openai"},
+        {"path": "$.resolved_model", "not_empty": true}
+      ],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["mercury"]}]
+    },
+    "notes": "Baseline conversational correctness + OpenAI usage normalization + audit routing."
+  },
+  {
+    "id": "chat.openai.stream",
+    "title": "OpenAI chat: streaming deltas terminate with [DONE]",
+    "provider": "openai",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "stream": true,
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "List three primary colors, comma separated."}],
+        "stream": true,
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]},
+      "quality": [{"target": "stream", "contains_any": ["red", "blue", "yellow"]}]
+    },
+    "notes": "SSE framing + terminal marker for chat dialect."
+  },
+  {
+    "id": "chat.openai.stream_usage",
+    "title": "OpenAI chat: stream_options include_usage emits a usage chunk",
+    "provider": "openai",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "stream": true,
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "Say hi."}],
+        "stream": true,
+        "stream_options": {"include_usage": true},
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 2, "terminal": "[DONE]"},
+      "audit": [{"path": "$.usage.total_tokens", "gt": 0}],
+      "quality": [{"target": "stream", "not_empty": true}]
+    },
+    "notes": "stream_options must survive translation; the usage chunk is provider-shaped. For a stream the gateway can only derive usage from the streamed usage chunk, so a recorded usage.total_tokens>0 proves the chunk was emitted and not dropped."
+  },
+  {
+    "id": "chat.openai.vision_data_url",
+    "title": "OpenAI chat: vision via inline image_url (data URL)",
+    "provider": "openai",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.vision",
+        "messages": [{"role": "user", "content": [
+          {"type": "text", "text": "What is the single dominant color of this image? Answer with one word."},
+          {"type": "image_url", "image_url": {"url": "@image.red"}}
+        ]}],
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["red"]}]
+    },
+    "notes": "Multimodal content-part passthrough; deterministic solid-color fixture."
+  },
+  {
+    "id": "chat.openai.tools_call",
+    "title": "OpenAI chat: function/tool calling is emitted",
+    "provider": "openai",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "What is the weather in Paris? Use the tool."}],
+        "tools": [{"type": "function", "function": {
+          "name": "get_weather",
+          "description": "Get current weather for a city",
+          "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+        }}],
+        "tool_choice": "required",
+        "max_tokens": 128
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather"},
+        {"path": "$.choices[0].finish_reason", "equals": "tool_calls"}
+      ]
+    },
+    "notes": "tool_choice=required must force a structured tool call."
+  },
+  {
+    "id": "chat.openai.tools_roundtrip",
+    "title": "OpenAI chat: tool result fed back yields a final answer",
+    "provider": "openai",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [
+          {"role": "user", "content": "What is the weather in Paris?"},
+          {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}}]},
+          {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 21, \"summary\": \"sunny\"}"}
+        ],
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["21", "sunny", "sun"]}]
+    },
+    "notes": "Assistant tool_calls + role:tool message round-trip translation."
+  },
+  {
+    "id": "chat.openai.structured_json_schema",
+    "title": "OpenAI chat: structured output via response_format json_schema",
+    "provider": "openai",
+    "modality": ["structured"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "Give the capital of France."}],
+        "response_format": {"type": "json_schema", "json_schema": {
+          "name": "capital",
+          "strict": true,
+          "schema": {"type": "object", "properties": {"country": {"type": "string"}, "capital": {"type": "string"}}, "required": ["country", "capital"], "additionalProperties": false}
+        }},
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["paris"]}]
+    },
+    "notes": "response_format must pass through and constrain output."
+  },
+  {
+    "id": "chat.openai.reasoning_max_tokens_mapping",
+    "title": "OpenAI reasoning: max_tokens accepted, temperature tolerated (Postel)",
+    "provider": "openai",
+    "modality": ["reasoning"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.reasoning",
+        "messages": [{"role": "user", "content": "What is 17 + 25? Reply with the number only."}],
+        "max_tokens": 2000,
+        "temperature": 0.5
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "audit": [
+        {"path": "$.provider", "equals": "openai"},
+        {"path": "$.data.request_body.max_tokens", "present": true, "hard": false}
+      ],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["42"]}]
+    },
+    "notes": "Reasoning models reject max_tokens/temperature upstream; a 200 proves the gateway mapped max_tokens->max_completion_tokens and dropped temperature. Audit shows the inbound body is preserved verbatim."
+  },
+  {
+    "id": "chat.openai.optional_field_preserved",
+    "title": "OpenAI chat: a valid optional field (user) is preserved end-to-end",
+    "provider": "openai",
+    "modality": ["text", "preservation"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "headers": {"X-QA-Marker": "keep-123"},
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "Reply with the word OK."}],
+        "user": "qa-user-001",
+        "max_tokens": 16
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "audit": [
+        {"path": "$.data.request_body.user", "equals": "qa-user-001", "hard": false}
+      ]
+    },
+    "notes": "A provider-valid optional field round-trips: request succeeds and the audit confirms the gateway recorded it verbatim. (Unknown/unrecognized fields are a separate case — see errors.openai_unknown_field_forwarded.)"
+  },
+  {
+    "id": "chat.openai.provider_extras_preserved",
+    "title": "OpenAI chat: provider-specific response extras survive normalization",
+    "provider": "openai",
+    "modality": ["preservation"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "Say hi."}],
+        "max_tokens": 16
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.id", "not_empty": true},
+        {"path": "$.created", "gt": 0},
+        {"path": "$.model", "not_empty": true},
+        {"path": "$.system_fingerprint", "present": true, "hard": false},
+        {"path": "$.service_tier", "present": true, "hard": false}
+      ]
+    },
+    "notes": "Normalization should preserve provider extras (system_fingerprint/service_tier) rather than strip to a minimal schema."
+  },
+  {
+    "id": "chat.anthropic.basic",
+    "title": "Anthropic via chat/completions: OpenAI-shaped response",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [
+          {"role": "system", "content": "Be concise."},
+          {"role": "user", "content": "Name the capital of Japan in one word."}
+        ],
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "chat.completion"},
+        {"path": "$.choices[0].message.content", "not_empty": true},
+        {"path": "$.usage.completion_tokens", "gt": 0}
+      ],
+      "audit": [{"path": "$.provider", "equals": "anthropic"}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["tokyo"]}]
+    },
+    "notes": "Claude served through the OpenAI chat dialect; Anthropic input/output token usage normalized to prompt/completion."
+  },
+  {
+    "id": "chat.anthropic.stream",
+    "title": "Anthropic via chat/completions: streaming normalized to [DONE]",
+    "provider": "anthropic",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "stream": true,
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [{"role": "user", "content": "Count from 1 to 3."}],
+        "stream": true,
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]}
+    },
+    "notes": "Anthropic SSE converted into OpenAI chat-stream framing with a [DONE] terminator."
+  },
+  {
+    "id": "chat.anthropic.vision",
+    "title": "Anthropic via chat/completions: vision image_url",
+    "provider": "anthropic",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@anthropic.vision",
+        "messages": [{"role": "user", "content": [
+          {"type": "text", "text": "One word: what is the dominant color?"},
+          {"type": "image_url", "image_url": {"url": "@image.blue"}}
+        ]}],
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["blue"]}]
+    },
+    "notes": "image_url data URL mapped to an Anthropic base64 image block."
+  },
+  {
+    "id": "chat.anthropic.params_fidelity",
+    "title": "Anthropic via chat/completions: sampling params + stop honored",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [{"role": "user", "content": "Write the word DONE then stop."}],
+        "temperature": 0.2,
+        "top_p": 0.9,
+        "stop": ["\n\n"],
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}]
+    },
+    "notes": "temperature/top_p/stop translated to Anthropic equivalents without error."
+  },
+  {
+    "id": "chat.gemini.basic",
+    "title": "Gemini via chat/completions: native API path",
+    "provider": "gemini",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@gemini.chat",
+        "messages": [
+          {"role": "system", "content": "Be concise."},
+          {"role": "user", "content": "Capital of Italy in one word?"}
+        ],
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "chat.completion"},
+        {"path": "$.choices[0].message.content", "not_empty": true},
+        {"path": "$.usage.total_tokens", "gt": 0}
+      ],
+      "audit": [{"path": "$.provider", "equals": "gemini"}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["rome"]}]
+    },
+    "notes": "Gemini native contents/parts mapping + usageMetadata normalization (USE_GOOGLE_GEMINI_NATIVE_API default true)."
+  },
+  {
+    "id": "chat.gemini.stream",
+    "title": "Gemini via chat/completions: streaming",
+    "provider": "gemini",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "stream": true,
+      "body": {
+        "model": "@gemini.chat",
+        "messages": [{"role": "user", "content": "Name two oceans, comma separated."}],
+        "stream": true,
+        "max_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]}
+    },
+    "notes": "Gemini native stream translated to OpenAI chat-stream framing."
+  },
+  {
+    "id": "chat.gemini.vision",
+    "title": "Gemini via chat/completions: vision inline image",
+    "provider": "gemini",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@gemini.vision",
+        "messages": [{"role": "user", "content": [
+          {"type": "text", "text": "One word: dominant color?"},
+          {"type": "image_url", "image_url": {"url": "@image.green"}}
+        ]}],
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+      "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["green"]}]
+    },
+    "notes": "image_url data URL mapped to Gemini inline_data."
+  },
+  {
+    "id": "chat.gemini.tools",
+    "title": "Gemini via chat/completions: function calling",
+    "provider": "gemini",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@gemini.chat",
+        "messages": [{"role": "user", "content": "Use the tool to get the weather in Berlin."}],
+        "tools": [{"type": "function", "function": {
+          "name": "get_weather",
+          "description": "Get weather for a city",
+          "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+        }}],
+        "max_tokens": 128
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather", "hard": false}],
+      "quality": [{"target": "response:$.choices[0].message.tool_calls[0].function.name", "contains": "weather"}]
+    },
+    "notes": "Gemini functionCall mapped to OpenAI tool_calls (soft: model may answer directly)."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
new file mode 100644
index 00000000..9c61475f
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
@@ -0,0 +1,99 @@
+[
+  {
+    "id": "conversations.create",
+    "title": "Conversations: create",
+    "provider": "openai",
+    "modality": ["stateful"],
+    "request": {
+      "path": "/v1/conversations",
+      "body": {"metadata": {"qa": "conv-flow"}}
+    },
+    "capture": {"conversation_id": "$.id"},
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "conversation"},
+        {"path": "$.id", "not_empty": true}
+      ]
+    },
+    "notes": "Creates a conversation and captures its id for the rest of this flow (cases run in order)."
+  },
+  {
+    "id": "conversations.get",
+    "title": "Conversations: retrieve by id",
+    "provider": "openai",
+    "modality": ["stateful"],
+    "request": {
+      "method": "GET",
+      "path": "/v1/conversations/${conversation_id}"
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "conversation"},
+        {"path": "$.id", "equals": "${conversation_id}"}
+      ]
+    },
+    "notes": "Reads back the conversation created above; the returned id must equal the captured ${conversation_id}."
+  },
+  {
+    "id": "conversations.use_in_responses",
+    "title": "Conversations: link a Responses call to a conversation",
+    "provider": "openai",
+    "modality": ["stateful", "text"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {
+        "model": "@openai.chat",
+        "conversation": {"id": "${conversation_id}"},
+        "input": "Remember the number 7.",
+        "max_output_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.status", "equals": "completed"},
+        {"path": "$.conversation.id", "equals": "${conversation_id}"}
+      ]
+    },
+    "notes": "Responses request bound to a conversation id (stateful threading); the response must carry the same ${conversation_id} it was attached to."
+  },
+  {
+    "id": "conversations.update",
+    "title": "Conversations: update metadata",
+    "provider": "openai",
+    "modality": ["stateful"],
+    "request": {
+      "path": "/v1/conversations/${conversation_id}",
+      "body": {"metadata": {"qa": "conv-flow", "stage": "updated"}}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "conversation"},
+        {"path": "$.id", "equals": "${conversation_id}"}
+      ]
+    },
+    "notes": "Metadata update on an existing conversation; the update must return the same ${conversation_id}."
+  },
+  {
+    "id": "conversations.delete",
+    "title": "Conversations: delete",
+    "provider": "openai",
+    "modality": ["stateful"],
+    "request": {
+      "method": "DELETE",
+      "path": "/v1/conversations/${conversation_id}"
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "conversation.deleted"},
+        {"path": "$.id", "equals": "${conversation_id}"},
+        {"path": "$.deleted", "equals": true}
+      ]
+    },
+    "notes": "Tears down the conversation created at the start of the flow; the deletion ack must reference the same ${conversation_id}."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
new file mode 100644
index 00000000..a03e7eea
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
@@ -0,0 +1,43 @@
+[
+  {
+    "id": "embeddings.openai.single",
+    "title": "Embeddings: single string input",
+    "provider": "openai",
+    "modality": ["embeddings"],
+    "request": {
+      "path": "/v1/embeddings",
+      "body": {"model": "@openai.embed", "input": "GoModel is an AI gateway."}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "list"},
+        {"path": "$.data[0].object", "equals": "embedding"},
+        {"path": "$.data[0].embedding", "length_gte": 256}
+      ],
+      "audit": [{"path": "$.provider", "equals": "openai"}]
+    },
+    "notes": "Embedding vector of expected dimensionality returned in OpenAI list shape."
+  },
+  {
+    "id": "embeddings.openai.batch",
+    "title": "Embeddings: batch input array",
+    "provider": "openai",
+    "modality": ["embeddings"],
+    "request": {
+      "path": "/v1/embeddings",
+      "body": {"model": "@openai.embed", "input": ["alpha", "beta", "gamma"]}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.data", "length_gte": 3},
+        {"path": "$.data[0].index", "equals": 0},
+        {"path": "$.data[1].index", "equals": 1},
+        {"path": "$.data[2].index", "equals": 2},
+        {"path": "$.data[2].embedding", "length_gte": 256}
+      ]
+    },
+    "notes": "Batched inputs produce one embedding per item, order preserved."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
new file mode 100644
index 00000000..8f1279c6
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
@@ -0,0 +1,82 @@
+[
+  {
+    "id": "errors.unknown_model",
+    "title": "Errors: unknown model returns a normalized OpenAI-style error",
+    "provider": "openai",
+    "modality": ["errors"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {"model": "definitely-not-a-real-model-zzz", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 8}
+    },
+    "expect": {
+      "status": [400, 404],
+      "response": [
+        {"path": "$.error.message", "not_empty": true},
+        {"path": "$.error.type", "present": true, "hard": false}
+      ]
+    },
+    "notes": "Routing failure surfaces as a clean error envelope, not a 5xx or hang."
+  },
+  {
+    "id": "errors.anthropic_audio_rejected",
+    "title": "Errors: unsupported input_audio on Anthropic chat is rejected gracefully",
+    "provider": "anthropic",
+    "modality": ["errors", "audio"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [{"role": "user", "content": [
+          {"type": "text", "text": "Transcribe this."},
+          {"type": "input_audio", "input_audio": {"data": "AAAA", "format": "mp3"}}
+        ]}],
+        "max_tokens": 32
+      }
+    },
+    "expect": {
+      "status": [400, 415, 422],
+      "response": [{"path": "$.error.message", "not_empty": true}]
+    },
+    "notes": "Anthropic chat does not support input_audio; the gateway must reject with a 4xx invalid-request error rather than crash or forward garbage. A non-4xx here is itself the finding."
+  },
+  {
+    "id": "errors.openai_unknown_field_forwarded",
+    "title": "Behavior: unknown top-level fields are forwarded verbatim (provider rejects)",
+    "provider": "openai",
+    "modality": ["errors", "preservation"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "body": {
+        "model": "@openai.chat",
+        "messages": [{"role": "user", "content": "hi"}],
+        "x_qa_marker": "keep-123",
+        "max_tokens": 8
+      }
+    },
+    "expect": {
+      "status": [400],
+      "response": [
+        {"path": "$.error.message", "not_empty": true},
+        {"path": "$.error.message", "contains": "x_qa_marker", "hard": false},
+        {"path": "$.error.type", "equals": "invalid_request_error", "hard": false}
+      ],
+      "audit": [{"path": "$.data.request_body.x_qa_marker", "equals": "keep-123", "hard": false}]
+    },
+    "notes": "Documented finding (2026-06): GoModel does not strip unrecognized top-level fields; it forwards them, so a strict provider (OpenAI) returns 400 'Unrecognized request argument'. The audit confirms the field was captured inbound and passed through. If the gateway later sanitizes unknown fields, change expect.status to 200."
+  },
+  {
+    "id": "errors.malformed_json",
+    "title": "Errors: malformed JSON body returns 400",
+    "provider": "openai",
+    "modality": ["errors"],
+    "request": {
+      "path": "/v1/chat/completions",
+      "raw_body": "{\"model\": \"@openai.chat\", \"messages\": [ "
+    },
+    "expect": {
+      "status": [400],
+      "response": [{"path": "$.error.message", "not_empty": true}]
+    },
+    "notes": "Truncated JSON must yield a 400 with a clear error message (raw_body is sent verbatim)."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
new file mode 100644
index 00000000..16bac806
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
@@ -0,0 +1,216 @@
+[
+  {
+    "id": "messages.anthropic.basic",
+    "title": "Messages: native Anthropic shape",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 64,
+        "messages": [{"role": "user", "content": "Capital of Canada? One word."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.type", "equals": "message"},
+        {"path": "$.role", "equals": "assistant"},
+        {"path": "$.content[0].text", "not_empty": true},
+        {"path": "$.stop_reason", "present": true},
+        {"path": "$.usage.output_tokens", "gt": 0}
+      ],
+      "audit": [{"path": "$.provider", "equals": "anthropic"}],
+      "quality": [{"target": "response:$.content[0].text", "contains_any": ["ottawa"]}]
+    },
+    "notes": "Anthropic-native response: type=message, content blocks, input/output_tokens, stop_reason."
+  },
+  {
+    "id": "messages.anthropic.system",
+    "title": "Messages: top-level system prompt",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 64,
+        "system": "You always answer with a single word.",
+        "messages": [{"role": "user", "content": "Largest mammal?"}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.content[0].text", "not_empty": true}],
+      "quality": [{"target": "response:$.content[0].text", "contains_any": ["whale"]}]
+    },
+    "notes": "Anthropic system is a top-level field, not a message role."
+  },
+  {
+    "id": "messages.anthropic.stream",
+    "title": "Messages: streaming SSE ends with message_stop",
+    "provider": "anthropic",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/messages",
+      "stream": true,
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 64,
+        "stream": true,
+        "messages": [{"role": "user", "content": "Count from 1 to 3."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 3, "terminal": "message_stop", "event_types": ["message_start", "content_block_delta"], "text": [{"not_empty": true}]}
+    },
+    "notes": "Native Anthropic event protocol relayed (message_start -> content_block_delta -> message_stop)."
+  },
+  {
+    "id": "messages.anthropic.vision",
+    "title": "Messages: image content block",
+    "provider": "anthropic",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.vision",
+        "max_tokens": 32,
+        "messages": [{"role": "user", "content": [
+          {"type": "text", "text": "One word: dominant color?"},
+          {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "@imageb64.red"}}
+        ]}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.content[0].text", "not_empty": true}],
+      "quality": [{"target": "response:$.content[0].text", "contains_any": ["red"]}]
+    },
+    "notes": "Native Anthropic base64 image source (raw base64, media_type separate)."
+  },
+  {
+    "id": "messages.anthropic.tools_auto",
+    "title": "Messages: tool definition, tool_choice auto",
+    "provider": "anthropic",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 256,
+        "tools": [{"name": "get_weather", "description": "weather for a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+        "tool_choice": {"type": "auto"},
+        "messages": [{"role": "user", "content": "What's the weather in Paris? Use the tool."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.stop_reason", "present": true}],
+      "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}]
+    },
+    "notes": "Native Anthropic tool schema (input_schema) + tool_choice object."
+  },
+  {
+    "id": "messages.anthropic.tools_required",
+    "title": "Messages: tool_choice any forces a tool call",
+    "provider": "anthropic",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 256,
+        "tools": [{"name": "get_time", "description": "current time in a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+        "tool_choice": {"type": "any"},
+        "messages": [{"role": "user", "content": "What time is it in Tokyo?"}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.stop_reason", "equals": "tool_use", "hard": false}],
+      "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}]
+    },
+    "notes": "tool_choice=any should force a tool_use stop_reason."
+  },
+  {
+    "id": "messages.anthropic.thinking",
+    "title": "Messages: extended thinking enabled",
+    "provider": "anthropic",
+    "modality": ["reasoning"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.thinking",
+        "max_tokens": 4000,
+        "thinking": {"type": "enabled", "budget_tokens": 1024},
+        "messages": [{"role": "user", "content": "If a train travels 60 km in 45 minutes, what is its speed in km/h? Show the number."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.content[0].type", "present": true}],
+      "quality": [{"target": "response:$.stop_reason", "contains_any": ["end_turn", "stop"]}]
+    },
+    "notes": "Extended-thinking request must be accepted (adaptive vs budget_tokens handled by the gateway per model)."
+  },
+  {
+    "id": "messages.anthropic.default_max_tokens",
+    "title": "Messages: missing max_tokens is injected by the gateway",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [{"role": "user", "content": "Say hi in one word."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.content[0].text", "not_empty": true}]
+    },
+    "notes": "Anthropic requires max_tokens; the gateway injects a default so a user request without it still succeeds (good defaults)."
+  },
+  {
+    "id": "messages.anthropic.count_tokens",
+    "title": "Messages: count_tokens",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/messages/count_tokens",
+      "body": {
+        "model": "@anthropic.chat",
+        "messages": [{"role": "user", "content": "How many tokens is this sentence, roughly?"}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.input_tokens", "gt": 0}]
+    },
+    "notes": "Token-counting endpoint returns input_tokens without a provider completion call."
+  },
+  {
+    "id": "messages.anthropic.metadata_preserved",
+    "title": "Messages: metadata.user_id (valid Anthropic field) is preserved",
+    "provider": "anthropic",
+    "modality": ["preservation"],
+    "request": {
+      "path": "/v1/messages",
+      "body": {
+        "model": "@anthropic.chat",
+        "max_tokens": 16,
+        "metadata": {"user_id": "qa-789"},
+        "messages": [{"role": "user", "content": "Say OK."}]
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.content[0].text", "not_empty": true}],
+      "audit": [{"path": "$.data.request_body.metadata.user_id", "equals": "qa-789", "hard": false}]
+    },
+    "notes": "metadata is a first-class Anthropic field; audit confirms the gateway recorded it as sent."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
new file mode 100644
index 00000000..d1ab7f1a
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
@@ -0,0 +1,198 @@
+[
+  {
+    "id": "responses.openai.basic_string",
+    "title": "Responses: plain string input",
+    "provider": "openai",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {"model": "@openai.chat", "input": "What is the capital of France? One word.", "max_output_tokens": 64}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.object", "equals": "response"},
+        {"path": "$.status", "equals": "completed"},
+        {"path": "$.output[0].content[0].text", "not_empty": true}
+      ],
+      "audit": [{"path": "$.provider", "equals": "openai"}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["paris"]}]
+    },
+    "notes": "Native Responses shape: output[].content[].text, status=completed."
+  },
+  {
+    "id": "responses.openai.instructions",
+    "title": "Responses: instructions + string input",
+    "provider": "openai",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {"model": "@openai.chat", "instructions": "Answer in exactly one word.", "input": "Largest ocean?", "max_output_tokens": 64}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.status", "equals": "completed"},
+        {"path": "$.output[0].content[0].text", "not_empty": true}
+      ],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["pacific"]}]
+    },
+    "notes": "instructions become a system-equivalent prompt."
+  },
+  {
+    "id": "responses.openai.multimodal_image",
+    "title": "Responses: multi-part input_text + input_image",
+    "provider": "openai",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {
+        "model": "@openai.vision",
+        "input": [{"role": "user", "content": [
+          {"type": "input_text", "text": "One word: dominant color?"},
+          {"type": "input_image", "image_url": "@image.red"}
+        ]}],
+        "max_output_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["red"]}]
+    },
+    "notes": "input_image content part normalized to a chat image part."
+  },
+  {
+    "id": "responses.openai.stream",
+    "title": "Responses: streaming output_text deltas + response.completed",
+    "provider": "openai",
+    "modality": ["text", "streaming"],
+    "request": {
+      "path": "/v1/responses",
+      "stream": true,
+      "body": {"model": "@openai.chat", "input": "Count from 1 to 3.", "stream": true, "max_output_tokens": 64}
+    },
+    "expect": {
+      "status": 200,
+      "stream": {"min_events": 2, "terminal": "response.completed", "text": [{"not_empty": true}]}
+    },
+    "notes": "Responses SSE event protocol (output_text.delta -> response.completed)."
+  },
+  {
+    "id": "responses.openai.tools",
+    "title": "Responses: function tool",
+    "provider": "openai",
+    "modality": ["tools"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {
+        "model": "@openai.chat",
+        "input": "Use the tool to get the weather in Rome.",
+        "tools": [{"type": "function", "name": "get_weather", "description": "weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+        "max_output_tokens": 128
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.status", "equals": "completed"}],
+      "quality": [{"target": "response:$.output[0].type", "contains_any": ["function_call", "message"]}]
+    },
+    "notes": "Responses tool schema (flat name/parameters) handled."
+  },
+  {
+    "id": "responses.openai.structured_text_format",
+    "title": "Responses: structured output via text.format json_schema",
+    "provider": "openai",
+    "modality": ["structured"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {
+        "model": "@openai.chat",
+        "input": "Capital of Spain.",
+        "text": {"format": {"type": "json_schema", "name": "cap", "strict": true, "schema": {"type": "object", "properties": {"capital": {"type": "string"}}, "required": ["capital"], "additionalProperties": false}}},
+        "max_output_tokens": 64
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["madrid"]}]
+    },
+    "notes": "text.format maps to chat response_format/json_schema for non-native providers."
+  },
+  {
+    "id": "responses.openai.reasoning_effort",
+    "title": "Responses: reasoning model with effort",
+    "provider": "openai",
+    "modality": ["reasoning"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {"model": "@openai.reasoning", "input": "What is 6 times 7? Number only.", "reasoning": {"effort": "low"}, "max_output_tokens": 2000}
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.status", "equals": "completed"}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["42"]}]
+    },
+    "notes": "reasoning.effort accepted on the Responses API."
+  },
+  {
+    "id": "responses.openai.metadata_preserved",
+    "title": "Responses: metadata (valid optional field) is preserved",
+    "provider": "openai",
+    "modality": ["preservation"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {"model": "@openai.chat", "input": "Say OK.", "metadata": {"qa_case": "resp-extra"}, "max_output_tokens": 16}
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.status", "equals": "completed"}],
+      "audit": [{"path": "$.data.request_body.metadata.qa_case", "equals": "resp-extra", "hard": false}]
+    },
+    "notes": "metadata is a first-class Responses field; audit confirms the gateway recorded it as sent."
+  },
+  {
+    "id": "responses.anthropic.basic",
+    "title": "Responses adapter -> Anthropic",
+    "provider": "anthropic",
+    "modality": ["text"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {"model": "@anthropic.chat", "input": "Capital of Germany? One word.", "max_output_tokens": 64}
+    },
+    "expect": {
+      "status": 200,
+      "response": [
+        {"path": "$.status", "equals": "completed"},
+        {"path": "$.output[0].content[0].text", "not_empty": true}
+      ],
+      "audit": [{"path": "$.provider", "equals": "anthropic"}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["berlin"]}]
+    },
+    "notes": "Non-native provider served through the Responses->chat adapter and renormalized to Responses shape."
+  },
+  {
+    "id": "responses.gemini.image",
+    "title": "Responses adapter -> Gemini with image input",
+    "provider": "gemini",
+    "modality": ["vision"],
+    "request": {
+      "path": "/v1/responses",
+      "body": {
+        "model": "@gemini.vision",
+        "input": [{"role": "user", "content": [
+          {"type": "input_text", "text": "One word: color?"},
+          {"type": "input_image", "image_url": "@image.blue"}
+        ]}],
+        "max_output_tokens": 32
+      }
+    },
+    "expect": {
+      "status": 200,
+      "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+      "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["blue"]}]
+    },
+    "notes": "Responses multimodal input adapted to Gemini inline_data."
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
new file mode 100644
index 00000000..182aa0c7
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+"""Generate the catchy dark cover image for the June 2026 gateway benchmark post.
+
+Thesis-driven: latency is overrated, the resource bill isn't. So the hero visual
+is the resource gap (Docker image + peak RAM), GoModel highlighted.
+"""
+import sys
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib import font_manager as fm
+
+BG = "#0b0e14"
+PANEL = "#11161f"
+TEXT = "#e6edf3"
+MUTED = "#8b98a9"
+GREEN = "#34d399"   # GoModel
+RED = "#f87171"     # LiteLLM
+GRAY = "#5b6675"    # others
+
+def font(weight="normal", size=12, black=False):
+    fam = "Arial Black" if black else "Arial"
+    return fm.FontProperties(family=fam, weight=weight, size=size)
+
+# data (June 2026 c7i.large run) - ascending, so GoModel (the winner) sits on top
+# and the giant red LiteLLM bar at the bottom. Image = compressed pull size; RAM =
+# peak under load (LiteLLM at its recommended one-worker-per-core config).
+IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)]
+RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)]
+
+W, H, DPI = 2400, 1260, 200
+fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI)
+fig.patch.set_facecolor(BG)
+
+# ── left text column (top-anchored so positions are predictable) ───
+T = dict(va="top", ha="left")
+fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK  ·  JUNE 25, 2026", color=GREEN,
+         fontproperties=font(size=14.5, weight="bold"), **T)
+fig.text(0.043, 0.84, "LATENCY IS", color=TEXT, fontproperties=font(size=39, black=True), **T)
+fig.text(0.043, 0.725, "OVERRATED", color=TEXT, fontproperties=font(size=39, black=True), **T)
+fig.text(0.043, 0.585, "LOOK AT THE BILL", color=GREEN, fontproperties=font(size=35, black=True), **T)
+fig.add_artist(plt.Line2D([0.045, 0.405], [0.475, 0.475], color="#1f2733", lw=2))
+fig.text(0.045, 0.45, "GoModel — the fastest,\nmost lightweight AI\ngateway in the world",
+         color=GREEN, fontproperties=font(size=18, weight="bold"), linespacing=1.4, **T)
+
+def panel(rect, title, rows, unit, ref):
+    ax = fig.add_axes(rect)
+    ax.set_facecolor(PANEL)
+    for s in ax.spines.values():
+        s.set_visible(False)
+    ax.tick_params(left=False, bottom=False, labelbottom=False)
+    labels = [r[0] for r in rows]
+    vals = [r[1] for r in rows]
+    colors = [r[2] for r in rows]
+    y = range(len(rows))
+    maxv = max(vals)
+    ax.barh(y, vals, color=colors, height=0.62, zorder=3)
+    ax.set_xlim(0, maxv * 1.34)  # headroom so value labels never clip
+    ax.set_ylim(-0.6, len(rows) - 0.4)
+    ax.invert_yaxis()
+    ax.set_yticks(list(y))
+    ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold"))
+    for i, v in enumerate(vals):
+        mult = v / ref
+        tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×"
+        label = f"{v:,} {unit}   ({tag})"
+        if colors[i] == RED:  # the worst: label centered inside the bar, dark text
+            ax.text(v / 2, i, label, va="center", ha="center", color=BG,
+                    fontproperties=font(size=12.5, weight="bold"))
+        else:
+            ax.text(v + maxv * 0.02, i, label, va="center", ha="left",
+                    color=TEXT if colors[i] != GRAY else MUTED,
+                    fontproperties=font(size=12.5, weight="bold"))
+    ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8)
+    return ax
+
+panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16)
+panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37)
+
+out = sys.argv[1] if len(sys.argv) > 1 else "cover.png"
+fig.savefig(out, facecolor=BG, dpi=DPI)
+print("wrote", out)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py
new file mode 100644
index 00000000..7a71dc89
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+"""Cover for the measured benchmark post variant (B).
+
+Same hero visual as make_cover.py (the resource gap: Docker image + peak RAM,
+GoModel highlighted). The text is a single takeaway -
+"Four gateways, one backend - GoModel wins" - with no cost-question framing.
+"""
+import sys
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib import font_manager as fm
+
+BG = "#0b0e14"
+PANEL = "#11161f"
+TEXT = "#e6edf3"
+MUTED = "#8b98a9"
+GREEN = "#34d399"   # GoModel
+RED = "#f87171"     # LiteLLM
+GRAY = "#5b6675"    # others
+
+def font(weight="normal", size=12, black=False):
+    fam = "Arial Black" if black else "Arial"
+    return fm.FontProperties(family=fam, weight=weight, size=size)
+
+# data (June 2026 c7i.large run) - ascending, GoModel (winner) on top.
+IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)]
+RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)]
+
+W, H, DPI = 2400, 1260, 200
+fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI)
+fig.patch.set_facecolor(BG)
+
+# ── left text column (single takeaway, no cost-question headline) ───
+T = dict(va="top", ha="left")
+fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK  ·  JUNE 25, 2026", color=GREEN,
+         fontproperties=font(size=14.5, weight="bold"), **T)
+fig.text(0.043, 0.72, "Four gateways,", color=TEXT, fontproperties=font(size=33, black=True), **T)
+fig.text(0.043, 0.60, "one backend —", color=TEXT, fontproperties=font(size=33, black=True), **T)
+fig.text(0.043, 0.48, "GoModel wins", color=GREEN, fontproperties=font(size=33, black=True), **T)
+
+def panel(rect, title, rows, unit, ref):
+    ax = fig.add_axes(rect)
+    ax.set_facecolor(PANEL)
+    for s in ax.spines.values():
+        s.set_visible(False)
+    ax.tick_params(left=False, bottom=False, labelbottom=False)
+    labels = [r[0] for r in rows]
+    vals = [r[1] for r in rows]
+    colors = [r[2] for r in rows]
+    y = range(len(rows))
+    maxv = max(vals)
+    ax.barh(y, vals, color=colors, height=0.62, zorder=3)
+    ax.set_xlim(0, maxv * 1.34)
+    ax.set_ylim(-0.6, len(rows) - 0.4)
+    ax.invert_yaxis()
+    ax.set_yticks(list(y))
+    ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold"))
+    for i, v in enumerate(vals):
+        mult = v / ref
+        tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×"
+        label = f"{v:,} {unit}   ({tag})"
+        if colors[i] == RED:  # the worst: label centered inside the bar, dark text
+            ax.text(v / 2, i, label, va="center", ha="center", color=BG,
+                    fontproperties=font(size=12.5, weight="bold"))
+        else:
+            ax.text(v + maxv * 0.02, i, label, va="center", ha="left",
+                    color=TEXT if colors[i] != GRAY else MUTED,
+                    fontproperties=font(size=12.5, weight="bold"))
+    ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8)
+    return ax
+
+panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16)
+panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37)
+
+out = sys.argv[1] if len(sys.argv) > 1 else "cover-b.png"
+fig.savefig(out, facecolor=BG, dpi=DPI)
+print("wrote", out)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
new file mode 100644
index 00000000..179b4868
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
@@ -0,0 +1,3 @@
+output/
+__pycache__/
+*.pyc
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/README.md b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md
new file mode 100644
index 00000000..82487965
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md
@@ -0,0 +1,75 @@
+# Gateway translation-fidelity analysis
+
+How faithfully does each AI gateway translate a request? This harness sends the
+**same** client request through **GoModel, LiteLLM, Portkey, and Bifrost**, all
+pointed at the **same recording mock provider**, and captures — per case, per
+gateway — four artifacts:
+
+| artifact | meaning |
+|---|---|
+| `client_request` | what we sent to the gateway (the **pure** request) |
+| `sent_body` | the body after per-gateway rewrites (e.g. Bifrost's `openai/` model prefix) |
+| `upstream` | the request the gateway actually sent to the provider (the **translated** request) + the canned (**pure**) response the mock returned |
+| `client_response` | what the gateway returned to us (the **translated** response) |
+
+Then an AI analyzes each case across gateways: what each one added, dropped,
+renamed, or reshaped — request *and* response — and which is most faithful.
+
+A recording mock (not real providers) is the only way to observe the translated
+*upstream* request: real providers don't echo what the gateway sent them.
+
+## Why a mock, and what "pure" means
+
+- **Pure request** = the original client body. **Translated request** = what the
+  gateway emitted upstream (captured by the mock).
+- **Pure response** = the deterministic provider-shaped body the mock returned
+  (enriched with `system_fingerprint`, `service_tier`, and a non-standard
+  `x_provider_note` so we can see which gateways preserve provider extras).
+  **Translated response** = what the gateway returned to the client.
+- The comparison axis is **gateway vs gateway** — every case uses the same model
+  (`gpt-4o-mini`) routed to the mock, so differences are the gateway's doing, not
+  the provider's.
+
+## Pieces
+
+```text
+docker-compose.yml   mock (MOCK_RECORD=1) + all 4 gateways, reusing ../remote configs
+corpus.json          12 gateway-agnostic cases across chat/responses/messages, stream + not
+capture.py           resets the mock, sends each case through each gateway, records 4 artifacts
+analyze.py           builds per-case AI-analysis prompts from the captures (one bundle per case)
+output/              captures.json + the AI comparison report (gitignored)
+```
+
+The recording mock lives in `../remote/bench-tools/mock/main.go` (recording is
+gated behind `MOCK_RECORD=1`, so the latency benchmark stays byte-identical).
+
+## Run it
+
+```bash
+# 0. build the GoModel image once (native arch):
+docker build -t gomodel-bench:local ../../..
+
+# 1. bring up the recording mock + all four gateways:
+cd docs/2026-06-25_aws_gateway_benchmark/translation
+docker compose --profile all up -d --build
+
+# 2. capture translations (resets the mock before each call):
+python3 capture.py            # -> output/captures.json
+
+# 3. tear down:
+docker compose --profile all down
+```
+
+No real provider keys or spend — every gateway talks to the local mock.
+
+## Per-gateway addressing (handled by capture.py)
+
+| gateway | port | model | messages path | extra headers |
+|---|--|---|---|---|
+| GoModel | 18080 | `gpt-4o-mini` | `/v1/messages` | — |
+| LiteLLM | 4000 | `gpt-4o-mini` | `/v1/messages` | — |
+| Portkey | 8787 | `gpt-4o-mini` | `/v1/messages` | `x-portkey-provider`, `x-portkey-custom-host` |
+| Bifrost | 8089 | `openai/gpt-4o-mini` | `/anthropic/v1/messages` | — |
+
+Dialects a gateway doesn't serve are not skipped — the non-200 (and empty
+upstream log) is recorded, because that asymmetry is itself a finding.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
new file mode 100644
index 00000000..9f2f0762
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python3
+"""Glue for the AI translation analysis.
+
+  analyze.py --split          read output/captures.json, write one self-contained
+                              bundle per case to output/cases/<id>.json (the input
+                              an AI analyst reviews for that case)
+  analyze.py --render         read output/analysis/<id>.json (the AI's structured
+                              verdict per case) + captures.json, write output/report.md
+
+The actual case-by-case comparison is done by an AI analyst (one per case): it
+reads a bundle and writes its verdict to output/analysis/<id>.json following the
+schema documented in --split's banner. Stdlib only.
+"""
+import argparse
+import glob
+import json
+import os
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+OUT = os.path.join(HERE, "output")
+GATEWAYS = ["gomodel", "litellm", "portkey", "bifrost"]
+
+ANALYSIS_SCHEMA = {
+    "case_id": "string",
+    "verdict_per_gateway": {
+        "<gateway>": {
+            "reached_provider": "bool — did the gateway make an upstream call?",
+            "upstream_path": "the path it called on the mock",
+            "request_added": ["fields/headers the gateway ADDED vs the client request"],
+            "request_dropped": ["client fields the gateway DROPPED before upstream"],
+            "request_renamed": ["client->upstream field renames, e.g. max_tokens->max_completion_tokens"],
+            "request_reshaped": "prose: structural changes (dialect translation, message shape, tool schema)",
+            "response_extras_preserved": ["provider extras kept in the client response: system_fingerprint/service_tier/x_provider_note/usage"],
+            "response_extras_dropped": ["provider extras the gateway stripped"],
+            "response_reshaped": "prose: how the upstream response was renormalized for the client",
+            "fidelity_score": "0-100 int: how faithfully intent was preserved end-to-end",
+            "notes": "anything notable"
+        }
+    },
+    "cross_gateway_findings": ["concise comparative observations"],
+    "ranking": ["gateways best->worst fidelity for this case"],
+}
+
+
+def split():
+    caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8"))
+    d = os.path.join(OUT, "cases")
+    os.makedirs(d, exist_ok=True)
+    ids = []
+    for cid, case in caps["cases"].items():
+        bundle = {"case_id": cid, "dialect": case["dialect"], "stream": case["stream"],
+                  "intent_note": case["note"], "client_request": case["client_request"],
+                  "gateways": case["gateways"]}
+        json.dump(bundle, open(os.path.join(d, f"{cid}.json"), "w", encoding="utf-8"), indent=2)
+        ids.append(cid)
+    print(f"wrote {len(ids)} case bundles to {d}")
+    for cid in ids:
+        print("  ", cid)
+
+
+def _esc(s):
+    # AI-authored cell values may contain `|` or newlines that would break the
+    # Markdown table; escape pipes and collapse newlines to spaces.
+    return str(s).replace("|", "\\|").replace("\r", " ").replace("\n", " ")
+
+
+def _cell(items):
+    if not items:
+        return "—"
+    return _esc("; ".join(str(x) for x in items)[:120])
+
+
+def render():
+    caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8"))
+    analyses = {}
+    for p in glob.glob(os.path.join(OUT, "analysis", "*.json")):
+        try:
+            a = json.load(open(p, encoding="utf-8"))
+            analyses[a.get("case_id", os.path.basename(p)[:-5])] = a
+        except (OSError, ValueError):
+            pass
+
+    gws = caps["meta"]["gateways"]
+    L = ["# Gateway translation-fidelity report\n",
+         "Same request through each gateway, same mock provider. The AI analyst "
+         "compared the translated upstream request vs the pure client request, and "
+         "the translated client response vs the pure mock response, per case.\n",
+         f"`gateways: {', '.join(gws)}`  ·  `cases: {len(caps['cases'])}`\n"]
+
+    # ── aggregate scoreboard ──────────────────────────────────────────────────
+    scores = {g: [] for g in gws}
+    for a in analyses.values():
+        for g, v in (a.get("verdict_per_gateway") or {}).items():
+            s = v.get("fidelity_score")
+            if isinstance(s, (int, float)):
+                scores.setdefault(g, []).append(s)
+    L.append("## Fidelity scoreboard (mean of per-case AI scores)\n")
+    L.append("| gateway | mean fidelity | cases scored |")
+    L.append("|---|--:|--:|")
+    for g in gws:
+        vals = scores.get(g, [])
+        mean = round(sum(vals) / len(vals)) if vals else 0
+        L.append(f"| {g} | {mean} | {len(vals)} |")
+    L.append("")
+
+    # ── per-case detail ────────────────────────────────────────────────────────
+    for cid, case in caps["cases"].items():
+        a = analyses.get(cid)
+        L.append(f"## `{cid}` — {case['dialect']}{', stream' if case['stream'] else ''}\n")
+        L.append(f"_{case['note']}_\n")
+        if not a:
+            L.append("> _no AI analysis recorded for this case_\n")
+            continue
+        L.append("| gateway | upstream | added | dropped | renamed | resp extras kept | resp dropped | fidelity |")
+        L.append("|---|---|---|---|---|---|---|--:|")
+        for g in gws:
+            v = (a.get("verdict_per_gateway") or {}).get(g)
+            if not v:
+                L.append(f"| {g} | — | — | — | — | — | — | — |")
+                continue
+            L.append(f"| {g} | {_esc(v.get('upstream_path','—'))} | {_cell(v.get('request_added'))} | "
+                     f"{_cell(v.get('request_dropped'))} | {_cell(v.get('request_renamed'))} | "
+                     f"{_cell(v.get('response_extras_preserved'))} | {_cell(v.get('response_extras_dropped'))} | "
+                     f"{_esc(v.get('fidelity_score','—'))} |")
+        L.append("")
+        if a.get("cross_gateway_findings"):
+            L.append("**Findings:**")
+            for f in a["cross_gateway_findings"]:
+                L.append(f"- {f}")
+            L.append("")
+        if a.get("ranking"):
+            L.append(f"**Fidelity ranking:** {' > '.join(a['ranking'])}\n")
+
+    path = os.path.join(OUT, "report.md")
+    open(path, "w", encoding="utf-8").write("\n".join(L))
+    print(f"wrote {path}")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--split", action="store_true")
+    ap.add_argument("--render", action="store_true")
+    ap.add_argument("--schema", action="store_true", help="print the analysis JSON schema")
+    args = ap.parse_args()
+    if args.schema:
+        print(json.dumps(ANALYSIS_SCHEMA, indent=2))
+    elif args.split:
+        split()
+    elif args.render:
+        render()
+    else:
+        ap.error("one of --split / --render / --schema required")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
new file mode 100644
index 00000000..73201219
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""Capture how each gateway translates the SAME client request to the SAME mock.
+
+For every (case, gateway) it records four artifacts:
+  - client_request   : what we sent to the gateway (the "pure" request)
+  - sent_body        : the body after per-gateway model rewrite
+  - upstream         : the request(s) the gateway actually sent to the mock
+                       (the TRANSLATED request) + the canned ("pure") response
+  - client_response  : what the gateway returned to us (the TRANSLATED response)
+
+The mock is reset before each call and requests are sent one at a time, so the
+shared recorder attributes each upstream call to the gateway+case that made it.
+Stdlib only. Output: output/captures.json.
+"""
+import argparse
+import copy
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.request
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+MOCK = "http://localhost:9999"
+
+# Per-gateway base URL is env-overridable (e.g. GOMODEL_BASE) so a local dev
+# server on a default port doesn't force a clash.
+GATEWAYS = {
+    "gomodel": {"base": os.environ.get("GOMODEL_BASE", "http://localhost:18080")},
+    "litellm": {"base": os.environ.get("LITELLM_BASE", "http://localhost:4000")},
+    "portkey": {"base": os.environ.get("PORTKEY_BASE", "http://localhost:8787"),
+                "headers": {"x-portkey-provider": "openai",
+                            "x-portkey-custom-host": "http://mock:9999/v1"}},
+    "bifrost": {"base": os.environ.get("BIFROST_BASE", "http://localhost:8089")},
+}
+ORDER = ["gomodel", "litellm", "portkey", "bifrost"]
+DIALECT_PATH = {"chat": "/v1/chat/completions", "responses": "/v1/responses",
+                "messages": "/v1/messages"}
+
+
+def model_for(gw, m):
+    return "openai/" + m if gw == "bifrost" else m
+
+
+def path_for(gw, dialect):
+    if gw == "bifrost" and dialect == "messages":
+        return "/anthropic/v1/messages"
+    return DIALECT_PATH[dialect]
+
+
+def headers_for(gw):
+    h = {"Content-Type": "application/json", "Authorization": "Bearer sk-bench-test-key",
+         "anthropic-version": "2023-06-01"}
+    h.update(GATEWAYS[gw].get("headers", {}))
+    return h
+
+
+# ── HTTP ─────────────────────────────────────────────────────────────────────
+def post(url, headers, body, stream, timeout=30):
+    data = json.dumps(body).encode("utf-8")
+    req = urllib.request.Request(url, data=data, method="POST", headers=headers)
+    out = {"status": 0, "content_type": "", "json": None, "text": None,
+           "stream_events": 0, "stream_text": "", "terminal": None, "error": None}
+    try:
+        resp = urllib.request.urlopen(req, timeout=timeout)
+        _capture(out, resp, stream)
+    except urllib.error.HTTPError as e:
+        out["status"] = e.code
+        _capture(out, e, stream=False)
+    except Exception as e:  # noqa: BLE001
+        out["error"] = f"{type(e).__name__}: {e}"
+    return out
+
+
+def _capture(out, resp, stream):
+    out["status"] = getattr(resp, "status", out["status"]) or out["status"]
+    try:
+        out["content_type"] = resp.headers.get("content-type", "")
+    except Exception:  # noqa: BLE001
+        pass
+    if stream and "text/event-stream" in out["content_type"]:
+        for rawline in resp:
+            line = rawline.decode("utf-8", "replace").strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[5:].strip()
+            if payload == "[DONE]":
+                out["terminal"] = "[DONE]"
+                continue
+            out["stream_events"] += 1
+            try:
+                ev = json.loads(payload)
+            except Exception:  # noqa: BLE001
+                continue
+            t = ev.get("type")
+            if t in ("response.completed", "message_stop"):
+                out["terminal"] = t
+            for ch in ev.get("choices", []) or []:
+                d = (ch.get("delta") or {}).get("content")
+                if isinstance(d, str):
+                    out["stream_text"] += d
+            if t == "response.output_text.delta" and isinstance(ev.get("delta"), str):
+                out["stream_text"] += ev["delta"]
+            if t == "content_block_delta":
+                td = (ev.get("delta") or {}).get("text")
+                if isinstance(td, str):
+                    out["stream_text"] += td
+        return
+    raw = resp.read()
+    if "application/json" in out["content_type"]:
+        try:
+            out["json"] = json.loads(raw.decode("utf-8"))
+        except Exception:  # noqa: BLE001
+            out["text"] = raw.decode("utf-8", "replace")
+    else:
+        out["text"] = raw.decode("utf-8", "replace")[:4000]
+
+
+def get_json(url, timeout=10):
+    try:
+        resp = urllib.request.urlopen(urllib.request.Request(url, method="GET"), timeout=timeout)
+        return json.loads(resp.read().decode("utf-8"))
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def mock_reset():
+    # Fail fast: a silently failed reset would attribute stale upstream calls to
+    # the wrong gateway/case and corrupt the captured corpus.
+    try:
+        resp = urllib.request.urlopen(
+            urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5)
+        status = getattr(resp, "status", 200) or 200
+        resp.read()
+    except Exception as e:  # noqa: BLE001
+        sys.exit(f"mock reset failed ({MOCK}/__reset): {e} — aborting to avoid a corrupt corpus")
+    if status >= 400:
+        sys.exit(f"mock reset returned HTTP {status} ({MOCK}/__reset) — aborting to avoid a corrupt corpus")
+
+
+def wait_ready(gw, tries=60):
+    url = GATEWAYS[gw]["base"] + "/v1/chat/completions"
+    body = {"model": model_for(gw, "gpt-4o-mini"),
+            "messages": [{"role": "user", "content": "ping"}]}
+    for _ in range(tries):
+        r = post(url, headers_for(gw), body, stream=False, timeout=8)
+        if r["status"] == 200:
+            return True
+        time.sleep(2)
+    return False
+
+
+# ── trimming (keep artifacts readable) ────────────────────────────────────────
+def trim(obj, limit=1500):
+    if isinstance(obj, str):
+        return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit})"
+    if isinstance(obj, list):
+        return [trim(x, limit) for x in obj]
+    if isinstance(obj, dict):
+        return {k: trim(v, limit) for k, v in obj.items()}
+    return obj
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--corpus", default=os.path.join(HERE, "corpus.json"))
+    ap.add_argument("--out", default=os.path.join(HERE, "output", "captures.json"))
+    ap.add_argument("--gateways", default=",".join(ORDER))
+    args = ap.parse_args()
+
+    gateways = [g.strip() for g in args.gateways.split(",") if g.strip()]
+    unknown = [g for g in gateways if g not in GATEWAYS]
+    if unknown:
+        ap.error(f"unknown gateway(s): {', '.join(unknown)}; valid options: {', '.join(ORDER)}")
+    corpus = json.load(open(args.corpus, encoding="utf-8"))
+
+    if get_json(MOCK + "/__log") is None:
+        print(f"mock not reachable at {MOCK} (is the stack up? is MOCK_RECORD=1?)", file=sys.stderr)
+        return 2
+
+    print("waiting for gateways…")
+    ready = {}
+    for gw in gateways:
+        ready[gw] = wait_ready(gw)
+        print(f"  {gw:9} {'ready' if ready[gw] else 'NOT READY (will still attempt)'}")
+
+    results = {"meta": {"gateways": gateways, "ready": ready}, "cases": {}}
+    for case in corpus:
+        cid, dialect, stream = case["id"], case["dialect"], case.get("stream", False)
+        entry = {"note": case.get("note", ""), "dialect": dialect, "stream": stream,
+                 "client_request": case["body"], "gateways": {}}
+        print(f"\n{cid}  ({dialect}{', stream' if stream else ''})")
+        for gw in gateways:
+            body = copy.deepcopy(case["body"])
+            body["model"] = model_for(gw, body["model"])
+            url = GATEWAYS[gw]["base"] + path_for(gw, dialect)
+            mock_reset()
+            resp = post(url, headers_for(gw), body, stream)
+            log = get_json(MOCK + "/__log") or {}
+            ups = log.get("entries") or []   # mock returns null when no upstream call was made
+            up_paths = ",".join(sorted({e.get("path", "?") for e in ups})) or "—"
+            print(f"  {gw:9} http={resp['status'] or resp['error']:>4}  "
+                  f"upstream={len(ups)} [{up_paths}]")
+            entry["gateways"][gw] = {
+                "sent_body": trim(body),
+                "url": url,
+                "client_response": {
+                    "status": resp["status"], "content_type": resp["content_type"],
+                    "error": resp["error"],
+                    "json": trim(resp["json"]) if resp["json"] is not None else None,
+                    "text": resp["text"],
+                    "stream_events": resp["stream_events"],
+                    "stream_text": trim(resp["stream_text"]) if resp["stream_text"] else "",
+                    "terminal": resp["terminal"],
+                },
+                "upstream": trim(ups),
+            }
+        results["cases"][cid] = entry
+
+    os.makedirs(os.path.dirname(args.out), exist_ok=True)
+    json.dump(results, open(args.out, "w", encoding="utf-8"), indent=2)
+    print(f"\nwrote {args.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
new file mode 100644
index 00000000..24c6fd1b
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
@@ -0,0 +1,153 @@
+[
+  {
+    "id": "chat.simple",
+    "dialect": "chat",
+    "stream": false,
+    "note": "Baseline: does the body pass through unchanged? what auth/headers are injected upstream?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "What is the capital of France?"}]
+    }
+  },
+  {
+    "id": "chat.stream",
+    "dialect": "chat",
+    "stream": true,
+    "note": "Streaming framing: chunk shape, terminal marker, whether stream_options is forwarded.",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "Count to three."}],
+      "stream": true,
+      "stream_options": {"include_usage": true}
+    }
+  },
+  {
+    "id": "chat.multiturn_system",
+    "dialect": "chat",
+    "stream": false,
+    "note": "System role + multi-turn: is the system message preserved in place and message order kept?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [
+        {"role": "system", "content": "You are a terse assistant."},
+        {"role": "user", "content": "Largest planet?"},
+        {"role": "assistant", "content": "Jupiter."},
+        {"role": "user", "content": "Smallest?"}
+      ]
+    }
+  },
+  {
+    "id": "chat.params",
+    "dialect": "chat",
+    "stream": false,
+    "note": "Sampling params fidelity: which of these survive verbatim upstream (temperature/top_p/penalties/stop/seed/max_tokens)?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "Say ok."}],
+      "temperature": 0.3,
+      "top_p": 0.8,
+      "frequency_penalty": 0.5,
+      "presence_penalty": 0.2,
+      "stop": ["\n\n"],
+      "seed": 42,
+      "max_tokens": 64
+    }
+  },
+  {
+    "id": "chat.extra_fields",
+    "dialect": "chat",
+    "stream": false,
+    "note": "KEY: unknown/extra fields. Which gateways forward them verbatim vs strip them (e.g. LiteLLM drop_params)?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "Say ok."}],
+      "metadata": {"qa_case": "extra-fields"},
+      "x_qa_marker": "keep-123",
+      "user": "qa-user-1"
+    }
+  },
+  {
+    "id": "chat.tools",
+    "dialect": "chat",
+    "stream": false,
+    "note": "Tool/function definitions and tool_choice: forwarded faithfully?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "Weather in Paris?"}],
+      "tools": [{"type": "function", "function": {
+        "name": "get_weather",
+        "description": "Get weather for a city",
+        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+      }}],
+      "tool_choice": "auto"
+    }
+  },
+  {
+    "id": "chat.response_format",
+    "dialect": "chat",
+    "stream": false,
+    "note": "Structured-output directive: is response_format forwarded?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": "Return JSON with capital of Spain."}],
+      "response_format": {"type": "json_object"}
+    }
+  },
+  {
+    "id": "chat.vision",
+    "dialect": "chat",
+    "stream": false,
+    "note": "Multimodal content parts: how is an image_url part forwarded upstream?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "messages": [{"role": "user", "content": [
+        {"type": "text", "text": "What color?"},
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8z8BQDwAEhQGAhKmMIQAAAABJRU5ErkJggg=="}}
+      ]}]
+    }
+  },
+  {
+    "id": "responses.simple",
+    "dialect": "responses",
+    "stream": false,
+    "note": "Responses API: how is `input` translated for each gateway's upstream provider call?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "input": "What is the capital of France?"
+    }
+  },
+  {
+    "id": "responses.stream",
+    "dialect": "responses",
+    "stream": true,
+    "note": "Responses streaming: event protocol the gateway returns to the client.",
+    "body": {
+      "model": "gpt-4o-mini",
+      "input": "Count to three.",
+      "stream": true
+    }
+  },
+  {
+    "id": "messages.simple",
+    "dialect": "messages",
+    "stream": false,
+    "note": "Anthropic Messages in: what upstream dialect/path does each gateway emit (native messages vs translated chat)?",
+    "body": {
+      "model": "gpt-4o-mini",
+      "max_tokens": 64,
+      "messages": [{"role": "user", "content": "What is the capital of France?"}]
+    }
+  },
+  {
+    "id": "messages.stream",
+    "dialect": "messages",
+    "stream": true,
+    "note": "Anthropic Messages streaming translation.",
+    "body": {
+      "model": "gpt-4o-mini",
+      "max_tokens": 64,
+      "stream": true,
+      "messages": [{"role": "user", "content": "Count to three."}]
+    }
+  }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
new file mode 100644
index 00000000..57a9a070
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
@@ -0,0 +1,84 @@
+# Translation-fidelity topology: all four gateways at once, every one pointed at
+# a single RECORDING mock backend (MOCK_RECORD=1). Because the capture runner
+# sends one request at a time and resets the mock before each, the shared mock
+# cleanly attributes each upstream call to the gateway+case that produced it.
+#
+#   docker compose --profile all up -d   # mock + gomodel + litellm + portkey + bifrost
+#
+# Gateways are on different host ports so they can run simultaneously. Configs
+# and the bench-tools build context are reused from ../remote.
+
+networks:
+  default:
+    name: xlatenet
+
+services:
+  mock:
+    build: ../remote/bench-tools
+    command: ["/mock"]
+    environment:
+      - MOCK_PORT=9999
+      - MOCK_RECORD=1
+    ports:
+      - "9999:9999"
+    restart: "no"
+
+  gomodel:
+    profiles: ["all", "gomodel"]
+    image: ${GOMODEL_IMAGE:-gomodel-bench:local}
+    depends_on: [mock]
+    ports:
+      # Host 18080 to avoid clashing with a local dev gomodel on 8080.
+      - "${GOMODEL_HOST_PORT:-18080}:8080"
+    environment:
+      - PORT=8080
+      - GOMODEL_MASTER_KEY=
+      - OPENAI_API_KEY=sk-bench-test-key
+      - OPENAI_BASE_URL=http://mock:9999/v1
+      - LOGGING_ENABLED=false
+      - USAGE_ENABLED=false
+      - METRICS_ENABLED=false
+      - SWAGGER_ENABLED=false
+      - PPROF_ENABLED=false
+      - ENABLE_PASSTHROUGH_ROUTES=false
+      - STORAGE_TYPE=sqlite
+      - SQLITE_PATH=/app/data/gomodel-xlate.db
+      - GOMODEL_CACHE_DIR=/app/.cache
+    restart: "no"
+
+  litellm:
+    profiles: ["all", "litellm"]
+    # Pinned by digest for a reproducible comparison (override via LITELLM_IMAGE).
+    image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable@sha256:afdc3cc37493d4f86d485ad7ac4445e7154c568a8d47c01bad15c9cf062c66b5}
+    depends_on: [mock]
+    ports:
+      - "4000:4000"
+    volumes:
+      - ../remote/configs/litellm-config.yaml:/app/config.yaml:ro
+    command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "1"]
+    restart: "no"
+
+  portkey:
+    profiles: ["all", "portkey"]
+    # Pinned by digest for a reproducible comparison (override via PORTKEY_IMAGE).
+    image: ${PORTKEY_IMAGE:-portkeyai/gateway:latest@sha256:97f094d9c8a764cbfaa2a7138c0017b247ca923bb06db1b4c13b7f8a33b5200d}
+    depends_on: [mock]
+    ports:
+      - "8787:8787"
+    environment:
+      - TRUSTED_CUSTOM_HOSTS=mock
+    restart: "no"
+
+  bifrost:
+    profiles: ["all", "bifrost"]
+    # Pinned by digest for a reproducible comparison (override via BIFROST_IMAGE).
+    image: ${BIFROST_IMAGE:-maximhq/bifrost:latest@sha256:6f20c020cd326199c050e6b15ba18131a6f7ac8627a9a4276750f83e92af2253}
+    depends_on: [mock]
+    ports:
+      - "8089:8089"
+    environment:
+      - APP_PORT=8089
+      - APP_HOST=0.0.0.0
+    volumes:
+      - ../remote/configs/bifrost-config.json:/app/data/config.json:ro
+    restart: "no"