diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
new file mode 100644
index 00000000..f59209d8
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
@@ -0,0 +1,222 @@
+---
+title: "AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost"
+description: "GoModel vs LiteLLM, Portkey, and Bifrost - a reproducible AWS benchmark of four open-source AI gateways across latency, throughput, memory, CPU, and Docker image size. A fast, lightweight LiteLLM alternative in Go."
+coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png"
+coverImageWidth: 2400
+coverImageHeight: 1260
+pubDate: 2026-06-26
+author: "Jakub A. Wasek"
+tags:
+ - benchmarking
+ - ai-gateway
+ - litellm
+ - portkey
+ - bifrost
+ - gomodel
+---
+
+
+
+The point of this benchmark is not to prove that LiteLLM sucks. The point is to
+measure GoModel honestly against the gateways people actually compare it to:
+**LiteLLM, Portkey, and Bifrost**.
+
+That said - yes, LiteLLM sucks, and that is exactly why GoModel exists. (If you're
+not sure what I mean, I'd recommend giving the software a try yourself - or doing
+your own research)
+
+In October 2025 I tried to build my startup on top of LiteLLM. I quickly found
+out that the software is fundamentally designed badly. A proxy-like server, on
+the hot path of every request, written in Python? On top of that came a long
+tail of operational issues. So I did my research and started writing GoModel: a
+production-grade and enterprise-grade AI gateway / AI control plane, in Go.
+
+The later supply-chain security incident around LiteLLM only confirmed my view.
+Go and its standard-library-heavy dependency trees are structurally far less
+exposed to that class of attack than a sprawling Python dependency graph.
+
+With the motivation out of the way, let's talk about what's actually worth
+measuring in an AI gateway benchmark - the metrics that make a comparison
+meaningful.
+
+When I [launched GoModel on Hacker News](https://news.ycombinator.com/item?id=47861333)
+I told the thread I'd publish a real, reproducible benchmark. Here it comes.
+
+## What to measure to choose the best AI gateway
+
+Here is the full list of metrics that matter:
+
+- `p99` / `p95` / `p50` latency (proxy overhead)
+- RAM consumption
+- CPU consumption (and throughput per core)
+- Cold-start time
+- Docker image size
+- Vendor-agnostic
+- Open-source
+
+A couple of these deserve a closer look.
+
+### Latency
+
+Latency matters less than you'd assume. Be precise about what we are measuring:
+**proxy overhead latency** - the time the gateway itself adds, on top of the
+upstream call.
+
+The trap is treating latency as the ultimate criterion. In any real workload the
+dominant latency comes from inference. The gateway's overhead is a small fraction
+of the total you're already living with. A gateway that is "2x faster" at adding
+`5 ms` is not meaningfully faster once a model takes `2000 ms` to respond.
+
+So I care far more about the *tail* (p99) than the median - a gateway that is
+usually fast but occasionally stalls is worse than one that is boringly
+consistent.
+
+### Resource consumption - CPU, RAM, image size, cold start
+
+These are the metrics that actually move the needle, because they map directly to:
+
+1. The monthly cost of your infrastructure.
+2. Whether you can run the gateway serverless (AWS Lambda, GCP Functions) or on
+ edge devices at all.
+
+A `372 MB` image (`1.2 GB` unpacked) that idles at gigabytes of RAM and takes
+`25 s` to cold-start is a different operational animal than a `16 MB` image that
+peaks at `37 MB` of RAM and is serving traffic `0.56 s` after launch.
+
+## The benchmark
+
+Every gateway talked to the **same instant mock backend**, so the numbers reflect
+gateway overhead, not model latency or network jitter. Each ran one at a time, in
+Docker, on an **AWS `c7i.large`** (2 vCPU, 4 GiB) running the latest **Amazon Linux
+2023** AMI - the whole thing is Terraform'd, runs on one command, and tears itself
+down afterwards.
+
+I actually ran this twice. The **first cut used the free-tier `t2.micro`**
+(1 vCPU, 1 GiB) - cheap, self-destructing, trivial to reproduce. But I realized
+that was *unfair to the competitors*: a 1 GiB box can't hold the memory-heavy
+gateways (LiteLLM idles near a gigabyte), so they spill into **swap** and get
+penalized for the host being too small rather than for their own overhead. So I
+switched to the roomier, non-burstable **`c7i.large`** - nothing swaps there, and a
+fixed-performance instance also removes the CPU-credit drift that muddies the tail
+on burstable boxes. **The relative results barely moved between the two runs** -
+GoModel still won on tail latency, throughput, memory, and image size. Giving the
+heavy gateways enough RAM to not thrash makes the comparison *more* honest, not
+less.
+
+I tested four gateways across six workloads - chat completions, the Responses API,
+and Anthropic messages, each streaming and non-streaming - driven at `8,000`
+requests per workload, concurrency `10`, across **two trials with randomized
+gateway order**. Latency is the **median across trials**, and I report each p99
+with its min-max across trials so a single noisy window can't drive the story.
+
+A few methodology details worth calling out:
+
+- **Throughput is measured, not inferred.** The latency runs report
+ completed-req/s at a fixed concurrency, which is just latency restated. Real
+ capacity comes from a separate **concurrency sweep** that drives each gateway to
+ saturation and records sustained req/s.
+- **I warm up every dialect before measuring it.** LiteLLM lazily imports its
+ per-dialect translation modules on first use, so a naive chat-only warmup left
+ the Responses and Messages paths cold and inflated their tails. I neutralized
+ that to be fair - but note what it tells you: a server that pays an import tax
+ the first time it sees a request type is, again, not designed for the hot path.
+- **Fair resilience config.** Every gateway runs with retries disabled. I also
+ disabled GoModel's circuit breaker for the test - under the saturation sweep a
+ few transient errors would otherwise trip it and it would (correctly, in
+ production) start rejecting requests, which would unfairly zero out its *own*
+ throughput. No other gateway here has a breaker, so off is the apples-to-apples
+ setting.
+- **LiteLLM at its recommended worker count.** A LiteLLM worker is effectively
+ single-threaded, and its own production guidance is one worker per CPU core - so I
+ run it with `num_workers` = the box's vCPU count (`2` here), the same multi-core
+ access the Go gateways get for free. (Pin it to one worker and it under-uses the
+ box; give it more and, as the table shows, its memory balloons. There's no setting
+ that makes it both fast *and* light.)
+- **Streaming uses terminal-marker or idle-gap detection**, so a gateway that
+ streams content without ever sending a terminal event (Bifrost, over a
+ non-native backend) is measured to last byte instead of hanging the harness.
+
+## The comparison
+
+Representative latency is chat completions, non-streaming. All resource figures
+are measured under load on the same box.
+
+| Metric | GoModel | Bifrost | Portkey | LiteLLM |
+|---|--:|--:|--:|--:|
+| Runtime | Go | Go | Node.js | Python |
+| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` |
+| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` |
+| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` |
+| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` |
+| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` |
+| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` |
+| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` |
+| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` |
+| Vendor-agnostic | Yes | Partial † | Yes | Yes |
+| Open-source | Yes ‡ | Partial ‡ | Partial ‡ | Yes |
+
+Same numbers, at a glance:
+
+
+
+
+
+
+
+
+
+A few honest notes, because I'd rather you trust the rest of the table:
+
+- **On a non-burstable host the medians are real, and GoModel leads on both ends.**
+ It posts the lowest `p50` (`1.8 ms`) *and* the tightest `p99` (`6.9 ms`).
+ Bifrost is a close second on the median (`2.5 ms`) - but its tail is ~`2.7x`
+ heavier (`18 ms`) and it carries ~`4x` the memory under load.
+- **GoModel cold-starts in `0.56 s` versus LiteLLM's ~`25 s`.** That is the
+ difference between viable on a serverless platform and not.
+- **Portkey** does not serve the Anthropic `/v1/messages` dialect in this
+ single-provider setup, hence `4/6` (it supports Anthropic with a fuller
+ virtual-key config; this is a setup limitation, not a hard capability gap).
+- **LiteLLM** ships a `372 MB` compressed image (`1.16 GB` on disk), and at its
+ recommended config (one worker per core) it uses **~`2.3 GB` of RAM** - two ~1 GB
+ worker processes - and ~`25 s` to cold-start. Running it *properly* for multi-core throughput makes the footprint
+ worse, not better. That is the cost of Python on the hot path.
+- **Bifrost is not a neutral project (†).** It is built by
+ [Maxim AI](https://www.getmaxim.ai/bifrost), an LLM evaluation & observability
+ platform, and ships a first-party plugin that forwards your gateway traffic to
+ Maxim's platform. It routes to many *model* providers, but the gateway itself is
+ a channel into one vendor's ecosystem - not the independent, vendor-neutral tool
+ the "1000+ models" headline implies.
+- **"Open-source" deserves an asterisk (‡).** Portkey keeps its observability
+ storage, dashboard, multi-team RBAC, and at-scale semantic caching in a closed
+ managed tier; Bifrost's core gateway is Apache-2.0 but its Enterprise edition
+ layers on closed/managed features. GoModel is open-source today, with some
+ enterprise-grade features planned to stay private. LiteLLM is the most open of
+ the four - its proxy core is MIT - but even it gates its enterprise features
+ (SSO, audit logs, fine-grained access control) behind a separate *proprietary*
+ commercial license that ships source-available in the `enterprise/` folder, not
+ as free OSS.
+
+## Summary
+
+GoModel is the best gateway in this comparison: the lowest median *and* the
+tightest latency tail, the highest sustained throughput, the best throughput per
+CPU (~`52` req/s per %), the smallest compressed image (≈`23x` smaller than
+LiteLLM) and memory, the fastest cold start - with full workload coverage.
+
+I've tried to be as objective as I can, and the whole thing is built to be
+**self-verifiable**: the harness provisions the AWS instance, runs every gateway
+against the same backend, prints the table, and destroys the infrastructure.
+**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)** -
+clone the repo, point it at your AWS account, and run `./run.sh`. It builds the
+images, provisions the box, runs all four gateways, prints the tables, and tears
+the infrastructure back down on its own.
+
+One caveat: it runs on **paid** AWS infrastructure, not the free tier. A
+`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or
+two, so budget **under `$1`** per run to be safe - and if you pass `KEEP=1` or a
+teardown ever fails, you keep paying until you destroy the box, so double-check
+it's gone.
+
+If you have objections to this benchmark, reach out on the GoModel Discord (link
+in the GoModel README on GitHub). And I'd genuinely like to see more impartial
+gateway comparisons out there - bring your own numbers.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md
new file mode 100644
index 00000000..7ea41cdf
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md
@@ -0,0 +1,351 @@
+---
+title: "Benchmarking AI Gateways: GoModel vs LiteLLM vs Portkey vs Bifrost"
+description: "A reproducible AI gateway benchmark comparing GoModel, LiteLLM, Portkey, and Bifrost on latency, throughput, memory, CPU, cold start, and image size."
+coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png"
+coverImageWidth: 2400
+coverImageHeight: 1260
+pubDate: 2026-06-26
+author: "Jakub A. Wasek"
+keywords:
+ - AI gateway benchmark
+ - AI control plane
+ - OpenAI-compatible API
+ - LiteLLM alternative
+ - GoModel
+ - LiteLLM
+ - Portkey
+ - Bifrost
+tags:
+ - benchmarking
+ - ai-gateway
+ - ai-control-plane
+ - litellm
+ - portkey
+ - bifrost
+ - gomodel
+---
+
+
+
+In October 2025 I tried to build my startup on top of LiteLLM.
+
+At first it looked like the obvious choice. It supported many providers, it had
+an OpenAI-compatible API, and it was already used by a lot of people. I did not
+want to write an AI gateway. I wanted to build the product behind it.
+
+Then I started running it on the hot path.
+
+My opinion changed there.
+
+A gateway is not a dashboard or integration glue you call once in a while. It
+sits on every request, every retry, every stream, every tool call, every
+fallback, every timeout.
+
+A heavy gateway charges rent forever.
+
+Most AI gateway comparisons miss that part. They talk about provider count,
+dashboards, tracing, and "support for 1000+ models". Those things matter, but
+they are not free. Before the gateway calls OpenAI, Anthropic, Gemini, vLLM, or
+anything else, it has already spent your CPU, memory, cold-start time, and
+operational budget.
+
+I am not comparing full product maturity here. I am comparing how these gateways
+behave on the hot path.
+
+So I started writing [GoModel](https://github.com/ENTERPILOT/GoModel): a small
+open-source AI gateway and AI control plane in Go, with an OpenAI-compatible API
+and explicit provider adapters.
+
+When I launched GoModel on Hacker News,
+I promised a real, reproducible benchmark. This article is that follow-up.
+
+The benchmark question is simple:
+
+**How lean is each AI gateway when it sits on the request path?**
+
+That question runs through the whole benchmark: GoModel vs LiteLLM vs Portkey vs
+Bifrost, measured by latency, throughput, memory, CPU, cold start, and image
+size rather than landing pages or feature matrices.
+
+## The runtime footprint matters
+
+Latency gets the easiest arguments. It rarely tells the whole story.
+
+Most real LLM calls are dominated by inference time. If a model takes `2000 ms`
+to answer, the difference between `5 ms` and `15 ms` of proxy overhead is not
+the main story.
+
+The main story is the deployment envelope:
+
+- How much RAM does the gateway need under load?
+- How much CPU does it burn per request?
+- How many requests can it serve per core?
+- How fast does it cold-start?
+- How large is the Docker image?
+- Can you run it as a sidecar, on a small VM, in serverless, or near local
+ models?
+- Is the core gateway actually open-source?
+
+Those numbers decide whether the gateway can run where you want it to run.
+
+A `372 MB` compressed image (`1.2 GB` unpacked) that idles around gigabytes of
+RAM and takes `25 s` to cold-start is a different operational thing than a
+`16 MB` image that peaks at `37 MB` of RAM and is serving traffic `0.56 s` after
+launch.
+
+So I care about the runtime footprint.
+
+## What this benchmark does not prove
+
+This benchmark does **not** prove that one gateway is best for every company.
+
+I am not measuring:
+
+- bug counts or overall correctness
+- semantic cache quality
+- tracing UI quality
+- guardrail quality
+- admin dashboards
+- long-term provider maintenance
+- every possible provider-specific feature
+- total provider count
+
+Those things matter. Some of them matter a lot.
+
+LiteLLM in particular has more integrated providers and more gateway features
+than GoModel today. If your first requirement is maximum provider coverage right
+now, LiteLLM has a real advantage. This benchmark does not erase that. It
+measures the runtime footprint of putting each gateway on the request path. In
+practice, many smaller or newer providers already expose an OpenAI-compatible
+API, so provider count is not always the same as practical routing coverage.
+
+The benchmark measures one narrower thing: **runtime and deployment overhead on
+the request path**.
+
+That still matters, because the gateway is on the hot path. If you run high
+request volume, local models, serverless workloads, edge workloads, or many small
+model calls, the overhead stops being theoretical.
+
+## AI gateway benchmark setup
+
+I tested four AI gateways people actually compare:
+
+- GoModel
+- LiteLLM
+- Portkey
+- Bifrost
+
+Every gateway talked to the **same instant mock backend**, on purpose. I did not
+want to benchmark OpenAI, Anthropic, AWS networking, or random internet jitter.
+I wanted to isolate the gateway itself.
+
+Each gateway ran one at a time, in Docker, on an **AWS `c7i.large`** with
+2 vCPU and 4 GiB RAM, running the latest **Amazon Linux 2023** AMI. The whole
+thing is Terraform'd, runs with one command, and tears itself down afterwards.
+
+I first ran this on a free-tier `t2.micro`. That was cheap and easy to
+reproduce, but unfair to the heavier gateways. A 1 GiB machine cannot hold a
+gateway that wants gigabytes of memory, so it starts swapping. At that point you
+are benchmarking the host being too small.
+
+So I moved to `c7i.large`: still small, but non-burstable and large enough that
+nothing swaps. It also makes the LiteLLM setup more honest. LiteLLM recommends
+one worker per vCPU, and this machine has 2 vCPUs, so LiteLLM gets 2
+workers. That gives it the multi-core access it is supposed to have instead of
+pinning it to a single worker on a tiny box.
+
+The test covered six workloads:
+
+- chat completions, non-streaming
+- chat completions, streaming
+- Responses API, non-streaming
+- Responses API, streaming
+- Anthropic messages, non-streaming
+- Anthropic messages, streaming
+
+Each workload used `8,000` requests at concurrency `10`, across **two trials
+with randomized gateway order**. Latency is the **median across trials**, and I
+report p99 with its min-max range so one noisy window cannot tell the whole
+story.
+
+I would not call this a statistically exhaustive study. It is a reproducible
+engineering benchmark, and the harness is public so people can rerun it, change
+the machine, or add their own workloads.
+
+A few details matter if you want to reproduce or criticize the numbers:
+
+- **Throughput is measured, not inferred.** The latency runs report
+ completed-req/s at fixed concurrency, but real capacity comes from a separate
+ concurrency sweep that drives each gateway to saturation.
+- **Every dialect is warmed up before measurement.** LiteLLM lazily imports some
+ per-dialect translation code on first use. A chat-only warmup made its
+ Responses and Messages paths look worse than they should. I warmed up all
+ dialects to avoid that.
+- **Retries are disabled for all gateways.** I also disabled GoModel's circuit
+ breaker for this benchmark. In production, rejecting traffic after upstream
+ trouble is the right behavior. In a saturation benchmark, it would make the
+ throughput number unfairly low.
+- **LiteLLM runs with its recommended worker count.** A LiteLLM worker is
+ effectively single-threaded, and its production guidance is one worker per
+ vCPU. On this box that means `2` workers.
+- **Streaming uses terminal-marker or idle-gap detection.** If a gateway streams
+ content but never sends a terminal event, the harness measures to last byte
+ instead of hanging forever.
+
+## GoModel vs LiteLLM vs Portkey vs Bifrost
+
+Representative latency is chat completions, non-streaming. All resource figures
+are measured under load on the same box.
+
+| Metric | GoModel | Bifrost | Portkey | LiteLLM |
+|---|--:|--:|--:|--:|
+| Runtime | Go | Go | Node.js | Python |
+| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` |
+| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` |
+| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` |
+| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` |
+| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` |
+| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` |
+| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` |
+| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` |
+| Vendor-neutral core | Yes | Partial † | Yes | Yes |
+| Core source available | Yes ‡ | Partial ‡ | Partial ‡ | Yes |
+
+Same numbers, at a glance:
+
+
+
+
+
+
+
+
+
+## What stood out
+
+GoModel had the lowest median latency and the tightest tail: `1.8 ms` p50 and
+`6.9 ms` p99.
+
+Bifrost was close on median latency at `2.5 ms`, which is a good result. The
+gap opened at the tail and in memory: `18.3 ms` p99 and `143 MB` peak RAM under
+load.
+
+Portkey was heavier than I expected for this narrow proxy benchmark. It served
+`950 req/s` sustained and used `112 MB` peak RAM under load. In this setup it did
+not serve the Anthropic `/v1/messages` dialect, so it gets `4/6` workload
+coverage. Treat that as a setup limitation, not a claim that Portkey cannot
+support Anthropic in a fuller virtual-key configuration.
+
+LiteLLM was the outlier. At its recommended worker count, it used about
+`2.3 GB` of RAM, cold-started in `25.5 s`, and sustained `324 req/s`.
+
+Not because Python is morally bad. The language matters only when it changes the
+deployment envelope. Here it does: memory floor, image size, cold-start time,
+dependency graph, and throughput per core.
+
+The later supply-chain incident around LiteLLM
+also made me more confident in GoModel's design direction. A small Go binary
+with a standard-library-heavy dependency tree is structurally less exposed to
+that class of problem than a large Python dependency graph.
+
+## What AI gateway benchmarks do not capture
+
+Forwarding JSON is not the hard part.
+
+The hard part is provider drift.
+
+OpenAI, Anthropic, Gemini, AWS Bedrock, Azure OpenAI, Groq, xAI, Cerebras, vLLM,
+and local servers all disagree in small ways. Then they change those ways. Tool
+calling changes. Streaming changes. Reasoning parameters change. Image inputs
+change. Error formats change. Rate-limit semantics change.
+
+An AI gateway or AI control plane has to absorb that without becoming magic.
+
+GoModel's bet is not "support every model name on the internet".
+
+The bet is:
+
+- support the providers people actually deploy
+- keep provider adapters explicit
+- accept OpenAI-compatible requests generously
+- translate only what needs translation
+- pass through what should stay provider-specific
+- return conservative OpenAI-compatible responses
+
+For the same reason, GoModel starts as a small OpenAI-compatible gateway, not as
+a dashboard with a proxy attached.
+
+## Why this matters for local models and vLLM
+
+If all your traffic goes to a cloud model that takes several seconds to answer,
+gateway overhead can look academic.
+
+Local models change the math.
+
+If you are routing through an AI gateway to vLLM, Ollama, LM Studio, llama.cpp,
+or small specialized models on your own network, the model call can be much
+faster. Then gateway overhead, cold starts, memory, and sidecar size matter more.
+
+One reason I want GoModel to stay small: a gateway should be cheap enough to put
+near the workload.
+
+## Notes on neutrality and open source
+
+Bifrost is built by Maxim AI, an LLM
+evaluation and observability platform. It routes to many model providers, but
+the gateway also sits close to Maxim's eval and observability ecosystem. If you
+want to choose your own eval platform, or stay independent from any eval
+platform, ask whether Bifrost is the right match for you. Good software can
+still have incentives attached. "Vendor-neutral" needs an asterisk here.
+
+"Open-source" also needs care.
+
+Portkey keeps observability storage, dashboard, multi-team RBAC, and at-scale
+semantic caching in a closed managed tier. Bifrost's core gateway is Apache-2.0,
+but its Enterprise edition adds closed or managed features. LiteLLM's proxy core
+is MIT, but enterprise features like SSO, audit logs, and fine-grained access
+control sit behind a proprietary commercial license.
+
+GoModel is open-source today. Some enterprise-grade AI control plane features may
+stay private. The core gateway is intended to remain useful without those private
+features.
+
+## Reproduce it yourself
+
+The benchmark is built to be self-verifiable. It provisions the AWS instance,
+runs every gateway against the same backend, prints the tables, and destroys the
+infrastructure.
+
+**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)**:
+
+```bash
+./run.sh
+```
+
+One caveat: it runs on **paid** AWS infrastructure, not the free tier. A
+`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or
+two, so budget **under `$1`** per run to be safe.
+
+If you pass `KEEP=1` or teardown fails, you keep paying until you destroy the
+box, so double-check the teardown.
+
+## Conclusion
+
+I did not start GoModel because I wanted another AI gateway in the world.
+
+I started it because the gateway I wanted to use became part of the problem. It
+sat on the hot path, but did not feel like hot-path software: too heavy, too
+slow to start, too expensive to keep around, too large for the job.
+
+This benchmark is the result of turning that frustration into numbers.
+
+The numbers say GoModel is small in the places I care about: `16 MB` image,
+`37 MB` peak RAM, `0.56 s` cold start, `1.8 ms` p50, `6.9 ms` p99, and
+`4900 req/s` sustained throughput on a small AWS box.
+
+LiteLLM still has more providers and more features today. Portkey and Bifrost
+have their own strengths. But if the gateway is going to sit between your users
+and every model call, I think it should first be cheap, predictable, and boring
+to run.
+
+GoModel is my attempt to build that kind of gateway.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg
new file mode 100644
index 00000000..51f6aa12
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg
@@ -0,0 +1,19 @@
+
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg
new file mode 100644
index 00000000..cac41ab0
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg
@@ -0,0 +1,19 @@
+
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg
new file mode 100644
index 00000000..f6dd3ce2
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg
@@ -0,0 +1,19 @@
+
diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg
new file mode 100644
index 00000000..4ea70ef6
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg
@@ -0,0 +1,19 @@
+
diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover-b.png b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png
new file mode 100644
index 00000000..9b2e833c
Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png differ
diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover.png b/docs/2026-06-25_aws_gateway_benchmark/cover.png
new file mode 100644
index 00000000..0da1dbcf
Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover.png differ
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
new file mode 100644
index 00000000..179b4868
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
@@ -0,0 +1,3 @@
+output/
+__pycache__/
+*.pyc
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/README.md b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md
new file mode 100644
index 00000000..8f489897
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md
@@ -0,0 +1,152 @@
+# GoModel quality (QA) suite
+
+A curated corpus of ~50 complex requests that exercises every client-facing
+dialect and modality of the gateway against **real providers**
+(OpenAI / Anthropic / Gemini), then **registers** and **rates** each one.
+
+It answers a different question than the latency benchmark next door
+(`docs/2026-06-25_aws_gateway_benchmark/`): not *how fast/cheap* the gateway is,
+but *does it correctly accept, translate, and normalize real-world requests* —
+the Postel's-law contract.
+
+For every case the suite records:
+
+- the **request as sent** (after model-role and variable resolution);
+- the **response** received (status, headers, body, or assembled SSE text);
+- **how the gateway recorded/normalized it** — pulled from the audit log:
+ the inbound request body it captured, the normalized response body it
+ returned, the resolved provider/model, and token usage;
+
+and rates it `PASS` / `FAIL` / `ERROR` / `SKIP`, plus a 0–100 **quality score**
+for soft modality checks (did the vision model name the colour, did STT recover
+the spoken words).
+
+## What it covers
+
+| Dialect / endpoint | Providers | Modalities exercised |
+|---|---|---|
+| `/v1/chat/completions` | OpenAI, Anthropic, Gemini | text, multi-turn, streaming, vision, tools, reasoning, structured output, field preservation |
+| `/v1/responses` | OpenAI, Anthropic, Gemini | text, multimodal input, streaming, tools, structured output, reasoning, conversation linkage |
+| `/v1/messages` (+ `/count_tokens`) | Anthropic | native shape, system prompt, streaming SSE, vision blocks, tool_use, extended thinking, default `max_tokens` injection |
+| `/v1/conversations` | OpenAI | create → get → use-in-Responses → update → delete (stateful) |
+| `/v1/audio/speech`, `/v1/audio/transcriptions` | OpenAI | TTS, and a TTS→STT round-trip that recovers the spoken words |
+| `/v1/embeddings` | OpenAI | single + batch |
+| error normalization | OpenAI, Anthropic | unknown model, unsupported `input_audio`, malformed JSON |
+
+## How "field preservation" is verified (and its honest limit)
+
+GoModel's audit log captures the **inbound** client request body and the
+**normalized** response body it returns — *not* the upstream provider-translated
+request. So the suite verifies translation two ways:
+
+1. **Behaviorally** — e.g. the reasoning case sends `max_tokens` to a model that
+ rejects it upstream; a `200` proves the gateway mapped it to
+ `max_completion_tokens` and dropped `temperature`. The audio-rejection case
+ proves an unsupported modality fails cleanly (4xx) rather than crashing.
+2. **From the audit record** — extra/unknown request fields (`x_qa_marker`,
+ `metadata`) are asserted present in the captured inbound body, and
+ provider-specific response extras (`system_fingerprint`, `service_tier`,
+ `stop_reason`, `usage`) are asserted preserved in the normalized response.
+
+Audit cross-checks are **soft** by default: if audit bodies are off or the entry
+hasn't flushed, those checks are skipped with a note, never a false failure.
+
+## Prerequisites
+
+Run the gateway with audit logging **and bodies** enabled so the preservation
+checks have data:
+
+```bash
+LOGGING_ENABLED=true \
+LOGGING_LOG_BODIES=true \
+LOGGING_LOG_HEADERS=true \
+LOGGING_LOG_AUDIO_BODIES=true \
+LOGGING_FLUSH_INTERVAL=2 \
+./gomodel # or: go run ./cmd/gomodel
+```
+
+Provider keys come from the gateway's environment (`OPENAI_API_KEY`,
+`ANTHROPIC_API_KEY`, `GEMINI_API_KEY`). The harness authenticates to the gateway
+with `GOMODEL_MASTER_KEY` (read from the env or the repo `.env`).
+
+> This calls real providers and spends real money — modest (a few cents) for one
+> full run, since payloads are tiny and `max_tokens` is capped on every case.
+
+## Run it
+
+```bash
+cd docs/2026-06-25_aws_gateway_benchmark/qa
+python3 run_qa.py # full corpus against http://localhost:8080
+python3 run_qa.py --only chat # filter by id/group/provider substring
+python3 run_qa.py --only openai
+python3 run_qa.py --no-audit # skip audit cross-checks (faster, fewer assertions)
+python3 run_qa.py --list # list matching cases, don't run
+python3 run_qa.py --gateway http://host:8080
+```
+
+Stdlib only — no `pip install`. Exit code is non-zero if any case failed or
+errored. Results land in `output//`:
+
+- `results.json` — full per-case record (request sent, response, audit view, every assertion)
+- `report.md` — readable table + a drill-down of failed/errored cases
+
+## Adapt to your account
+
+The spec never hardcodes a model id. Cases reference logical roles
+(`@openai.chat`, `@anthropic.thinking`, `@gemini.vision`); edit `models.json` to
+map them to models your keys can reach. A role with no mapping makes its cases
+`SKIP`, never fail. Image inputs (`@image.red` / `@imageb64.red`) are generated
+solid-colour PNGs — no binary assets in the repo.
+
+## Layout
+
+```
+run_qa.py orchestrator + assertion evaluation + CLI
+models.json logical model roles -> concrete model ids (edit this)
+spec/ declarative cases, one JSON file per endpoint group
+qalib/ stdlib helpers: config, paths, assertions, client, report
+output/ run artifacts (gitignored)
+```
+
+## Case schema (quick reference)
+
+```jsonc
+{
+ "id": "chat.openai.multiturn", // unique
+ "title": "...", "provider": "openai",
+ "modality": ["text"], // labels for reporting
+ "request": {
+ "method": "POST", // default POST
+ "path": "/v1/chat/completions", // may contain ${captured_var}
+ "headers": {"X-QA-Marker": "keep"},
+ "stream": false,
+ "body": { "model": "@openai.chat", "...": "..." },
+ "raw_body": "…", // send verbatim (malformed-JSON tests)
+ "produce": "tts_then_stt", // composite: TTS then transcribe its output
+ "tts": {...}, "stt": {...} // inputs for produce=tts_then_stt
+ },
+ "capture": { "conversation_id": "$.id" },// save response values for later ${vars}
+ "expect": {
+ "status": 200, // int or list
+ "headers": [ {"name": "X-Request-Id", "present": true} ],
+ "body": [ {"field": "content_type", "contains": "audio/"},
+ {"field": "bytes", "gte": 2000},
+ {"field": "text", "not_empty": true} ],
+ "response": [ {"path": "$.choices[0].message.content", "not_empty": true} ],
+ "stream": { "min_events": 2, "terminal": "[DONE]",
+ "event_types": ["message_start"], "text": [{"not_empty": true}] },
+ "audit": [ {"path": "$.provider", "equals": "openai"},
+ {"path": "$.data.request_body.x_qa_marker", "equals": "keep"} ],
+ "quality": [ {"target": "response:$.output[0].content[0].text",
+ "contains_any": ["paris"]} ] // soft; feeds the score
+ }
+}
+```
+
+**Operators** (one per assertion): `present` · `absent` · `equals` ·
+`not_equals` · `not_empty` · `contains` · `not_contains` · `contains_any` ·
+`contains_all` · `regex` · `gt` · `gte` · `lt` · `lte` · `type` · `length_gte` ·
+`one_of`. Add `"hard": false` to make a failure a soft signal instead of failing
+the case (audit and quality checks are soft by default).
+
+**Quality targets:** `stream` · `body.text` · `response:$.path` · `audit:$.path`.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/models.json b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json
new file mode 100644
index 00000000..98dfdc3e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json
@@ -0,0 +1,20 @@
+{
+ "_comment": "Logical model roles used by the spec (@openai.chat, @anthropic.thinking, ...). Edit these to match the models your account/keys can reach. Image roles (@image.red/blue/green) are generated by the harness and need no entry.",
+ "openai": {
+ "chat": "gpt-4.1-mini",
+ "vision": "gpt-4.1-mini",
+ "reasoning": "gpt-5-mini",
+ "tts": "gpt-4o-mini-tts",
+ "stt": "gpt-4o-mini-transcribe",
+ "embed": "text-embedding-3-small"
+ },
+ "anthropic": {
+ "chat": "claude-sonnet-4-6",
+ "vision": "claude-sonnet-4-6",
+ "thinking": "claude-opus-4-8"
+ },
+ "gemini": {
+ "chat": "gemini-2.5-flash",
+ "vision": "gemini-2.5-flash"
+ }
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
new file mode 100644
index 00000000..ebb6a9b8
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
@@ -0,0 +1,9 @@
+"""qalib — small helpers for the GoModel quality (QA) harness.
+
+Stdlib-only. Split into focused modules so each stays readable:
+ config — gateway URL, master key, model/image role resolution, spec loading
+ paths — JSON-path mini-language + deterministic image fixtures
+ assertions — declarative assertion operators
+ client — HTTP send (JSON / multipart / SSE) + audit-log lookup
+ report — console table + results.json + report.md
+"""
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
new file mode 100644
index 00000000..93a8d78b
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
@@ -0,0 +1,93 @@
+"""Declarative assertion operators.
+
+Each assertion object names exactly one operator plus optional metadata:
+
+ {"path": "$.usage.total_tokens", "gt": 0}
+ {"path": "$.choices[0].message.content", "not_empty": true}
+ {"path": "$.system_fingerprint", "present": true, "hard": false}
+
+`hard` (default true) decides whether a failure fails the case or is recorded
+as a soft/quality signal. The caller locates the value (from a response body,
+header, stream, or audit entry) and passes (found, value) here.
+"""
+import re
+
+from .paths import json_type
+
+# Operators that are meaningful even when the value is absent.
+_ABSENCE_OPS = {"present", "absent"}
+
+
+def _as_number(v):
+ try:
+ return float(v)
+ except (TypeError, ValueError):
+ return None
+
+
+def apply_operator(assertion, found, value):
+ """Evaluate one assertion. Returns (ok: bool, reason: str)."""
+ for op in assertion:
+ if op in ("path", "field", "name", "hard", "note", "target"):
+ continue
+ expected = assertion[op]
+
+ if op == "present":
+ ok = found is expected if isinstance(expected, bool) else found
+ return ok, f"present={found}"
+ if op == "absent":
+ return (not found), f"present={found}"
+
+ # All remaining operators require the value to exist.
+ if not found and op not in _ABSENCE_OPS:
+ return False, "value not found"
+
+ if op == "equals":
+ return value == expected, f"{value!r} == {expected!r}"
+ if op == "not_equals":
+ return value != expected, f"{value!r} != {expected!r}"
+ if op == "not_empty":
+ empty = value is None or value == "" or value == [] or value == {}
+ return (not empty), f"non-empty (got {_short(value)})"
+ if op == "contains":
+ return str(expected).lower() in str(value).lower(), f"contains {expected!r}"
+ if op == "not_contains":
+ return str(expected).lower() not in str(value).lower(), f"not contains {expected!r}"
+ if op == "contains_any":
+ hay = str(value).lower()
+ hit = next((w for w in expected if str(w).lower() in hay), None)
+ return hit is not None, f"any{expected} -> {hit!r}"
+ if op == "contains_all":
+ hay = str(value).lower()
+ miss = [w for w in expected if str(w).lower() not in hay]
+ return not miss, f"all present (missing {miss})"
+ if op == "regex":
+ return re.search(expected, str(value)) is not None, f"~ /{expected}/"
+ if op in ("gt", "gte", "lt", "lte"):
+ n, e = _as_number(value), _as_number(expected)
+ if n is None or e is None:
+ return False, f"non-numeric {value!r}"
+ ok = {"gt": n > e, "gte": n >= e, "lt": n < e, "lte": n <= e}[op]
+ return ok, f"{n} {op} {e}"
+ if op == "type":
+ return json_type(value) == expected, f"type {json_type(value)} == {expected}"
+ if op == "length_gte":
+ try:
+ return len(value) >= expected, f"len {len(value)} >= {expected}"
+ except TypeError:
+ return False, f"no length: {value!r}"
+ if op == "one_of":
+ return value in expected, f"{value!r} in {expected}"
+
+ return False, f"unknown operator {op!r}"
+
+ return False, "empty assertion"
+
+
+def is_hard(assertion):
+ return assertion.get("hard", True)
+
+
+def _short(value, n=60):
+ s = repr(value)
+ return s if len(s) <= n else s[: n - 1] + "…"
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
new file mode 100644
index 00000000..18830604
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
@@ -0,0 +1,206 @@
+"""HTTP client for the QA harness: JSON, multipart, and SSE, plus audit lookup.
+
+Stdlib only (urllib). Every gateway call carries a unique X-Request-Id and a
+run-scoped X-GoModel-User-Path so the matching audit entry can be found, which
+is how the harness inspects what the gateway *recorded* it received and
+returned (request/response bodies, provider, resolved model, usage).
+"""
+import json
+import time
+import urllib.error
+import urllib.request
+import uuid
+
+
+class Result:
+ """Captured outcome of one HTTP exchange."""
+
+ def __init__(self):
+ self.status = 0
+ self.headers = {}
+ self.request_id = ""
+ self.json = None # parsed JSON body (if any)
+ self.text = None # text body (non-JSON)
+ self.raw = b"" # raw body bytes (binary, e.g. TTS audio)
+ self.bytes = 0 # raw body length
+ self.content_type = ""
+ self.events = [] # parsed SSE event objects
+ self.stream_text = "" # assembled assistant text from a stream
+ self.terminal = None # terminal SSE marker seen ("[DONE]", "message_stop", ...)
+ self.error = None # transport-level exception text
+
+
+class Client:
+ def __init__(self, base_url, api_key, user_path, timeout=120):
+ self.base = base_url.rstrip("/")
+ self.api_key = api_key
+ self.user_path = user_path
+ self.timeout = timeout
+
+ def _common_headers(self, request_id, extra):
+ h = {
+ "Authorization": f"Bearer {self.api_key}",
+ "X-Request-ID": request_id,
+ "X-GoModel-User-Path": self.user_path,
+ }
+ if extra:
+ h.update(extra)
+ return h
+
+ # ── JSON / raw request, optionally streaming ────────────────────────────
+ def send(self, method, path, body=None, headers=None, stream=False, raw_body=None):
+ rid = "qa-" + uuid.uuid4().hex[:24]
+ res = Result()
+ res.request_id = rid
+ url = self.base + path
+ hdrs = self._common_headers(rid, headers)
+
+ data = None
+ if raw_body is not None:
+ data = raw_body.encode("utf-8")
+ hdrs.setdefault("Content-Type", "application/json")
+ elif body is not None:
+ data = json.dumps(body).encode("utf-8")
+ hdrs["Content-Type"] = "application/json"
+
+ req = urllib.request.Request(url, data=data, method=method, headers=hdrs)
+ try:
+ resp = urllib.request.urlopen(req, timeout=self.timeout)
+ self._capture(res, resp, stream)
+ except urllib.error.HTTPError as e:
+ res.status = e.code
+ self._capture(res, e, stream=False)
+ except Exception as e: # noqa: BLE001 — surface any transport failure as ERROR
+ res.error = f"{type(e).__name__}: {e}"
+ return res
+
+ # ── multipart/form-data (audio transcriptions) ──────────────────────────
+ def send_multipart(self, path, fields, file_field, filename, file_bytes,
+ file_content_type, headers=None):
+ rid = "qa-" + uuid.uuid4().hex[:24]
+ res = Result()
+ res.request_id = rid
+ boundary = "----qa" + uuid.uuid4().hex
+ parts = []
+ for k, v in (fields or {}).items():
+ parts.append(f"--{boundary}\r\n".encode())
+ parts.append(f'Content-Disposition: form-data; name="{k}"\r\n\r\n'.encode())
+ parts.append(f"{v}\r\n".encode())
+ parts.append(f"--{boundary}\r\n".encode())
+ parts.append(
+ f'Content-Disposition: form-data; name="{file_field}"; filename="{filename}"\r\n'.encode())
+ parts.append(f"Content-Type: {file_content_type}\r\n\r\n".encode())
+ parts.append(file_bytes)
+ parts.append(f"\r\n--{boundary}--\r\n".encode())
+ data = b"".join(parts)
+
+ hdrs = self._common_headers(rid, headers)
+ hdrs["Content-Type"] = f"multipart/form-data; boundary={boundary}"
+ req = urllib.request.Request(self.base + path, data=data, method="POST", headers=hdrs)
+ try:
+ resp = urllib.request.urlopen(req, timeout=self.timeout)
+ self._capture(res, resp, stream=False)
+ except urllib.error.HTTPError as e:
+ res.status = e.code
+ self._capture(res, e, stream=False)
+ except Exception as e: # noqa: BLE001
+ res.error = f"{type(e).__name__}: {e}"
+ return res
+
+ # ── response capture ────────────────────────────────────────────────────
+ def _capture(self, res, resp, stream):
+ res.status = getattr(resp, "status", res.status) or res.status
+ try:
+ res.headers = {k.lower(): v for k, v in resp.headers.items()}
+ except Exception: # noqa: BLE001
+ res.headers = {}
+ res.request_id = res.headers.get("x-request-id", res.request_id)
+ res.content_type = res.headers.get("content-type", "")
+
+ if stream and "text/event-stream" in res.content_type:
+ self._read_sse(res, resp)
+ return
+
+ raw = resp.read()
+ res.raw = raw
+ res.bytes = len(raw)
+ if "application/json" in res.content_type:
+ try:
+ res.json = json.loads(raw.decode("utf-8"))
+ except Exception: # noqa: BLE001
+ res.text = raw.decode("utf-8", "replace")
+ elif res.content_type.startswith("text/"):
+ res.text = raw.decode("utf-8", "replace")
+ # binary (audio) bodies: only size + content-type are kept.
+
+ def _read_sse(self, res, resp):
+ for rawline in resp:
+ line = rawline.decode("utf-8", "replace").rstrip("\n").rstrip("\r")
+ if not line or line.startswith(":"):
+ continue
+ if not line.startswith("data:"):
+ continue
+ payload = line[len("data:"):].strip()
+ if payload == "[DONE]":
+ res.terminal = "[DONE]"
+ continue
+ try:
+ ev = json.loads(payload)
+ except Exception: # noqa: BLE001
+ continue
+ res.events.append(ev)
+ self._accumulate(res, ev)
+
+ @staticmethod
+ def _accumulate(res, ev):
+ """Assemble assistant text across the three streaming dialects and note
+ terminal markers."""
+ etype = ev.get("type")
+ if etype in ("response.completed", "message_stop", "response.output_text.done"):
+ res.terminal = etype
+ # chat.completions: choices[].delta.content
+ for ch in ev.get("choices", []) or []:
+ delta = ch.get("delta") or {}
+ if isinstance(delta.get("content"), str):
+ res.stream_text += delta["content"]
+ # responses: output_text deltas
+ if etype == "response.output_text.delta" and isinstance(ev.get("delta"), str):
+ res.stream_text += ev["delta"]
+ # anthropic messages: content_block_delta.text
+ if etype == "content_block_delta":
+ d = ev.get("delta") or {}
+ if isinstance(d.get("text"), str):
+ res.stream_text += d["text"]
+
+ # ── audit lookup ────────────────────────────────────────────────────────
+ def fetch_audit(self, request_id, attempts=6, delay=1.5):
+ """Find the audit entry for a request_id (retrying for flush lag) and
+ return the full detail entry, or None."""
+ for i in range(attempts):
+ entry_id = self._find_entry_id(request_id)
+ if entry_id:
+ detail = self._get_json(f"/admin/audit/detail?log_id={entry_id}")
+ if detail:
+ return detail
+ if i < attempts - 1:
+ time.sleep(delay)
+ return None
+
+ def _find_entry_id(self, request_id):
+ listing = self._get_json(f"/admin/audit/log?search={request_id}&limit=20")
+ if not listing:
+ return None
+ for entry in listing.get("entries", []):
+ if entry.get("request_id") == request_id:
+ return entry.get("id")
+ return None
+
+ def _get_json(self, path):
+ req = urllib.request.Request(
+ self.base + path, method="GET",
+ headers={"Authorization": f"Bearer {self.api_key}"})
+ try:
+ resp = urllib.request.urlopen(req, timeout=self.timeout)
+ return json.loads(resp.read().decode("utf-8"))
+ except Exception: # noqa: BLE001
+ return None
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
new file mode 100644
index 00000000..a3072b4e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
@@ -0,0 +1,114 @@
+"""Config loading: master key, model/image roles, spec files.
+
+The spec never hardcodes a concrete model id. Cases reference logical roles
+("@openai.chat", "@anthropic.thinking", "@image.red") that resolve through
+`models.json`, so a user adapts the whole corpus to their account by editing
+one file.
+"""
+import glob
+import json
+import os
+
+from .paths import png_base64, png_data_url
+
+_COLORS = {"red": (220, 30, 30), "blue": (30, 60, 220), "green": (30, 180, 70)}
+
+# @image. -> data: URL (chat/responses image_url form)
+# @imageb64. -> raw base64 (native Anthropic image source.data)
+IMAGES = {name: png_data_url(rgb) for name, rgb in _COLORS.items()}
+IMAGES_B64 = {name: png_base64(rgb) for name, rgb in _COLORS.items()}
+
+
+def load_master_key(repo_root):
+ """Master/admin key: env first, then the repo .env (never printed)."""
+ key = os.environ.get("GOMODEL_API_KEY") or os.environ.get("GOMODEL_MASTER_KEY")
+ if key:
+ return key.strip()
+ env_path = os.path.join(repo_root, ".env")
+ if os.path.exists(env_path):
+ with open(env_path, encoding="utf-8") as f:
+ for line in f:
+ line = line.strip()
+ if line.startswith("GOMODEL_MASTER_KEY="):
+ return line.split("=", 1)[1].strip().strip('"').strip("'")
+ return ""
+
+
+def load_models(path):
+ with open(path, encoding="utf-8") as f:
+ return json.load(f)
+
+
+def load_specs(spec_dir, only=None):
+ """Load and concatenate every spec/*.json (sorted by filename, then array
+ order). `only` filters by substring against id / group / provider."""
+ cases = []
+ for path in sorted(glob.glob(os.path.join(spec_dir, "*.json"))):
+ with open(path, encoding="utf-8") as f:
+ data = json.load(f)
+ for case in data:
+ case.setdefault("group", os.path.splitext(os.path.basename(path))[0])
+ cases.append(case)
+ if only:
+ needle = only.lower()
+ cases = [c for c in cases
+ if needle in c.get("id", "").lower()
+ or needle in c.get("group", "").lower()
+ or needle in c.get("provider", "").lower()]
+ return cases
+
+
+def resolve_roles(obj, models):
+ """Recursively replace @provider.role and @image.name tokens with concrete
+ values. Returns (resolved_obj, unresolved_roles)."""
+ unresolved = []
+
+ def walk(node):
+ if isinstance(node, str):
+ if node.startswith("@imageb64."):
+ name = node[len("@imageb64."):]
+ if name in IMAGES_B64:
+ return IMAGES_B64[name]
+ unresolved.append(node)
+ return node
+ if node.startswith("@image."):
+ name = node[len("@image."):]
+ if name in IMAGES:
+ return IMAGES[name]
+ unresolved.append(node)
+ return node
+ if node.startswith("@"):
+ parts = node[1:].split(".")
+ cur = models
+ for p in parts:
+ if isinstance(cur, dict) and p in cur:
+ cur = cur[p]
+ else:
+ unresolved.append(node)
+ return node
+ return cur
+ return node
+ if isinstance(node, list):
+ return [walk(x) for x in node]
+ if isinstance(node, dict):
+ return {k: walk(v) for k, v in node.items()}
+ return node
+
+ return walk(obj), unresolved
+
+
+def interpolate_vars(obj, variables):
+ """Replace ${var} occurrences inside any string using captured runtime vars."""
+ def walk(node):
+ if isinstance(node, str):
+ out = node
+ for name, val in variables.items():
+ out = out.replace("${" + name + "}", str(val))
+ return out
+ if isinstance(node, list):
+ return [walk(x) for x in node]
+ if isinstance(node, dict):
+ return {k: walk(v) for k, v in node.items()}
+ return node
+
+ return walk(obj)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
new file mode 100644
index 00000000..c82fbbda
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
@@ -0,0 +1,90 @@
+"""JSON-path mini-language and deterministic image fixtures.
+
+The path language is intentionally tiny — enough to address normalized AI
+responses and audit entries without a dependency:
+
+ $ the root object
+ $.a.b nested object keys
+ $.choices[0].message array index
+ $.data.request_body.x arbitrary nested key (audit bodies)
+
+`get_path` returns (found, value) so callers can distinguish "missing" from
+"present but null/empty".
+"""
+import base64
+import re
+import struct
+import zlib
+
+_TOKEN = re.compile(r"([^.\[\]]+)|\[(\d+)\]")
+
+
+def get_path(obj, path):
+ """Resolve a `$.a.b[0]` path. Returns (found: bool, value)."""
+ if path in ("$", "", None):
+ return True, obj
+ if path.startswith("$."):
+ path = path[2:]
+ elif path.startswith("$"):
+ path = path[1:]
+ cur = obj
+ for key, idx in _TOKEN.findall(path):
+ if idx != "":
+ if not isinstance(cur, list):
+ return False, None
+ i = int(idx)
+ if i >= len(cur):
+ return False, None
+ cur = cur[i]
+ else:
+ if not isinstance(cur, dict) or key not in cur:
+ return False, None
+ cur = cur[key]
+ return True, cur
+
+
+def json_type(value):
+ """JSON type name for a Python value (for the `type` assertion)."""
+ if value is None:
+ return "null"
+ if isinstance(value, bool):
+ return "boolean"
+ if isinstance(value, (int, float)):
+ return "number"
+ if isinstance(value, str):
+ return "string"
+ if isinstance(value, list):
+ return "array"
+ if isinstance(value, dict):
+ return "object"
+ return "unknown"
+
+
+# ── deterministic image fixtures ────────────────────────────────────────────
+# A solid-colour PNG is the simplest reproducible vision input: every provider
+# can name a colour, so `quality: contains_any [red]` is a stable smoke check
+# that needs no network fetch and no binary asset checked into the repo.
+
+def _solid_png(rgb, size=48):
+ raw = bytearray()
+ row = bytes(rgb) * size
+ for _ in range(size):
+ raw.append(0) # PNG filter type 0 (none) per scanline
+ raw.extend(row)
+
+ def chunk(typ, data):
+ body = typ + data
+ return struct.pack(">I", len(data)) + body + struct.pack(">I", zlib.crc32(body) & 0xFFFFFFFF)
+
+ sig = b"\x89PNG\r\n\x1a\n"
+ ihdr = struct.pack(">IIBBBBB", size, size, 8, 2, 0, 0, 0) # 8-bit RGB
+ idat = zlib.compress(bytes(raw), 9)
+ return sig + chunk(b"IHDR", ihdr) + chunk(b"IDAT", idat) + chunk(b"IEND", b"")
+
+
+def png_base64(rgb, size=48):
+ return base64.b64encode(_solid_png(rgb, size)).decode("ascii")
+
+
+def png_data_url(rgb, size=48):
+ return "data:image/png;base64," + png_base64(rgb, size)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
new file mode 100644
index 00000000..d551e84e
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
@@ -0,0 +1,117 @@
+"""Reporting: console table, results.json, and a Markdown report.
+
+The report "registers" each case — the request as sent, the response and how
+the gateway recorded/normalized it (from the audit entry), and every assertion
+with its observed value — and "rates" it PASS / FAIL / ERROR / SKIP plus a
+0–100 quality score for soft modality checks.
+"""
+import json
+import os
+
+STATUS_GLYPH = {"PASS": "PASS", "FAIL": "FAIL", "ERROR": "ERR ", "SKIP": "skip"}
+
+
+def quality_score(case_result):
+ soft = [c for c in case_result["checks"] if not c["hard"]]
+ if not soft:
+ return None
+ return round(100 * sum(1 for c in soft if c["ok"]) / len(soft))
+
+
+def print_console(results, meta):
+ print("\n" + "=" * 92)
+ print("GOMODEL QUALITY (QA) SUITE")
+ print("=" * 92)
+ print(f"gateway={meta['gateway']} cases={len(results)} "
+ f"audit_bodies={'on' if meta['audit_bodies'] else 'OFF'}")
+ print("-" * 92)
+ hdr = f"{'status':6} {'id':46} {'prov':9} {'http':>4} {'qual':>5} detail"
+ print(hdr)
+ print("-" * 92)
+ for r in results:
+ q = quality_score(r)
+ qs = f"{q:>4}%" if q is not None else " - "
+ detail = r["detail"]
+ if len(detail) > 24:
+ detail = detail[:23] + "…"
+ print(f"{STATUS_GLYPH.get(r['status'], r['status']):6} "
+ f"{r['id'][:46]:46} {(r.get('provider') or ''):9} "
+ f"{(r['http'] or ''):>4} {qs:>5} {detail}")
+
+ counts = _counts(results)
+ print("-" * 92)
+ print(f"PASS {counts['PASS']} FAIL {counts['FAIL']} "
+ f"ERROR {counts['ERROR']} SKIP {counts['SKIP']} "
+ f"(total {len(results)})")
+ _print_breakdown("by endpoint", results, "group")
+ _print_breakdown("by provider", results, "provider")
+ print("=" * 92)
+
+
+def _counts(results):
+ c = {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0}
+ for r in results:
+ c[r["status"]] = c.get(r["status"], 0) + 1
+ return c
+
+
+def _print_breakdown(label, results, key):
+ groups = {}
+ for r in results:
+ g = r.get(key) or "?"
+ groups.setdefault(g, {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0})
+ groups[g][r["status"]] += 1
+ line = " ".join(
+ f"{g}:{v['PASS']}/{v['PASS'] + v['FAIL'] + v['ERROR'] + v['SKIP']}"
+ for g, v in sorted(groups.items()))
+ print(f"{label:12}: {line}")
+
+
+def write_results(out_dir, results, meta):
+ os.makedirs(out_dir, exist_ok=True)
+ with open(os.path.join(out_dir, "results.json"), "w", encoding="utf-8") as f:
+ json.dump({"meta": meta, "counts": _counts(results), "cases": results},
+ f, indent=2)
+ _write_markdown(out_dir, results, meta)
+ return out_dir
+
+
+def _write_markdown(out_dir, results, meta):
+ c = _counts(results)
+ L = ["# GoModel Quality (QA) Report\n",
+ f"`gateway={meta['gateway']} cases={len(results)} "
+ f"audit_bodies={'on' if meta['audit_bodies'] else 'off'}`\n",
+ f"**PASS {c['PASS']} · FAIL {c['FAIL']} · ERROR {c['ERROR']} · SKIP {c['SKIP']}**\n",
+ "| status | id | endpoint | provider | modality | http | quality | detail |",
+ "|---|---|---|---|--:|--:|--:|---|"]
+ for r in results:
+ q = quality_score(r)
+ qs = f"{q}%" if q is not None else ""
+ mod = r.get("modality")
+ if isinstance(mod, str):
+ mod = [mod]
+ elif not isinstance(mod, list):
+ mod = []
+ modality = ",".join(str(m) for m in mod)
+ L.append(f"| {r['status']} | `{r['id']}` | {r.get('group','')} | "
+ f"{r.get('provider','')} | {modality} | {r['http'] or ''} | {qs} | "
+ f"{_md(r['detail'])} |")
+ L.append("")
+ L.append("## Failed & errored cases\n")
+ bad = [r for r in results if r["status"] in ("FAIL", "ERROR")]
+ if not bad:
+ L.append("_None._\n")
+ for r in bad:
+ L.append(f"### `{r['id']}` — {r['status']}\n")
+ L.append(f"- {_md(r.get('title',''))}")
+ L.append(f"- http `{r['http']}` · {_md(r['detail'])}")
+ for chk in r["checks"]:
+ if not chk["ok"] and chk["hard"]:
+ L.append(f" - FAIL `{chk['where']}` — {_md(chk['reason'])}")
+ L.append("")
+ with open(os.path.join(out_dir, "report.md"), "w", encoding="utf-8") as f:
+ f.write("\n".join(L))
+
+
+def _md(s):
+ return str(s).replace("|", "\\|").replace("\n", " ")
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
new file mode 100644
index 00000000..804973f9
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
@@ -0,0 +1,347 @@
+#!/usr/bin/env python3
+"""GoModel quality (QA) harness — declarative spec runner.
+
+Sends a curated corpus of complex requests through a running GoModel gateway to
+real providers (OpenAI / Anthropic / Gemini) across every dialect and modality,
+then registers and rates each case:
+
+ - registers the request as sent, the response, and how the gateway *recorded*
+ and normalized it (pulled from the audit log: inbound body, normalized body,
+ provider, resolved model, usage);
+ - rates each case PASS / FAIL / ERROR / SKIP, plus a 0–100 quality score for
+ soft modality checks (did the vision model name the colour, did STT recover
+ the spoken words, …).
+
+Usage:
+ python run_qa.py # full corpus against localhost:8080
+ python run_qa.py --only chat # filter by id/group/provider substring
+ python run_qa.py --only openai --no-audit
+ python run_qa.py --list # list cases without running
+ python run_qa.py --gateway http://host:8080 --models models.json
+
+Requires the gateway running with audit logging + bodies for the preservation
+checks: LOGGING_ENABLED=true LOGGING_LOG_BODIES=true LOGGING_LOG_HEADERS=true
+LOGGING_LOG_AUDIO_BODIES=true (see README). Stdlib only.
+"""
+import argparse
+import os
+import sys
+import time
+import uuid
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, HERE)
+
+from qalib import config, report # noqa: E402
+from qalib.assertions import apply_operator, is_hard # noqa: E402
+from qalib.client import Client # noqa: E402
+from qalib.paths import get_path # noqa: E402
+
+def _find_repo_root(start):
+ """Walk up to the repo root (the dir holding .git), for the .env lookup."""
+ d = start
+ while d != os.path.dirname(d):
+ if os.path.exists(os.path.join(d, ".git")):
+ return d
+ d = os.path.dirname(d)
+ return start
+
+
+REPO_ROOT = _find_repo_root(HERE)
+
+
+def locate(target, res, audit):
+ """Resolve a quality/assertion target selector to (found, value)."""
+ if target == "stream":
+ return bool(res.stream_text), res.stream_text
+ if target == "body.text":
+ return res.text is not None, res.text
+ if target.startswith("response:"):
+ return get_path(res.json, target[len("response:"):])
+ if target.startswith("audit:"):
+ if audit is None:
+ return False, None
+ return get_path(audit, target[len("audit:"):])
+ return False, None
+
+
+def evaluate(case, res, audit, audit_attempted, variables=None):
+ """Return (status, checks, detail). checks: [{where, ok, hard, reason}]."""
+ checks = []
+ expect = case.get("expect", {})
+ if variables:
+ # Resolve ${var} references (e.g. a captured ${conversation_id}) in
+ # assertion operands, the same way request paths/bodies are interpolated.
+ expect = config.interpolate_vars(expect, variables)
+
+ if res.error:
+ return "ERROR", checks, res.error
+
+ # ── status ──────────────────────────────────────────────────────────────
+ want = expect.get("status", 200)
+ want = want if isinstance(want, list) else [want]
+ checks.append({"where": "status", "ok": res.status in want, "hard": True,
+ "reason": f"{res.status} in {want}"})
+
+ # ── headers ───────────────────────────────────────────────────────────────
+ for a in expect.get("headers", []):
+ name = a["name"].lower()
+ found = name in res.headers
+ ok, reason = apply_operator(a, found, res.headers.get(name))
+ checks.append({"where": f"header:{a['name']}", "ok": ok,
+ "hard": is_hard(a), "reason": reason})
+
+ # ── body (synthetic fields for any body, incl. binary) ────────────────────
+ body_fields = {"content_type": res.content_type, "bytes": res.bytes,
+ "text": res.text}
+ for a in expect.get("body", []):
+ field = a["field"]
+ val = body_fields.get(field)
+ ok, reason = apply_operator(a, val is not None, val)
+ checks.append({"where": f"body:{field}", "ok": ok,
+ "hard": is_hard(a), "reason": reason})
+
+ # ── response JSON ─────────────────────────────────────────────────────────
+ for a in expect.get("response", []):
+ found, val = get_path(res.json, a["path"]) if res.json is not None else (False, None)
+ ok, reason = apply_operator(a, found, val)
+ checks.append({"where": f"response:{a['path']}", "ok": ok,
+ "hard": is_hard(a), "reason": reason})
+
+ # ── streaming ─────────────────────────────────────────────────────────────
+ st = expect.get("stream")
+ if st:
+ if "min_events" in st:
+ n = len(res.events)
+ checks.append({"where": "stream:events", "ok": n >= st["min_events"],
+ "hard": True, "reason": f"{n} events >= {st['min_events']}"})
+ if "terminal" in st:
+ checks.append({"where": "stream:terminal", "ok": res.terminal == st["terminal"],
+ "hard": True, "reason": f"{res.terminal!r} == {st['terminal']!r}"})
+ for et in st.get("event_types", []):
+ present = any(e.get("type") == et for e in res.events)
+ checks.append({"where": f"stream:type:{et}", "ok": present,
+ "hard": True, "reason": f"event {et} present={present}"})
+ for a in st.get("text", []):
+ ok, reason = apply_operator(a, bool(res.stream_text), res.stream_text)
+ checks.append({"where": "stream:text", "ok": ok,
+ "hard": is_hard(a), "reason": reason})
+
+ # ── audit (gateway's own record of what it received / returned) ───────────
+ for a in expect.get("audit", []):
+ path = a["path"]
+ if not audit_attempted:
+ continue
+ if audit is None:
+ checks.append({"where": f"audit:{path}", "ok": True, "hard": False,
+ "reason": "audit entry not found (skipped)"})
+ continue
+ found, val = get_path(audit, path)
+ # If body capture is off, demote data.* checks to soft skips.
+ if not found and path.startswith("$.data."):
+ data = audit.get("data") or {}
+ if "request_body" not in data and "response_body" not in data:
+ checks.append({"where": f"audit:{path}", "ok": True, "hard": False,
+ "reason": "audit bodies off (enable LOGGING_LOG_BODIES)"})
+ continue
+ ok, reason = apply_operator(a, found, val)
+ checks.append({"where": f"audit:{path}", "ok": ok,
+ "hard": is_hard(a), "reason": reason})
+
+ # ── quality (always soft; feeds the score) ────────────────────────────────
+ for a in expect.get("quality", []):
+ found, val = locate(a.get("target", "stream"), res, audit)
+ a = dict(a)
+ a["hard"] = False
+ ok, reason = apply_operator(a, found, val)
+ checks.append({"where": f"quality:{a.get('target','stream')}", "ok": ok,
+ "hard": False, "reason": reason})
+
+ hard_fail = [c for c in checks if c["hard"] and not c["ok"]]
+ status = "FAIL" if hard_fail else "PASS"
+ if hard_fail:
+ detail = f"{hard_fail[0]['where']}: {hard_fail[0]['reason']}"
+ else:
+ ok_n = sum(1 for c in checks if c["ok"])
+ detail = f"{ok_n}/{len(checks)} ok"
+ return status, checks, detail
+
+
+def run_case(case, client, models, variables, do_audit):
+ """Build, send, capture vars, fetch audit for one case. Returns (res, audit,
+ audit_attempted, skip_reason)."""
+ resolved, unresolved = config.resolve_roles(case.get("request", {}), models)
+ if unresolved:
+ return None, None, False, f"unresolved role(s): {', '.join(sorted(set(unresolved)))}"
+ req = config.interpolate_vars(resolved, variables)
+
+ produce = req.get("produce")
+ if produce == "tts_then_stt":
+ res = _produce_tts_then_stt(req, client)
+ else:
+ res = client.send(req.get("method", "POST"), req["path"], body=req.get("body"),
+ headers=req.get("headers"), stream=req.get("stream", False),
+ raw_body=req.get("raw_body"))
+
+ # capture runtime vars from the response body
+ for name, path in (case.get("capture") or {}).items():
+ if res.json is not None:
+ found, val = get_path(res.json, path)
+ if found:
+ variables[name] = val
+
+ audit_attempted = bool(do_audit and case.get("expect", {}).get("audit"))
+ audit = client.fetch_audit(res.request_id) if audit_attempted else None
+ return res, audit, audit_attempted, None
+
+
+def _produce_tts_then_stt(req, client):
+ tts = req["tts"]
+ fmt = tts.get("response_format", "mp3")
+ r1 = client.send("POST", "/v1/audio/speech", body=tts)
+ if r1.status != 200 or not r1.raw:
+ r1.error = f"tts produce failed (status {r1.status}, {r1.bytes} bytes)"
+ return r1
+ stt = req["stt"]
+ mime = r1.content_type or "audio/mpeg"
+ res = client.send_multipart("/v1/audio/transcriptions", stt, "file",
+ f"qa.{fmt}", r1.raw, mime)
+ res.produced_from = {"tts_status": r1.status, "tts_bytes": r1.bytes,
+ "tts_content_type": r1.content_type}
+ return res
+
+
+def _trim(obj, limit=4000):
+ """Trim long strings (base64 audio, etc.) so the artifact stays readable."""
+ if isinstance(obj, str):
+ return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit} chars)"
+ if isinstance(obj, list):
+ return [_trim(x, limit) for x in obj]
+ if isinstance(obj, dict):
+ return {k: _trim(v, limit) for k, v in obj.items()}
+ return obj
+
+
+def artifact(case, res, audit):
+ """The registered record: what was sent, what came back, how the gateway
+ recorded/normalized it."""
+ if res is None:
+ return {"request": case.get("request"), "response": None, "audit": None}
+ resp = {"status": res.status, "content_type": res.content_type,
+ "bytes": res.bytes, "request_id": res.request_id}
+ if res.json is not None:
+ resp["json"] = _trim(res.json)
+ if res.text is not None:
+ resp["text"] = _trim(res.text)
+ if res.events:
+ resp["stream_events"] = len(res.events)
+ resp["stream_text"] = _trim(res.stream_text)
+ resp["terminal"] = res.terminal
+ if getattr(res, "produced_from", None):
+ resp["produced_from"] = res.produced_from
+ audit_view = None
+ if audit:
+ data = audit.get("data") or {}
+ audit_view = {
+ "provider": audit.get("provider"),
+ "resolved_model": audit.get("resolved_model"),
+ "requested_model": audit.get("requested_model"),
+ "status_code": audit.get("status_code"),
+ "duration_ns": audit.get("duration_ns"),
+ "usage": audit.get("usage"),
+ "request_body": _trim(data.get("request_body")),
+ "response_body": _trim(data.get("response_body")),
+ }
+ return {"request": _trim(case.get("request")), "response": resp, "audit": audit_view}
+
+
+def main():
+ ap = argparse.ArgumentParser(description="GoModel quality (QA) harness")
+ ap.add_argument("--gateway", default=os.environ.get("GATEWAY", "http://localhost:8080"))
+ ap.add_argument("--models", default=os.path.join(HERE, "models.json"))
+ ap.add_argument("--spec-dir", default=os.path.join(HERE, "spec"))
+ ap.add_argument("--out", default=os.path.join(HERE, "output"))
+ ap.add_argument("--only", default=None, help="filter by id/group/provider substring")
+ ap.add_argument("--no-audit", action="store_true", help="skip audit-log cross-checks")
+ ap.add_argument("--list", action="store_true", help="list matching cases and exit")
+ ap.add_argument("--timeout", type=int, default=120)
+ args = ap.parse_args()
+
+ models = config.load_models(args.models)
+ cases = config.load_specs(args.spec_dir, args.only)
+ if not cases:
+ print("no cases matched", file=sys.stderr)
+ return 2
+ if args.list:
+ for c in cases:
+ print(f"{c['id']:48} {c.get('group',''):14} {c.get('provider','')}")
+ print(f"\n{len(cases)} cases")
+ return 0
+
+ key = config.load_master_key(REPO_ROOT)
+ if not key:
+ print("no GOMODEL_MASTER_KEY found (env or repo .env)", file=sys.stderr)
+ return 2
+
+ run_id = uuid.uuid4().hex[:12]
+ user_path = f"/qa/{run_id}"
+ client = Client(args.gateway, key, user_path, timeout=args.timeout)
+
+ health = client.send("GET", "/health")
+ if health.error or health.status >= 500:
+ print(f"gateway not reachable at {args.gateway}: "
+ f"{health.error or health.status}", file=sys.stderr)
+ return 2
+
+ print(f"running {len(cases)} cases against {args.gateway} (user_path {user_path})")
+ results = []
+ variables = {}
+ audit_bodies_seen = False
+ for case in cases:
+ t0 = time.time()
+ try:
+ res, audit, attempted, skip = run_case(case, client, models, variables,
+ do_audit=not args.no_audit)
+
+ if skip:
+ results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
+ print(f"skip {case['id']}: {skip}")
+ continue
+
+ if audit and (audit.get("data") or {}).get("request_body") is not None:
+ audit_bodies_seen = True
+
+ status, checks, detail = evaluate(case, res, audit, attempted, variables)
+ rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
+ results.append(rec)
+ print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}")
+ except Exception as e: # noqa: BLE001 — never let one case abort the run
+ err = f"{type(e).__name__}: {e}"
+ results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
+ print(f"ERR {case['id']}: {err}")
+ continue
+
+ meta = {"gateway": args.gateway, "run_id": run_id, "user_path": user_path,
+ "audit_bodies": audit_bodies_seen, "models": models}
+ report.print_console(results, meta)
+ out_dir = os.path.join(args.out, run_id)
+ report.write_results(out_dir, results, meta)
+ print(f"\nwrote {os.path.join(out_dir, 'results.json')}\n"
+ f"wrote {os.path.join(out_dir, 'report.md')}")
+
+ failed = sum(1 for r in results if r["status"] in ("FAIL", "ERROR"))
+ return 1 if failed else 0
+
+
+def _record(case, status, checks, detail, res, audit, elapsed):
+ return {
+ "id": case["id"], "title": case.get("title", ""), "group": case.get("group"),
+ "provider": case.get("provider"), "modality": case.get("modality"),
+ "status": status, "http": (res.status if res else None),
+ "detail": detail, "elapsed_ms": round(elapsed * 1000),
+ "checks": checks, "artifact": artifact(case, res, audit),
+ }
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
new file mode 100644
index 00000000..a2a2fbfb
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
@@ -0,0 +1,75 @@
+[
+ {
+ "id": "audio.openai.tts_mp3",
+ "title": "TTS: synthesize speech (mp3)",
+ "provider": "openai",
+ "modality": ["audio"],
+ "request": {
+ "path": "/v1/audio/speech",
+ "body": {"model": "@openai.tts", "voice": "alloy", "input": "The quick brown fox jumps over the lazy dog.", "response_format": "mp3"}
+ },
+ "expect": {
+ "status": 200,
+ "body": [
+ {"field": "content_type", "contains": "audio/"},
+ {"field": "bytes", "gte": 2000}
+ ]
+ },
+ "notes": "Text-to-speech returns binary audio with an audio/* content type."
+ },
+ {
+ "id": "audio.openai.tts_wav",
+ "title": "TTS: response_format wav changes content type",
+ "provider": "openai",
+ "modality": ["audio"],
+ "request": {
+ "path": "/v1/audio/speech",
+ "body": {"model": "@openai.tts", "voice": "alloy", "input": "Hello world.", "response_format": "wav"}
+ },
+ "expect": {
+ "status": 200,
+ "body": [
+ {"field": "content_type", "contains": "audio/wav"},
+ {"field": "bytes", "gte": 2000}
+ ]
+ },
+ "notes": "response_format must drive the response MIME type (audio/wav)."
+ },
+ {
+ "id": "audio.openai.tts_stt_json",
+ "title": "STT: round-trip TTS -> transcription (json) recovers the words",
+ "provider": "openai",
+ "modality": ["audio"],
+ "request": {
+ "produce": "tts_then_stt",
+ "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Benchmark gateways measure latency and cost.", "response_format": "mp3"},
+ "stt": {"model": "@openai.stt", "response_format": "json"}
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.text", "not_empty": true}],
+ "quality": [{"target": "response:$.text", "contains_any": ["benchmark", "gateway", "latency", "cost"]}]
+ },
+ "notes": "Self-contained modality round-trip: synthesize known text, transcribe it, assert the words come back. No external audio fixture."
+ },
+ {
+ "id": "audio.openai.tts_stt_text",
+ "title": "STT: response_format text returns plain text",
+ "provider": "openai",
+ "modality": ["audio"],
+ "request": {
+ "produce": "tts_then_stt",
+ "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Speech to text in plain format.", "response_format": "mp3"},
+ "stt": {"model": "@openai.stt", "response_format": "text"}
+ },
+ "expect": {
+ "status": 200,
+ "body": [
+ {"field": "content_type", "contains": "text/"},
+ {"field": "text", "not_empty": true}
+ ],
+ "quality": [{"target": "body.text", "contains_any": ["speech", "text", "plain", "format"]}]
+ },
+ "notes": "Transcription response_format=text returns text/plain, not JSON."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
new file mode 100644
index 00000000..f2ea9eab
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
@@ -0,0 +1,448 @@
+[
+ {
+ "id": "chat.openai.multiturn",
+ "title": "OpenAI chat: multi-turn system+user+assistant+user",
+ "provider": "openai",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [
+ {"role": "system", "content": "You are a terse assistant. Answer in one short sentence."},
+ {"role": "user", "content": "Name the largest planet in the solar system."},
+ {"role": "assistant", "content": "Jupiter."},
+ {"role": "user", "content": "And the smallest?"}
+ ],
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "headers": [{"name": "X-Request-Id", "present": true}],
+ "response": [
+ {"path": "$.object", "equals": "chat.completion"},
+ {"path": "$.choices[0].message.role", "equals": "assistant"},
+ {"path": "$.choices[0].message.content", "not_empty": true},
+ {"path": "$.usage.total_tokens", "gt": 0}
+ ],
+ "audit": [
+ {"path": "$.provider", "equals": "openai"},
+ {"path": "$.resolved_model", "not_empty": true}
+ ],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["mercury"]}]
+ },
+ "notes": "Baseline conversational correctness + OpenAI usage normalization + audit routing."
+ },
+ {
+ "id": "chat.openai.stream",
+ "title": "OpenAI chat: streaming deltas terminate with [DONE]",
+ "provider": "openai",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "stream": true,
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "List three primary colors, comma separated."}],
+ "stream": true,
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]},
+ "quality": [{"target": "stream", "contains_any": ["red", "blue", "yellow"]}]
+ },
+ "notes": "SSE framing + terminal marker for chat dialect."
+ },
+ {
+ "id": "chat.openai.stream_usage",
+ "title": "OpenAI chat: stream_options include_usage emits a usage chunk",
+ "provider": "openai",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "stream": true,
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "Say hi."}],
+ "stream": true,
+ "stream_options": {"include_usage": true},
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 2, "terminal": "[DONE]"},
+ "audit": [{"path": "$.usage.total_tokens", "gt": 0}],
+ "quality": [{"target": "stream", "not_empty": true}]
+ },
+ "notes": "stream_options must survive translation; the usage chunk is provider-shaped. For a stream the gateway can only derive usage from the streamed usage chunk, so a recorded usage.total_tokens>0 proves the chunk was emitted and not dropped."
+ },
+ {
+ "id": "chat.openai.vision_data_url",
+ "title": "OpenAI chat: vision via inline image_url (data URL)",
+ "provider": "openai",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.vision",
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "What is the single dominant color of this image? Answer with one word."},
+ {"type": "image_url", "image_url": {"url": "@image.red"}}
+ ]}],
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["red"]}]
+ },
+ "notes": "Multimodal content-part passthrough; deterministic solid-color fixture."
+ },
+ {
+ "id": "chat.openai.tools_call",
+ "title": "OpenAI chat: function/tool calling is emitted",
+ "provider": "openai",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "What is the weather in Paris? Use the tool."}],
+ "tools": [{"type": "function", "function": {
+ "name": "get_weather",
+ "description": "Get current weather for a city",
+ "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+ }}],
+ "tool_choice": "required",
+ "max_tokens": 128
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather"},
+ {"path": "$.choices[0].finish_reason", "equals": "tool_calls"}
+ ]
+ },
+ "notes": "tool_choice=required must force a structured tool call."
+ },
+ {
+ "id": "chat.openai.tools_roundtrip",
+ "title": "OpenAI chat: tool result fed back yields a final answer",
+ "provider": "openai",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [
+ {"role": "user", "content": "What is the weather in Paris?"},
+ {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}}]},
+ {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 21, \"summary\": \"sunny\"}"}
+ ],
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["21", "sunny", "sun"]}]
+ },
+ "notes": "Assistant tool_calls + role:tool message round-trip translation."
+ },
+ {
+ "id": "chat.openai.structured_json_schema",
+ "title": "OpenAI chat: structured output via response_format json_schema",
+ "provider": "openai",
+ "modality": ["structured"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "Give the capital of France."}],
+ "response_format": {"type": "json_schema", "json_schema": {
+ "name": "capital",
+ "strict": true,
+ "schema": {"type": "object", "properties": {"country": {"type": "string"}, "capital": {"type": "string"}}, "required": ["country", "capital"], "additionalProperties": false}
+ }},
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["paris"]}]
+ },
+ "notes": "response_format must pass through and constrain output."
+ },
+ {
+ "id": "chat.openai.reasoning_max_tokens_mapping",
+ "title": "OpenAI reasoning: max_tokens accepted, temperature tolerated (Postel)",
+ "provider": "openai",
+ "modality": ["reasoning"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.reasoning",
+ "messages": [{"role": "user", "content": "What is 17 + 25? Reply with the number only."}],
+ "max_tokens": 2000,
+ "temperature": 0.5
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "audit": [
+ {"path": "$.provider", "equals": "openai"},
+ {"path": "$.data.request_body.max_tokens", "present": true, "hard": false}
+ ],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["42"]}]
+ },
+ "notes": "Reasoning models reject max_tokens/temperature upstream; a 200 proves the gateway mapped max_tokens->max_completion_tokens and dropped temperature. Audit shows the inbound body is preserved verbatim."
+ },
+ {
+ "id": "chat.openai.optional_field_preserved",
+ "title": "OpenAI chat: a valid optional field (user) is preserved end-to-end",
+ "provider": "openai",
+ "modality": ["text", "preservation"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "headers": {"X-QA-Marker": "keep-123"},
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "Reply with the word OK."}],
+ "user": "qa-user-001",
+ "max_tokens": 16
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "audit": [
+ {"path": "$.data.request_body.user", "equals": "qa-user-001", "hard": false}
+ ]
+ },
+ "notes": "A provider-valid optional field round-trips: request succeeds and the audit confirms the gateway recorded it verbatim. (Unknown/unrecognized fields are a separate case — see errors.openai_unknown_field_forwarded.)"
+ },
+ {
+ "id": "chat.openai.provider_extras_preserved",
+ "title": "OpenAI chat: provider-specific response extras survive normalization",
+ "provider": "openai",
+ "modality": ["preservation"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "Say hi."}],
+ "max_tokens": 16
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.id", "not_empty": true},
+ {"path": "$.created", "gt": 0},
+ {"path": "$.model", "not_empty": true},
+ {"path": "$.system_fingerprint", "present": true, "hard": false},
+ {"path": "$.service_tier", "present": true, "hard": false}
+ ]
+ },
+ "notes": "Normalization should preserve provider extras (system_fingerprint/service_tier) rather than strip to a minimal schema."
+ },
+ {
+ "id": "chat.anthropic.basic",
+ "title": "Anthropic via chat/completions: OpenAI-shaped response",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [
+ {"role": "system", "content": "Be concise."},
+ {"role": "user", "content": "Name the capital of Japan in one word."}
+ ],
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "chat.completion"},
+ {"path": "$.choices[0].message.content", "not_empty": true},
+ {"path": "$.usage.completion_tokens", "gt": 0}
+ ],
+ "audit": [{"path": "$.provider", "equals": "anthropic"}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["tokyo"]}]
+ },
+ "notes": "Claude served through the OpenAI chat dialect; Anthropic input/output token usage normalized to prompt/completion."
+ },
+ {
+ "id": "chat.anthropic.stream",
+ "title": "Anthropic via chat/completions: streaming normalized to [DONE]",
+ "provider": "anthropic",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "stream": true,
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [{"role": "user", "content": "Count from 1 to 3."}],
+ "stream": true,
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]}
+ },
+ "notes": "Anthropic SSE converted into OpenAI chat-stream framing with a [DONE] terminator."
+ },
+ {
+ "id": "chat.anthropic.vision",
+ "title": "Anthropic via chat/completions: vision image_url",
+ "provider": "anthropic",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@anthropic.vision",
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "One word: what is the dominant color?"},
+ {"type": "image_url", "image_url": {"url": "@image.blue"}}
+ ]}],
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["blue"]}]
+ },
+ "notes": "image_url data URL mapped to an Anthropic base64 image block."
+ },
+ {
+ "id": "chat.anthropic.params_fidelity",
+ "title": "Anthropic via chat/completions: sampling params + stop honored",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [{"role": "user", "content": "Write the word DONE then stop."}],
+ "temperature": 0.2,
+ "top_p": 0.9,
+ "stop": ["\n\n"],
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}]
+ },
+ "notes": "temperature/top_p/stop translated to Anthropic equivalents without error."
+ },
+ {
+ "id": "chat.gemini.basic",
+ "title": "Gemini via chat/completions: native API path",
+ "provider": "gemini",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@gemini.chat",
+ "messages": [
+ {"role": "system", "content": "Be concise."},
+ {"role": "user", "content": "Capital of Italy in one word?"}
+ ],
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "chat.completion"},
+ {"path": "$.choices[0].message.content", "not_empty": true},
+ {"path": "$.usage.total_tokens", "gt": 0}
+ ],
+ "audit": [{"path": "$.provider", "equals": "gemini"}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["rome"]}]
+ },
+ "notes": "Gemini native contents/parts mapping + usageMetadata normalization (USE_GOOGLE_GEMINI_NATIVE_API default true)."
+ },
+ {
+ "id": "chat.gemini.stream",
+ "title": "Gemini via chat/completions: streaming",
+ "provider": "gemini",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "stream": true,
+ "body": {
+ "model": "@gemini.chat",
+ "messages": [{"role": "user", "content": "Name two oceans, comma separated."}],
+ "stream": true,
+ "max_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]}
+ },
+ "notes": "Gemini native stream translated to OpenAI chat-stream framing."
+ },
+ {
+ "id": "chat.gemini.vision",
+ "title": "Gemini via chat/completions: vision inline image",
+ "provider": "gemini",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@gemini.vision",
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "One word: dominant color?"},
+ {"type": "image_url", "image_url": {"url": "@image.green"}}
+ ]}],
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.content", "not_empty": true}],
+ "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["green"]}]
+ },
+ "notes": "image_url data URL mapped to Gemini inline_data."
+ },
+ {
+ "id": "chat.gemini.tools",
+ "title": "Gemini via chat/completions: function calling",
+ "provider": "gemini",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@gemini.chat",
+ "messages": [{"role": "user", "content": "Use the tool to get the weather in Berlin."}],
+ "tools": [{"type": "function", "function": {
+ "name": "get_weather",
+ "description": "Get weather for a city",
+ "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+ }}],
+ "max_tokens": 128
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather", "hard": false}],
+ "quality": [{"target": "response:$.choices[0].message.tool_calls[0].function.name", "contains": "weather"}]
+ },
+ "notes": "Gemini functionCall mapped to OpenAI tool_calls (soft: model may answer directly)."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
new file mode 100644
index 00000000..9c61475f
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
@@ -0,0 +1,99 @@
+[
+ {
+ "id": "conversations.create",
+ "title": "Conversations: create",
+ "provider": "openai",
+ "modality": ["stateful"],
+ "request": {
+ "path": "/v1/conversations",
+ "body": {"metadata": {"qa": "conv-flow"}}
+ },
+ "capture": {"conversation_id": "$.id"},
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "conversation"},
+ {"path": "$.id", "not_empty": true}
+ ]
+ },
+ "notes": "Creates a conversation and captures its id for the rest of this flow (cases run in order)."
+ },
+ {
+ "id": "conversations.get",
+ "title": "Conversations: retrieve by id",
+ "provider": "openai",
+ "modality": ["stateful"],
+ "request": {
+ "method": "GET",
+ "path": "/v1/conversations/${conversation_id}"
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "conversation"},
+ {"path": "$.id", "equals": "${conversation_id}"}
+ ]
+ },
+ "notes": "Reads back the conversation created above; the returned id must equal the captured ${conversation_id}."
+ },
+ {
+ "id": "conversations.use_in_responses",
+ "title": "Conversations: link a Responses call to a conversation",
+ "provider": "openai",
+ "modality": ["stateful", "text"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {
+ "model": "@openai.chat",
+ "conversation": {"id": "${conversation_id}"},
+ "input": "Remember the number 7.",
+ "max_output_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.status", "equals": "completed"},
+ {"path": "$.conversation.id", "equals": "${conversation_id}"}
+ ]
+ },
+ "notes": "Responses request bound to a conversation id (stateful threading); the response must carry the same ${conversation_id} it was attached to."
+ },
+ {
+ "id": "conversations.update",
+ "title": "Conversations: update metadata",
+ "provider": "openai",
+ "modality": ["stateful"],
+ "request": {
+ "path": "/v1/conversations/${conversation_id}",
+ "body": {"metadata": {"qa": "conv-flow", "stage": "updated"}}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "conversation"},
+ {"path": "$.id", "equals": "${conversation_id}"}
+ ]
+ },
+ "notes": "Metadata update on an existing conversation; the update must return the same ${conversation_id}."
+ },
+ {
+ "id": "conversations.delete",
+ "title": "Conversations: delete",
+ "provider": "openai",
+ "modality": ["stateful"],
+ "request": {
+ "method": "DELETE",
+ "path": "/v1/conversations/${conversation_id}"
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "conversation.deleted"},
+ {"path": "$.id", "equals": "${conversation_id}"},
+ {"path": "$.deleted", "equals": true}
+ ]
+ },
+ "notes": "Tears down the conversation created at the start of the flow; the deletion ack must reference the same ${conversation_id}."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
new file mode 100644
index 00000000..a03e7eea
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
@@ -0,0 +1,43 @@
+[
+ {
+ "id": "embeddings.openai.single",
+ "title": "Embeddings: single string input",
+ "provider": "openai",
+ "modality": ["embeddings"],
+ "request": {
+ "path": "/v1/embeddings",
+ "body": {"model": "@openai.embed", "input": "GoModel is an AI gateway."}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "list"},
+ {"path": "$.data[0].object", "equals": "embedding"},
+ {"path": "$.data[0].embedding", "length_gte": 256}
+ ],
+ "audit": [{"path": "$.provider", "equals": "openai"}]
+ },
+ "notes": "Embedding vector of expected dimensionality returned in OpenAI list shape."
+ },
+ {
+ "id": "embeddings.openai.batch",
+ "title": "Embeddings: batch input array",
+ "provider": "openai",
+ "modality": ["embeddings"],
+ "request": {
+ "path": "/v1/embeddings",
+ "body": {"model": "@openai.embed", "input": ["alpha", "beta", "gamma"]}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.data", "length_gte": 3},
+ {"path": "$.data[0].index", "equals": 0},
+ {"path": "$.data[1].index", "equals": 1},
+ {"path": "$.data[2].index", "equals": 2},
+ {"path": "$.data[2].embedding", "length_gte": 256}
+ ]
+ },
+ "notes": "Batched inputs produce one embedding per item, order preserved."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
new file mode 100644
index 00000000..8f1279c6
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
@@ -0,0 +1,82 @@
+[
+ {
+ "id": "errors.unknown_model",
+ "title": "Errors: unknown model returns a normalized OpenAI-style error",
+ "provider": "openai",
+ "modality": ["errors"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {"model": "definitely-not-a-real-model-zzz", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 8}
+ },
+ "expect": {
+ "status": [400, 404],
+ "response": [
+ {"path": "$.error.message", "not_empty": true},
+ {"path": "$.error.type", "present": true, "hard": false}
+ ]
+ },
+ "notes": "Routing failure surfaces as a clean error envelope, not a 5xx or hang."
+ },
+ {
+ "id": "errors.anthropic_audio_rejected",
+ "title": "Errors: unsupported input_audio on Anthropic chat is rejected gracefully",
+ "provider": "anthropic",
+ "modality": ["errors", "audio"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "Transcribe this."},
+ {"type": "input_audio", "input_audio": {"data": "AAAA", "format": "mp3"}}
+ ]}],
+ "max_tokens": 32
+ }
+ },
+ "expect": {
+ "status": [400, 415, 422],
+ "response": [{"path": "$.error.message", "not_empty": true}]
+ },
+ "notes": "Anthropic chat does not support input_audio; the gateway must reject with a 4xx invalid-request error rather than crash or forward garbage. A non-4xx here is itself the finding."
+ },
+ {
+ "id": "errors.openai_unknown_field_forwarded",
+ "title": "Behavior: unknown top-level fields are forwarded verbatim (provider rejects)",
+ "provider": "openai",
+ "modality": ["errors", "preservation"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "body": {
+ "model": "@openai.chat",
+ "messages": [{"role": "user", "content": "hi"}],
+ "x_qa_marker": "keep-123",
+ "max_tokens": 8
+ }
+ },
+ "expect": {
+ "status": [400],
+ "response": [
+ {"path": "$.error.message", "not_empty": true},
+ {"path": "$.error.message", "contains": "x_qa_marker", "hard": false},
+ {"path": "$.error.type", "equals": "invalid_request_error", "hard": false}
+ ],
+ "audit": [{"path": "$.data.request_body.x_qa_marker", "equals": "keep-123", "hard": false}]
+ },
+ "notes": "Documented finding (2026-06): GoModel does not strip unrecognized top-level fields; it forwards them, so a strict provider (OpenAI) returns 400 'Unrecognized request argument'. The audit confirms the field was captured inbound and passed through. If the gateway later sanitizes unknown fields, change expect.status to 200."
+ },
+ {
+ "id": "errors.malformed_json",
+ "title": "Errors: malformed JSON body returns 400",
+ "provider": "openai",
+ "modality": ["errors"],
+ "request": {
+ "path": "/v1/chat/completions",
+ "raw_body": "{\"model\": \"@openai.chat\", \"messages\": [ "
+ },
+ "expect": {
+ "status": [400],
+ "response": [{"path": "$.error.message", "not_empty": true}]
+ },
+ "notes": "Truncated JSON must yield a 400 with a clear error message (raw_body is sent verbatim)."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
new file mode 100644
index 00000000..16bac806
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
@@ -0,0 +1,216 @@
+[
+ {
+ "id": "messages.anthropic.basic",
+ "title": "Messages: native Anthropic shape",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 64,
+ "messages": [{"role": "user", "content": "Capital of Canada? One word."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.type", "equals": "message"},
+ {"path": "$.role", "equals": "assistant"},
+ {"path": "$.content[0].text", "not_empty": true},
+ {"path": "$.stop_reason", "present": true},
+ {"path": "$.usage.output_tokens", "gt": 0}
+ ],
+ "audit": [{"path": "$.provider", "equals": "anthropic"}],
+ "quality": [{"target": "response:$.content[0].text", "contains_any": ["ottawa"]}]
+ },
+ "notes": "Anthropic-native response: type=message, content blocks, input/output_tokens, stop_reason."
+ },
+ {
+ "id": "messages.anthropic.system",
+ "title": "Messages: top-level system prompt",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 64,
+ "system": "You always answer with a single word.",
+ "messages": [{"role": "user", "content": "Largest mammal?"}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.content[0].text", "not_empty": true}],
+ "quality": [{"target": "response:$.content[0].text", "contains_any": ["whale"]}]
+ },
+ "notes": "Anthropic system is a top-level field, not a message role."
+ },
+ {
+ "id": "messages.anthropic.stream",
+ "title": "Messages: streaming SSE ends with message_stop",
+ "provider": "anthropic",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/messages",
+ "stream": true,
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 64,
+ "stream": true,
+ "messages": [{"role": "user", "content": "Count from 1 to 3."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 3, "terminal": "message_stop", "event_types": ["message_start", "content_block_delta"], "text": [{"not_empty": true}]}
+ },
+ "notes": "Native Anthropic event protocol relayed (message_start -> content_block_delta -> message_stop)."
+ },
+ {
+ "id": "messages.anthropic.vision",
+ "title": "Messages: image content block",
+ "provider": "anthropic",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.vision",
+ "max_tokens": 32,
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "One word: dominant color?"},
+ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "@imageb64.red"}}
+ ]}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.content[0].text", "not_empty": true}],
+ "quality": [{"target": "response:$.content[0].text", "contains_any": ["red"]}]
+ },
+ "notes": "Native Anthropic base64 image source (raw base64, media_type separate)."
+ },
+ {
+ "id": "messages.anthropic.tools_auto",
+ "title": "Messages: tool definition, tool_choice auto",
+ "provider": "anthropic",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 256,
+ "tools": [{"name": "get_weather", "description": "weather for a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+ "tool_choice": {"type": "auto"},
+ "messages": [{"role": "user", "content": "What's the weather in Paris? Use the tool."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.stop_reason", "present": true}],
+ "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}]
+ },
+ "notes": "Native Anthropic tool schema (input_schema) + tool_choice object."
+ },
+ {
+ "id": "messages.anthropic.tools_required",
+ "title": "Messages: tool_choice any forces a tool call",
+ "provider": "anthropic",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 256,
+ "tools": [{"name": "get_time", "description": "current time in a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+ "tool_choice": {"type": "any"},
+ "messages": [{"role": "user", "content": "What time is it in Tokyo?"}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.stop_reason", "equals": "tool_use", "hard": false}],
+ "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}]
+ },
+ "notes": "tool_choice=any should force a tool_use stop_reason."
+ },
+ {
+ "id": "messages.anthropic.thinking",
+ "title": "Messages: extended thinking enabled",
+ "provider": "anthropic",
+ "modality": ["reasoning"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.thinking",
+ "max_tokens": 4000,
+ "thinking": {"type": "enabled", "budget_tokens": 1024},
+ "messages": [{"role": "user", "content": "If a train travels 60 km in 45 minutes, what is its speed in km/h? Show the number."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.content[0].type", "present": true}],
+ "quality": [{"target": "response:$.stop_reason", "contains_any": ["end_turn", "stop"]}]
+ },
+ "notes": "Extended-thinking request must be accepted (adaptive vs budget_tokens handled by the gateway per model)."
+ },
+ {
+ "id": "messages.anthropic.default_max_tokens",
+ "title": "Messages: missing max_tokens is injected by the gateway",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [{"role": "user", "content": "Say hi in one word."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.content[0].text", "not_empty": true}]
+ },
+ "notes": "Anthropic requires max_tokens; the gateway injects a default so a user request without it still succeeds (good defaults)."
+ },
+ {
+ "id": "messages.anthropic.count_tokens",
+ "title": "Messages: count_tokens",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/messages/count_tokens",
+ "body": {
+ "model": "@anthropic.chat",
+ "messages": [{"role": "user", "content": "How many tokens is this sentence, roughly?"}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.input_tokens", "gt": 0}]
+ },
+ "notes": "Token-counting endpoint returns input_tokens without a provider completion call."
+ },
+ {
+ "id": "messages.anthropic.metadata_preserved",
+ "title": "Messages: metadata.user_id (valid Anthropic field) is preserved",
+ "provider": "anthropic",
+ "modality": ["preservation"],
+ "request": {
+ "path": "/v1/messages",
+ "body": {
+ "model": "@anthropic.chat",
+ "max_tokens": 16,
+ "metadata": {"user_id": "qa-789"},
+ "messages": [{"role": "user", "content": "Say OK."}]
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.content[0].text", "not_empty": true}],
+ "audit": [{"path": "$.data.request_body.metadata.user_id", "equals": "qa-789", "hard": false}]
+ },
+ "notes": "metadata is a first-class Anthropic field; audit confirms the gateway recorded it as sent."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
new file mode 100644
index 00000000..d1ab7f1a
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
@@ -0,0 +1,198 @@
+[
+ {
+ "id": "responses.openai.basic_string",
+ "title": "Responses: plain string input",
+ "provider": "openai",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {"model": "@openai.chat", "input": "What is the capital of France? One word.", "max_output_tokens": 64}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.object", "equals": "response"},
+ {"path": "$.status", "equals": "completed"},
+ {"path": "$.output[0].content[0].text", "not_empty": true}
+ ],
+ "audit": [{"path": "$.provider", "equals": "openai"}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["paris"]}]
+ },
+ "notes": "Native Responses shape: output[].content[].text, status=completed."
+ },
+ {
+ "id": "responses.openai.instructions",
+ "title": "Responses: instructions + string input",
+ "provider": "openai",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {"model": "@openai.chat", "instructions": "Answer in exactly one word.", "input": "Largest ocean?", "max_output_tokens": 64}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.status", "equals": "completed"},
+ {"path": "$.output[0].content[0].text", "not_empty": true}
+ ],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["pacific"]}]
+ },
+ "notes": "instructions become a system-equivalent prompt."
+ },
+ {
+ "id": "responses.openai.multimodal_image",
+ "title": "Responses: multi-part input_text + input_image",
+ "provider": "openai",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {
+ "model": "@openai.vision",
+ "input": [{"role": "user", "content": [
+ {"type": "input_text", "text": "One word: dominant color?"},
+ {"type": "input_image", "image_url": "@image.red"}
+ ]}],
+ "max_output_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["red"]}]
+ },
+ "notes": "input_image content part normalized to a chat image part."
+ },
+ {
+ "id": "responses.openai.stream",
+ "title": "Responses: streaming output_text deltas + response.completed",
+ "provider": "openai",
+ "modality": ["text", "streaming"],
+ "request": {
+ "path": "/v1/responses",
+ "stream": true,
+ "body": {"model": "@openai.chat", "input": "Count from 1 to 3.", "stream": true, "max_output_tokens": 64}
+ },
+ "expect": {
+ "status": 200,
+ "stream": {"min_events": 2, "terminal": "response.completed", "text": [{"not_empty": true}]}
+ },
+ "notes": "Responses SSE event protocol (output_text.delta -> response.completed)."
+ },
+ {
+ "id": "responses.openai.tools",
+ "title": "Responses: function tool",
+ "provider": "openai",
+ "modality": ["tools"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {
+ "model": "@openai.chat",
+ "input": "Use the tool to get the weather in Rome.",
+ "tools": [{"type": "function", "name": "get_weather", "description": "weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}],
+ "max_output_tokens": 128
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.status", "equals": "completed"}],
+ "quality": [{"target": "response:$.output[0].type", "contains_any": ["function_call", "message"]}]
+ },
+ "notes": "Responses tool schema (flat name/parameters) handled."
+ },
+ {
+ "id": "responses.openai.structured_text_format",
+ "title": "Responses: structured output via text.format json_schema",
+ "provider": "openai",
+ "modality": ["structured"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {
+ "model": "@openai.chat",
+ "input": "Capital of Spain.",
+ "text": {"format": {"type": "json_schema", "name": "cap", "strict": true, "schema": {"type": "object", "properties": {"capital": {"type": "string"}}, "required": ["capital"], "additionalProperties": false}}},
+ "max_output_tokens": 64
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["madrid"]}]
+ },
+ "notes": "text.format maps to chat response_format/json_schema for non-native providers."
+ },
+ {
+ "id": "responses.openai.reasoning_effort",
+ "title": "Responses: reasoning model with effort",
+ "provider": "openai",
+ "modality": ["reasoning"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {"model": "@openai.reasoning", "input": "What is 6 times 7? Number only.", "reasoning": {"effort": "low"}, "max_output_tokens": 2000}
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.status", "equals": "completed"}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["42"]}]
+ },
+ "notes": "reasoning.effort accepted on the Responses API."
+ },
+ {
+ "id": "responses.openai.metadata_preserved",
+ "title": "Responses: metadata (valid optional field) is preserved",
+ "provider": "openai",
+ "modality": ["preservation"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {"model": "@openai.chat", "input": "Say OK.", "metadata": {"qa_case": "resp-extra"}, "max_output_tokens": 16}
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.status", "equals": "completed"}],
+ "audit": [{"path": "$.data.request_body.metadata.qa_case", "equals": "resp-extra", "hard": false}]
+ },
+ "notes": "metadata is a first-class Responses field; audit confirms the gateway recorded it as sent."
+ },
+ {
+ "id": "responses.anthropic.basic",
+ "title": "Responses adapter -> Anthropic",
+ "provider": "anthropic",
+ "modality": ["text"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {"model": "@anthropic.chat", "input": "Capital of Germany? One word.", "max_output_tokens": 64}
+ },
+ "expect": {
+ "status": 200,
+ "response": [
+ {"path": "$.status", "equals": "completed"},
+ {"path": "$.output[0].content[0].text", "not_empty": true}
+ ],
+ "audit": [{"path": "$.provider", "equals": "anthropic"}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["berlin"]}]
+ },
+ "notes": "Non-native provider served through the Responses->chat adapter and renormalized to Responses shape."
+ },
+ {
+ "id": "responses.gemini.image",
+ "title": "Responses adapter -> Gemini with image input",
+ "provider": "gemini",
+ "modality": ["vision"],
+ "request": {
+ "path": "/v1/responses",
+ "body": {
+ "model": "@gemini.vision",
+ "input": [{"role": "user", "content": [
+ {"type": "input_text", "text": "One word: color?"},
+ {"type": "input_image", "image_url": "@image.blue"}
+ ]}],
+ "max_output_tokens": 32
+ }
+ },
+ "expect": {
+ "status": 200,
+ "response": [{"path": "$.output[0].content[0].text", "not_empty": true}],
+ "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["blue"]}]
+ },
+ "notes": "Responses multimodal input adapted to Gemini inline_data."
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
new file mode 100644
index 00000000..182aa0c7
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+"""Generate the catchy dark cover image for the June 2026 gateway benchmark post.
+
+Thesis-driven: latency is overrated, the resource bill isn't. So the hero visual
+is the resource gap (Docker image + peak RAM), GoModel highlighted.
+"""
+import sys
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib import font_manager as fm
+
+BG = "#0b0e14"
+PANEL = "#11161f"
+TEXT = "#e6edf3"
+MUTED = "#8b98a9"
+GREEN = "#34d399" # GoModel
+RED = "#f87171" # LiteLLM
+GRAY = "#5b6675" # others
+
+def font(weight="normal", size=12, black=False):
+ fam = "Arial Black" if black else "Arial"
+ return fm.FontProperties(family=fam, weight=weight, size=size)
+
+# data (June 2026 c7i.large run) - ascending, so GoModel (the winner) sits on top
+# and the giant red LiteLLM bar at the bottom. Image = compressed pull size; RAM =
+# peak under load (LiteLLM at its recommended one-worker-per-core config).
+IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)]
+RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)]
+
+W, H, DPI = 2400, 1260, 200
+fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI)
+fig.patch.set_facecolor(BG)
+
+# ── left text column (top-anchored so positions are predictable) ───
+T = dict(va="top", ha="left")
+fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK · JUNE 25, 2026", color=GREEN,
+ fontproperties=font(size=14.5, weight="bold"), **T)
+fig.text(0.043, 0.84, "LATENCY IS", color=TEXT, fontproperties=font(size=39, black=True), **T)
+fig.text(0.043, 0.725, "OVERRATED", color=TEXT, fontproperties=font(size=39, black=True), **T)
+fig.text(0.043, 0.585, "LOOK AT THE BILL", color=GREEN, fontproperties=font(size=35, black=True), **T)
+fig.add_artist(plt.Line2D([0.045, 0.405], [0.475, 0.475], color="#1f2733", lw=2))
+fig.text(0.045, 0.45, "GoModel — the fastest,\nmost lightweight AI\ngateway in the world",
+ color=GREEN, fontproperties=font(size=18, weight="bold"), linespacing=1.4, **T)
+
+def panel(rect, title, rows, unit, ref):
+ ax = fig.add_axes(rect)
+ ax.set_facecolor(PANEL)
+ for s in ax.spines.values():
+ s.set_visible(False)
+ ax.tick_params(left=False, bottom=False, labelbottom=False)
+ labels = [r[0] for r in rows]
+ vals = [r[1] for r in rows]
+ colors = [r[2] for r in rows]
+ y = range(len(rows))
+ maxv = max(vals)
+ ax.barh(y, vals, color=colors, height=0.62, zorder=3)
+ ax.set_xlim(0, maxv * 1.34) # headroom so value labels never clip
+ ax.set_ylim(-0.6, len(rows) - 0.4)
+ ax.invert_yaxis()
+ ax.set_yticks(list(y))
+ ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold"))
+ for i, v in enumerate(vals):
+ mult = v / ref
+ tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×"
+ label = f"{v:,} {unit} ({tag})"
+ if colors[i] == RED: # the worst: label centered inside the bar, dark text
+ ax.text(v / 2, i, label, va="center", ha="center", color=BG,
+ fontproperties=font(size=12.5, weight="bold"))
+ else:
+ ax.text(v + maxv * 0.02, i, label, va="center", ha="left",
+ color=TEXT if colors[i] != GRAY else MUTED,
+ fontproperties=font(size=12.5, weight="bold"))
+ ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8)
+ return ax
+
+panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16)
+panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37)
+
+out = sys.argv[1] if len(sys.argv) > 1 else "cover.png"
+fig.savefig(out, facecolor=BG, dpi=DPI)
+print("wrote", out)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py
new file mode 100644
index 00000000..7a71dc89
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+"""Cover for the measured benchmark post variant (B).
+
+Same hero visual as make_cover.py (the resource gap: Docker image + peak RAM,
+GoModel highlighted). The text is a single takeaway -
+"Four gateways, one backend - GoModel wins" - with no cost-question framing.
+"""
+import sys
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib import font_manager as fm
+
+BG = "#0b0e14"
+PANEL = "#11161f"
+TEXT = "#e6edf3"
+MUTED = "#8b98a9"
+GREEN = "#34d399" # GoModel
+RED = "#f87171" # LiteLLM
+GRAY = "#5b6675" # others
+
+def font(weight="normal", size=12, black=False):
+ fam = "Arial Black" if black else "Arial"
+ return fm.FontProperties(family=fam, weight=weight, size=size)
+
+# data (June 2026 c7i.large run) - ascending, GoModel (winner) on top.
+IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)]
+RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)]
+
+W, H, DPI = 2400, 1260, 200
+fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI)
+fig.patch.set_facecolor(BG)
+
+# ── left text column (single takeaway, no cost-question headline) ───
+T = dict(va="top", ha="left")
+fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK · JUNE 25, 2026", color=GREEN,
+ fontproperties=font(size=14.5, weight="bold"), **T)
+fig.text(0.043, 0.72, "Four gateways,", color=TEXT, fontproperties=font(size=33, black=True), **T)
+fig.text(0.043, 0.60, "one backend —", color=TEXT, fontproperties=font(size=33, black=True), **T)
+fig.text(0.043, 0.48, "GoModel wins", color=GREEN, fontproperties=font(size=33, black=True), **T)
+
+def panel(rect, title, rows, unit, ref):
+ ax = fig.add_axes(rect)
+ ax.set_facecolor(PANEL)
+ for s in ax.spines.values():
+ s.set_visible(False)
+ ax.tick_params(left=False, bottom=False, labelbottom=False)
+ labels = [r[0] for r in rows]
+ vals = [r[1] for r in rows]
+ colors = [r[2] for r in rows]
+ y = range(len(rows))
+ maxv = max(vals)
+ ax.barh(y, vals, color=colors, height=0.62, zorder=3)
+ ax.set_xlim(0, maxv * 1.34)
+ ax.set_ylim(-0.6, len(rows) - 0.4)
+ ax.invert_yaxis()
+ ax.set_yticks(list(y))
+ ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold"))
+ for i, v in enumerate(vals):
+ mult = v / ref
+ tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×"
+ label = f"{v:,} {unit} ({tag})"
+ if colors[i] == RED: # the worst: label centered inside the bar, dark text
+ ax.text(v / 2, i, label, va="center", ha="center", color=BG,
+ fontproperties=font(size=12.5, weight="bold"))
+ else:
+ ax.text(v + maxv * 0.02, i, label, va="center", ha="left",
+ color=TEXT if colors[i] != GRAY else MUTED,
+ fontproperties=font(size=12.5, weight="bold"))
+ ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8)
+ return ax
+
+panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16)
+panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37)
+
+out = sys.argv[1] if len(sys.argv) > 1 else "cover-b.png"
+fig.savefig(out, facecolor=BG, dpi=DPI)
+print("wrote", out)
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
new file mode 100644
index 00000000..179b4868
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
@@ -0,0 +1,3 @@
+output/
+__pycache__/
+*.pyc
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/README.md b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md
new file mode 100644
index 00000000..82487965
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md
@@ -0,0 +1,75 @@
+# Gateway translation-fidelity analysis
+
+How faithfully does each AI gateway translate a request? This harness sends the
+**same** client request through **GoModel, LiteLLM, Portkey, and Bifrost**, all
+pointed at the **same recording mock provider**, and captures — per case, per
+gateway — four artifacts:
+
+| artifact | meaning |
+|---|---|
+| `client_request` | what we sent to the gateway (the **pure** request) |
+| `sent_body` | the body after per-gateway rewrites (e.g. Bifrost's `openai/` model prefix) |
+| `upstream` | the request the gateway actually sent to the provider (the **translated** request) + the canned (**pure**) response the mock returned |
+| `client_response` | what the gateway returned to us (the **translated** response) |
+
+Then an AI analyzes each case across gateways: what each one added, dropped,
+renamed, or reshaped — request *and* response — and which is most faithful.
+
+A recording mock (not real providers) is the only way to observe the translated
+*upstream* request: real providers don't echo what the gateway sent them.
+
+## Why a mock, and what "pure" means
+
+- **Pure request** = the original client body. **Translated request** = what the
+ gateway emitted upstream (captured by the mock).
+- **Pure response** = the deterministic provider-shaped body the mock returned
+ (enriched with `system_fingerprint`, `service_tier`, and a non-standard
+ `x_provider_note` so we can see which gateways preserve provider extras).
+ **Translated response** = what the gateway returned to the client.
+- The comparison axis is **gateway vs gateway** — every case uses the same model
+ (`gpt-4o-mini`) routed to the mock, so differences are the gateway's doing, not
+ the provider's.
+
+## Pieces
+
+```text
+docker-compose.yml mock (MOCK_RECORD=1) + all 4 gateways, reusing ../remote configs
+corpus.json 12 gateway-agnostic cases across chat/responses/messages, stream + not
+capture.py resets the mock, sends each case through each gateway, records 4 artifacts
+analyze.py builds per-case AI-analysis prompts from the captures (one bundle per case)
+output/ captures.json + the AI comparison report (gitignored)
+```
+
+The recording mock lives in `../remote/bench-tools/mock/main.go` (recording is
+gated behind `MOCK_RECORD=1`, so the latency benchmark stays byte-identical).
+
+## Run it
+
+```bash
+# 0. build the GoModel image once (native arch):
+docker build -t gomodel-bench:local ../../..
+
+# 1. bring up the recording mock + all four gateways:
+cd docs/2026-06-25_aws_gateway_benchmark/translation
+docker compose --profile all up -d --build
+
+# 2. capture translations (resets the mock before each call):
+python3 capture.py # -> output/captures.json
+
+# 3. tear down:
+docker compose --profile all down
+```
+
+No real provider keys or spend — every gateway talks to the local mock.
+
+## Per-gateway addressing (handled by capture.py)
+
+| gateway | port | model | messages path | extra headers |
+|---|--|---|---|---|
+| GoModel | 18080 | `gpt-4o-mini` | `/v1/messages` | — |
+| LiteLLM | 4000 | `gpt-4o-mini` | `/v1/messages` | — |
+| Portkey | 8787 | `gpt-4o-mini` | `/v1/messages` | `x-portkey-provider`, `x-portkey-custom-host` |
+| Bifrost | 8089 | `openai/gpt-4o-mini` | `/anthropic/v1/messages` | — |
+
+Dialects a gateway doesn't serve are not skipped — the non-200 (and empty
+upstream log) is recorded, because that asymmetry is itself a finding.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
new file mode 100644
index 00000000..9f2f0762
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python3
+"""Glue for the AI translation analysis.
+
+ analyze.py --split read output/captures.json, write one self-contained
+ bundle per case to output/cases/.json (the input
+ an AI analyst reviews for that case)
+ analyze.py --render read output/analysis/.json (the AI's structured
+ verdict per case) + captures.json, write output/report.md
+
+The actual case-by-case comparison is done by an AI analyst (one per case): it
+reads a bundle and writes its verdict to output/analysis/.json following the
+schema documented in --split's banner. Stdlib only.
+"""
+import argparse
+import glob
+import json
+import os
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+OUT = os.path.join(HERE, "output")
+GATEWAYS = ["gomodel", "litellm", "portkey", "bifrost"]
+
+ANALYSIS_SCHEMA = {
+ "case_id": "string",
+ "verdict_per_gateway": {
+ "": {
+ "reached_provider": "bool — did the gateway make an upstream call?",
+ "upstream_path": "the path it called on the mock",
+ "request_added": ["fields/headers the gateway ADDED vs the client request"],
+ "request_dropped": ["client fields the gateway DROPPED before upstream"],
+ "request_renamed": ["client->upstream field renames, e.g. max_tokens->max_completion_tokens"],
+ "request_reshaped": "prose: structural changes (dialect translation, message shape, tool schema)",
+ "response_extras_preserved": ["provider extras kept in the client response: system_fingerprint/service_tier/x_provider_note/usage"],
+ "response_extras_dropped": ["provider extras the gateway stripped"],
+ "response_reshaped": "prose: how the upstream response was renormalized for the client",
+ "fidelity_score": "0-100 int: how faithfully intent was preserved end-to-end",
+ "notes": "anything notable"
+ }
+ },
+ "cross_gateway_findings": ["concise comparative observations"],
+ "ranking": ["gateways best->worst fidelity for this case"],
+}
+
+
+def split():
+ caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8"))
+ d = os.path.join(OUT, "cases")
+ os.makedirs(d, exist_ok=True)
+ ids = []
+ for cid, case in caps["cases"].items():
+ bundle = {"case_id": cid, "dialect": case["dialect"], "stream": case["stream"],
+ "intent_note": case["note"], "client_request": case["client_request"],
+ "gateways": case["gateways"]}
+ json.dump(bundle, open(os.path.join(d, f"{cid}.json"), "w", encoding="utf-8"), indent=2)
+ ids.append(cid)
+ print(f"wrote {len(ids)} case bundles to {d}")
+ for cid in ids:
+ print(" ", cid)
+
+
+def _esc(s):
+ # AI-authored cell values may contain `|` or newlines that would break the
+ # Markdown table; escape pipes and collapse newlines to spaces.
+ return str(s).replace("|", "\\|").replace("\r", " ").replace("\n", " ")
+
+
+def _cell(items):
+ if not items:
+ return "—"
+ return _esc("; ".join(str(x) for x in items)[:120])
+
+
+def render():
+ caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8"))
+ analyses = {}
+ for p in glob.glob(os.path.join(OUT, "analysis", "*.json")):
+ try:
+ a = json.load(open(p, encoding="utf-8"))
+ analyses[a.get("case_id", os.path.basename(p)[:-5])] = a
+ except (OSError, ValueError):
+ pass
+
+ gws = caps["meta"]["gateways"]
+ L = ["# Gateway translation-fidelity report\n",
+ "Same request through each gateway, same mock provider. The AI analyst "
+ "compared the translated upstream request vs the pure client request, and "
+ "the translated client response vs the pure mock response, per case.\n",
+ f"`gateways: {', '.join(gws)}` · `cases: {len(caps['cases'])}`\n"]
+
+ # ── aggregate scoreboard ──────────────────────────────────────────────────
+ scores = {g: [] for g in gws}
+ for a in analyses.values():
+ for g, v in (a.get("verdict_per_gateway") or {}).items():
+ s = v.get("fidelity_score")
+ if isinstance(s, (int, float)):
+ scores.setdefault(g, []).append(s)
+ L.append("## Fidelity scoreboard (mean of per-case AI scores)\n")
+ L.append("| gateway | mean fidelity | cases scored |")
+ L.append("|---|--:|--:|")
+ for g in gws:
+ vals = scores.get(g, [])
+ mean = round(sum(vals) / len(vals)) if vals else 0
+ L.append(f"| {g} | {mean} | {len(vals)} |")
+ L.append("")
+
+ # ── per-case detail ────────────────────────────────────────────────────────
+ for cid, case in caps["cases"].items():
+ a = analyses.get(cid)
+ L.append(f"## `{cid}` — {case['dialect']}{', stream' if case['stream'] else ''}\n")
+ L.append(f"_{case['note']}_\n")
+ if not a:
+ L.append("> _no AI analysis recorded for this case_\n")
+ continue
+ L.append("| gateway | upstream | added | dropped | renamed | resp extras kept | resp dropped | fidelity |")
+ L.append("|---|---|---|---|---|---|---|--:|")
+ for g in gws:
+ v = (a.get("verdict_per_gateway") or {}).get(g)
+ if not v:
+ L.append(f"| {g} | — | — | — | — | — | — | — |")
+ continue
+ L.append(f"| {g} | {_esc(v.get('upstream_path','—'))} | {_cell(v.get('request_added'))} | "
+ f"{_cell(v.get('request_dropped'))} | {_cell(v.get('request_renamed'))} | "
+ f"{_cell(v.get('response_extras_preserved'))} | {_cell(v.get('response_extras_dropped'))} | "
+ f"{_esc(v.get('fidelity_score','—'))} |")
+ L.append("")
+ if a.get("cross_gateway_findings"):
+ L.append("**Findings:**")
+ for f in a["cross_gateway_findings"]:
+ L.append(f"- {f}")
+ L.append("")
+ if a.get("ranking"):
+ L.append(f"**Fidelity ranking:** {' > '.join(a['ranking'])}\n")
+
+ path = os.path.join(OUT, "report.md")
+ open(path, "w", encoding="utf-8").write("\n".join(L))
+ print(f"wrote {path}")
+
+
+def main():
+ ap = argparse.ArgumentParser()
+ ap.add_argument("--split", action="store_true")
+ ap.add_argument("--render", action="store_true")
+ ap.add_argument("--schema", action="store_true", help="print the analysis JSON schema")
+ args = ap.parse_args()
+ if args.schema:
+ print(json.dumps(ANALYSIS_SCHEMA, indent=2))
+ elif args.split:
+ split()
+ elif args.render:
+ render()
+ else:
+ ap.error("one of --split / --render / --schema required")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
new file mode 100644
index 00000000..73201219
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""Capture how each gateway translates the SAME client request to the SAME mock.
+
+For every (case, gateway) it records four artifacts:
+ - client_request : what we sent to the gateway (the "pure" request)
+ - sent_body : the body after per-gateway model rewrite
+ - upstream : the request(s) the gateway actually sent to the mock
+ (the TRANSLATED request) + the canned ("pure") response
+ - client_response : what the gateway returned to us (the TRANSLATED response)
+
+The mock is reset before each call and requests are sent one at a time, so the
+shared recorder attributes each upstream call to the gateway+case that made it.
+Stdlib only. Output: output/captures.json.
+"""
+import argparse
+import copy
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.request
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+MOCK = "http://localhost:9999"
+
+# Per-gateway base URL is env-overridable (e.g. GOMODEL_BASE) so a local dev
+# server on a default port doesn't force a clash.
+GATEWAYS = {
+ "gomodel": {"base": os.environ.get("GOMODEL_BASE", "http://localhost:18080")},
+ "litellm": {"base": os.environ.get("LITELLM_BASE", "http://localhost:4000")},
+ "portkey": {"base": os.environ.get("PORTKEY_BASE", "http://localhost:8787"),
+ "headers": {"x-portkey-provider": "openai",
+ "x-portkey-custom-host": "http://mock:9999/v1"}},
+ "bifrost": {"base": os.environ.get("BIFROST_BASE", "http://localhost:8089")},
+}
+ORDER = ["gomodel", "litellm", "portkey", "bifrost"]
+DIALECT_PATH = {"chat": "/v1/chat/completions", "responses": "/v1/responses",
+ "messages": "/v1/messages"}
+
+
+def model_for(gw, m):
+ return "openai/" + m if gw == "bifrost" else m
+
+
+def path_for(gw, dialect):
+ if gw == "bifrost" and dialect == "messages":
+ return "/anthropic/v1/messages"
+ return DIALECT_PATH[dialect]
+
+
+def headers_for(gw):
+ h = {"Content-Type": "application/json", "Authorization": "Bearer sk-bench-test-key",
+ "anthropic-version": "2023-06-01"}
+ h.update(GATEWAYS[gw].get("headers", {}))
+ return h
+
+
+# ── HTTP ─────────────────────────────────────────────────────────────────────
+def post(url, headers, body, stream, timeout=30):
+ data = json.dumps(body).encode("utf-8")
+ req = urllib.request.Request(url, data=data, method="POST", headers=headers)
+ out = {"status": 0, "content_type": "", "json": None, "text": None,
+ "stream_events": 0, "stream_text": "", "terminal": None, "error": None}
+ try:
+ resp = urllib.request.urlopen(req, timeout=timeout)
+ _capture(out, resp, stream)
+ except urllib.error.HTTPError as e:
+ out["status"] = e.code
+ _capture(out, e, stream=False)
+ except Exception as e: # noqa: BLE001
+ out["error"] = f"{type(e).__name__}: {e}"
+ return out
+
+
+def _capture(out, resp, stream):
+ out["status"] = getattr(resp, "status", out["status"]) or out["status"]
+ try:
+ out["content_type"] = resp.headers.get("content-type", "")
+ except Exception: # noqa: BLE001
+ pass
+ if stream and "text/event-stream" in out["content_type"]:
+ for rawline in resp:
+ line = rawline.decode("utf-8", "replace").strip()
+ if not line.startswith("data:"):
+ continue
+ payload = line[5:].strip()
+ if payload == "[DONE]":
+ out["terminal"] = "[DONE]"
+ continue
+ out["stream_events"] += 1
+ try:
+ ev = json.loads(payload)
+ except Exception: # noqa: BLE001
+ continue
+ t = ev.get("type")
+ if t in ("response.completed", "message_stop"):
+ out["terminal"] = t
+ for ch in ev.get("choices", []) or []:
+ d = (ch.get("delta") or {}).get("content")
+ if isinstance(d, str):
+ out["stream_text"] += d
+ if t == "response.output_text.delta" and isinstance(ev.get("delta"), str):
+ out["stream_text"] += ev["delta"]
+ if t == "content_block_delta":
+ td = (ev.get("delta") or {}).get("text")
+ if isinstance(td, str):
+ out["stream_text"] += td
+ return
+ raw = resp.read()
+ if "application/json" in out["content_type"]:
+ try:
+ out["json"] = json.loads(raw.decode("utf-8"))
+ except Exception: # noqa: BLE001
+ out["text"] = raw.decode("utf-8", "replace")
+ else:
+ out["text"] = raw.decode("utf-8", "replace")[:4000]
+
+
+def get_json(url, timeout=10):
+ try:
+ resp = urllib.request.urlopen(urllib.request.Request(url, method="GET"), timeout=timeout)
+ return json.loads(resp.read().decode("utf-8"))
+ except Exception: # noqa: BLE001
+ return None
+
+
+def mock_reset():
+ # Fail fast: a silently failed reset would attribute stale upstream calls to
+ # the wrong gateway/case and corrupt the captured corpus.
+ try:
+ resp = urllib.request.urlopen(
+ urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5)
+ status = getattr(resp, "status", 200) or 200
+ resp.read()
+ except Exception as e: # noqa: BLE001
+ sys.exit(f"mock reset failed ({MOCK}/__reset): {e} — aborting to avoid a corrupt corpus")
+ if status >= 400:
+ sys.exit(f"mock reset returned HTTP {status} ({MOCK}/__reset) — aborting to avoid a corrupt corpus")
+
+
+def wait_ready(gw, tries=60):
+ url = GATEWAYS[gw]["base"] + "/v1/chat/completions"
+ body = {"model": model_for(gw, "gpt-4o-mini"),
+ "messages": [{"role": "user", "content": "ping"}]}
+ for _ in range(tries):
+ r = post(url, headers_for(gw), body, stream=False, timeout=8)
+ if r["status"] == 200:
+ return True
+ time.sleep(2)
+ return False
+
+
+# ── trimming (keep artifacts readable) ────────────────────────────────────────
+def trim(obj, limit=1500):
+ if isinstance(obj, str):
+ return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit})"
+ if isinstance(obj, list):
+ return [trim(x, limit) for x in obj]
+ if isinstance(obj, dict):
+ return {k: trim(v, limit) for k, v in obj.items()}
+ return obj
+
+
+def main():
+ ap = argparse.ArgumentParser()
+ ap.add_argument("--corpus", default=os.path.join(HERE, "corpus.json"))
+ ap.add_argument("--out", default=os.path.join(HERE, "output", "captures.json"))
+ ap.add_argument("--gateways", default=",".join(ORDER))
+ args = ap.parse_args()
+
+ gateways = [g.strip() for g in args.gateways.split(",") if g.strip()]
+ unknown = [g for g in gateways if g not in GATEWAYS]
+ if unknown:
+ ap.error(f"unknown gateway(s): {', '.join(unknown)}; valid options: {', '.join(ORDER)}")
+ corpus = json.load(open(args.corpus, encoding="utf-8"))
+
+ if get_json(MOCK + "/__log") is None:
+ print(f"mock not reachable at {MOCK} (is the stack up? is MOCK_RECORD=1?)", file=sys.stderr)
+ return 2
+
+ print("waiting for gateways…")
+ ready = {}
+ for gw in gateways:
+ ready[gw] = wait_ready(gw)
+ print(f" {gw:9} {'ready' if ready[gw] else 'NOT READY (will still attempt)'}")
+
+ results = {"meta": {"gateways": gateways, "ready": ready}, "cases": {}}
+ for case in corpus:
+ cid, dialect, stream = case["id"], case["dialect"], case.get("stream", False)
+ entry = {"note": case.get("note", ""), "dialect": dialect, "stream": stream,
+ "client_request": case["body"], "gateways": {}}
+ print(f"\n{cid} ({dialect}{', stream' if stream else ''})")
+ for gw in gateways:
+ body = copy.deepcopy(case["body"])
+ body["model"] = model_for(gw, body["model"])
+ url = GATEWAYS[gw]["base"] + path_for(gw, dialect)
+ mock_reset()
+ resp = post(url, headers_for(gw), body, stream)
+ log = get_json(MOCK + "/__log") or {}
+ ups = log.get("entries") or [] # mock returns null when no upstream call was made
+ up_paths = ",".join(sorted({e.get("path", "?") for e in ups})) or "—"
+ print(f" {gw:9} http={resp['status'] or resp['error']:>4} "
+ f"upstream={len(ups)} [{up_paths}]")
+ entry["gateways"][gw] = {
+ "sent_body": trim(body),
+ "url": url,
+ "client_response": {
+ "status": resp["status"], "content_type": resp["content_type"],
+ "error": resp["error"],
+ "json": trim(resp["json"]) if resp["json"] is not None else None,
+ "text": resp["text"],
+ "stream_events": resp["stream_events"],
+ "stream_text": trim(resp["stream_text"]) if resp["stream_text"] else "",
+ "terminal": resp["terminal"],
+ },
+ "upstream": trim(ups),
+ }
+ results["cases"][cid] = entry
+
+ os.makedirs(os.path.dirname(args.out), exist_ok=True)
+ json.dump(results, open(args.out, "w", encoding="utf-8"), indent=2)
+ print(f"\nwrote {args.out}")
+ return 0
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
new file mode 100644
index 00000000..24c6fd1b
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
@@ -0,0 +1,153 @@
+[
+ {
+ "id": "chat.simple",
+ "dialect": "chat",
+ "stream": false,
+ "note": "Baseline: does the body pass through unchanged? what auth/headers are injected upstream?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "What is the capital of France?"}]
+ }
+ },
+ {
+ "id": "chat.stream",
+ "dialect": "chat",
+ "stream": true,
+ "note": "Streaming framing: chunk shape, terminal marker, whether stream_options is forwarded.",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "Count to three."}],
+ "stream": true,
+ "stream_options": {"include_usage": true}
+ }
+ },
+ {
+ "id": "chat.multiturn_system",
+ "dialect": "chat",
+ "stream": false,
+ "note": "System role + multi-turn: is the system message preserved in place and message order kept?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [
+ {"role": "system", "content": "You are a terse assistant."},
+ {"role": "user", "content": "Largest planet?"},
+ {"role": "assistant", "content": "Jupiter."},
+ {"role": "user", "content": "Smallest?"}
+ ]
+ }
+ },
+ {
+ "id": "chat.params",
+ "dialect": "chat",
+ "stream": false,
+ "note": "Sampling params fidelity: which of these survive verbatim upstream (temperature/top_p/penalties/stop/seed/max_tokens)?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "Say ok."}],
+ "temperature": 0.3,
+ "top_p": 0.8,
+ "frequency_penalty": 0.5,
+ "presence_penalty": 0.2,
+ "stop": ["\n\n"],
+ "seed": 42,
+ "max_tokens": 64
+ }
+ },
+ {
+ "id": "chat.extra_fields",
+ "dialect": "chat",
+ "stream": false,
+ "note": "KEY: unknown/extra fields. Which gateways forward them verbatim vs strip them (e.g. LiteLLM drop_params)?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "Say ok."}],
+ "metadata": {"qa_case": "extra-fields"},
+ "x_qa_marker": "keep-123",
+ "user": "qa-user-1"
+ }
+ },
+ {
+ "id": "chat.tools",
+ "dialect": "chat",
+ "stream": false,
+ "note": "Tool/function definitions and tool_choice: forwarded faithfully?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "Weather in Paris?"}],
+ "tools": [{"type": "function", "function": {
+ "name": "get_weather",
+ "description": "Get weather for a city",
+ "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
+ }}],
+ "tool_choice": "auto"
+ }
+ },
+ {
+ "id": "chat.response_format",
+ "dialect": "chat",
+ "stream": false,
+ "note": "Structured-output directive: is response_format forwarded?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": "Return JSON with capital of Spain."}],
+ "response_format": {"type": "json_object"}
+ }
+ },
+ {
+ "id": "chat.vision",
+ "dialect": "chat",
+ "stream": false,
+ "note": "Multimodal content parts: how is an image_url part forwarded upstream?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "messages": [{"role": "user", "content": [
+ {"type": "text", "text": "What color?"},
+ {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8z8BQDwAEhQGAhKmMIQAAAABJRU5ErkJggg=="}}
+ ]}]
+ }
+ },
+ {
+ "id": "responses.simple",
+ "dialect": "responses",
+ "stream": false,
+ "note": "Responses API: how is `input` translated for each gateway's upstream provider call?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "input": "What is the capital of France?"
+ }
+ },
+ {
+ "id": "responses.stream",
+ "dialect": "responses",
+ "stream": true,
+ "note": "Responses streaming: event protocol the gateway returns to the client.",
+ "body": {
+ "model": "gpt-4o-mini",
+ "input": "Count to three.",
+ "stream": true
+ }
+ },
+ {
+ "id": "messages.simple",
+ "dialect": "messages",
+ "stream": false,
+ "note": "Anthropic Messages in: what upstream dialect/path does each gateway emit (native messages vs translated chat)?",
+ "body": {
+ "model": "gpt-4o-mini",
+ "max_tokens": 64,
+ "messages": [{"role": "user", "content": "What is the capital of France?"}]
+ }
+ },
+ {
+ "id": "messages.stream",
+ "dialect": "messages",
+ "stream": true,
+ "note": "Anthropic Messages streaming translation.",
+ "body": {
+ "model": "gpt-4o-mini",
+ "max_tokens": 64,
+ "stream": true,
+ "messages": [{"role": "user", "content": "Count to three."}]
+ }
+ }
+]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
new file mode 100644
index 00000000..57a9a070
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
@@ -0,0 +1,84 @@
+# Translation-fidelity topology: all four gateways at once, every one pointed at
+# a single RECORDING mock backend (MOCK_RECORD=1). Because the capture runner
+# sends one request at a time and resets the mock before each, the shared mock
+# cleanly attributes each upstream call to the gateway+case that produced it.
+#
+# docker compose --profile all up -d # mock + gomodel + litellm + portkey + bifrost
+#
+# Gateways are on different host ports so they can run simultaneously. Configs
+# and the bench-tools build context are reused from ../remote.
+
+networks:
+ default:
+ name: xlatenet
+
+services:
+ mock:
+ build: ../remote/bench-tools
+ command: ["/mock"]
+ environment:
+ - MOCK_PORT=9999
+ - MOCK_RECORD=1
+ ports:
+ - "9999:9999"
+ restart: "no"
+
+ gomodel:
+ profiles: ["all", "gomodel"]
+ image: ${GOMODEL_IMAGE:-gomodel-bench:local}
+ depends_on: [mock]
+ ports:
+ # Host 18080 to avoid clashing with a local dev gomodel on 8080.
+ - "${GOMODEL_HOST_PORT:-18080}:8080"
+ environment:
+ - PORT=8080
+ - GOMODEL_MASTER_KEY=
+ - OPENAI_API_KEY=sk-bench-test-key
+ - OPENAI_BASE_URL=http://mock:9999/v1
+ - LOGGING_ENABLED=false
+ - USAGE_ENABLED=false
+ - METRICS_ENABLED=false
+ - SWAGGER_ENABLED=false
+ - PPROF_ENABLED=false
+ - ENABLE_PASSTHROUGH_ROUTES=false
+ - STORAGE_TYPE=sqlite
+ - SQLITE_PATH=/app/data/gomodel-xlate.db
+ - GOMODEL_CACHE_DIR=/app/.cache
+ restart: "no"
+
+ litellm:
+ profiles: ["all", "litellm"]
+ # Pinned by digest for a reproducible comparison (override via LITELLM_IMAGE).
+ image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable@sha256:afdc3cc37493d4f86d485ad7ac4445e7154c568a8d47c01bad15c9cf062c66b5}
+ depends_on: [mock]
+ ports:
+ - "4000:4000"
+ volumes:
+ - ../remote/configs/litellm-config.yaml:/app/config.yaml:ro
+ command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "1"]
+ restart: "no"
+
+ portkey:
+ profiles: ["all", "portkey"]
+ # Pinned by digest for a reproducible comparison (override via PORTKEY_IMAGE).
+ image: ${PORTKEY_IMAGE:-portkeyai/gateway:latest@sha256:97f094d9c8a764cbfaa2a7138c0017b247ca923bb06db1b4c13b7f8a33b5200d}
+ depends_on: [mock]
+ ports:
+ - "8787:8787"
+ environment:
+ - TRUSTED_CUSTOM_HOSTS=mock
+ restart: "no"
+
+ bifrost:
+ profiles: ["all", "bifrost"]
+ # Pinned by digest for a reproducible comparison (override via BIFROST_IMAGE).
+ image: ${BIFROST_IMAGE:-maximhq/bifrost:latest@sha256:6f20c020cd326199c050e6b15ba18131a6f7ac8627a9a4276750f83e92af2253}
+ depends_on: [mock]
+ ports:
+ - "8089:8089"
+ environment:
+ - APP_PORT=8089
+ - APP_HOST=0.0.0.0
+ volumes:
+ - ../remote/configs/bifrost-config.json:/app/data/config.json:ro
+ restart: "no"