diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md new file mode 100644 index 00000000..f59209d8 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md @@ -0,0 +1,222 @@ +--- +title: "AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost" +description: "GoModel vs LiteLLM, Portkey, and Bifrost - a reproducible AWS benchmark of four open-source AI gateways across latency, throughput, memory, CPU, and Docker image size. A fast, lightweight LiteLLM alternative in Go." +coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png" +coverImageWidth: 2400 +coverImageHeight: 1260 +pubDate: 2026-06-26 +author: "Jakub A. Wasek" +tags: + - benchmarking + - ai-gateway + - litellm + - portkey + - bifrost + - gomodel +--- + +![GoModel vs LiteLLM, Portkey and Bifrost - latency is overrated, look at the bill](./cover.png) + +The point of this benchmark is not to prove that LiteLLM sucks. The point is to +measure GoModel honestly against the gateways people actually compare it to: +**LiteLLM, Portkey, and Bifrost**. + +That said - yes, LiteLLM sucks, and that is exactly why GoModel exists. (If you're +not sure what I mean, I'd recommend giving the software a try yourself - or doing +your own research) + +In October 2025 I tried to build my startup on top of LiteLLM. I quickly found +out that the software is fundamentally designed badly. A proxy-like server, on +the hot path of every request, written in Python? On top of that came a long +tail of operational issues. So I did my research and started writing GoModel: a +production-grade and enterprise-grade AI gateway / AI control plane, in Go. + +The later supply-chain security incident around LiteLLM only confirmed my view. +Go and its standard-library-heavy dependency trees are structurally far less +exposed to that class of attack than a sprawling Python dependency graph. + +With the motivation out of the way, let's talk about what's actually worth +measuring in an AI gateway benchmark - the metrics that make a comparison +meaningful. + +When I [launched GoModel on Hacker News](https://news.ycombinator.com/item?id=47861333) +I told the thread I'd publish a real, reproducible benchmark. Here it comes. + +## What to measure to choose the best AI gateway + +Here is the full list of metrics that matter: + +- `p99` / `p95` / `p50` latency (proxy overhead) +- RAM consumption +- CPU consumption (and throughput per core) +- Cold-start time +- Docker image size +- Vendor-agnostic +- Open-source + +A couple of these deserve a closer look. + +### Latency + +Latency matters less than you'd assume. Be precise about what we are measuring: +**proxy overhead latency** - the time the gateway itself adds, on top of the +upstream call. + +The trap is treating latency as the ultimate criterion. In any real workload the +dominant latency comes from inference. The gateway's overhead is a small fraction +of the total you're already living with. A gateway that is "2x faster" at adding +`5 ms` is not meaningfully faster once a model takes `2000 ms` to respond. + +So I care far more about the *tail* (p99) than the median - a gateway that is +usually fast but occasionally stalls is worse than one that is boringly +consistent. + +### Resource consumption - CPU, RAM, image size, cold start + +These are the metrics that actually move the needle, because they map directly to: + +1. The monthly cost of your infrastructure. +2. Whether you can run the gateway serverless (AWS Lambda, GCP Functions) or on + edge devices at all. + +A `372 MB` image (`1.2 GB` unpacked) that idles at gigabytes of RAM and takes +`25 s` to cold-start is a different operational animal than a `16 MB` image that +peaks at `37 MB` of RAM and is serving traffic `0.56 s` after launch. + +## The benchmark + +Every gateway talked to the **same instant mock backend**, so the numbers reflect +gateway overhead, not model latency or network jitter. Each ran one at a time, in +Docker, on an **AWS `c7i.large`** (2 vCPU, 4 GiB) running the latest **Amazon Linux +2023** AMI - the whole thing is Terraform'd, runs on one command, and tears itself +down afterwards. + +I actually ran this twice. The **first cut used the free-tier `t2.micro`** +(1 vCPU, 1 GiB) - cheap, self-destructing, trivial to reproduce. But I realized +that was *unfair to the competitors*: a 1 GiB box can't hold the memory-heavy +gateways (LiteLLM idles near a gigabyte), so they spill into **swap** and get +penalized for the host being too small rather than for their own overhead. So I +switched to the roomier, non-burstable **`c7i.large`** - nothing swaps there, and a +fixed-performance instance also removes the CPU-credit drift that muddies the tail +on burstable boxes. **The relative results barely moved between the two runs** - +GoModel still won on tail latency, throughput, memory, and image size. Giving the +heavy gateways enough RAM to not thrash makes the comparison *more* honest, not +less. + +I tested four gateways across six workloads - chat completions, the Responses API, +and Anthropic messages, each streaming and non-streaming - driven at `8,000` +requests per workload, concurrency `10`, across **two trials with randomized +gateway order**. Latency is the **median across trials**, and I report each p99 +with its min-max across trials so a single noisy window can't drive the story. + +A few methodology details worth calling out: + +- **Throughput is measured, not inferred.** The latency runs report + completed-req/s at a fixed concurrency, which is just latency restated. Real + capacity comes from a separate **concurrency sweep** that drives each gateway to + saturation and records sustained req/s. +- **I warm up every dialect before measuring it.** LiteLLM lazily imports its + per-dialect translation modules on first use, so a naive chat-only warmup left + the Responses and Messages paths cold and inflated their tails. I neutralized + that to be fair - but note what it tells you: a server that pays an import tax + the first time it sees a request type is, again, not designed for the hot path. +- **Fair resilience config.** Every gateway runs with retries disabled. I also + disabled GoModel's circuit breaker for the test - under the saturation sweep a + few transient errors would otherwise trip it and it would (correctly, in + production) start rejecting requests, which would unfairly zero out its *own* + throughput. No other gateway here has a breaker, so off is the apples-to-apples + setting. +- **LiteLLM at its recommended worker count.** A LiteLLM worker is effectively + single-threaded, and its own production guidance is one worker per CPU core - so I + run it with `num_workers` = the box's vCPU count (`2` here), the same multi-core + access the Go gateways get for free. (Pin it to one worker and it under-uses the + box; give it more and, as the table shows, its memory balloons. There's no setting + that makes it both fast *and* light.) +- **Streaming uses terminal-marker or idle-gap detection**, so a gateway that + streams content without ever sending a terminal event (Bifrost, over a + non-native backend) is measured to last byte instead of hanging the harness. + +## The comparison + +Representative latency is chat completions, non-streaming. All resource figures +are measured under load on the same box. + +| Metric | GoModel | Bifrost | Portkey | LiteLLM | +|---|--:|--:|--:|--:| +| Runtime | Go | Go | Node.js | Python | +| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` | +| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` | +| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` | +| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` | +| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` | +| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` | +| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` | +| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` | +| Vendor-agnostic | Yes | Partial † | Yes | Yes | +| Open-source | Yes ‡ | Partial ‡ | Partial ‡ | Yes | + +Same numbers, at a glance: + +![Latency tail p99: GoModel 6.9 ms, Bifrost 18.3 ms, Portkey 30.5 ms, LiteLLM 39.3 ms](./charts/june-2026-latency-p99.svg) + +![Sustained throughput: GoModel 4,900 req/s, Bifrost 3,100, Portkey 950, LiteLLM 324](./charts/june-2026-throughput.svg) + +![Peak memory under load: GoModel 37 MB, Bifrost 143 MB, Portkey 112 MB, LiteLLM 2.3 GB](./charts/june-2026-memory.svg) + +![Docker image, compressed: GoModel 16 MB, Bifrost 77 MB, Portkey 59 MB, LiteLLM 372 MB](./charts/june-2026-image.svg) + +A few honest notes, because I'd rather you trust the rest of the table: + +- **On a non-burstable host the medians are real, and GoModel leads on both ends.** + It posts the lowest `p50` (`1.8 ms`) *and* the tightest `p99` (`6.9 ms`). + Bifrost is a close second on the median (`2.5 ms`) - but its tail is ~`2.7x` + heavier (`18 ms`) and it carries ~`4x` the memory under load. +- **GoModel cold-starts in `0.56 s` versus LiteLLM's ~`25 s`.** That is the + difference between viable on a serverless platform and not. +- **Portkey** does not serve the Anthropic `/v1/messages` dialect in this + single-provider setup, hence `4/6` (it supports Anthropic with a fuller + virtual-key config; this is a setup limitation, not a hard capability gap). +- **LiteLLM** ships a `372 MB` compressed image (`1.16 GB` on disk), and at its + recommended config (one worker per core) it uses **~`2.3 GB` of RAM** - two ~1 GB + worker processes - and ~`25 s` to cold-start. Running it *properly* for multi-core throughput makes the footprint + worse, not better. That is the cost of Python on the hot path. +- **Bifrost is not a neutral project (†).** It is built by + [Maxim AI](https://www.getmaxim.ai/bifrost), an LLM evaluation & observability + platform, and ships a first-party plugin that forwards your gateway traffic to + Maxim's platform. It routes to many *model* providers, but the gateway itself is + a channel into one vendor's ecosystem - not the independent, vendor-neutral tool + the "1000+ models" headline implies. +- **"Open-source" deserves an asterisk (‡).** Portkey keeps its observability + storage, dashboard, multi-team RBAC, and at-scale semantic caching in a closed + managed tier; Bifrost's core gateway is Apache-2.0 but its Enterprise edition + layers on closed/managed features. GoModel is open-source today, with some + enterprise-grade features planned to stay private. LiteLLM is the most open of + the four - its proxy core is MIT - but even it gates its enterprise features + (SSO, audit logs, fine-grained access control) behind a separate *proprietary* + commercial license that ships source-available in the `enterprise/` folder, not + as free OSS. + +## Summary + +GoModel is the best gateway in this comparison: the lowest median *and* the +tightest latency tail, the highest sustained throughput, the best throughput per +CPU (~`52` req/s per %), the smallest compressed image (≈`23x` smaller than +LiteLLM) and memory, the fastest cold start - with full workload coverage. + +I've tried to be as objective as I can, and the whole thing is built to be +**self-verifiable**: the harness provisions the AWS instance, runs every gateway +against the same backend, prints the table, and destroys the infrastructure. +**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)** - +clone the repo, point it at your AWS account, and run `./run.sh`. It builds the +images, provisions the box, runs all four gateways, prints the tables, and tears +the infrastructure back down on its own. + +One caveat: it runs on **paid** AWS infrastructure, not the free tier. A +`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or +two, so budget **under `$1`** per run to be safe - and if you pass `KEEP=1` or a +teardown ever fails, you keep paying until you destroy the box, so double-check +it's gone. + +If you have objections to this benchmark, reach out on the GoModel Discord (link +in the GoModel README on GitHub). And I'd genuinely like to see more impartial +gateway comparisons out there - bring your own numbers. diff --git a/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md new file mode 100644 index 00000000..7ea41cdf --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/ARTICLE2.md @@ -0,0 +1,351 @@ +--- +title: "Benchmarking AI Gateways: GoModel vs LiteLLM vs Portkey vs Bifrost" +description: "A reproducible AI gateway benchmark comparing GoModel, LiteLLM, Portkey, and Bifrost on latency, throughput, memory, CPU, cold start, and image size." +coverImage: "/blog/charts/gomodel-gateway-benchmark-june-2026-cover.png" +coverImageWidth: 2400 +coverImageHeight: 1260 +pubDate: 2026-06-26 +author: "Jakub A. Wasek" +keywords: + - AI gateway benchmark + - AI control plane + - OpenAI-compatible API + - LiteLLM alternative + - GoModel + - LiteLLM + - Portkey + - Bifrost +tags: + - benchmarking + - ai-gateway + - ai-control-plane + - litellm + - portkey + - bifrost + - gomodel +--- + +![GoModel vs LiteLLM, Portkey and Bifrost benchmark - four gateways, one backend, GoModel wins](./cover.png) + +In October 2025 I tried to build my startup on top of LiteLLM. + +At first it looked like the obvious choice. It supported many providers, it had +an OpenAI-compatible API, and it was already used by a lot of people. I did not +want to write an AI gateway. I wanted to build the product behind it. + +Then I started running it on the hot path. + +My opinion changed there. + +A gateway is not a dashboard or integration glue you call once in a while. It +sits on every request, every retry, every stream, every tool call, every +fallback, every timeout. + +A heavy gateway charges rent forever. + +Most AI gateway comparisons miss that part. They talk about provider count, +dashboards, tracing, and "support for 1000+ models". Those things matter, but +they are not free. Before the gateway calls OpenAI, Anthropic, Gemini, vLLM, or +anything else, it has already spent your CPU, memory, cold-start time, and +operational budget. + +I am not comparing full product maturity here. I am comparing how these gateways +behave on the hot path. + +So I started writing [GoModel](https://github.com/ENTERPILOT/GoModel): a small +open-source AI gateway and AI control plane in Go, with an OpenAI-compatible API +and explicit provider adapters. + +When I launched GoModel on Hacker News, +I promised a real, reproducible benchmark. This article is that follow-up. + +The benchmark question is simple: + +**How lean is each AI gateway when it sits on the request path?** + +That question runs through the whole benchmark: GoModel vs LiteLLM vs Portkey vs +Bifrost, measured by latency, throughput, memory, CPU, cold start, and image +size rather than landing pages or feature matrices. + +## The runtime footprint matters + +Latency gets the easiest arguments. It rarely tells the whole story. + +Most real LLM calls are dominated by inference time. If a model takes `2000 ms` +to answer, the difference between `5 ms` and `15 ms` of proxy overhead is not +the main story. + +The main story is the deployment envelope: + +- How much RAM does the gateway need under load? +- How much CPU does it burn per request? +- How many requests can it serve per core? +- How fast does it cold-start? +- How large is the Docker image? +- Can you run it as a sidecar, on a small VM, in serverless, or near local + models? +- Is the core gateway actually open-source? + +Those numbers decide whether the gateway can run where you want it to run. + +A `372 MB` compressed image (`1.2 GB` unpacked) that idles around gigabytes of +RAM and takes `25 s` to cold-start is a different operational thing than a +`16 MB` image that peaks at `37 MB` of RAM and is serving traffic `0.56 s` after +launch. + +So I care about the runtime footprint. + +## What this benchmark does not prove + +This benchmark does **not** prove that one gateway is best for every company. + +I am not measuring: + +- bug counts or overall correctness +- semantic cache quality +- tracing UI quality +- guardrail quality +- admin dashboards +- long-term provider maintenance +- every possible provider-specific feature +- total provider count + +Those things matter. Some of them matter a lot. + +LiteLLM in particular has more integrated providers and more gateway features +than GoModel today. If your first requirement is maximum provider coverage right +now, LiteLLM has a real advantage. This benchmark does not erase that. It +measures the runtime footprint of putting each gateway on the request path. In +practice, many smaller or newer providers already expose an OpenAI-compatible +API, so provider count is not always the same as practical routing coverage. + +The benchmark measures one narrower thing: **runtime and deployment overhead on +the request path**. + +That still matters, because the gateway is on the hot path. If you run high +request volume, local models, serverless workloads, edge workloads, or many small +model calls, the overhead stops being theoretical. + +## AI gateway benchmark setup + +I tested four AI gateways people actually compare: + +- GoModel +- LiteLLM +- Portkey +- Bifrost + +Every gateway talked to the **same instant mock backend**, on purpose. I did not +want to benchmark OpenAI, Anthropic, AWS networking, or random internet jitter. +I wanted to isolate the gateway itself. + +Each gateway ran one at a time, in Docker, on an **AWS `c7i.large`** with +2 vCPU and 4 GiB RAM, running the latest **Amazon Linux 2023** AMI. The whole +thing is Terraform'd, runs with one command, and tears itself down afterwards. + +I first ran this on a free-tier `t2.micro`. That was cheap and easy to +reproduce, but unfair to the heavier gateways. A 1 GiB machine cannot hold a +gateway that wants gigabytes of memory, so it starts swapping. At that point you +are benchmarking the host being too small. + +So I moved to `c7i.large`: still small, but non-burstable and large enough that +nothing swaps. It also makes the LiteLLM setup more honest. LiteLLM recommends +one worker per vCPU, and this machine has 2 vCPUs, so LiteLLM gets 2 +workers. That gives it the multi-core access it is supposed to have instead of +pinning it to a single worker on a tiny box. + +The test covered six workloads: + +- chat completions, non-streaming +- chat completions, streaming +- Responses API, non-streaming +- Responses API, streaming +- Anthropic messages, non-streaming +- Anthropic messages, streaming + +Each workload used `8,000` requests at concurrency `10`, across **two trials +with randomized gateway order**. Latency is the **median across trials**, and I +report p99 with its min-max range so one noisy window cannot tell the whole +story. + +I would not call this a statistically exhaustive study. It is a reproducible +engineering benchmark, and the harness is public so people can rerun it, change +the machine, or add their own workloads. + +A few details matter if you want to reproduce or criticize the numbers: + +- **Throughput is measured, not inferred.** The latency runs report + completed-req/s at fixed concurrency, but real capacity comes from a separate + concurrency sweep that drives each gateway to saturation. +- **Every dialect is warmed up before measurement.** LiteLLM lazily imports some + per-dialect translation code on first use. A chat-only warmup made its + Responses and Messages paths look worse than they should. I warmed up all + dialects to avoid that. +- **Retries are disabled for all gateways.** I also disabled GoModel's circuit + breaker for this benchmark. In production, rejecting traffic after upstream + trouble is the right behavior. In a saturation benchmark, it would make the + throughput number unfairly low. +- **LiteLLM runs with its recommended worker count.** A LiteLLM worker is + effectively single-threaded, and its production guidance is one worker per + vCPU. On this box that means `2` workers. +- **Streaming uses terminal-marker or idle-gap detection.** If a gateway streams + content but never sends a terminal event, the harness measures to last byte + instead of hanging forever. + +## GoModel vs LiteLLM vs Portkey vs Bifrost + +Representative latency is chat completions, non-streaming. All resource figures +are measured under load on the same box. + +| Metric | GoModel | Bifrost | Portkey | LiteLLM | +|---|--:|--:|--:|--:| +| Runtime | Go | Go | Node.js | Python | +| Latency overhead `p50` | **`1.8 ms`** | `2.5 ms` | `9.7 ms` | `30.6 ms` | +| Latency `p99` | **`6.9 ms`** | `18.3 ms` | `30.5 ms` | `39.3 ms` | +| Throughput (sustained) | **`4900 req/s`** | `3100 req/s` | `950 req/s` | `324 req/s` | +| Peak RAM under load | **`37 MB`** | `143 MB` | `112 MB` | `2.3 GB` | +| Efficiency (req/s per CPU %) | **`52`** | `25` | `8.2` | `2.6` | +| Cold start to first request | **`0.56 s`** | `7.1 s` | `1.1 s` | `25.5 s` | +| Docker image (compressed pull) | **`16 MB`** | `77 MB` | `59 MB` | `372 MB` | +| Workload coverage | `6/6` | `6/6` | `4/6` | `6/6` | +| Vendor-neutral core | Yes | Partial † | Yes | Yes | +| Core source available | Yes ‡ | Partial ‡ | Partial ‡ | Yes | + +Same numbers, at a glance: + +![Latency tail p99: GoModel 6.9 ms, Bifrost 18.3 ms, Portkey 30.5 ms, LiteLLM 39.3 ms](./charts/june-2026-latency-p99.svg) + +![Sustained throughput: GoModel 4,900 req/s, Bifrost 3,100, Portkey 950, LiteLLM 324](./charts/june-2026-throughput.svg) + +![Peak memory under load: GoModel 37 MB, Bifrost 143 MB, Portkey 112 MB, LiteLLM 2.3 GB](./charts/june-2026-memory.svg) + +![Docker image, compressed: GoModel 16 MB, Bifrost 77 MB, Portkey 59 MB, LiteLLM 372 MB](./charts/june-2026-image.svg) + +## What stood out + +GoModel had the lowest median latency and the tightest tail: `1.8 ms` p50 and +`6.9 ms` p99. + +Bifrost was close on median latency at `2.5 ms`, which is a good result. The +gap opened at the tail and in memory: `18.3 ms` p99 and `143 MB` peak RAM under +load. + +Portkey was heavier than I expected for this narrow proxy benchmark. It served +`950 req/s` sustained and used `112 MB` peak RAM under load. In this setup it did +not serve the Anthropic `/v1/messages` dialect, so it gets `4/6` workload +coverage. Treat that as a setup limitation, not a claim that Portkey cannot +support Anthropic in a fuller virtual-key configuration. + +LiteLLM was the outlier. At its recommended worker count, it used about +`2.3 GB` of RAM, cold-started in `25.5 s`, and sustained `324 req/s`. + +Not because Python is morally bad. The language matters only when it changes the +deployment envelope. Here it does: memory floor, image size, cold-start time, +dependency graph, and throughput per core. + +The later supply-chain incident around LiteLLM +also made me more confident in GoModel's design direction. A small Go binary +with a standard-library-heavy dependency tree is structurally less exposed to +that class of problem than a large Python dependency graph. + +## What AI gateway benchmarks do not capture + +Forwarding JSON is not the hard part. + +The hard part is provider drift. + +OpenAI, Anthropic, Gemini, AWS Bedrock, Azure OpenAI, Groq, xAI, Cerebras, vLLM, +and local servers all disagree in small ways. Then they change those ways. Tool +calling changes. Streaming changes. Reasoning parameters change. Image inputs +change. Error formats change. Rate-limit semantics change. + +An AI gateway or AI control plane has to absorb that without becoming magic. + +GoModel's bet is not "support every model name on the internet". + +The bet is: + +- support the providers people actually deploy +- keep provider adapters explicit +- accept OpenAI-compatible requests generously +- translate only what needs translation +- pass through what should stay provider-specific +- return conservative OpenAI-compatible responses + +For the same reason, GoModel starts as a small OpenAI-compatible gateway, not as +a dashboard with a proxy attached. + +## Why this matters for local models and vLLM + +If all your traffic goes to a cloud model that takes several seconds to answer, +gateway overhead can look academic. + +Local models change the math. + +If you are routing through an AI gateway to vLLM, Ollama, LM Studio, llama.cpp, +or small specialized models on your own network, the model call can be much +faster. Then gateway overhead, cold starts, memory, and sidecar size matter more. + +One reason I want GoModel to stay small: a gateway should be cheap enough to put +near the workload. + +## Notes on neutrality and open source + +Bifrost is built by Maxim AI, an LLM +evaluation and observability platform. It routes to many model providers, but +the gateway also sits close to Maxim's eval and observability ecosystem. If you +want to choose your own eval platform, or stay independent from any eval +platform, ask whether Bifrost is the right match for you. Good software can +still have incentives attached. "Vendor-neutral" needs an asterisk here. + +"Open-source" also needs care. + +Portkey keeps observability storage, dashboard, multi-team RBAC, and at-scale +semantic caching in a closed managed tier. Bifrost's core gateway is Apache-2.0, +but its Enterprise edition adds closed or managed features. LiteLLM's proxy core +is MIT, but enterprise features like SSO, audit logs, and fine-grained access +control sit behind a proprietary commercial license. + +GoModel is open-source today. Some enterprise-grade AI control plane features may +stay private. The core gateway is intended to remain useful without those private +features. + +## Reproduce it yourself + +The benchmark is built to be self-verifiable. It provisions the AWS instance, +runs every gateway against the same backend, prints the tables, and destroys the +infrastructure. + +**[Reproduce it yourself](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark)**: + +```bash +./run.sh +``` + +One caveat: it runs on **paid** AWS infrastructure, not the free tier. A +`c7i.large` is about `$0.09`/hour and the run self-destructs within an hour or +two, so budget **under `$1`** per run to be safe. + +If you pass `KEEP=1` or teardown fails, you keep paying until you destroy the +box, so double-check the teardown. + +## Conclusion + +I did not start GoModel because I wanted another AI gateway in the world. + +I started it because the gateway I wanted to use became part of the problem. It +sat on the hot path, but did not feel like hot-path software: too heavy, too +slow to start, too expensive to keep around, too large for the job. + +This benchmark is the result of turning that frustration into numbers. + +The numbers say GoModel is small in the places I care about: `16 MB` image, +`37 MB` peak RAM, `0.56 s` cold start, `1.8 ms` p50, `6.9 ms` p99, and +`4900 req/s` sustained throughput on a small AWS box. + +LiteLLM still has more providers and more features today. Portkey and Bifrost +have their own strengths. But if the gateway is going to sit between your users +and every model call, I think it should first be cheap, predictable, and boring +to run. + +GoModel is my attempt to build that kind of gateway. diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg new file mode 100644 index 00000000..51f6aa12 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg @@ -0,0 +1,19 @@ + + + Docker image (compressed) + pull size · lower is better + + GoModel + + 16 MB + Bifrost + + 77 MB + Portkey + + 59 MB + LiteLLM + + 372 MB + + diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg new file mode 100644 index 00000000..cac41ab0 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg @@ -0,0 +1,19 @@ + + + Latency tail (p99, chat) + ms · lower is better + + GoModel + + 6.9 ms + Bifrost + + 18.3 ms + Portkey + + 30.5 ms + LiteLLM + + 39.3 ms + + diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg new file mode 100644 index 00000000..f6dd3ce2 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg @@ -0,0 +1,19 @@ + + + Peak memory under load + RAM · lower is better + + GoModel + + 37 MB + Bifrost + + 143 MB + Portkey + + 112 MB + LiteLLM + + 2.3 GB + + diff --git a/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg new file mode 100644 index 00000000..4ea70ef6 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg @@ -0,0 +1,19 @@ + + + Sustained throughput + req/s · higher is better + + GoModel + + 4,900 + Bifrost + + 3,100 + Portkey + + 950 + LiteLLM + + 324 + + diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover-b.png b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png new file mode 100644 index 00000000..9b2e833c Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover-b.png differ diff --git a/docs/2026-06-25_aws_gateway_benchmark/cover.png b/docs/2026-06-25_aws_gateway_benchmark/cover.png new file mode 100644 index 00000000..0da1dbcf Binary files /dev/null and b/docs/2026-06-25_aws_gateway_benchmark/cover.png differ diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore new file mode 100644 index 00000000..179b4868 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore @@ -0,0 +1,3 @@ +output/ +__pycache__/ +*.pyc diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/README.md b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md new file mode 100644 index 00000000..8f489897 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/README.md @@ -0,0 +1,152 @@ +# GoModel quality (QA) suite + +A curated corpus of ~50 complex requests that exercises every client-facing +dialect and modality of the gateway against **real providers** +(OpenAI / Anthropic / Gemini), then **registers** and **rates** each one. + +It answers a different question than the latency benchmark next door +(`docs/2026-06-25_aws_gateway_benchmark/`): not *how fast/cheap* the gateway is, +but *does it correctly accept, translate, and normalize real-world requests* — +the Postel's-law contract. + +For every case the suite records: + +- the **request as sent** (after model-role and variable resolution); +- the **response** received (status, headers, body, or assembled SSE text); +- **how the gateway recorded/normalized it** — pulled from the audit log: + the inbound request body it captured, the normalized response body it + returned, the resolved provider/model, and token usage; + +and rates it `PASS` / `FAIL` / `ERROR` / `SKIP`, plus a 0–100 **quality score** +for soft modality checks (did the vision model name the colour, did STT recover +the spoken words). + +## What it covers + +| Dialect / endpoint | Providers | Modalities exercised | +|---|---|---| +| `/v1/chat/completions` | OpenAI, Anthropic, Gemini | text, multi-turn, streaming, vision, tools, reasoning, structured output, field preservation | +| `/v1/responses` | OpenAI, Anthropic, Gemini | text, multimodal input, streaming, tools, structured output, reasoning, conversation linkage | +| `/v1/messages` (+ `/count_tokens`) | Anthropic | native shape, system prompt, streaming SSE, vision blocks, tool_use, extended thinking, default `max_tokens` injection | +| `/v1/conversations` | OpenAI | create → get → use-in-Responses → update → delete (stateful) | +| `/v1/audio/speech`, `/v1/audio/transcriptions` | OpenAI | TTS, and a TTS→STT round-trip that recovers the spoken words | +| `/v1/embeddings` | OpenAI | single + batch | +| error normalization | OpenAI, Anthropic | unknown model, unsupported `input_audio`, malformed JSON | + +## How "field preservation" is verified (and its honest limit) + +GoModel's audit log captures the **inbound** client request body and the +**normalized** response body it returns — *not* the upstream provider-translated +request. So the suite verifies translation two ways: + +1. **Behaviorally** — e.g. the reasoning case sends `max_tokens` to a model that + rejects it upstream; a `200` proves the gateway mapped it to + `max_completion_tokens` and dropped `temperature`. The audio-rejection case + proves an unsupported modality fails cleanly (4xx) rather than crashing. +2. **From the audit record** — extra/unknown request fields (`x_qa_marker`, + `metadata`) are asserted present in the captured inbound body, and + provider-specific response extras (`system_fingerprint`, `service_tier`, + `stop_reason`, `usage`) are asserted preserved in the normalized response. + +Audit cross-checks are **soft** by default: if audit bodies are off or the entry +hasn't flushed, those checks are skipped with a note, never a false failure. + +## Prerequisites + +Run the gateway with audit logging **and bodies** enabled so the preservation +checks have data: + +```bash +LOGGING_ENABLED=true \ +LOGGING_LOG_BODIES=true \ +LOGGING_LOG_HEADERS=true \ +LOGGING_LOG_AUDIO_BODIES=true \ +LOGGING_FLUSH_INTERVAL=2 \ +./gomodel # or: go run ./cmd/gomodel +``` + +Provider keys come from the gateway's environment (`OPENAI_API_KEY`, +`ANTHROPIC_API_KEY`, `GEMINI_API_KEY`). The harness authenticates to the gateway +with `GOMODEL_MASTER_KEY` (read from the env or the repo `.env`). + +> This calls real providers and spends real money — modest (a few cents) for one +> full run, since payloads are tiny and `max_tokens` is capped on every case. + +## Run it + +```bash +cd docs/2026-06-25_aws_gateway_benchmark/qa +python3 run_qa.py # full corpus against http://localhost:8080 +python3 run_qa.py --only chat # filter by id/group/provider substring +python3 run_qa.py --only openai +python3 run_qa.py --no-audit # skip audit cross-checks (faster, fewer assertions) +python3 run_qa.py --list # list matching cases, don't run +python3 run_qa.py --gateway http://host:8080 +``` + +Stdlib only — no `pip install`. Exit code is non-zero if any case failed or +errored. Results land in `output//`: + +- `results.json` — full per-case record (request sent, response, audit view, every assertion) +- `report.md` — readable table + a drill-down of failed/errored cases + +## Adapt to your account + +The spec never hardcodes a model id. Cases reference logical roles +(`@openai.chat`, `@anthropic.thinking`, `@gemini.vision`); edit `models.json` to +map them to models your keys can reach. A role with no mapping makes its cases +`SKIP`, never fail. Image inputs (`@image.red` / `@imageb64.red`) are generated +solid-colour PNGs — no binary assets in the repo. + +## Layout + +``` +run_qa.py orchestrator + assertion evaluation + CLI +models.json logical model roles -> concrete model ids (edit this) +spec/ declarative cases, one JSON file per endpoint group +qalib/ stdlib helpers: config, paths, assertions, client, report +output/ run artifacts (gitignored) +``` + +## Case schema (quick reference) + +```jsonc +{ + "id": "chat.openai.multiturn", // unique + "title": "...", "provider": "openai", + "modality": ["text"], // labels for reporting + "request": { + "method": "POST", // default POST + "path": "/v1/chat/completions", // may contain ${captured_var} + "headers": {"X-QA-Marker": "keep"}, + "stream": false, + "body": { "model": "@openai.chat", "...": "..." }, + "raw_body": "…", // send verbatim (malformed-JSON tests) + "produce": "tts_then_stt", // composite: TTS then transcribe its output + "tts": {...}, "stt": {...} // inputs for produce=tts_then_stt + }, + "capture": { "conversation_id": "$.id" },// save response values for later ${vars} + "expect": { + "status": 200, // int or list + "headers": [ {"name": "X-Request-Id", "present": true} ], + "body": [ {"field": "content_type", "contains": "audio/"}, + {"field": "bytes", "gte": 2000}, + {"field": "text", "not_empty": true} ], + "response": [ {"path": "$.choices[0].message.content", "not_empty": true} ], + "stream": { "min_events": 2, "terminal": "[DONE]", + "event_types": ["message_start"], "text": [{"not_empty": true}] }, + "audit": [ {"path": "$.provider", "equals": "openai"}, + {"path": "$.data.request_body.x_qa_marker", "equals": "keep"} ], + "quality": [ {"target": "response:$.output[0].content[0].text", + "contains_any": ["paris"]} ] // soft; feeds the score + } +} +``` + +**Operators** (one per assertion): `present` · `absent` · `equals` · +`not_equals` · `not_empty` · `contains` · `not_contains` · `contains_any` · +`contains_all` · `regex` · `gt` · `gte` · `lt` · `lte` · `type` · `length_gte` · +`one_of`. Add `"hard": false` to make a failure a soft signal instead of failing +the case (audit and quality checks are soft by default). + +**Quality targets:** `stream` · `body.text` · `response:$.path` · `audit:$.path`. diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/models.json b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json new file mode 100644 index 00000000..98dfdc3e --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/models.json @@ -0,0 +1,20 @@ +{ + "_comment": "Logical model roles used by the spec (@openai.chat, @anthropic.thinking, ...). Edit these to match the models your account/keys can reach. Image roles (@image.red/blue/green) are generated by the harness and need no entry.", + "openai": { + "chat": "gpt-4.1-mini", + "vision": "gpt-4.1-mini", + "reasoning": "gpt-5-mini", + "tts": "gpt-4o-mini-tts", + "stt": "gpt-4o-mini-transcribe", + "embed": "text-embedding-3-small" + }, + "anthropic": { + "chat": "claude-sonnet-4-6", + "vision": "claude-sonnet-4-6", + "thinking": "claude-opus-4-8" + }, + "gemini": { + "chat": "gemini-2.5-flash", + "vision": "gemini-2.5-flash" + } +} diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py new file mode 100644 index 00000000..ebb6a9b8 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py @@ -0,0 +1,9 @@ +"""qalib — small helpers for the GoModel quality (QA) harness. + +Stdlib-only. Split into focused modules so each stays readable: + config — gateway URL, master key, model/image role resolution, spec loading + paths — JSON-path mini-language + deterministic image fixtures + assertions — declarative assertion operators + client — HTTP send (JSON / multipart / SSE) + audit-log lookup + report — console table + results.json + report.md +""" diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py new file mode 100644 index 00000000..93a8d78b --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py @@ -0,0 +1,93 @@ +"""Declarative assertion operators. + +Each assertion object names exactly one operator plus optional metadata: + + {"path": "$.usage.total_tokens", "gt": 0} + {"path": "$.choices[0].message.content", "not_empty": true} + {"path": "$.system_fingerprint", "present": true, "hard": false} + +`hard` (default true) decides whether a failure fails the case or is recorded +as a soft/quality signal. The caller locates the value (from a response body, +header, stream, or audit entry) and passes (found, value) here. +""" +import re + +from .paths import json_type + +# Operators that are meaningful even when the value is absent. +_ABSENCE_OPS = {"present", "absent"} + + +def _as_number(v): + try: + return float(v) + except (TypeError, ValueError): + return None + + +def apply_operator(assertion, found, value): + """Evaluate one assertion. Returns (ok: bool, reason: str).""" + for op in assertion: + if op in ("path", "field", "name", "hard", "note", "target"): + continue + expected = assertion[op] + + if op == "present": + ok = found is expected if isinstance(expected, bool) else found + return ok, f"present={found}" + if op == "absent": + return (not found), f"present={found}" + + # All remaining operators require the value to exist. + if not found and op not in _ABSENCE_OPS: + return False, "value not found" + + if op == "equals": + return value == expected, f"{value!r} == {expected!r}" + if op == "not_equals": + return value != expected, f"{value!r} != {expected!r}" + if op == "not_empty": + empty = value is None or value == "" or value == [] or value == {} + return (not empty), f"non-empty (got {_short(value)})" + if op == "contains": + return str(expected).lower() in str(value).lower(), f"contains {expected!r}" + if op == "not_contains": + return str(expected).lower() not in str(value).lower(), f"not contains {expected!r}" + if op == "contains_any": + hay = str(value).lower() + hit = next((w for w in expected if str(w).lower() in hay), None) + return hit is not None, f"any{expected} -> {hit!r}" + if op == "contains_all": + hay = str(value).lower() + miss = [w for w in expected if str(w).lower() not in hay] + return not miss, f"all present (missing {miss})" + if op == "regex": + return re.search(expected, str(value)) is not None, f"~ /{expected}/" + if op in ("gt", "gte", "lt", "lte"): + n, e = _as_number(value), _as_number(expected) + if n is None or e is None: + return False, f"non-numeric {value!r}" + ok = {"gt": n > e, "gte": n >= e, "lt": n < e, "lte": n <= e}[op] + return ok, f"{n} {op} {e}" + if op == "type": + return json_type(value) == expected, f"type {json_type(value)} == {expected}" + if op == "length_gte": + try: + return len(value) >= expected, f"len {len(value)} >= {expected}" + except TypeError: + return False, f"no length: {value!r}" + if op == "one_of": + return value in expected, f"{value!r} in {expected}" + + return False, f"unknown operator {op!r}" + + return False, "empty assertion" + + +def is_hard(assertion): + return assertion.get("hard", True) + + +def _short(value, n=60): + s = repr(value) + return s if len(s) <= n else s[: n - 1] + "…" diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py new file mode 100644 index 00000000..18830604 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py @@ -0,0 +1,206 @@ +"""HTTP client for the QA harness: JSON, multipart, and SSE, plus audit lookup. + +Stdlib only (urllib). Every gateway call carries a unique X-Request-Id and a +run-scoped X-GoModel-User-Path so the matching audit entry can be found, which +is how the harness inspects what the gateway *recorded* it received and +returned (request/response bodies, provider, resolved model, usage). +""" +import json +import time +import urllib.error +import urllib.request +import uuid + + +class Result: + """Captured outcome of one HTTP exchange.""" + + def __init__(self): + self.status = 0 + self.headers = {} + self.request_id = "" + self.json = None # parsed JSON body (if any) + self.text = None # text body (non-JSON) + self.raw = b"" # raw body bytes (binary, e.g. TTS audio) + self.bytes = 0 # raw body length + self.content_type = "" + self.events = [] # parsed SSE event objects + self.stream_text = "" # assembled assistant text from a stream + self.terminal = None # terminal SSE marker seen ("[DONE]", "message_stop", ...) + self.error = None # transport-level exception text + + +class Client: + def __init__(self, base_url, api_key, user_path, timeout=120): + self.base = base_url.rstrip("/") + self.api_key = api_key + self.user_path = user_path + self.timeout = timeout + + def _common_headers(self, request_id, extra): + h = { + "Authorization": f"Bearer {self.api_key}", + "X-Request-ID": request_id, + "X-GoModel-User-Path": self.user_path, + } + if extra: + h.update(extra) + return h + + # ── JSON / raw request, optionally streaming ──────────────────────────── + def send(self, method, path, body=None, headers=None, stream=False, raw_body=None): + rid = "qa-" + uuid.uuid4().hex[:24] + res = Result() + res.request_id = rid + url = self.base + path + hdrs = self._common_headers(rid, headers) + + data = None + if raw_body is not None: + data = raw_body.encode("utf-8") + hdrs.setdefault("Content-Type", "application/json") + elif body is not None: + data = json.dumps(body).encode("utf-8") + hdrs["Content-Type"] = "application/json" + + req = urllib.request.Request(url, data=data, method=method, headers=hdrs) + try: + resp = urllib.request.urlopen(req, timeout=self.timeout) + self._capture(res, resp, stream) + except urllib.error.HTTPError as e: + res.status = e.code + self._capture(res, e, stream=False) + except Exception as e: # noqa: BLE001 — surface any transport failure as ERROR + res.error = f"{type(e).__name__}: {e}" + return res + + # ── multipart/form-data (audio transcriptions) ────────────────────────── + def send_multipart(self, path, fields, file_field, filename, file_bytes, + file_content_type, headers=None): + rid = "qa-" + uuid.uuid4().hex[:24] + res = Result() + res.request_id = rid + boundary = "----qa" + uuid.uuid4().hex + parts = [] + for k, v in (fields or {}).items(): + parts.append(f"--{boundary}\r\n".encode()) + parts.append(f'Content-Disposition: form-data; name="{k}"\r\n\r\n'.encode()) + parts.append(f"{v}\r\n".encode()) + parts.append(f"--{boundary}\r\n".encode()) + parts.append( + f'Content-Disposition: form-data; name="{file_field}"; filename="{filename}"\r\n'.encode()) + parts.append(f"Content-Type: {file_content_type}\r\n\r\n".encode()) + parts.append(file_bytes) + parts.append(f"\r\n--{boundary}--\r\n".encode()) + data = b"".join(parts) + + hdrs = self._common_headers(rid, headers) + hdrs["Content-Type"] = f"multipart/form-data; boundary={boundary}" + req = urllib.request.Request(self.base + path, data=data, method="POST", headers=hdrs) + try: + resp = urllib.request.urlopen(req, timeout=self.timeout) + self._capture(res, resp, stream=False) + except urllib.error.HTTPError as e: + res.status = e.code + self._capture(res, e, stream=False) + except Exception as e: # noqa: BLE001 + res.error = f"{type(e).__name__}: {e}" + return res + + # ── response capture ──────────────────────────────────────────────────── + def _capture(self, res, resp, stream): + res.status = getattr(resp, "status", res.status) or res.status + try: + res.headers = {k.lower(): v for k, v in resp.headers.items()} + except Exception: # noqa: BLE001 + res.headers = {} + res.request_id = res.headers.get("x-request-id", res.request_id) + res.content_type = res.headers.get("content-type", "") + + if stream and "text/event-stream" in res.content_type: + self._read_sse(res, resp) + return + + raw = resp.read() + res.raw = raw + res.bytes = len(raw) + if "application/json" in res.content_type: + try: + res.json = json.loads(raw.decode("utf-8")) + except Exception: # noqa: BLE001 + res.text = raw.decode("utf-8", "replace") + elif res.content_type.startswith("text/"): + res.text = raw.decode("utf-8", "replace") + # binary (audio) bodies: only size + content-type are kept. + + def _read_sse(self, res, resp): + for rawline in resp: + line = rawline.decode("utf-8", "replace").rstrip("\n").rstrip("\r") + if not line or line.startswith(":"): + continue + if not line.startswith("data:"): + continue + payload = line[len("data:"):].strip() + if payload == "[DONE]": + res.terminal = "[DONE]" + continue + try: + ev = json.loads(payload) + except Exception: # noqa: BLE001 + continue + res.events.append(ev) + self._accumulate(res, ev) + + @staticmethod + def _accumulate(res, ev): + """Assemble assistant text across the three streaming dialects and note + terminal markers.""" + etype = ev.get("type") + if etype in ("response.completed", "message_stop", "response.output_text.done"): + res.terminal = etype + # chat.completions: choices[].delta.content + for ch in ev.get("choices", []) or []: + delta = ch.get("delta") or {} + if isinstance(delta.get("content"), str): + res.stream_text += delta["content"] + # responses: output_text deltas + if etype == "response.output_text.delta" and isinstance(ev.get("delta"), str): + res.stream_text += ev["delta"] + # anthropic messages: content_block_delta.text + if etype == "content_block_delta": + d = ev.get("delta") or {} + if isinstance(d.get("text"), str): + res.stream_text += d["text"] + + # ── audit lookup ──────────────────────────────────────────────────────── + def fetch_audit(self, request_id, attempts=6, delay=1.5): + """Find the audit entry for a request_id (retrying for flush lag) and + return the full detail entry, or None.""" + for i in range(attempts): + entry_id = self._find_entry_id(request_id) + if entry_id: + detail = self._get_json(f"/admin/audit/detail?log_id={entry_id}") + if detail: + return detail + if i < attempts - 1: + time.sleep(delay) + return None + + def _find_entry_id(self, request_id): + listing = self._get_json(f"/admin/audit/log?search={request_id}&limit=20") + if not listing: + return None + for entry in listing.get("entries", []): + if entry.get("request_id") == request_id: + return entry.get("id") + return None + + def _get_json(self, path): + req = urllib.request.Request( + self.base + path, method="GET", + headers={"Authorization": f"Bearer {self.api_key}"}) + try: + resp = urllib.request.urlopen(req, timeout=self.timeout) + return json.loads(resp.read().decode("utf-8")) + except Exception: # noqa: BLE001 + return None diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py new file mode 100644 index 00000000..a3072b4e --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py @@ -0,0 +1,114 @@ +"""Config loading: master key, model/image roles, spec files. + +The spec never hardcodes a concrete model id. Cases reference logical roles +("@openai.chat", "@anthropic.thinking", "@image.red") that resolve through +`models.json`, so a user adapts the whole corpus to their account by editing +one file. +""" +import glob +import json +import os + +from .paths import png_base64, png_data_url + +_COLORS = {"red": (220, 30, 30), "blue": (30, 60, 220), "green": (30, 180, 70)} + +# @image. -> data: URL (chat/responses image_url form) +# @imageb64. -> raw base64 (native Anthropic image source.data) +IMAGES = {name: png_data_url(rgb) for name, rgb in _COLORS.items()} +IMAGES_B64 = {name: png_base64(rgb) for name, rgb in _COLORS.items()} + + +def load_master_key(repo_root): + """Master/admin key: env first, then the repo .env (never printed).""" + key = os.environ.get("GOMODEL_API_KEY") or os.environ.get("GOMODEL_MASTER_KEY") + if key: + return key.strip() + env_path = os.path.join(repo_root, ".env") + if os.path.exists(env_path): + with open(env_path, encoding="utf-8") as f: + for line in f: + line = line.strip() + if line.startswith("GOMODEL_MASTER_KEY="): + return line.split("=", 1)[1].strip().strip('"').strip("'") + return "" + + +def load_models(path): + with open(path, encoding="utf-8") as f: + return json.load(f) + + +def load_specs(spec_dir, only=None): + """Load and concatenate every spec/*.json (sorted by filename, then array + order). `only` filters by substring against id / group / provider.""" + cases = [] + for path in sorted(glob.glob(os.path.join(spec_dir, "*.json"))): + with open(path, encoding="utf-8") as f: + data = json.load(f) + for case in data: + case.setdefault("group", os.path.splitext(os.path.basename(path))[0]) + cases.append(case) + if only: + needle = only.lower() + cases = [c for c in cases + if needle in c.get("id", "").lower() + or needle in c.get("group", "").lower() + or needle in c.get("provider", "").lower()] + return cases + + +def resolve_roles(obj, models): + """Recursively replace @provider.role and @image.name tokens with concrete + values. Returns (resolved_obj, unresolved_roles).""" + unresolved = [] + + def walk(node): + if isinstance(node, str): + if node.startswith("@imageb64."): + name = node[len("@imageb64."):] + if name in IMAGES_B64: + return IMAGES_B64[name] + unresolved.append(node) + return node + if node.startswith("@image."): + name = node[len("@image."):] + if name in IMAGES: + return IMAGES[name] + unresolved.append(node) + return node + if node.startswith("@"): + parts = node[1:].split(".") + cur = models + for p in parts: + if isinstance(cur, dict) and p in cur: + cur = cur[p] + else: + unresolved.append(node) + return node + return cur + return node + if isinstance(node, list): + return [walk(x) for x in node] + if isinstance(node, dict): + return {k: walk(v) for k, v in node.items()} + return node + + return walk(obj), unresolved + + +def interpolate_vars(obj, variables): + """Replace ${var} occurrences inside any string using captured runtime vars.""" + def walk(node): + if isinstance(node, str): + out = node + for name, val in variables.items(): + out = out.replace("${" + name + "}", str(val)) + return out + if isinstance(node, list): + return [walk(x) for x in node] + if isinstance(node, dict): + return {k: walk(v) for k, v in node.items()} + return node + + return walk(obj) diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py new file mode 100644 index 00000000..c82fbbda --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py @@ -0,0 +1,90 @@ +"""JSON-path mini-language and deterministic image fixtures. + +The path language is intentionally tiny — enough to address normalized AI +responses and audit entries without a dependency: + + $ the root object + $.a.b nested object keys + $.choices[0].message array index + $.data.request_body.x arbitrary nested key (audit bodies) + +`get_path` returns (found, value) so callers can distinguish "missing" from +"present but null/empty". +""" +import base64 +import re +import struct +import zlib + +_TOKEN = re.compile(r"([^.\[\]]+)|\[(\d+)\]") + + +def get_path(obj, path): + """Resolve a `$.a.b[0]` path. Returns (found: bool, value).""" + if path in ("$", "", None): + return True, obj + if path.startswith("$."): + path = path[2:] + elif path.startswith("$"): + path = path[1:] + cur = obj + for key, idx in _TOKEN.findall(path): + if idx != "": + if not isinstance(cur, list): + return False, None + i = int(idx) + if i >= len(cur): + return False, None + cur = cur[i] + else: + if not isinstance(cur, dict) or key not in cur: + return False, None + cur = cur[key] + return True, cur + + +def json_type(value): + """JSON type name for a Python value (for the `type` assertion).""" + if value is None: + return "null" + if isinstance(value, bool): + return "boolean" + if isinstance(value, (int, float)): + return "number" + if isinstance(value, str): + return "string" + if isinstance(value, list): + return "array" + if isinstance(value, dict): + return "object" + return "unknown" + + +# ── deterministic image fixtures ──────────────────────────────────────────── +# A solid-colour PNG is the simplest reproducible vision input: every provider +# can name a colour, so `quality: contains_any [red]` is a stable smoke check +# that needs no network fetch and no binary asset checked into the repo. + +def _solid_png(rgb, size=48): + raw = bytearray() + row = bytes(rgb) * size + for _ in range(size): + raw.append(0) # PNG filter type 0 (none) per scanline + raw.extend(row) + + def chunk(typ, data): + body = typ + data + return struct.pack(">I", len(data)) + body + struct.pack(">I", zlib.crc32(body) & 0xFFFFFFFF) + + sig = b"\x89PNG\r\n\x1a\n" + ihdr = struct.pack(">IIBBBBB", size, size, 8, 2, 0, 0, 0) # 8-bit RGB + idat = zlib.compress(bytes(raw), 9) + return sig + chunk(b"IHDR", ihdr) + chunk(b"IDAT", idat) + chunk(b"IEND", b"") + + +def png_base64(rgb, size=48): + return base64.b64encode(_solid_png(rgb, size)).decode("ascii") + + +def png_data_url(rgb, size=48): + return "data:image/png;base64," + png_base64(rgb, size) diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py new file mode 100644 index 00000000..d551e84e --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py @@ -0,0 +1,117 @@ +"""Reporting: console table, results.json, and a Markdown report. + +The report "registers" each case — the request as sent, the response and how +the gateway recorded/normalized it (from the audit entry), and every assertion +with its observed value — and "rates" it PASS / FAIL / ERROR / SKIP plus a +0–100 quality score for soft modality checks. +""" +import json +import os + +STATUS_GLYPH = {"PASS": "PASS", "FAIL": "FAIL", "ERROR": "ERR ", "SKIP": "skip"} + + +def quality_score(case_result): + soft = [c for c in case_result["checks"] if not c["hard"]] + if not soft: + return None + return round(100 * sum(1 for c in soft if c["ok"]) / len(soft)) + + +def print_console(results, meta): + print("\n" + "=" * 92) + print("GOMODEL QUALITY (QA) SUITE") + print("=" * 92) + print(f"gateway={meta['gateway']} cases={len(results)} " + f"audit_bodies={'on' if meta['audit_bodies'] else 'OFF'}") + print("-" * 92) + hdr = f"{'status':6} {'id':46} {'prov':9} {'http':>4} {'qual':>5} detail" + print(hdr) + print("-" * 92) + for r in results: + q = quality_score(r) + qs = f"{q:>4}%" if q is not None else " - " + detail = r["detail"] + if len(detail) > 24: + detail = detail[:23] + "…" + print(f"{STATUS_GLYPH.get(r['status'], r['status']):6} " + f"{r['id'][:46]:46} {(r.get('provider') or ''):9} " + f"{(r['http'] or ''):>4} {qs:>5} {detail}") + + counts = _counts(results) + print("-" * 92) + print(f"PASS {counts['PASS']} FAIL {counts['FAIL']} " + f"ERROR {counts['ERROR']} SKIP {counts['SKIP']} " + f"(total {len(results)})") + _print_breakdown("by endpoint", results, "group") + _print_breakdown("by provider", results, "provider") + print("=" * 92) + + +def _counts(results): + c = {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0} + for r in results: + c[r["status"]] = c.get(r["status"], 0) + 1 + return c + + +def _print_breakdown(label, results, key): + groups = {} + for r in results: + g = r.get(key) or "?" + groups.setdefault(g, {"PASS": 0, "FAIL": 0, "ERROR": 0, "SKIP": 0}) + groups[g][r["status"]] += 1 + line = " ".join( + f"{g}:{v['PASS']}/{v['PASS'] + v['FAIL'] + v['ERROR'] + v['SKIP']}" + for g, v in sorted(groups.items())) + print(f"{label:12}: {line}") + + +def write_results(out_dir, results, meta): + os.makedirs(out_dir, exist_ok=True) + with open(os.path.join(out_dir, "results.json"), "w", encoding="utf-8") as f: + json.dump({"meta": meta, "counts": _counts(results), "cases": results}, + f, indent=2) + _write_markdown(out_dir, results, meta) + return out_dir + + +def _write_markdown(out_dir, results, meta): + c = _counts(results) + L = ["# GoModel Quality (QA) Report\n", + f"`gateway={meta['gateway']} cases={len(results)} " + f"audit_bodies={'on' if meta['audit_bodies'] else 'off'}`\n", + f"**PASS {c['PASS']} · FAIL {c['FAIL']} · ERROR {c['ERROR']} · SKIP {c['SKIP']}**\n", + "| status | id | endpoint | provider | modality | http | quality | detail |", + "|---|---|---|---|--:|--:|--:|---|"] + for r in results: + q = quality_score(r) + qs = f"{q}%" if q is not None else "" + mod = r.get("modality") + if isinstance(mod, str): + mod = [mod] + elif not isinstance(mod, list): + mod = [] + modality = ",".join(str(m) for m in mod) + L.append(f"| {r['status']} | `{r['id']}` | {r.get('group','')} | " + f"{r.get('provider','')} | {modality} | {r['http'] or ''} | {qs} | " + f"{_md(r['detail'])} |") + L.append("") + L.append("## Failed & errored cases\n") + bad = [r for r in results if r["status"] in ("FAIL", "ERROR")] + if not bad: + L.append("_None._\n") + for r in bad: + L.append(f"### `{r['id']}` — {r['status']}\n") + L.append(f"- {_md(r.get('title',''))}") + L.append(f"- http `{r['http']}` · {_md(r['detail'])}") + for chk in r["checks"]: + if not chk["ok"] and chk["hard"]: + L.append(f" - FAIL `{chk['where']}` — {_md(chk['reason'])}") + L.append("") + with open(os.path.join(out_dir, "report.md"), "w", encoding="utf-8") as f: + f.write("\n".join(L)) + + +def _md(s): + return str(s).replace("|", "\\|").replace("\n", " ") diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py new file mode 100644 index 00000000..804973f9 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py @@ -0,0 +1,347 @@ +#!/usr/bin/env python3 +"""GoModel quality (QA) harness — declarative spec runner. + +Sends a curated corpus of complex requests through a running GoModel gateway to +real providers (OpenAI / Anthropic / Gemini) across every dialect and modality, +then registers and rates each case: + + - registers the request as sent, the response, and how the gateway *recorded* + and normalized it (pulled from the audit log: inbound body, normalized body, + provider, resolved model, usage); + - rates each case PASS / FAIL / ERROR / SKIP, plus a 0–100 quality score for + soft modality checks (did the vision model name the colour, did STT recover + the spoken words, …). + +Usage: + python run_qa.py # full corpus against localhost:8080 + python run_qa.py --only chat # filter by id/group/provider substring + python run_qa.py --only openai --no-audit + python run_qa.py --list # list cases without running + python run_qa.py --gateway http://host:8080 --models models.json + +Requires the gateway running with audit logging + bodies for the preservation +checks: LOGGING_ENABLED=true LOGGING_LOG_BODIES=true LOGGING_LOG_HEADERS=true +LOGGING_LOG_AUDIO_BODIES=true (see README). Stdlib only. +""" +import argparse +import os +import sys +import time +import uuid + +HERE = os.path.dirname(os.path.abspath(__file__)) +sys.path.insert(0, HERE) + +from qalib import config, report # noqa: E402 +from qalib.assertions import apply_operator, is_hard # noqa: E402 +from qalib.client import Client # noqa: E402 +from qalib.paths import get_path # noqa: E402 + +def _find_repo_root(start): + """Walk up to the repo root (the dir holding .git), for the .env lookup.""" + d = start + while d != os.path.dirname(d): + if os.path.exists(os.path.join(d, ".git")): + return d + d = os.path.dirname(d) + return start + + +REPO_ROOT = _find_repo_root(HERE) + + +def locate(target, res, audit): + """Resolve a quality/assertion target selector to (found, value).""" + if target == "stream": + return bool(res.stream_text), res.stream_text + if target == "body.text": + return res.text is not None, res.text + if target.startswith("response:"): + return get_path(res.json, target[len("response:"):]) + if target.startswith("audit:"): + if audit is None: + return False, None + return get_path(audit, target[len("audit:"):]) + return False, None + + +def evaluate(case, res, audit, audit_attempted, variables=None): + """Return (status, checks, detail). checks: [{where, ok, hard, reason}].""" + checks = [] + expect = case.get("expect", {}) + if variables: + # Resolve ${var} references (e.g. a captured ${conversation_id}) in + # assertion operands, the same way request paths/bodies are interpolated. + expect = config.interpolate_vars(expect, variables) + + if res.error: + return "ERROR", checks, res.error + + # ── status ────────────────────────────────────────────────────────────── + want = expect.get("status", 200) + want = want if isinstance(want, list) else [want] + checks.append({"where": "status", "ok": res.status in want, "hard": True, + "reason": f"{res.status} in {want}"}) + + # ── headers ─────────────────────────────────────────────────────────────── + for a in expect.get("headers", []): + name = a["name"].lower() + found = name in res.headers + ok, reason = apply_operator(a, found, res.headers.get(name)) + checks.append({"where": f"header:{a['name']}", "ok": ok, + "hard": is_hard(a), "reason": reason}) + + # ── body (synthetic fields for any body, incl. binary) ──────────────────── + body_fields = {"content_type": res.content_type, "bytes": res.bytes, + "text": res.text} + for a in expect.get("body", []): + field = a["field"] + val = body_fields.get(field) + ok, reason = apply_operator(a, val is not None, val) + checks.append({"where": f"body:{field}", "ok": ok, + "hard": is_hard(a), "reason": reason}) + + # ── response JSON ───────────────────────────────────────────────────────── + for a in expect.get("response", []): + found, val = get_path(res.json, a["path"]) if res.json is not None else (False, None) + ok, reason = apply_operator(a, found, val) + checks.append({"where": f"response:{a['path']}", "ok": ok, + "hard": is_hard(a), "reason": reason}) + + # ── streaming ───────────────────────────────────────────────────────────── + st = expect.get("stream") + if st: + if "min_events" in st: + n = len(res.events) + checks.append({"where": "stream:events", "ok": n >= st["min_events"], + "hard": True, "reason": f"{n} events >= {st['min_events']}"}) + if "terminal" in st: + checks.append({"where": "stream:terminal", "ok": res.terminal == st["terminal"], + "hard": True, "reason": f"{res.terminal!r} == {st['terminal']!r}"}) + for et in st.get("event_types", []): + present = any(e.get("type") == et for e in res.events) + checks.append({"where": f"stream:type:{et}", "ok": present, + "hard": True, "reason": f"event {et} present={present}"}) + for a in st.get("text", []): + ok, reason = apply_operator(a, bool(res.stream_text), res.stream_text) + checks.append({"where": "stream:text", "ok": ok, + "hard": is_hard(a), "reason": reason}) + + # ── audit (gateway's own record of what it received / returned) ─────────── + for a in expect.get("audit", []): + path = a["path"] + if not audit_attempted: + continue + if audit is None: + checks.append({"where": f"audit:{path}", "ok": True, "hard": False, + "reason": "audit entry not found (skipped)"}) + continue + found, val = get_path(audit, path) + # If body capture is off, demote data.* checks to soft skips. + if not found and path.startswith("$.data."): + data = audit.get("data") or {} + if "request_body" not in data and "response_body" not in data: + checks.append({"where": f"audit:{path}", "ok": True, "hard": False, + "reason": "audit bodies off (enable LOGGING_LOG_BODIES)"}) + continue + ok, reason = apply_operator(a, found, val) + checks.append({"where": f"audit:{path}", "ok": ok, + "hard": is_hard(a), "reason": reason}) + + # ── quality (always soft; feeds the score) ──────────────────────────────── + for a in expect.get("quality", []): + found, val = locate(a.get("target", "stream"), res, audit) + a = dict(a) + a["hard"] = False + ok, reason = apply_operator(a, found, val) + checks.append({"where": f"quality:{a.get('target','stream')}", "ok": ok, + "hard": False, "reason": reason}) + + hard_fail = [c for c in checks if c["hard"] and not c["ok"]] + status = "FAIL" if hard_fail else "PASS" + if hard_fail: + detail = f"{hard_fail[0]['where']}: {hard_fail[0]['reason']}" + else: + ok_n = sum(1 for c in checks if c["ok"]) + detail = f"{ok_n}/{len(checks)} ok" + return status, checks, detail + + +def run_case(case, client, models, variables, do_audit): + """Build, send, capture vars, fetch audit for one case. Returns (res, audit, + audit_attempted, skip_reason).""" + resolved, unresolved = config.resolve_roles(case.get("request", {}), models) + if unresolved: + return None, None, False, f"unresolved role(s): {', '.join(sorted(set(unresolved)))}" + req = config.interpolate_vars(resolved, variables) + + produce = req.get("produce") + if produce == "tts_then_stt": + res = _produce_tts_then_stt(req, client) + else: + res = client.send(req.get("method", "POST"), req["path"], body=req.get("body"), + headers=req.get("headers"), stream=req.get("stream", False), + raw_body=req.get("raw_body")) + + # capture runtime vars from the response body + for name, path in (case.get("capture") or {}).items(): + if res.json is not None: + found, val = get_path(res.json, path) + if found: + variables[name] = val + + audit_attempted = bool(do_audit and case.get("expect", {}).get("audit")) + audit = client.fetch_audit(res.request_id) if audit_attempted else None + return res, audit, audit_attempted, None + + +def _produce_tts_then_stt(req, client): + tts = req["tts"] + fmt = tts.get("response_format", "mp3") + r1 = client.send("POST", "/v1/audio/speech", body=tts) + if r1.status != 200 or not r1.raw: + r1.error = f"tts produce failed (status {r1.status}, {r1.bytes} bytes)" + return r1 + stt = req["stt"] + mime = r1.content_type or "audio/mpeg" + res = client.send_multipart("/v1/audio/transcriptions", stt, "file", + f"qa.{fmt}", r1.raw, mime) + res.produced_from = {"tts_status": r1.status, "tts_bytes": r1.bytes, + "tts_content_type": r1.content_type} + return res + + +def _trim(obj, limit=4000): + """Trim long strings (base64 audio, etc.) so the artifact stays readable.""" + if isinstance(obj, str): + return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit} chars)" + if isinstance(obj, list): + return [_trim(x, limit) for x in obj] + if isinstance(obj, dict): + return {k: _trim(v, limit) for k, v in obj.items()} + return obj + + +def artifact(case, res, audit): + """The registered record: what was sent, what came back, how the gateway + recorded/normalized it.""" + if res is None: + return {"request": case.get("request"), "response": None, "audit": None} + resp = {"status": res.status, "content_type": res.content_type, + "bytes": res.bytes, "request_id": res.request_id} + if res.json is not None: + resp["json"] = _trim(res.json) + if res.text is not None: + resp["text"] = _trim(res.text) + if res.events: + resp["stream_events"] = len(res.events) + resp["stream_text"] = _trim(res.stream_text) + resp["terminal"] = res.terminal + if getattr(res, "produced_from", None): + resp["produced_from"] = res.produced_from + audit_view = None + if audit: + data = audit.get("data") or {} + audit_view = { + "provider": audit.get("provider"), + "resolved_model": audit.get("resolved_model"), + "requested_model": audit.get("requested_model"), + "status_code": audit.get("status_code"), + "duration_ns": audit.get("duration_ns"), + "usage": audit.get("usage"), + "request_body": _trim(data.get("request_body")), + "response_body": _trim(data.get("response_body")), + } + return {"request": _trim(case.get("request")), "response": resp, "audit": audit_view} + + +def main(): + ap = argparse.ArgumentParser(description="GoModel quality (QA) harness") + ap.add_argument("--gateway", default=os.environ.get("GATEWAY", "http://localhost:8080")) + ap.add_argument("--models", default=os.path.join(HERE, "models.json")) + ap.add_argument("--spec-dir", default=os.path.join(HERE, "spec")) + ap.add_argument("--out", default=os.path.join(HERE, "output")) + ap.add_argument("--only", default=None, help="filter by id/group/provider substring") + ap.add_argument("--no-audit", action="store_true", help="skip audit-log cross-checks") + ap.add_argument("--list", action="store_true", help="list matching cases and exit") + ap.add_argument("--timeout", type=int, default=120) + args = ap.parse_args() + + models = config.load_models(args.models) + cases = config.load_specs(args.spec_dir, args.only) + if not cases: + print("no cases matched", file=sys.stderr) + return 2 + if args.list: + for c in cases: + print(f"{c['id']:48} {c.get('group',''):14} {c.get('provider','')}") + print(f"\n{len(cases)} cases") + return 0 + + key = config.load_master_key(REPO_ROOT) + if not key: + print("no GOMODEL_MASTER_KEY found (env or repo .env)", file=sys.stderr) + return 2 + + run_id = uuid.uuid4().hex[:12] + user_path = f"/qa/{run_id}" + client = Client(args.gateway, key, user_path, timeout=args.timeout) + + health = client.send("GET", "/health") + if health.error or health.status >= 500: + print(f"gateway not reachable at {args.gateway}: " + f"{health.error or health.status}", file=sys.stderr) + return 2 + + print(f"running {len(cases)} cases against {args.gateway} (user_path {user_path})") + results = [] + variables = {} + audit_bodies_seen = False + for case in cases: + t0 = time.time() + try: + res, audit, attempted, skip = run_case(case, client, models, variables, + do_audit=not args.no_audit) + + if skip: + results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0)) + print(f"skip {case['id']}: {skip}") + continue + + if audit and (audit.get("data") or {}).get("request_body") is not None: + audit_bodies_seen = True + + status, checks, detail = evaluate(case, res, audit, attempted, variables) + rec = _record(case, status, checks, detail, res, audit, time.time() - t0) + results.append(rec) + print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}") + except Exception as e: # noqa: BLE001 — never let one case abort the run + err = f"{type(e).__name__}: {e}" + results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0)) + print(f"ERR {case['id']}: {err}") + continue + + meta = {"gateway": args.gateway, "run_id": run_id, "user_path": user_path, + "audit_bodies": audit_bodies_seen, "models": models} + report.print_console(results, meta) + out_dir = os.path.join(args.out, run_id) + report.write_results(out_dir, results, meta) + print(f"\nwrote {os.path.join(out_dir, 'results.json')}\n" + f"wrote {os.path.join(out_dir, 'report.md')}") + + failed = sum(1 for r in results if r["status"] in ("FAIL", "ERROR")) + return 1 if failed else 0 + + +def _record(case, status, checks, detail, res, audit, elapsed): + return { + "id": case["id"], "title": case.get("title", ""), "group": case.get("group"), + "provider": case.get("provider"), "modality": case.get("modality"), + "status": status, "http": (res.status if res else None), + "detail": detail, "elapsed_ms": round(elapsed * 1000), + "checks": checks, "artifact": artifact(case, res, audit), + } + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json new file mode 100644 index 00000000..a2a2fbfb --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json @@ -0,0 +1,75 @@ +[ + { + "id": "audio.openai.tts_mp3", + "title": "TTS: synthesize speech (mp3)", + "provider": "openai", + "modality": ["audio"], + "request": { + "path": "/v1/audio/speech", + "body": {"model": "@openai.tts", "voice": "alloy", "input": "The quick brown fox jumps over the lazy dog.", "response_format": "mp3"} + }, + "expect": { + "status": 200, + "body": [ + {"field": "content_type", "contains": "audio/"}, + {"field": "bytes", "gte": 2000} + ] + }, + "notes": "Text-to-speech returns binary audio with an audio/* content type." + }, + { + "id": "audio.openai.tts_wav", + "title": "TTS: response_format wav changes content type", + "provider": "openai", + "modality": ["audio"], + "request": { + "path": "/v1/audio/speech", + "body": {"model": "@openai.tts", "voice": "alloy", "input": "Hello world.", "response_format": "wav"} + }, + "expect": { + "status": 200, + "body": [ + {"field": "content_type", "contains": "audio/wav"}, + {"field": "bytes", "gte": 2000} + ] + }, + "notes": "response_format must drive the response MIME type (audio/wav)." + }, + { + "id": "audio.openai.tts_stt_json", + "title": "STT: round-trip TTS -> transcription (json) recovers the words", + "provider": "openai", + "modality": ["audio"], + "request": { + "produce": "tts_then_stt", + "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Benchmark gateways measure latency and cost.", "response_format": "mp3"}, + "stt": {"model": "@openai.stt", "response_format": "json"} + }, + "expect": { + "status": 200, + "response": [{"path": "$.text", "not_empty": true}], + "quality": [{"target": "response:$.text", "contains_any": ["benchmark", "gateway", "latency", "cost"]}] + }, + "notes": "Self-contained modality round-trip: synthesize known text, transcribe it, assert the words come back. No external audio fixture." + }, + { + "id": "audio.openai.tts_stt_text", + "title": "STT: response_format text returns plain text", + "provider": "openai", + "modality": ["audio"], + "request": { + "produce": "tts_then_stt", + "tts": {"model": "@openai.tts", "voice": "alloy", "input": "Speech to text in plain format.", "response_format": "mp3"}, + "stt": {"model": "@openai.stt", "response_format": "text"} + }, + "expect": { + "status": 200, + "body": [ + {"field": "content_type", "contains": "text/"}, + {"field": "text", "not_empty": true} + ], + "quality": [{"target": "body.text", "contains_any": ["speech", "text", "plain", "format"]}] + }, + "notes": "Transcription response_format=text returns text/plain, not JSON." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json new file mode 100644 index 00000000..f2ea9eab --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json @@ -0,0 +1,448 @@ +[ + { + "id": "chat.openai.multiturn", + "title": "OpenAI chat: multi-turn system+user+assistant+user", + "provider": "openai", + "modality": ["text"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [ + {"role": "system", "content": "You are a terse assistant. Answer in one short sentence."}, + {"role": "user", "content": "Name the largest planet in the solar system."}, + {"role": "assistant", "content": "Jupiter."}, + {"role": "user", "content": "And the smallest?"} + ], + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "headers": [{"name": "X-Request-Id", "present": true}], + "response": [ + {"path": "$.object", "equals": "chat.completion"}, + {"path": "$.choices[0].message.role", "equals": "assistant"}, + {"path": "$.choices[0].message.content", "not_empty": true}, + {"path": "$.usage.total_tokens", "gt": 0} + ], + "audit": [ + {"path": "$.provider", "equals": "openai"}, + {"path": "$.resolved_model", "not_empty": true} + ], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["mercury"]}] + }, + "notes": "Baseline conversational correctness + OpenAI usage normalization + audit routing." + }, + { + "id": "chat.openai.stream", + "title": "OpenAI chat: streaming deltas terminate with [DONE]", + "provider": "openai", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/chat/completions", + "stream": true, + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "List three primary colors, comma separated."}], + "stream": true, + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]}, + "quality": [{"target": "stream", "contains_any": ["red", "blue", "yellow"]}] + }, + "notes": "SSE framing + terminal marker for chat dialect." + }, + { + "id": "chat.openai.stream_usage", + "title": "OpenAI chat: stream_options include_usage emits a usage chunk", + "provider": "openai", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/chat/completions", + "stream": true, + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "Say hi."}], + "stream": true, + "stream_options": {"include_usage": true}, + "max_tokens": 32 + } + }, + "expect": { + "status": 200, + "stream": {"min_events": 2, "terminal": "[DONE]"}, + "audit": [{"path": "$.usage.total_tokens", "gt": 0}], + "quality": [{"target": "stream", "not_empty": true}] + }, + "notes": "stream_options must survive translation; the usage chunk is provider-shaped. For a stream the gateway can only derive usage from the streamed usage chunk, so a recorded usage.total_tokens>0 proves the chunk was emitted and not dropped." + }, + { + "id": "chat.openai.vision_data_url", + "title": "OpenAI chat: vision via inline image_url (data URL)", + "provider": "openai", + "modality": ["vision"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.vision", + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "What is the single dominant color of this image? Answer with one word."}, + {"type": "image_url", "image_url": {"url": "@image.red"}} + ]}], + "max_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["red"]}] + }, + "notes": "Multimodal content-part passthrough; deterministic solid-color fixture." + }, + { + "id": "chat.openai.tools_call", + "title": "OpenAI chat: function/tool calling is emitted", + "provider": "openai", + "modality": ["tools"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "What is the weather in Paris? Use the tool."}], + "tools": [{"type": "function", "function": { + "name": "get_weather", + "description": "Get current weather for a city", + "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]} + }}], + "tool_choice": "required", + "max_tokens": 128 + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather"}, + {"path": "$.choices[0].finish_reason", "equals": "tool_calls"} + ] + }, + "notes": "tool_choice=required must force a structured tool call." + }, + { + "id": "chat.openai.tools_roundtrip", + "title": "OpenAI chat: tool result fed back yields a final answer", + "provider": "openai", + "modality": ["tools"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [ + {"role": "user", "content": "What is the weather in Paris?"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp_c\": 21, \"summary\": \"sunny\"}"} + ], + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["21", "sunny", "sun"]}] + }, + "notes": "Assistant tool_calls + role:tool message round-trip translation." + }, + { + "id": "chat.openai.structured_json_schema", + "title": "OpenAI chat: structured output via response_format json_schema", + "provider": "openai", + "modality": ["structured"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "Give the capital of France."}], + "response_format": {"type": "json_schema", "json_schema": { + "name": "capital", + "strict": true, + "schema": {"type": "object", "properties": {"country": {"type": "string"}, "capital": {"type": "string"}}, "required": ["country", "capital"], "additionalProperties": false} + }}, + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["paris"]}] + }, + "notes": "response_format must pass through and constrain output." + }, + { + "id": "chat.openai.reasoning_max_tokens_mapping", + "title": "OpenAI reasoning: max_tokens accepted, temperature tolerated (Postel)", + "provider": "openai", + "modality": ["reasoning"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.reasoning", + "messages": [{"role": "user", "content": "What is 17 + 25? Reply with the number only."}], + "max_tokens": 2000, + "temperature": 0.5 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "audit": [ + {"path": "$.provider", "equals": "openai"}, + {"path": "$.data.request_body.max_tokens", "present": true, "hard": false} + ], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["42"]}] + }, + "notes": "Reasoning models reject max_tokens/temperature upstream; a 200 proves the gateway mapped max_tokens->max_completion_tokens and dropped temperature. Audit shows the inbound body is preserved verbatim." + }, + { + "id": "chat.openai.optional_field_preserved", + "title": "OpenAI chat: a valid optional field (user) is preserved end-to-end", + "provider": "openai", + "modality": ["text", "preservation"], + "request": { + "path": "/v1/chat/completions", + "headers": {"X-QA-Marker": "keep-123"}, + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "Reply with the word OK."}], + "user": "qa-user-001", + "max_tokens": 16 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "audit": [ + {"path": "$.data.request_body.user", "equals": "qa-user-001", "hard": false} + ] + }, + "notes": "A provider-valid optional field round-trips: request succeeds and the audit confirms the gateway recorded it verbatim. (Unknown/unrecognized fields are a separate case — see errors.openai_unknown_field_forwarded.)" + }, + { + "id": "chat.openai.provider_extras_preserved", + "title": "OpenAI chat: provider-specific response extras survive normalization", + "provider": "openai", + "modality": ["preservation"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "Say hi."}], + "max_tokens": 16 + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.id", "not_empty": true}, + {"path": "$.created", "gt": 0}, + {"path": "$.model", "not_empty": true}, + {"path": "$.system_fingerprint", "present": true, "hard": false}, + {"path": "$.service_tier", "present": true, "hard": false} + ] + }, + "notes": "Normalization should preserve provider extras (system_fingerprint/service_tier) rather than strip to a minimal schema." + }, + { + "id": "chat.anthropic.basic", + "title": "Anthropic via chat/completions: OpenAI-shaped response", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@anthropic.chat", + "messages": [ + {"role": "system", "content": "Be concise."}, + {"role": "user", "content": "Name the capital of Japan in one word."} + ], + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "chat.completion"}, + {"path": "$.choices[0].message.content", "not_empty": true}, + {"path": "$.usage.completion_tokens", "gt": 0} + ], + "audit": [{"path": "$.provider", "equals": "anthropic"}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["tokyo"]}] + }, + "notes": "Claude served through the OpenAI chat dialect; Anthropic input/output token usage normalized to prompt/completion." + }, + { + "id": "chat.anthropic.stream", + "title": "Anthropic via chat/completions: streaming normalized to [DONE]", + "provider": "anthropic", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/chat/completions", + "stream": true, + "body": { + "model": "@anthropic.chat", + "messages": [{"role": "user", "content": "Count from 1 to 3."}], + "stream": true, + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]} + }, + "notes": "Anthropic SSE converted into OpenAI chat-stream framing with a [DONE] terminator." + }, + { + "id": "chat.anthropic.vision", + "title": "Anthropic via chat/completions: vision image_url", + "provider": "anthropic", + "modality": ["vision"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@anthropic.vision", + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "One word: what is the dominant color?"}, + {"type": "image_url", "image_url": {"url": "@image.blue"}} + ]}], + "max_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["blue"]}] + }, + "notes": "image_url data URL mapped to an Anthropic base64 image block." + }, + { + "id": "chat.anthropic.params_fidelity", + "title": "Anthropic via chat/completions: sampling params + stop honored", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@anthropic.chat", + "messages": [{"role": "user", "content": "Write the word DONE then stop."}], + "temperature": 0.2, + "top_p": 0.9, + "stop": ["\n\n"], + "max_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}] + }, + "notes": "temperature/top_p/stop translated to Anthropic equivalents without error." + }, + { + "id": "chat.gemini.basic", + "title": "Gemini via chat/completions: native API path", + "provider": "gemini", + "modality": ["text"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@gemini.chat", + "messages": [ + {"role": "system", "content": "Be concise."}, + {"role": "user", "content": "Capital of Italy in one word?"} + ], + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "chat.completion"}, + {"path": "$.choices[0].message.content", "not_empty": true}, + {"path": "$.usage.total_tokens", "gt": 0} + ], + "audit": [{"path": "$.provider", "equals": "gemini"}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["rome"]}] + }, + "notes": "Gemini native contents/parts mapping + usageMetadata normalization (USE_GOOGLE_GEMINI_NATIVE_API default true)." + }, + { + "id": "chat.gemini.stream", + "title": "Gemini via chat/completions: streaming", + "provider": "gemini", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/chat/completions", + "stream": true, + "body": { + "model": "@gemini.chat", + "messages": [{"role": "user", "content": "Name two oceans, comma separated."}], + "stream": true, + "max_tokens": 64 + } + }, + "expect": { + "status": 200, + "stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]} + }, + "notes": "Gemini native stream translated to OpenAI chat-stream framing." + }, + { + "id": "chat.gemini.vision", + "title": "Gemini via chat/completions: vision inline image", + "provider": "gemini", + "modality": ["vision"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@gemini.vision", + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "One word: dominant color?"}, + {"type": "image_url", "image_url": {"url": "@image.green"}} + ]}], + "max_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.content", "not_empty": true}], + "quality": [{"target": "response:$.choices[0].message.content", "contains_any": ["green"]}] + }, + "notes": "image_url data URL mapped to Gemini inline_data." + }, + { + "id": "chat.gemini.tools", + "title": "Gemini via chat/completions: function calling", + "provider": "gemini", + "modality": ["tools"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@gemini.chat", + "messages": [{"role": "user", "content": "Use the tool to get the weather in Berlin."}], + "tools": [{"type": "function", "function": { + "name": "get_weather", + "description": "Get weather for a city", + "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]} + }}], + "max_tokens": 128 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.choices[0].message.tool_calls[0].function.name", "equals": "get_weather", "hard": false}], + "quality": [{"target": "response:$.choices[0].message.tool_calls[0].function.name", "contains": "weather"}] + }, + "notes": "Gemini functionCall mapped to OpenAI tool_calls (soft: model may answer directly)." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json new file mode 100644 index 00000000..9c61475f --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json @@ -0,0 +1,99 @@ +[ + { + "id": "conversations.create", + "title": "Conversations: create", + "provider": "openai", + "modality": ["stateful"], + "request": { + "path": "/v1/conversations", + "body": {"metadata": {"qa": "conv-flow"}} + }, + "capture": {"conversation_id": "$.id"}, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "conversation"}, + {"path": "$.id", "not_empty": true} + ] + }, + "notes": "Creates a conversation and captures its id for the rest of this flow (cases run in order)." + }, + { + "id": "conversations.get", + "title": "Conversations: retrieve by id", + "provider": "openai", + "modality": ["stateful"], + "request": { + "method": "GET", + "path": "/v1/conversations/${conversation_id}" + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "conversation"}, + {"path": "$.id", "equals": "${conversation_id}"} + ] + }, + "notes": "Reads back the conversation created above; the returned id must equal the captured ${conversation_id}." + }, + { + "id": "conversations.use_in_responses", + "title": "Conversations: link a Responses call to a conversation", + "provider": "openai", + "modality": ["stateful", "text"], + "request": { + "path": "/v1/responses", + "body": { + "model": "@openai.chat", + "conversation": {"id": "${conversation_id}"}, + "input": "Remember the number 7.", + "max_output_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.status", "equals": "completed"}, + {"path": "$.conversation.id", "equals": "${conversation_id}"} + ] + }, + "notes": "Responses request bound to a conversation id (stateful threading); the response must carry the same ${conversation_id} it was attached to." + }, + { + "id": "conversations.update", + "title": "Conversations: update metadata", + "provider": "openai", + "modality": ["stateful"], + "request": { + "path": "/v1/conversations/${conversation_id}", + "body": {"metadata": {"qa": "conv-flow", "stage": "updated"}} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "conversation"}, + {"path": "$.id", "equals": "${conversation_id}"} + ] + }, + "notes": "Metadata update on an existing conversation; the update must return the same ${conversation_id}." + }, + { + "id": "conversations.delete", + "title": "Conversations: delete", + "provider": "openai", + "modality": ["stateful"], + "request": { + "method": "DELETE", + "path": "/v1/conversations/${conversation_id}" + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "conversation.deleted"}, + {"path": "$.id", "equals": "${conversation_id}"}, + {"path": "$.deleted", "equals": true} + ] + }, + "notes": "Tears down the conversation created at the start of the flow; the deletion ack must reference the same ${conversation_id}." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json new file mode 100644 index 00000000..a03e7eea --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json @@ -0,0 +1,43 @@ +[ + { + "id": "embeddings.openai.single", + "title": "Embeddings: single string input", + "provider": "openai", + "modality": ["embeddings"], + "request": { + "path": "/v1/embeddings", + "body": {"model": "@openai.embed", "input": "GoModel is an AI gateway."} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "list"}, + {"path": "$.data[0].object", "equals": "embedding"}, + {"path": "$.data[0].embedding", "length_gte": 256} + ], + "audit": [{"path": "$.provider", "equals": "openai"}] + }, + "notes": "Embedding vector of expected dimensionality returned in OpenAI list shape." + }, + { + "id": "embeddings.openai.batch", + "title": "Embeddings: batch input array", + "provider": "openai", + "modality": ["embeddings"], + "request": { + "path": "/v1/embeddings", + "body": {"model": "@openai.embed", "input": ["alpha", "beta", "gamma"]} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.data", "length_gte": 3}, + {"path": "$.data[0].index", "equals": 0}, + {"path": "$.data[1].index", "equals": 1}, + {"path": "$.data[2].index", "equals": 2}, + {"path": "$.data[2].embedding", "length_gte": 256} + ] + }, + "notes": "Batched inputs produce one embedding per item, order preserved." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json new file mode 100644 index 00000000..8f1279c6 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json @@ -0,0 +1,82 @@ +[ + { + "id": "errors.unknown_model", + "title": "Errors: unknown model returns a normalized OpenAI-style error", + "provider": "openai", + "modality": ["errors"], + "request": { + "path": "/v1/chat/completions", + "body": {"model": "definitely-not-a-real-model-zzz", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 8} + }, + "expect": { + "status": [400, 404], + "response": [ + {"path": "$.error.message", "not_empty": true}, + {"path": "$.error.type", "present": true, "hard": false} + ] + }, + "notes": "Routing failure surfaces as a clean error envelope, not a 5xx or hang." + }, + { + "id": "errors.anthropic_audio_rejected", + "title": "Errors: unsupported input_audio on Anthropic chat is rejected gracefully", + "provider": "anthropic", + "modality": ["errors", "audio"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@anthropic.chat", + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "Transcribe this."}, + {"type": "input_audio", "input_audio": {"data": "AAAA", "format": "mp3"}} + ]}], + "max_tokens": 32 + } + }, + "expect": { + "status": [400, 415, 422], + "response": [{"path": "$.error.message", "not_empty": true}] + }, + "notes": "Anthropic chat does not support input_audio; the gateway must reject with a 4xx invalid-request error rather than crash or forward garbage. A non-4xx here is itself the finding." + }, + { + "id": "errors.openai_unknown_field_forwarded", + "title": "Behavior: unknown top-level fields are forwarded verbatim (provider rejects)", + "provider": "openai", + "modality": ["errors", "preservation"], + "request": { + "path": "/v1/chat/completions", + "body": { + "model": "@openai.chat", + "messages": [{"role": "user", "content": "hi"}], + "x_qa_marker": "keep-123", + "max_tokens": 8 + } + }, + "expect": { + "status": [400], + "response": [ + {"path": "$.error.message", "not_empty": true}, + {"path": "$.error.message", "contains": "x_qa_marker", "hard": false}, + {"path": "$.error.type", "equals": "invalid_request_error", "hard": false} + ], + "audit": [{"path": "$.data.request_body.x_qa_marker", "equals": "keep-123", "hard": false}] + }, + "notes": "Documented finding (2026-06): GoModel does not strip unrecognized top-level fields; it forwards them, so a strict provider (OpenAI) returns 400 'Unrecognized request argument'. The audit confirms the field was captured inbound and passed through. If the gateway later sanitizes unknown fields, change expect.status to 200." + }, + { + "id": "errors.malformed_json", + "title": "Errors: malformed JSON body returns 400", + "provider": "openai", + "modality": ["errors"], + "request": { + "path": "/v1/chat/completions", + "raw_body": "{\"model\": \"@openai.chat\", \"messages\": [ " + }, + "expect": { + "status": [400], + "response": [{"path": "$.error.message", "not_empty": true}] + }, + "notes": "Truncated JSON must yield a 400 with a clear error message (raw_body is sent verbatim)." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json new file mode 100644 index 00000000..16bac806 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json @@ -0,0 +1,216 @@ +[ + { + "id": "messages.anthropic.basic", + "title": "Messages: native Anthropic shape", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "max_tokens": 64, + "messages": [{"role": "user", "content": "Capital of Canada? One word."}] + } + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.type", "equals": "message"}, + {"path": "$.role", "equals": "assistant"}, + {"path": "$.content[0].text", "not_empty": true}, + {"path": "$.stop_reason", "present": true}, + {"path": "$.usage.output_tokens", "gt": 0} + ], + "audit": [{"path": "$.provider", "equals": "anthropic"}], + "quality": [{"target": "response:$.content[0].text", "contains_any": ["ottawa"]}] + }, + "notes": "Anthropic-native response: type=message, content blocks, input/output_tokens, stop_reason." + }, + { + "id": "messages.anthropic.system", + "title": "Messages: top-level system prompt", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "max_tokens": 64, + "system": "You always answer with a single word.", + "messages": [{"role": "user", "content": "Largest mammal?"}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.content[0].text", "not_empty": true}], + "quality": [{"target": "response:$.content[0].text", "contains_any": ["whale"]}] + }, + "notes": "Anthropic system is a top-level field, not a message role." + }, + { + "id": "messages.anthropic.stream", + "title": "Messages: streaming SSE ends with message_stop", + "provider": "anthropic", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/messages", + "stream": true, + "body": { + "model": "@anthropic.chat", + "max_tokens": 64, + "stream": true, + "messages": [{"role": "user", "content": "Count from 1 to 3."}] + } + }, + "expect": { + "status": 200, + "stream": {"min_events": 3, "terminal": "message_stop", "event_types": ["message_start", "content_block_delta"], "text": [{"not_empty": true}]} + }, + "notes": "Native Anthropic event protocol relayed (message_start -> content_block_delta -> message_stop)." + }, + { + "id": "messages.anthropic.vision", + "title": "Messages: image content block", + "provider": "anthropic", + "modality": ["vision"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.vision", + "max_tokens": 32, + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "One word: dominant color?"}, + {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "@imageb64.red"}} + ]}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.content[0].text", "not_empty": true}], + "quality": [{"target": "response:$.content[0].text", "contains_any": ["red"]}] + }, + "notes": "Native Anthropic base64 image source (raw base64, media_type separate)." + }, + { + "id": "messages.anthropic.tools_auto", + "title": "Messages: tool definition, tool_choice auto", + "provider": "anthropic", + "modality": ["tools"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "max_tokens": 256, + "tools": [{"name": "get_weather", "description": "weather for a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}], + "tool_choice": {"type": "auto"}, + "messages": [{"role": "user", "content": "What's the weather in Paris? Use the tool."}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.stop_reason", "present": true}], + "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}] + }, + "notes": "Native Anthropic tool schema (input_schema) + tool_choice object." + }, + { + "id": "messages.anthropic.tools_required", + "title": "Messages: tool_choice any forces a tool call", + "provider": "anthropic", + "modality": ["tools"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "max_tokens": 256, + "tools": [{"name": "get_time", "description": "current time in a city", "input_schema": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}], + "tool_choice": {"type": "any"}, + "messages": [{"role": "user", "content": "What time is it in Tokyo?"}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.stop_reason", "equals": "tool_use", "hard": false}], + "quality": [{"target": "response:$.stop_reason", "contains_any": ["tool_use"]}] + }, + "notes": "tool_choice=any should force a tool_use stop_reason." + }, + { + "id": "messages.anthropic.thinking", + "title": "Messages: extended thinking enabled", + "provider": "anthropic", + "modality": ["reasoning"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.thinking", + "max_tokens": 4000, + "thinking": {"type": "enabled", "budget_tokens": 1024}, + "messages": [{"role": "user", "content": "If a train travels 60 km in 45 minutes, what is its speed in km/h? Show the number."}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.content[0].type", "present": true}], + "quality": [{"target": "response:$.stop_reason", "contains_any": ["end_turn", "stop"]}] + }, + "notes": "Extended-thinking request must be accepted (adaptive vs budget_tokens handled by the gateway per model)." + }, + { + "id": "messages.anthropic.default_max_tokens", + "title": "Messages: missing max_tokens is injected by the gateway", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "messages": [{"role": "user", "content": "Say hi in one word."}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.content[0].text", "not_empty": true}] + }, + "notes": "Anthropic requires max_tokens; the gateway injects a default so a user request without it still succeeds (good defaults)." + }, + { + "id": "messages.anthropic.count_tokens", + "title": "Messages: count_tokens", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/messages/count_tokens", + "body": { + "model": "@anthropic.chat", + "messages": [{"role": "user", "content": "How many tokens is this sentence, roughly?"}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.input_tokens", "gt": 0}] + }, + "notes": "Token-counting endpoint returns input_tokens without a provider completion call." + }, + { + "id": "messages.anthropic.metadata_preserved", + "title": "Messages: metadata.user_id (valid Anthropic field) is preserved", + "provider": "anthropic", + "modality": ["preservation"], + "request": { + "path": "/v1/messages", + "body": { + "model": "@anthropic.chat", + "max_tokens": 16, + "metadata": {"user_id": "qa-789"}, + "messages": [{"role": "user", "content": "Say OK."}] + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.content[0].text", "not_empty": true}], + "audit": [{"path": "$.data.request_body.metadata.user_id", "equals": "qa-789", "hard": false}] + }, + "notes": "metadata is a first-class Anthropic field; audit confirms the gateway recorded it as sent." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json new file mode 100644 index 00000000..d1ab7f1a --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json @@ -0,0 +1,198 @@ +[ + { + "id": "responses.openai.basic_string", + "title": "Responses: plain string input", + "provider": "openai", + "modality": ["text"], + "request": { + "path": "/v1/responses", + "body": {"model": "@openai.chat", "input": "What is the capital of France? One word.", "max_output_tokens": 64} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.object", "equals": "response"}, + {"path": "$.status", "equals": "completed"}, + {"path": "$.output[0].content[0].text", "not_empty": true} + ], + "audit": [{"path": "$.provider", "equals": "openai"}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["paris"]}] + }, + "notes": "Native Responses shape: output[].content[].text, status=completed." + }, + { + "id": "responses.openai.instructions", + "title": "Responses: instructions + string input", + "provider": "openai", + "modality": ["text"], + "request": { + "path": "/v1/responses", + "body": {"model": "@openai.chat", "instructions": "Answer in exactly one word.", "input": "Largest ocean?", "max_output_tokens": 64} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.status", "equals": "completed"}, + {"path": "$.output[0].content[0].text", "not_empty": true} + ], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["pacific"]}] + }, + "notes": "instructions become a system-equivalent prompt." + }, + { + "id": "responses.openai.multimodal_image", + "title": "Responses: multi-part input_text + input_image", + "provider": "openai", + "modality": ["vision"], + "request": { + "path": "/v1/responses", + "body": { + "model": "@openai.vision", + "input": [{"role": "user", "content": [ + {"type": "input_text", "text": "One word: dominant color?"}, + {"type": "input_image", "image_url": "@image.red"} + ]}], + "max_output_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.output[0].content[0].text", "not_empty": true}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["red"]}] + }, + "notes": "input_image content part normalized to a chat image part." + }, + { + "id": "responses.openai.stream", + "title": "Responses: streaming output_text deltas + response.completed", + "provider": "openai", + "modality": ["text", "streaming"], + "request": { + "path": "/v1/responses", + "stream": true, + "body": {"model": "@openai.chat", "input": "Count from 1 to 3.", "stream": true, "max_output_tokens": 64} + }, + "expect": { + "status": 200, + "stream": {"min_events": 2, "terminal": "response.completed", "text": [{"not_empty": true}]} + }, + "notes": "Responses SSE event protocol (output_text.delta -> response.completed)." + }, + { + "id": "responses.openai.tools", + "title": "Responses: function tool", + "provider": "openai", + "modality": ["tools"], + "request": { + "path": "/v1/responses", + "body": { + "model": "@openai.chat", + "input": "Use the tool to get the weather in Rome.", + "tools": [{"type": "function", "name": "get_weather", "description": "weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}], + "max_output_tokens": 128 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.status", "equals": "completed"}], + "quality": [{"target": "response:$.output[0].type", "contains_any": ["function_call", "message"]}] + }, + "notes": "Responses tool schema (flat name/parameters) handled." + }, + { + "id": "responses.openai.structured_text_format", + "title": "Responses: structured output via text.format json_schema", + "provider": "openai", + "modality": ["structured"], + "request": { + "path": "/v1/responses", + "body": { + "model": "@openai.chat", + "input": "Capital of Spain.", + "text": {"format": {"type": "json_schema", "name": "cap", "strict": true, "schema": {"type": "object", "properties": {"capital": {"type": "string"}}, "required": ["capital"], "additionalProperties": false}}}, + "max_output_tokens": 64 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.output[0].content[0].text", "not_empty": true}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["madrid"]}] + }, + "notes": "text.format maps to chat response_format/json_schema for non-native providers." + }, + { + "id": "responses.openai.reasoning_effort", + "title": "Responses: reasoning model with effort", + "provider": "openai", + "modality": ["reasoning"], + "request": { + "path": "/v1/responses", + "body": {"model": "@openai.reasoning", "input": "What is 6 times 7? Number only.", "reasoning": {"effort": "low"}, "max_output_tokens": 2000} + }, + "expect": { + "status": 200, + "response": [{"path": "$.status", "equals": "completed"}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["42"]}] + }, + "notes": "reasoning.effort accepted on the Responses API." + }, + { + "id": "responses.openai.metadata_preserved", + "title": "Responses: metadata (valid optional field) is preserved", + "provider": "openai", + "modality": ["preservation"], + "request": { + "path": "/v1/responses", + "body": {"model": "@openai.chat", "input": "Say OK.", "metadata": {"qa_case": "resp-extra"}, "max_output_tokens": 16} + }, + "expect": { + "status": 200, + "response": [{"path": "$.status", "equals": "completed"}], + "audit": [{"path": "$.data.request_body.metadata.qa_case", "equals": "resp-extra", "hard": false}] + }, + "notes": "metadata is a first-class Responses field; audit confirms the gateway recorded it as sent." + }, + { + "id": "responses.anthropic.basic", + "title": "Responses adapter -> Anthropic", + "provider": "anthropic", + "modality": ["text"], + "request": { + "path": "/v1/responses", + "body": {"model": "@anthropic.chat", "input": "Capital of Germany? One word.", "max_output_tokens": 64} + }, + "expect": { + "status": 200, + "response": [ + {"path": "$.status", "equals": "completed"}, + {"path": "$.output[0].content[0].text", "not_empty": true} + ], + "audit": [{"path": "$.provider", "equals": "anthropic"}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["berlin"]}] + }, + "notes": "Non-native provider served through the Responses->chat adapter and renormalized to Responses shape." + }, + { + "id": "responses.gemini.image", + "title": "Responses adapter -> Gemini with image input", + "provider": "gemini", + "modality": ["vision"], + "request": { + "path": "/v1/responses", + "body": { + "model": "@gemini.vision", + "input": [{"role": "user", "content": [ + {"type": "input_text", "text": "One word: color?"}, + {"type": "input_image", "image_url": "@image.blue"} + ]}], + "max_output_tokens": 32 + } + }, + "expect": { + "status": 200, + "response": [{"path": "$.output[0].content[0].text", "not_empty": true}], + "quality": [{"target": "response:$.output[0].content[0].text", "contains_any": ["blue"]}] + }, + "notes": "Responses multimodal input adapted to Gemini inline_data." + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py new file mode 100644 index 00000000..182aa0c7 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py @@ -0,0 +1,82 @@ +#!/usr/bin/env python3 +"""Generate the catchy dark cover image for the June 2026 gateway benchmark post. + +Thesis-driven: latency is overrated, the resource bill isn't. So the hero visual +is the resource gap (Docker image + peak RAM), GoModel highlighted. +""" +import sys +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +from matplotlib import font_manager as fm + +BG = "#0b0e14" +PANEL = "#11161f" +TEXT = "#e6edf3" +MUTED = "#8b98a9" +GREEN = "#34d399" # GoModel +RED = "#f87171" # LiteLLM +GRAY = "#5b6675" # others + +def font(weight="normal", size=12, black=False): + fam = "Arial Black" if black else "Arial" + return fm.FontProperties(family=fam, weight=weight, size=size) + +# data (June 2026 c7i.large run) - ascending, so GoModel (the winner) sits on top +# and the giant red LiteLLM bar at the bottom. Image = compressed pull size; RAM = +# peak under load (LiteLLM at its recommended one-worker-per-core config). +IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)] +RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)] + +W, H, DPI = 2400, 1260, 200 +fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI) +fig.patch.set_facecolor(BG) + +# ── left text column (top-anchored so positions are predictable) ─── +T = dict(va="top", ha="left") +fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK · JUNE 25, 2026", color=GREEN, + fontproperties=font(size=14.5, weight="bold"), **T) +fig.text(0.043, 0.84, "LATENCY IS", color=TEXT, fontproperties=font(size=39, black=True), **T) +fig.text(0.043, 0.725, "OVERRATED", color=TEXT, fontproperties=font(size=39, black=True), **T) +fig.text(0.043, 0.585, "LOOK AT THE BILL", color=GREEN, fontproperties=font(size=35, black=True), **T) +fig.add_artist(plt.Line2D([0.045, 0.405], [0.475, 0.475], color="#1f2733", lw=2)) +fig.text(0.045, 0.45, "GoModel — the fastest,\nmost lightweight AI\ngateway in the world", + color=GREEN, fontproperties=font(size=18, weight="bold"), linespacing=1.4, **T) + +def panel(rect, title, rows, unit, ref): + ax = fig.add_axes(rect) + ax.set_facecolor(PANEL) + for s in ax.spines.values(): + s.set_visible(False) + ax.tick_params(left=False, bottom=False, labelbottom=False) + labels = [r[0] for r in rows] + vals = [r[1] for r in rows] + colors = [r[2] for r in rows] + y = range(len(rows)) + maxv = max(vals) + ax.barh(y, vals, color=colors, height=0.62, zorder=3) + ax.set_xlim(0, maxv * 1.34) # headroom so value labels never clip + ax.set_ylim(-0.6, len(rows) - 0.4) + ax.invert_yaxis() + ax.set_yticks(list(y)) + ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold")) + for i, v in enumerate(vals): + mult = v / ref + tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×" + label = f"{v:,} {unit} ({tag})" + if colors[i] == RED: # the worst: label centered inside the bar, dark text + ax.text(v / 2, i, label, va="center", ha="center", color=BG, + fontproperties=font(size=12.5, weight="bold")) + else: + ax.text(v + maxv * 0.02, i, label, va="center", ha="left", + color=TEXT if colors[i] != GRAY else MUTED, + fontproperties=font(size=12.5, weight="bold")) + ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8) + return ax + +panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16) +panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37) + +out = sys.argv[1] if len(sys.argv) > 1 else "cover.png" +fig.savefig(out, facecolor=BG, dpi=DPI) +print("wrote", out) diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py new file mode 100644 index 00000000..7a71dc89 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover_b.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python3 +"""Cover for the measured benchmark post variant (B). + +Same hero visual as make_cover.py (the resource gap: Docker image + peak RAM, +GoModel highlighted). The text is a single takeaway - +"Four gateways, one backend - GoModel wins" - with no cost-question framing. +""" +import sys +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +from matplotlib import font_manager as fm + +BG = "#0b0e14" +PANEL = "#11161f" +TEXT = "#e6edf3" +MUTED = "#8b98a9" +GREEN = "#34d399" # GoModel +RED = "#f87171" # LiteLLM +GRAY = "#5b6675" # others + +def font(weight="normal", size=12, black=False): + fam = "Arial Black" if black else "Arial" + return fm.FontProperties(family=fam, weight=weight, size=size) + +# data (June 2026 c7i.large run) - ascending, GoModel (winner) on top. +IMG = [("GoModel", 16, GREEN), ("Portkey", 59, GRAY), ("Bifrost", 77, GRAY), ("LiteLLM", 372, RED)] +RAM = [("GoModel", 37, GREEN), ("Portkey", 112, GRAY), ("Bifrost", 143, GRAY), ("LiteLLM", 2272, RED)] + +W, H, DPI = 2400, 1260, 200 +fig = plt.figure(figsize=(W / DPI, H / DPI), dpi=DPI) +fig.patch.set_facecolor(BG) + +# ── left text column (single takeaway, no cost-question headline) ─── +T = dict(va="top", ha="left") +fig.text(0.045, 0.93, "AI GATEWAY BENCHMARK · JUNE 25, 2026", color=GREEN, + fontproperties=font(size=14.5, weight="bold"), **T) +fig.text(0.043, 0.72, "Four gateways,", color=TEXT, fontproperties=font(size=33, black=True), **T) +fig.text(0.043, 0.60, "one backend —", color=TEXT, fontproperties=font(size=33, black=True), **T) +fig.text(0.043, 0.48, "GoModel wins", color=GREEN, fontproperties=font(size=33, black=True), **T) + +def panel(rect, title, rows, unit, ref): + ax = fig.add_axes(rect) + ax.set_facecolor(PANEL) + for s in ax.spines.values(): + s.set_visible(False) + ax.tick_params(left=False, bottom=False, labelbottom=False) + labels = [r[0] for r in rows] + vals = [r[1] for r in rows] + colors = [r[2] for r in rows] + y = range(len(rows)) + maxv = max(vals) + ax.barh(y, vals, color=colors, height=0.62, zorder=3) + ax.set_xlim(0, maxv * 1.34) + ax.set_ylim(-0.6, len(rows) - 0.4) + ax.invert_yaxis() + ax.set_yticks(list(y)) + ax.set_yticklabels(labels, color=TEXT, fontproperties=font(size=14, weight="bold")) + for i, v in enumerate(vals): + mult = v / ref + tag = "1×" if abs(mult - 1) < 0.05 else f"{mult:.0f}×" + label = f"{v:,} {unit} ({tag})" + if colors[i] == RED: # the worst: label centered inside the bar, dark text + ax.text(v / 2, i, label, va="center", ha="center", color=BG, + fontproperties=font(size=12.5, weight="bold")) + else: + ax.text(v + maxv * 0.02, i, label, va="center", ha="left", + color=TEXT if colors[i] != GRAY else MUTED, + fontproperties=font(size=12.5, weight="bold")) + ax.set_title(title, loc="left", color=MUTED, fontproperties=font(size=14, weight="bold"), pad=8) + return ax + +panel([0.55, 0.575, 0.36, 0.295], "DOCKER IMAGE (COMPRESSED)", IMG, "MB", 16) +panel([0.55, 0.135, 0.36, 0.295], "PEAK RAM UNDER LOAD", RAM, "MB", 37) + +out = sys.argv[1] if len(sys.argv) > 1 else "cover-b.png" +fig.savefig(out, facecolor=BG, dpi=DPI) +print("wrote", out) diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore new file mode 100644 index 00000000..179b4868 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore @@ -0,0 +1,3 @@ +output/ +__pycache__/ +*.pyc diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/README.md b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md new file mode 100644 index 00000000..82487965 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/README.md @@ -0,0 +1,75 @@ +# Gateway translation-fidelity analysis + +How faithfully does each AI gateway translate a request? This harness sends the +**same** client request through **GoModel, LiteLLM, Portkey, and Bifrost**, all +pointed at the **same recording mock provider**, and captures — per case, per +gateway — four artifacts: + +| artifact | meaning | +|---|---| +| `client_request` | what we sent to the gateway (the **pure** request) | +| `sent_body` | the body after per-gateway rewrites (e.g. Bifrost's `openai/` model prefix) | +| `upstream` | the request the gateway actually sent to the provider (the **translated** request) + the canned (**pure**) response the mock returned | +| `client_response` | what the gateway returned to us (the **translated** response) | + +Then an AI analyzes each case across gateways: what each one added, dropped, +renamed, or reshaped — request *and* response — and which is most faithful. + +A recording mock (not real providers) is the only way to observe the translated +*upstream* request: real providers don't echo what the gateway sent them. + +## Why a mock, and what "pure" means + +- **Pure request** = the original client body. **Translated request** = what the + gateway emitted upstream (captured by the mock). +- **Pure response** = the deterministic provider-shaped body the mock returned + (enriched with `system_fingerprint`, `service_tier`, and a non-standard + `x_provider_note` so we can see which gateways preserve provider extras). + **Translated response** = what the gateway returned to the client. +- The comparison axis is **gateway vs gateway** — every case uses the same model + (`gpt-4o-mini`) routed to the mock, so differences are the gateway's doing, not + the provider's. + +## Pieces + +```text +docker-compose.yml mock (MOCK_RECORD=1) + all 4 gateways, reusing ../remote configs +corpus.json 12 gateway-agnostic cases across chat/responses/messages, stream + not +capture.py resets the mock, sends each case through each gateway, records 4 artifacts +analyze.py builds per-case AI-analysis prompts from the captures (one bundle per case) +output/ captures.json + the AI comparison report (gitignored) +``` + +The recording mock lives in `../remote/bench-tools/mock/main.go` (recording is +gated behind `MOCK_RECORD=1`, so the latency benchmark stays byte-identical). + +## Run it + +```bash +# 0. build the GoModel image once (native arch): +docker build -t gomodel-bench:local ../../.. + +# 1. bring up the recording mock + all four gateways: +cd docs/2026-06-25_aws_gateway_benchmark/translation +docker compose --profile all up -d --build + +# 2. capture translations (resets the mock before each call): +python3 capture.py # -> output/captures.json + +# 3. tear down: +docker compose --profile all down +``` + +No real provider keys or spend — every gateway talks to the local mock. + +## Per-gateway addressing (handled by capture.py) + +| gateway | port | model | messages path | extra headers | +|---|--|---|---|---| +| GoModel | 18080 | `gpt-4o-mini` | `/v1/messages` | — | +| LiteLLM | 4000 | `gpt-4o-mini` | `/v1/messages` | — | +| Portkey | 8787 | `gpt-4o-mini` | `/v1/messages` | `x-portkey-provider`, `x-portkey-custom-host` | +| Bifrost | 8089 | `openai/gpt-4o-mini` | `/anthropic/v1/messages` | — | + +Dialects a gateway doesn't serve are not skipped — the non-200 (and empty +upstream log) is recorded, because that asymmetry is itself a finding. diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py new file mode 100644 index 00000000..9f2f0762 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py @@ -0,0 +1,156 @@ +#!/usr/bin/env python3 +"""Glue for the AI translation analysis. + + analyze.py --split read output/captures.json, write one self-contained + bundle per case to output/cases/.json (the input + an AI analyst reviews for that case) + analyze.py --render read output/analysis/.json (the AI's structured + verdict per case) + captures.json, write output/report.md + +The actual case-by-case comparison is done by an AI analyst (one per case): it +reads a bundle and writes its verdict to output/analysis/.json following the +schema documented in --split's banner. Stdlib only. +""" +import argparse +import glob +import json +import os + +HERE = os.path.dirname(os.path.abspath(__file__)) +OUT = os.path.join(HERE, "output") +GATEWAYS = ["gomodel", "litellm", "portkey", "bifrost"] + +ANALYSIS_SCHEMA = { + "case_id": "string", + "verdict_per_gateway": { + "": { + "reached_provider": "bool — did the gateway make an upstream call?", + "upstream_path": "the path it called on the mock", + "request_added": ["fields/headers the gateway ADDED vs the client request"], + "request_dropped": ["client fields the gateway DROPPED before upstream"], + "request_renamed": ["client->upstream field renames, e.g. max_tokens->max_completion_tokens"], + "request_reshaped": "prose: structural changes (dialect translation, message shape, tool schema)", + "response_extras_preserved": ["provider extras kept in the client response: system_fingerprint/service_tier/x_provider_note/usage"], + "response_extras_dropped": ["provider extras the gateway stripped"], + "response_reshaped": "prose: how the upstream response was renormalized for the client", + "fidelity_score": "0-100 int: how faithfully intent was preserved end-to-end", + "notes": "anything notable" + } + }, + "cross_gateway_findings": ["concise comparative observations"], + "ranking": ["gateways best->worst fidelity for this case"], +} + + +def split(): + caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8")) + d = os.path.join(OUT, "cases") + os.makedirs(d, exist_ok=True) + ids = [] + for cid, case in caps["cases"].items(): + bundle = {"case_id": cid, "dialect": case["dialect"], "stream": case["stream"], + "intent_note": case["note"], "client_request": case["client_request"], + "gateways": case["gateways"]} + json.dump(bundle, open(os.path.join(d, f"{cid}.json"), "w", encoding="utf-8"), indent=2) + ids.append(cid) + print(f"wrote {len(ids)} case bundles to {d}") + for cid in ids: + print(" ", cid) + + +def _esc(s): + # AI-authored cell values may contain `|` or newlines that would break the + # Markdown table; escape pipes and collapse newlines to spaces. + return str(s).replace("|", "\\|").replace("\r", " ").replace("\n", " ") + + +def _cell(items): + if not items: + return "—" + return _esc("; ".join(str(x) for x in items)[:120]) + + +def render(): + caps = json.load(open(os.path.join(OUT, "captures.json"), encoding="utf-8")) + analyses = {} + for p in glob.glob(os.path.join(OUT, "analysis", "*.json")): + try: + a = json.load(open(p, encoding="utf-8")) + analyses[a.get("case_id", os.path.basename(p)[:-5])] = a + except (OSError, ValueError): + pass + + gws = caps["meta"]["gateways"] + L = ["# Gateway translation-fidelity report\n", + "Same request through each gateway, same mock provider. The AI analyst " + "compared the translated upstream request vs the pure client request, and " + "the translated client response vs the pure mock response, per case.\n", + f"`gateways: {', '.join(gws)}` · `cases: {len(caps['cases'])}`\n"] + + # ── aggregate scoreboard ────────────────────────────────────────────────── + scores = {g: [] for g in gws} + for a in analyses.values(): + for g, v in (a.get("verdict_per_gateway") or {}).items(): + s = v.get("fidelity_score") + if isinstance(s, (int, float)): + scores.setdefault(g, []).append(s) + L.append("## Fidelity scoreboard (mean of per-case AI scores)\n") + L.append("| gateway | mean fidelity | cases scored |") + L.append("|---|--:|--:|") + for g in gws: + vals = scores.get(g, []) + mean = round(sum(vals) / len(vals)) if vals else 0 + L.append(f"| {g} | {mean} | {len(vals)} |") + L.append("") + + # ── per-case detail ──────────────────────────────────────────────────────── + for cid, case in caps["cases"].items(): + a = analyses.get(cid) + L.append(f"## `{cid}` — {case['dialect']}{', stream' if case['stream'] else ''}\n") + L.append(f"_{case['note']}_\n") + if not a: + L.append("> _no AI analysis recorded for this case_\n") + continue + L.append("| gateway | upstream | added | dropped | renamed | resp extras kept | resp dropped | fidelity |") + L.append("|---|---|---|---|---|---|---|--:|") + for g in gws: + v = (a.get("verdict_per_gateway") or {}).get(g) + if not v: + L.append(f"| {g} | — | — | — | — | — | — | — |") + continue + L.append(f"| {g} | {_esc(v.get('upstream_path','—'))} | {_cell(v.get('request_added'))} | " + f"{_cell(v.get('request_dropped'))} | {_cell(v.get('request_renamed'))} | " + f"{_cell(v.get('response_extras_preserved'))} | {_cell(v.get('response_extras_dropped'))} | " + f"{_esc(v.get('fidelity_score','—'))} |") + L.append("") + if a.get("cross_gateway_findings"): + L.append("**Findings:**") + for f in a["cross_gateway_findings"]: + L.append(f"- {f}") + L.append("") + if a.get("ranking"): + L.append(f"**Fidelity ranking:** {' > '.join(a['ranking'])}\n") + + path = os.path.join(OUT, "report.md") + open(path, "w", encoding="utf-8").write("\n".join(L)) + print(f"wrote {path}") + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--split", action="store_true") + ap.add_argument("--render", action="store_true") + ap.add_argument("--schema", action="store_true", help="print the analysis JSON schema") + args = ap.parse_args() + if args.schema: + print(json.dumps(ANALYSIS_SCHEMA, indent=2)) + elif args.split: + split() + elif args.render: + render() + else: + ap.error("one of --split / --render / --schema required") + + +if __name__ == "__main__": + main() diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py new file mode 100644 index 00000000..73201219 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/capture.py @@ -0,0 +1,228 @@ +#!/usr/bin/env python3 +"""Capture how each gateway translates the SAME client request to the SAME mock. + +For every (case, gateway) it records four artifacts: + - client_request : what we sent to the gateway (the "pure" request) + - sent_body : the body after per-gateway model rewrite + - upstream : the request(s) the gateway actually sent to the mock + (the TRANSLATED request) + the canned ("pure") response + - client_response : what the gateway returned to us (the TRANSLATED response) + +The mock is reset before each call and requests are sent one at a time, so the +shared recorder attributes each upstream call to the gateway+case that made it. +Stdlib only. Output: output/captures.json. +""" +import argparse +import copy +import json +import os +import sys +import time +import urllib.error +import urllib.request + +HERE = os.path.dirname(os.path.abspath(__file__)) +MOCK = "http://localhost:9999" + +# Per-gateway base URL is env-overridable (e.g. GOMODEL_BASE) so a local dev +# server on a default port doesn't force a clash. +GATEWAYS = { + "gomodel": {"base": os.environ.get("GOMODEL_BASE", "http://localhost:18080")}, + "litellm": {"base": os.environ.get("LITELLM_BASE", "http://localhost:4000")}, + "portkey": {"base": os.environ.get("PORTKEY_BASE", "http://localhost:8787"), + "headers": {"x-portkey-provider": "openai", + "x-portkey-custom-host": "http://mock:9999/v1"}}, + "bifrost": {"base": os.environ.get("BIFROST_BASE", "http://localhost:8089")}, +} +ORDER = ["gomodel", "litellm", "portkey", "bifrost"] +DIALECT_PATH = {"chat": "/v1/chat/completions", "responses": "/v1/responses", + "messages": "/v1/messages"} + + +def model_for(gw, m): + return "openai/" + m if gw == "bifrost" else m + + +def path_for(gw, dialect): + if gw == "bifrost" and dialect == "messages": + return "/anthropic/v1/messages" + return DIALECT_PATH[dialect] + + +def headers_for(gw): + h = {"Content-Type": "application/json", "Authorization": "Bearer sk-bench-test-key", + "anthropic-version": "2023-06-01"} + h.update(GATEWAYS[gw].get("headers", {})) + return h + + +# ── HTTP ───────────────────────────────────────────────────────────────────── +def post(url, headers, body, stream, timeout=30): + data = json.dumps(body).encode("utf-8") + req = urllib.request.Request(url, data=data, method="POST", headers=headers) + out = {"status": 0, "content_type": "", "json": None, "text": None, + "stream_events": 0, "stream_text": "", "terminal": None, "error": None} + try: + resp = urllib.request.urlopen(req, timeout=timeout) + _capture(out, resp, stream) + except urllib.error.HTTPError as e: + out["status"] = e.code + _capture(out, e, stream=False) + except Exception as e: # noqa: BLE001 + out["error"] = f"{type(e).__name__}: {e}" + return out + + +def _capture(out, resp, stream): + out["status"] = getattr(resp, "status", out["status"]) or out["status"] + try: + out["content_type"] = resp.headers.get("content-type", "") + except Exception: # noqa: BLE001 + pass + if stream and "text/event-stream" in out["content_type"]: + for rawline in resp: + line = rawline.decode("utf-8", "replace").strip() + if not line.startswith("data:"): + continue + payload = line[5:].strip() + if payload == "[DONE]": + out["terminal"] = "[DONE]" + continue + out["stream_events"] += 1 + try: + ev = json.loads(payload) + except Exception: # noqa: BLE001 + continue + t = ev.get("type") + if t in ("response.completed", "message_stop"): + out["terminal"] = t + for ch in ev.get("choices", []) or []: + d = (ch.get("delta") or {}).get("content") + if isinstance(d, str): + out["stream_text"] += d + if t == "response.output_text.delta" and isinstance(ev.get("delta"), str): + out["stream_text"] += ev["delta"] + if t == "content_block_delta": + td = (ev.get("delta") or {}).get("text") + if isinstance(td, str): + out["stream_text"] += td + return + raw = resp.read() + if "application/json" in out["content_type"]: + try: + out["json"] = json.loads(raw.decode("utf-8")) + except Exception: # noqa: BLE001 + out["text"] = raw.decode("utf-8", "replace") + else: + out["text"] = raw.decode("utf-8", "replace")[:4000] + + +def get_json(url, timeout=10): + try: + resp = urllib.request.urlopen(urllib.request.Request(url, method="GET"), timeout=timeout) + return json.loads(resp.read().decode("utf-8")) + except Exception: # noqa: BLE001 + return None + + +def mock_reset(): + # Fail fast: a silently failed reset would attribute stale upstream calls to + # the wrong gateway/case and corrupt the captured corpus. + try: + resp = urllib.request.urlopen( + urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5) + status = getattr(resp, "status", 200) or 200 + resp.read() + except Exception as e: # noqa: BLE001 + sys.exit(f"mock reset failed ({MOCK}/__reset): {e} — aborting to avoid a corrupt corpus") + if status >= 400: + sys.exit(f"mock reset returned HTTP {status} ({MOCK}/__reset) — aborting to avoid a corrupt corpus") + + +def wait_ready(gw, tries=60): + url = GATEWAYS[gw]["base"] + "/v1/chat/completions" + body = {"model": model_for(gw, "gpt-4o-mini"), + "messages": [{"role": "user", "content": "ping"}]} + for _ in range(tries): + r = post(url, headers_for(gw), body, stream=False, timeout=8) + if r["status"] == 200: + return True + time.sleep(2) + return False + + +# ── trimming (keep artifacts readable) ──────────────────────────────────────── +def trim(obj, limit=1500): + if isinstance(obj, str): + return obj if len(obj) <= limit else obj[:limit] + f"…(+{len(obj) - limit})" + if isinstance(obj, list): + return [trim(x, limit) for x in obj] + if isinstance(obj, dict): + return {k: trim(v, limit) for k, v in obj.items()} + return obj + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--corpus", default=os.path.join(HERE, "corpus.json")) + ap.add_argument("--out", default=os.path.join(HERE, "output", "captures.json")) + ap.add_argument("--gateways", default=",".join(ORDER)) + args = ap.parse_args() + + gateways = [g.strip() for g in args.gateways.split(",") if g.strip()] + unknown = [g for g in gateways if g not in GATEWAYS] + if unknown: + ap.error(f"unknown gateway(s): {', '.join(unknown)}; valid options: {', '.join(ORDER)}") + corpus = json.load(open(args.corpus, encoding="utf-8")) + + if get_json(MOCK + "/__log") is None: + print(f"mock not reachable at {MOCK} (is the stack up? is MOCK_RECORD=1?)", file=sys.stderr) + return 2 + + print("waiting for gateways…") + ready = {} + for gw in gateways: + ready[gw] = wait_ready(gw) + print(f" {gw:9} {'ready' if ready[gw] else 'NOT READY (will still attempt)'}") + + results = {"meta": {"gateways": gateways, "ready": ready}, "cases": {}} + for case in corpus: + cid, dialect, stream = case["id"], case["dialect"], case.get("stream", False) + entry = {"note": case.get("note", ""), "dialect": dialect, "stream": stream, + "client_request": case["body"], "gateways": {}} + print(f"\n{cid} ({dialect}{', stream' if stream else ''})") + for gw in gateways: + body = copy.deepcopy(case["body"]) + body["model"] = model_for(gw, body["model"]) + url = GATEWAYS[gw]["base"] + path_for(gw, dialect) + mock_reset() + resp = post(url, headers_for(gw), body, stream) + log = get_json(MOCK + "/__log") or {} + ups = log.get("entries") or [] # mock returns null when no upstream call was made + up_paths = ",".join(sorted({e.get("path", "?") for e in ups})) or "—" + print(f" {gw:9} http={resp['status'] or resp['error']:>4} " + f"upstream={len(ups)} [{up_paths}]") + entry["gateways"][gw] = { + "sent_body": trim(body), + "url": url, + "client_response": { + "status": resp["status"], "content_type": resp["content_type"], + "error": resp["error"], + "json": trim(resp["json"]) if resp["json"] is not None else None, + "text": resp["text"], + "stream_events": resp["stream_events"], + "stream_text": trim(resp["stream_text"]) if resp["stream_text"] else "", + "terminal": resp["terminal"], + }, + "upstream": trim(ups), + } + results["cases"][cid] = entry + + os.makedirs(os.path.dirname(args.out), exist_ok=True) + json.dump(results, open(args.out, "w", encoding="utf-8"), indent=2) + print(f"\nwrote {args.out}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json new file mode 100644 index 00000000..24c6fd1b --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json @@ -0,0 +1,153 @@ +[ + { + "id": "chat.simple", + "dialect": "chat", + "stream": false, + "note": "Baseline: does the body pass through unchanged? what auth/headers are injected upstream?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "What is the capital of France?"}] + } + }, + { + "id": "chat.stream", + "dialect": "chat", + "stream": true, + "note": "Streaming framing: chunk shape, terminal marker, whether stream_options is forwarded.", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "Count to three."}], + "stream": true, + "stream_options": {"include_usage": true} + } + }, + { + "id": "chat.multiturn_system", + "dialect": "chat", + "stream": false, + "note": "System role + multi-turn: is the system message preserved in place and message order kept?", + "body": { + "model": "gpt-4o-mini", + "messages": [ + {"role": "system", "content": "You are a terse assistant."}, + {"role": "user", "content": "Largest planet?"}, + {"role": "assistant", "content": "Jupiter."}, + {"role": "user", "content": "Smallest?"} + ] + } + }, + { + "id": "chat.params", + "dialect": "chat", + "stream": false, + "note": "Sampling params fidelity: which of these survive verbatim upstream (temperature/top_p/penalties/stop/seed/max_tokens)?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "Say ok."}], + "temperature": 0.3, + "top_p": 0.8, + "frequency_penalty": 0.5, + "presence_penalty": 0.2, + "stop": ["\n\n"], + "seed": 42, + "max_tokens": 64 + } + }, + { + "id": "chat.extra_fields", + "dialect": "chat", + "stream": false, + "note": "KEY: unknown/extra fields. Which gateways forward them verbatim vs strip them (e.g. LiteLLM drop_params)?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "Say ok."}], + "metadata": {"qa_case": "extra-fields"}, + "x_qa_marker": "keep-123", + "user": "qa-user-1" + } + }, + { + "id": "chat.tools", + "dialect": "chat", + "stream": false, + "note": "Tool/function definitions and tool_choice: forwarded faithfully?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "Weather in Paris?"}], + "tools": [{"type": "function", "function": { + "name": "get_weather", + "description": "Get weather for a city", + "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]} + }}], + "tool_choice": "auto" + } + }, + { + "id": "chat.response_format", + "dialect": "chat", + "stream": false, + "note": "Structured-output directive: is response_format forwarded?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": "Return JSON with capital of Spain."}], + "response_format": {"type": "json_object"} + } + }, + { + "id": "chat.vision", + "dialect": "chat", + "stream": false, + "note": "Multimodal content parts: how is an image_url part forwarded upstream?", + "body": { + "model": "gpt-4o-mini", + "messages": [{"role": "user", "content": [ + {"type": "text", "text": "What color?"}, + {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8z8BQDwAEhQGAhKmMIQAAAABJRU5ErkJggg=="}} + ]}] + } + }, + { + "id": "responses.simple", + "dialect": "responses", + "stream": false, + "note": "Responses API: how is `input` translated for each gateway's upstream provider call?", + "body": { + "model": "gpt-4o-mini", + "input": "What is the capital of France?" + } + }, + { + "id": "responses.stream", + "dialect": "responses", + "stream": true, + "note": "Responses streaming: event protocol the gateway returns to the client.", + "body": { + "model": "gpt-4o-mini", + "input": "Count to three.", + "stream": true + } + }, + { + "id": "messages.simple", + "dialect": "messages", + "stream": false, + "note": "Anthropic Messages in: what upstream dialect/path does each gateway emit (native messages vs translated chat)?", + "body": { + "model": "gpt-4o-mini", + "max_tokens": 64, + "messages": [{"role": "user", "content": "What is the capital of France?"}] + } + }, + { + "id": "messages.stream", + "dialect": "messages", + "stream": true, + "note": "Anthropic Messages streaming translation.", + "body": { + "model": "gpt-4o-mini", + "max_tokens": 64, + "stream": true, + "messages": [{"role": "user", "content": "Count to three."}] + } + } +] diff --git a/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml new file mode 100644 index 00000000..57a9a070 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml @@ -0,0 +1,84 @@ +# Translation-fidelity topology: all four gateways at once, every one pointed at +# a single RECORDING mock backend (MOCK_RECORD=1). Because the capture runner +# sends one request at a time and resets the mock before each, the shared mock +# cleanly attributes each upstream call to the gateway+case that produced it. +# +# docker compose --profile all up -d # mock + gomodel + litellm + portkey + bifrost +# +# Gateways are on different host ports so they can run simultaneously. Configs +# and the bench-tools build context are reused from ../remote. + +networks: + default: + name: xlatenet + +services: + mock: + build: ../remote/bench-tools + command: ["/mock"] + environment: + - MOCK_PORT=9999 + - MOCK_RECORD=1 + ports: + - "9999:9999" + restart: "no" + + gomodel: + profiles: ["all", "gomodel"] + image: ${GOMODEL_IMAGE:-gomodel-bench:local} + depends_on: [mock] + ports: + # Host 18080 to avoid clashing with a local dev gomodel on 8080. + - "${GOMODEL_HOST_PORT:-18080}:8080" + environment: + - PORT=8080 + - GOMODEL_MASTER_KEY= + - OPENAI_API_KEY=sk-bench-test-key + - OPENAI_BASE_URL=http://mock:9999/v1 + - LOGGING_ENABLED=false + - USAGE_ENABLED=false + - METRICS_ENABLED=false + - SWAGGER_ENABLED=false + - PPROF_ENABLED=false + - ENABLE_PASSTHROUGH_ROUTES=false + - STORAGE_TYPE=sqlite + - SQLITE_PATH=/app/data/gomodel-xlate.db + - GOMODEL_CACHE_DIR=/app/.cache + restart: "no" + + litellm: + profiles: ["all", "litellm"] + # Pinned by digest for a reproducible comparison (override via LITELLM_IMAGE). + image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable@sha256:afdc3cc37493d4f86d485ad7ac4445e7154c568a8d47c01bad15c9cf062c66b5} + depends_on: [mock] + ports: + - "4000:4000" + volumes: + - ../remote/configs/litellm-config.yaml:/app/config.yaml:ro + command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "1"] + restart: "no" + + portkey: + profiles: ["all", "portkey"] + # Pinned by digest for a reproducible comparison (override via PORTKEY_IMAGE). + image: ${PORTKEY_IMAGE:-portkeyai/gateway:latest@sha256:97f094d9c8a764cbfaa2a7138c0017b247ca923bb06db1b4c13b7f8a33b5200d} + depends_on: [mock] + ports: + - "8787:8787" + environment: + - TRUSTED_CUSTOM_HOSTS=mock + restart: "no" + + bifrost: + profiles: ["all", "bifrost"] + # Pinned by digest for a reproducible comparison (override via BIFROST_IMAGE). + image: ${BIFROST_IMAGE:-maximhq/bifrost:latest@sha256:6f20c020cd326199c050e6b15ba18131a6f7ac8627a9a4276750f83e92af2253} + depends_on: [mock] + ports: + - "8089:8089" + environment: + - APP_PORT=8089 + - APP_HOST=0.0.0.0 + volumes: + - ../remote/configs/bifrost-config.json:/app/data/config.json:ro + restart: "no"