Conversation
Trusted Server RSL-compliant AI crawler detection and licensing enforcement, MVP-ready. Six-signal classification (UA, IP, JA4, ASN, H2, robots/license.xml correlation), permissive-by-default with strict override, public license.toml + private license.private.toml split, standards-compliant 402/403 responses, debug endpoints, structured logging. Targets publishers already running TS. Phase 2 adds OLP license server for programmatic token issuance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aram356
left a comment
There was a problem hiding this comment.
Summary
Design spec PR — single 1055-line markdown file in docs/superpowers/specs/, no code changes. The proposed RSL-compliant AI crawler licensing layer is well-scoped and the public/private config split is sound, but the spec misrepresents the current state of TS infrastructure and contains two architectural assumptions that don't hold on Fastly Compute. Requesting changes on those before merge.
Blocking
🔧 wrench
- §3.5 "Existing capability" table is materially inaccurate — JA4 signal, bot gate,
/robots.txthandling, and/_ts/debug/*auth pattern are all listed as existing but do not exist incrates/. Reframe as new infrastructure required. - §7.2 / §7.3 in-process ring buffer assumes long-lived process state — Fastly Compute WASM instances are short-lived per-request; the debug endpoints can't aggregate without KV/Config Store or external log-stream aggregation. Pick one and document the trade-off.
❓ question
- §4.1 / §3.5 — How does the WASM instance obtain JA4? Fastly Compute does not expose ClientHello bytes today. Without a concrete acquisition path the entire stealth-detection branch is unimplementable.
- §6.1 / §8.5 —
Link: rel="license"on every response, including ad/RTB/integration responses? §8.5 says integrations are unaffected, but every response gaining a header is a change. Suggest scoping to HTML responses.
Non-blocking
♻️ refactor
- §3.7 module path — should be
crates/trusted-server-core/src/integrations/rsl/, matching every other integration in the project. - §3.4 IP allowlist lookup structure unspecified — naive Vec scan over thousands of CIDRs would dominate hot-path latency; specify a radix/trie structure.
- §6.6 rendered XML drops
contact_url—license.tomldefines it but the example only renderscontactEmail. - §5.5 usage vocabulary missing
all— RSL 1.0 definesall,ai-all,ai-train,ai-input,ai-index,search.
🤔 thinking
- §3.8 / §8.1 "no Fastly-specific dependencies in core" overstates current reality —
crates/trusted-server-core/Cargo.tomlalready hasfastlyas a non-optional dep; PR #581/#609 are the in-progress abstraction work. - §4.1 ASN database not in the §3.9 binary-size budget — MaxMind GeoLite2-ASN is ~10 MB; reconcile with the <100 KB budget.
- §4.7 mentions 401 but §6.7 matrix doesn't — drop or describe when 401 fires.
- §4.3 IP-allowlist refresh cadence couples to TS release train — staleness window or KV-based refresh path worth acknowledging.
- §6.6 RSL
max-ageis in days, HTTPCache-Control: max-agein seconds — note the unit difference.
🌱 seedling
- §4.2
purposelikely belongs on bot identity, not request classification.
📌 out of scope
- §4.3 control-plane refresh job referenced but not designed — should appear in §2.2 if deferred.
📝 note
- IP-list URLs (
openai.com/{gptbot,searchbot,chatgpt-user}.json) verified live (200 OK).
⛏ nitpick
- format-docs CI failure is a trivial prettier whitespace diff (asterisk italics → underscore italics, table column padding); fix with
cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md. - "Trusted Server" / "TS" used interchangeably mid-paragraph; pick one.
CI Status
- format-docs: FAIL (one-command prettier fix)
- cargo fmt: PASS
- cargo clippy: PASS
- cargo test: PASS
- vitest: PASS
- browser/integration tests: PASS
- CodeQL: PASS
| | `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth | | ||
| | Structured logging (`log-fastly`) | Classification events emitted as structured log lines | | ||
| | Settings (`trusted-server.toml`) | RSL config block added to existing settings parser | | ||
|
|
There was a problem hiding this comment.
🔧 wrench — "Existing capability" table is materially inaccurate.
Four of six items in this column do not exist in the codebase today (verified by searching crates/):
- JA4 signal from edge TLS — no JA4 code, no
client_helloaccess, no TLS-fingerprint plumbing anywhere - Bot gate (H2 + JA4) — no bot gate exists
/robots.txthandling — no robots.txt handler incrates/trusted-server-core//_ts/debug/*auth pattern — no such route family or token-auth pattern exists
A reader walks away believing the implementation reuses four existing systems. It actually builds them all from scratch — a materially different effort estimate.
Fix: split the table into two columns:
- Existing capability —
IntegrationRegistrationbuilder, Settings (trusted-server.toml), structured logging - New infrastructure required — JA4 acquisition path, bot gate,
/robots.txthandler,/_ts/debug/*framework
| ### 7.3 `GET /_ts/debug/rsl/recent` | ||
|
|
||
| Last N classified requests, newest first. Backed by an in-process ring buffer | ||
| (no KV writes on hot path). Default 1000 entries, configurable. |
There was a problem hiding this comment.
🔧 wrench — In-process ring buffer assumes long-lived process state that Fastly Compute does not provide.
Fastly Compute WASM instances are short-lived per-request — there is no in-memory state shared across requests. As specified, /_ts/debug/rsl/recent and /_ts/debug/rsl/summary would only see the single classification of the request that hit the debug endpoint itself.
The spec promises both "no KV writes on the hot path" and "live counters / recent classifications" — these are mutually exclusive on Fastly Compute today.
Fix — pick one:
- Pipe
/summarythrough an external aggregator over the structured log stream (Fastly log shipping → S3/BigQuery/Datadog), and document that the debug endpoints are not live edge state. - Commit to KV/Config Store reads/writes on the hot path with the trade-offs §5.1 explicitly defers (availability, eventual consistency, auth, write QPS limits).
| |---|---|---|---|---| | ||
| | 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent | | ||
| | 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges | | ||
| | 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer | |
There was a problem hiding this comment.
❓ question — How does the WASM instance actually obtain JA4?
JA4 is the only signal that catches spoofed-UA stealth crawlers; §4.5 ("Stealth Classification Example") and the entire stealth branch in the §6.7 response matrix hinge on it.
Fastly Compute does not expose ClientHello bytes or a precomputed JA4 to the WASM instance today. Without a concrete acquisition path the stealth-detection branch is unimplementable.
Please specify which mechanism is assumed:
- VCL pre-stage that hashes ClientHello and forwards as
Fastly-JA4request header? - A closed-beta / experimental Fastly API?
- Upstream computation in a different layer?
- Phase-2 deferred until the edge platform exposes JA4?
Whichever it is, one paragraph describing it would unblock the design.
| ``` | ||
|
|
||
| TS adds the `Link` header on every response so honest crawlers can discover | ||
| license terms on any request, not just by fetching `robots.txt` first. |
There was a problem hiding this comment.
❓ question — Link: rel="license" on every response — including non-HTML responses?
§6.1 says "TS adds the Link header on every response." §8.5 says existing integrations "continue working unchanged."
In practice every ad-server response, RTB endpoint, integration proxy response (Permutive, Lockr, Datadome, Didomi), 204 beacon, and OPTIONS preflight will gain this header. Is that intentional?
- For top-level navigation HTML responses: yes, it's the goal.
- For JSON RTB bid responses or analytics 204s: it's noise that competes with
Linkheaders used for HTTP/2 push hints, preconnect, etc.
Suggest scoping to Content-Type: text/html (or top-level navigation responses) and explicitly stating the scope in §6.1.
| ├── enforcement.rs # verdict + terms + mode → Action | ||
| ├── endpoints.rs # /license.xml, /robots.txt augmentation, debug routes | ||
| └── logging.rs # structured log emission | ||
| ``` |
There was a problem hiding this comment.
♻️ refactor — Module path inconsistent with project layout.
Spec proposes crates/trusted-server-core/src/rsl/. Every other integration in TS lives under crates/trusted-server-core/src/integrations/{datadome,permutive,lockr,didomi,testlight,…} — verified in the filesystem and CLAUDE.md.
Fix: crates/trusted-server-core/src/integrations/rsl/{mod.rs, classifier.rs, …} for consistency with IntegrationRegistration discovery and the existing module structure.
| Ambiguous { | ||
| signals: Vec<Signal>, | ||
| }, | ||
| } |
There was a problem hiding this comment.
🌱 seedling — purpose likely belongs on bot identity, not request classification.
Classification::HonestAiCrawler { purpose: AiPurpose } — but a bot's declared purpose is a property of its UA (GPTBot = training, ChatGPT-User = in-conversation, OAI-SearchBot = search), not of the individual request. Consider a static BotId → AiPurpose mapping table; the classification then carries bot_identity: BotId and purpose is a derived lookup.
Not blocking — flagging now so the type design isn't locked in before implementation.
| (community-maintained list). | ||
| - **ASN database:** updated via Maxmind or equivalent on publisher's own | ||
| schedule. | ||
|
|
There was a problem hiding this comment.
📌 out of scope — Control-plane refresh job is referenced but not designed.
"Fetched by a control-plane job from each operator's published JSON endpoint" — this control plane is new infrastructure outside the WASM edge. Reasonable to defer, but should appear explicitly in §2.2 (Out of Scope) as a dependency for the "publishers always have fresh allowlists" promise. Otherwise the freshness story is implicitly "manual TS release cadence".
| | # | Signal | Source | Strength | Coverage | | ||
| |---|---|---|---|---| | ||
| | 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent | | ||
| | 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges | |
There was a problem hiding this comment.
📝 note — Verified live (200 OK) for the IP-list URLs cited:
https://openai.com/gptbot.jsonhttps://openai.com/searchbot.jsonhttps://openai.com/chatgpt-user.json
No action — recording the verification.
| @@ -0,0 +1,1055 @@ | |||
| # Trusted Server AI Crawler Licensing (RSL-compliant) | |||
|
|
|||
| *April 2026* | |||
There was a problem hiding this comment.
⛏ nitpick — format-docs CI failure is a trivial prettier whitespace diff.
- Italic syntax:
*April 2026*→_April 2026_(this line) - Markdown table column padding (§3.5 and §4.1)
Fix in one command:
cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md| crawlers that spoof user-agent strings. | ||
| 4. **Publisher-owned config** — single `license.toml` file, version-controlled, | ||
| no lock-in to a vendor's dashboard. | ||
| 5. **Open source** — publishers can audit the enforcement behavior. |
There was a problem hiding this comment.
⛏ nitpick — "Trusted Server" / "TS" used interchangeably mid-paragraph throughout §1. Pick one and stick with it for readability. CLAUDE.md doesn't enforce, but the project prose elsewhere is consistent.
aram356
left a comment
There was a problem hiding this comment.
Summary
Re-review against unchanged head 8d081287 (no commits since the 2026-04-30 review). The 17 prior findings still stand and remain blocking — re-verified against the current main (post PR #581 / #609 / #610). This review adds 17 new findings surfaced by a deeper re-read; verdict remains REQUEST_CHANGES.
The new findings cluster around three themes:
-
Hot-path infrastructure that doesn't exist — Signal #6 fetch-correlation needs cross-request state, and
/robots.txtaugmentation needs an origin-fetch + cache pattern that TS doesn't have today (only synthetic body path is/static/tsjs=). Both are load-bearing and unaddressed. -
Stealth detection has gaps the spec doesn't acknowledge — §4.5's
ua_spoofed_chromerequires a Chrome JA4 reference set the design doesn't budget for, and the JA4 set as scoped will false-positive against legitimate non-AI clients (CI, monitoring, internal scripts). Combined with the unanswered prior question about how WASM obtains JA4 in the first place, the entire stealth branch is non-functional as specified. -
Decision logic is implicit and a success criterion is unmeetable — The
bot_name → declared purpose → permits/prohibits precedencetable that drives every enforcement decision is never written down. §10 criterion #1 ("100% identification") is unmeetable given §4.3's release-bundled IP-allowlist refresh model.
Blocking
🔧 wrench
- Signal #6 needs cross-request state Fastly Compute lacks — line 274
/robots.txtaugmentation has no origin-fetch + cache mechanism — line 635License:directive is an RSL extension, not RFC 9309 — line 500- §10 success criterion #1 unmeetable given §4.3 refresh model — line 1009
❓ question
- Stealth detection requires a Chrome JA4 reference set not in scope — line 345
- JA4 false-positive surface for legitimate non-AI clients — line 348
- Bot-name normalization (case folding, match style) unspecified — line 471
Non-blocking
🤔 thinking
Confidenceenum semantics undefined — line 297- 30-day
Cache-Controlmay strand publishers on stale terms — line 648 bot_name → purpose → permits/prohibitsdecision table missing — line 705- Phase-2 EMS contradicts §3.5 "no changes" promise — line 996
- Two-config-file split deviates from
trusted-server.toml, no rationale — line 516 - Default Permissive can hide stealth-detection mis-fires — line 354
🌱 seedling
- No test plan or testing strategy — cross-cutting. 1055 lines of design with no §test-strategy. How are classifiers unit-tested without JA4 in the test runtime? How are integration tests structured against simulated Fastly geo / TLS metadata? Existing integrations test inline (e.g.
crates/trusted-server-core/src/integrations/permutive.rs:643+) but there is no harness for TLS-fingerprint or ASN signals. A short §test-strategy (unit + integration + a sample-crawler end-to-end check) would close this.
📌 out of scope
- Phase-2 OLP key distribution / rotation unspecified — line 994
⛏ nitpick
- 402 RSL fragment omits
<copyright>— line 568 - §5.3 example doesn't exercise the "prohibition wins" rule — line 494
CI Status (re-verified locally)
-
cargo fmt: PASS
-
cargo clippy: PASS
-
cargo test: PASS
-
vitest: PASS
-
format-docs: FAIL — still red 14 days after the prior review flagged it. One-line fix:
cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md
| | 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer | | ||
| | 4 | **ASN classification** | IP → ASN lookup | Supporting signal only (never decisive alone) | Datacenter/hosting ASNs (AWS, GCP, Azure, DigitalOcean, Hetzner, OVH), VPN/proxy ASNs, residential ASNs | | ||
| | 5 | **H2 handshake presence** | Edge TLS/HTTP layer | Supporting signal (humans nearly always H2; many scrapers still H1) | All traffic | | ||
| | 6 | **`/robots.txt` and `/license.xml` fetch correlation** | TS request logs | Supporting signal (honest bots fetch before crawling) | All traffic | |
There was a problem hiding this comment.
🔧 wrench — Signal #6 (/robots.txt and /license.xml fetch correlation) needs cross-request state Fastly Compute does not provide.
This is the same root issue as the in-process ring buffer flagged earlier, but here it is a classification signal on the hot path, not just observability. Tracking "did this IP fetch /robots.txt recently?" requires cross-request lookups against Fastly KV / Edge Dictionary / Config Store, with cost and latency consequences that aren't acknowledged. As written, the signal is non-functional in the proposed architecture.
Fix: either drop signal #6 from the table, or re-spec it explicitly as a KV-backed correlation store with the request-budget impact called out in §3.4 and §3.9.
| Sitemap: https://example.com/sitemap.xml | ||
| ``` | ||
|
|
||
| TS preserves the publisher's existing `robots.txt` content and prepends the |
There was a problem hiding this comment.
🔧 wrench — /robots.txt augmentation needs origin-fetch + cache infrastructure that does not exist today.
"TS preserves the publisher's existing robots.txt content and prepends the License: directive" implies an origin subrequest, body mutation, and an edge cache with an invalidation strategy. The only synthetic-body endpoint pattern in TS today is /static/tsjs= (publisher.rs:114-150), which generates output entirely from in-binary modules — no origin fetch. §3.4 doesn't list robots.txt as compiled-in state, and §3.7 has no fetcher module.
Fix: specify the mechanism explicitly — origin subrequest path, cache key + TTL, surrogate-key purge on license.toml change, and behavior when origin returns 404 / 5xx (the §6.5 "TS generates a minimal one" case). Or constrain the design to a fully-generated robots.txt and acknowledge that publisher origin-side robots.txt content is dropped.
| 2. **`payment` types match RSL's payment vocabulary** — `purchase`, | ||
| `subscription`, `training`, `crawl`, `use`, `contribution`, `attribution`, | ||
| `free`. | ||
| 3. **Route patterns are RFC 9309-compliant** (same syntax as robots.txt) — |
There was a problem hiding this comment.
🔧 wrench — License: directive is not RFC 9309.
This bullet says "Route patterns are RFC 9309-compliant (same syntax as robots.txt)". RFC 9309 defines User-agent, Disallow, Allow, Sitemap — there is no License: directive. The License: line in §6.5 is an RSL 1.0 extension to the robots.txt format, not part of RFC 9309. Conflating the two will mislead implementers and reviewers about standards posture.
Fix: clarify that the route-pattern syntax (wildcards, longest-match) is RFC 9309-compliant, while the License: line is an RSL 1.0 extension to robots.txt — and link the RSL spec section.
| Set before running against real traffic: | ||
|
|
||
| 1. **Classification accuracy:** 100% of honest AI crawlers (OpenAI, Anthropic, | ||
| Perplexity, Google-Extended, CCBot) correctly identified by UA + IP |
There was a problem hiding this comment.
🔧 wrench — Success criterion #1 cannot hold given the §4.3 IP-allowlist refresh model.
"100% of honest AI crawlers correctly identified by UA + IP allowlist signals" is a hard claim, but §4.3 says IP allowlists are bundled into TS releases ("Recommended publisher refresh cadence: weekly"). Between an operator publishing new IPs and the publisher rolling a new TS build, requests from those IPs match UA but miss the allowlist — falling to Ambiguous or StealthAiCrawler. The criterion as stated is unmeetable for any week-long observation window.
Fix: either qualify the criterion ("…against the bundled allowlist current at deploy time, measured within X days of release") or commit to runtime-loading the IP allowlist from a Fastly Config Store / KV refreshed by the control-plane job mentioned in §4.3.
|
|
||
| - Signal: `asn:aws` ✓ | ||
| - Signal: `ja4:python_requests` ✓ | ||
| - Signal: `ua_spoofed_chrome` (UA claims Chrome but JA4 says Python) ✓ |
There was a problem hiding this comment.
❓ question — How is ua_spoofed_chrome actually computed?
This stealth example detects "UA claims Chrome but JA4 says Python" — that requires both an LLM-library JA4 set and a known-good browser/Chrome JA4 reference set to detect the mismatch. §3.9 only budgets "~5–10 KB JA4 (a few hundred LLM fetcher fingerprints)" — there is no Chrome reference set in scope. Chrome JA4s additionally vary by version × OS × GREASE, so the reference set is non-trivial.
Without an answer here, the entire stealth branch (§4.5, §6.3) — which is the design's differentiator over plain UA matching — is non-functional as specified.
|
|
||
| ```toml | ||
| # trusted-server.toml | ||
| [integrations.rsl] |
There was a problem hiding this comment.
🤔 thinking — Two-config-file split deviates from the trusted-server.toml-everything pattern, no rationale given.
Other integrations declare config inline in trusted-server.toml (verified at settings.rs). RSL introduces license.toml + license.private.toml as separate files. The likely reason — publishers want to git-version license.toml publicly while keeping internal settings private — is plausible, but the spec doesn't say so.
Fix: add a sentence to §5.1 explaining why RSL is the exception (public version-controlled terms vs. operational settings), or fold the contents under [integrations.rsl.public] / [integrations.rsl.private] blocks in trusted-server.toml for consistency.
|
|
||
| **Default:** Permissive. Block only confirmed crawlers whose license terms | ||
| prohibit access. Stealth crawlers and ambiguous traffic are allowed through | ||
| but logged for publisher review. |
There was a problem hiding this comment.
🤔 thinking — Default Permissive may hide stealth-detection mis-fires from publishers.
Most publishers will enable this integration to enforce. Defaulting to Permissive means false-negatives (stealth crawlers misclassified as Ambiguous, or honest crawlers blocked by a logic bug) are invisible until the publisher reads structured logs.
Worth justifying the rollout posture explicitly (safety-first staging, then opt-in to Strict per route) or recommending Strict-on-premium-routes as a default for new deployments. As written, the publisher gets minimal protection until they manually flip to Strict.
| **Split architecture:** | ||
|
|
||
| - **Hot path (WASM at edge):** token validation only. HMAC check against a | ||
| shared signing key. Sub-millisecond. No KV writes. |
There was a problem hiding this comment.
📌 out of scope — Phase-2 OLP key distribution / rotation unspecified.
HMAC against a shared signing key works only if (a) the WASM has the key, (b) the publisher can rotate it, and (c) revocation is possible after compromise. None of these are in scope today, but they should appear in §11 open questions or §2.2 explicitly so the dependency isn't lost between phases.
Likely answer for Fastly: load via PlatformSecretStore at request time. State this — and acknowledge that key rotation while tokens are in flight needs a grace window that the design doesn't yet specify.
| </payment> | ||
| </license> | ||
| </content> | ||
| </rsl> |
There was a problem hiding this comment.
⛏ nitpick — 402 RSL fragment omits <copyright>.
The 402 body inlines <content url="/premium/*"> with <license>...<payment> but no <copyright> element. The full /license.xml in §6.6 has it. RSL 1.0 attribution tracking expects <copyright> so consumers know the rights-holder for every fragment served.
Fix: keep at minimum a <copyright type="organization" contactEmail="..." contactUrl="...">Example Publisher, Inc.</copyright> element in 402 bodies.
|
|
||
| ### 5.5 Key Config Design Points | ||
|
|
||
| 1. **`permits` / `prohibits` use RSL's usage vocabulary** — `search`, `ai-all`, |
There was a problem hiding this comment.
⛏ nitpick — §5.3 example doesn't actually exercise "prohibition wins."
permits = ["search", "ai-input"] and prohibits = ["ai-train", "ai-index"] are disjoint sets, so the precedence rule stated here is never tested by the example.
Fix: add an example with overlapping vocab — e.g., permits = ["ai-all"] paired with prohibits = ["ai-train"] — to demonstrate that prohibition narrows the broader permission. Otherwise readers may not realize the precedence is load-bearing.
Summary
Adds design spec for Trusted Server's RSL-compliant AI crawler detection and licensing enforcement layer.
/license.xml,robots.txtaugmentation,Linkheader)license.tomlfor RSL terms; privatelicense.private.tomlfor enforcement secrets and commercial overrides/_ts/debug/rsl/summary,/_ts/debug/rsl/recent,/_ts/debug/rsl/license) and structured loggingTest plan
Closes #649