Skip to content

RSL AI crawler licensing design spec#650

Open
jevansnyc wants to merge 1 commit intomainfrom
rsl-ai-crawler-licensing-spec
Open

RSL AI crawler licensing design spec#650
jevansnyc wants to merge 1 commit intomainfrom
rsl-ai-crawler-licensing-spec

Conversation

@jevansnyc
Copy link
Copy Markdown
Collaborator

@jevansnyc jevansnyc commented Apr 22, 2026

Summary

Adds design spec for Trusted Server's RSL-compliant AI crawler detection and licensing enforcement layer.

  • Edge-deployed AI crawler classification using six signals (UA, IP allowlist, JA4, ASN, H2, robots/license fetch correlation)
  • RSL 1.0 standards-compliant license publishing (/license.xml, robots.txt augmentation, Link header)
  • Public license.toml for RSL terms; private license.private.toml for enforcement secrets and commercial overrides
  • Standards-compliant 402/403 enforcement responses with inline RSL fragments
  • Permissive-by-default with per-publisher/per-route Strict override
  • Debug endpoints (/_ts/debug/rsl/summary, /_ts/debug/rsl/recent, /_ts/debug/rsl/license) and structured logging
  • Integrates with existing TS architecture; no changes needed to Edge Cookie, auction orchestrator, consent, or other integrations
  • Phase 2 preview for Open License Protocol (OLP) token-based access

Test plan

  • Review spec for accuracy against current TS infrastructure (integration hooks, JA4 signals, bot gate)
  • Verify RSL usage/payment vocabulary matches RSL 1.0 spec (https://rslstandard.org/rsl)
  • Validate onboarding flow assumptions against an existing TS publisher deployment
  • Confirm binary size estimates (~100 KB additional for IP allowlists + JA4 DB + new code)

Closes #649

Trusted Server RSL-compliant AI crawler detection and licensing
enforcement, MVP-ready. Six-signal classification (UA, IP, JA4, ASN,
H2, robots/license.xml correlation), permissive-by-default with strict
override, public license.toml + private license.private.toml split,
standards-compliant 402/403 responses, debug endpoints, structured
logging. Targets publishers already running TS. Phase 2 adds OLP
license server for programmatic token issuance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jevansnyc jevansnyc linked an issue Apr 22, 2026 that may be closed by this pull request
@aram356 aram356 assigned aram356 and jevansnyc and unassigned aram356 Apr 22, 2026
@aram356 aram356 requested a review from prk-Jr April 27, 2026 15:36
Copy link
Copy Markdown
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Design spec PR — single 1055-line markdown file in docs/superpowers/specs/, no code changes. The proposed RSL-compliant AI crawler licensing layer is well-scoped and the public/private config split is sound, but the spec misrepresents the current state of TS infrastructure and contains two architectural assumptions that don't hold on Fastly Compute. Requesting changes on those before merge.

Blocking

🔧 wrench

  • §3.5 "Existing capability" table is materially inaccurate — JA4 signal, bot gate, /robots.txt handling, and /_ts/debug/* auth pattern are all listed as existing but do not exist in crates/. Reframe as new infrastructure required.
  • §7.2 / §7.3 in-process ring buffer assumes long-lived process state — Fastly Compute WASM instances are short-lived per-request; the debug endpoints can't aggregate without KV/Config Store or external log-stream aggregation. Pick one and document the trade-off.

❓ question

  • §4.1 / §3.5 — How does the WASM instance obtain JA4? Fastly Compute does not expose ClientHello bytes today. Without a concrete acquisition path the entire stealth-detection branch is unimplementable.
  • §6.1 / §8.5 — Link: rel="license" on every response, including ad/RTB/integration responses? §8.5 says integrations are unaffected, but every response gaining a header is a change. Suggest scoping to HTML responses.

Non-blocking

♻️ refactor

  • §3.7 module path — should be crates/trusted-server-core/src/integrations/rsl/, matching every other integration in the project.
  • §3.4 IP allowlist lookup structure unspecified — naive Vec scan over thousands of CIDRs would dominate hot-path latency; specify a radix/trie structure.
  • §6.6 rendered XML drops contact_urllicense.toml defines it but the example only renders contactEmail.
  • §5.5 usage vocabulary missing all — RSL 1.0 defines all, ai-all, ai-train, ai-input, ai-index, search.

🤔 thinking

  • §3.8 / §8.1 "no Fastly-specific dependencies in core" overstates current reality — crates/trusted-server-core/Cargo.toml already has fastly as a non-optional dep; PR #581/#609 are the in-progress abstraction work.
  • §4.1 ASN database not in the §3.9 binary-size budget — MaxMind GeoLite2-ASN is ~10 MB; reconcile with the <100 KB budget.
  • §4.7 mentions 401 but §6.7 matrix doesn't — drop or describe when 401 fires.
  • §4.3 IP-allowlist refresh cadence couples to TS release train — staleness window or KV-based refresh path worth acknowledging.
  • §6.6 RSL max-age is in days, HTTP Cache-Control: max-age in seconds — note the unit difference.

🌱 seedling

  • §4.2 purpose likely belongs on bot identity, not request classification.

📌 out of scope

  • §4.3 control-plane refresh job referenced but not designed — should appear in §2.2 if deferred.

📝 note

  • IP-list URLs (openai.com/{gptbot,searchbot,chatgpt-user}.json) verified live (200 OK).

⛏ nitpick

  • format-docs CI failure is a trivial prettier whitespace diff (asterisk italics → underscore italics, table column padding); fix with cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md.
  • "Trusted Server" / "TS" used interchangeably mid-paragraph; pick one.

CI Status

  • format-docs: FAIL (one-command prettier fix)
  • cargo fmt: PASS
  • cargo clippy: PASS
  • cargo test: PASS
  • vitest: PASS
  • browser/integration tests: PASS
  • CodeQL: PASS

| `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth |
| Structured logging (`log-fastly`) | Classification events emitted as structured log lines |
| Settings (`trusted-server.toml`) | RSL config block added to existing settings parser |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — "Existing capability" table is materially inaccurate.

Four of six items in this column do not exist in the codebase today (verified by searching crates/):

  • JA4 signal from edge TLS — no JA4 code, no client_hello access, no TLS-fingerprint plumbing anywhere
  • Bot gate (H2 + JA4) — no bot gate exists
  • /robots.txt handling — no robots.txt handler in crates/trusted-server-core/
  • /_ts/debug/* auth pattern — no such route family or token-auth pattern exists

A reader walks away believing the implementation reuses four existing systems. It actually builds them all from scratch — a materially different effort estimate.

Fix: split the table into two columns:

  • Existing capabilityIntegrationRegistration builder, Settings (trusted-server.toml), structured logging
  • New infrastructure required — JA4 acquisition path, bot gate, /robots.txt handler, /_ts/debug/* framework

### 7.3 `GET /_ts/debug/rsl/recent`

Last N classified requests, newest first. Backed by an in-process ring buffer
(no KV writes on hot path). Default 1000 entries, configurable.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — In-process ring buffer assumes long-lived process state that Fastly Compute does not provide.

Fastly Compute WASM instances are short-lived per-request — there is no in-memory state shared across requests. As specified, /_ts/debug/rsl/recent and /_ts/debug/rsl/summary would only see the single classification of the request that hit the debug endpoint itself.

The spec promises both "no KV writes on the hot path" and "live counters / recent classifications" — these are mutually exclusive on Fastly Compute today.

Fix — pick one:

  1. Pipe /summary through an external aggregator over the structured log stream (Fastly log shipping → S3/BigQuery/Datadog), and document that the debug endpoints are not live edge state.
  2. Commit to KV/Config Store reads/writes on the hot path with the trade-offs §5.1 explicitly defers (availability, eventual consistency, auth, write QPS limits).

|---|---|---|---|---|
| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |
| 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — How does the WASM instance actually obtain JA4?

JA4 is the only signal that catches spoofed-UA stealth crawlers; §4.5 ("Stealth Classification Example") and the entire stealth branch in the §6.7 response matrix hinge on it.

Fastly Compute does not expose ClientHello bytes or a precomputed JA4 to the WASM instance today. Without a concrete acquisition path the stealth-detection branch is unimplementable.

Please specify which mechanism is assumed:

  • VCL pre-stage that hashes ClientHello and forwards as Fastly-JA4 request header?
  • A closed-beta / experimental Fastly API?
  • Upstream computation in a different layer?
  • Phase-2 deferred until the edge platform exposes JA4?

Whichever it is, one paragraph describing it would unblock the design.

```

TS adds the `Link` header on every response so honest crawlers can discover
license terms on any request, not just by fetching `robots.txt` first.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionLink: rel="license" on every response — including non-HTML responses?

§6.1 says "TS adds the Link header on every response." §8.5 says existing integrations "continue working unchanged."

In practice every ad-server response, RTB endpoint, integration proxy response (Permutive, Lockr, Datadome, Didomi), 204 beacon, and OPTIONS preflight will gain this header. Is that intentional?

  • For top-level navigation HTML responses: yes, it's the goal.
  • For JSON RTB bid responses or analytics 204s: it's noise that competes with Link headers used for HTTP/2 push hints, preconnect, etc.

Suggest scoping to Content-Type: text/html (or top-level navigation responses) and explicitly stating the scope in §6.1.

├── enforcement.rs # verdict + terms + mode → Action
├── endpoints.rs # /license.xml, /robots.txt augmentation, debug routes
└── logging.rs # structured log emission
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ refactor — Module path inconsistent with project layout.

Spec proposes crates/trusted-server-core/src/rsl/. Every other integration in TS lives under crates/trusted-server-core/src/integrations/{datadome,permutive,lockr,didomi,testlight,…} — verified in the filesystem and CLAUDE.md.

Fix: crates/trusted-server-core/src/integrations/rsl/{mod.rs, classifier.rs, …} for consistency with IntegrationRegistration discovery and the existing module structure.

Ambiguous {
signals: Vec<Signal>,
},
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌱 seedlingpurpose likely belongs on bot identity, not request classification.

Classification::HonestAiCrawler { purpose: AiPurpose } — but a bot's declared purpose is a property of its UA (GPTBot = training, ChatGPT-User = in-conversation, OAI-SearchBot = search), not of the individual request. Consider a static BotId → AiPurpose mapping table; the classification then carries bot_identity: BotId and purpose is a derived lookup.

Not blocking — flagging now so the type design isn't locked in before implementation.

(community-maintained list).
- **ASN database:** updated via Maxmind or equivalent on publisher's own
schedule.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📌 out of scope — Control-plane refresh job is referenced but not designed.

"Fetched by a control-plane job from each operator's published JSON endpoint" — this control plane is new infrastructure outside the WASM edge. Reasonable to defer, but should appear explicitly in §2.2 (Out of Scope) as a dependency for the "publishers always have fresh allowlists" promise. Otherwise the freshness story is implicitly "manual TS release cadence".

| # | Signal | Source | Strength | Coverage |
|---|---|---|---|---|
| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 note — Verified live (200 OK) for the IP-list URLs cited:

  • https://openai.com/gptbot.json
  • https://openai.com/searchbot.json
  • https://openai.com/chatgpt-user.json

No action — recording the verification.

@@ -0,0 +1,1055 @@
# Trusted Server AI Crawler Licensing (RSL-compliant)

*April 2026*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — format-docs CI failure is a trivial prettier whitespace diff.

  • Italic syntax: *April 2026*_April 2026_ (this line)
  • Markdown table column padding (§3.5 and §4.1)

Fix in one command:

cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md

crawlers that spoof user-agent strings.
4. **Publisher-owned config** — single `license.toml` file, version-controlled,
no lock-in to a vendor's dashboard.
5. **Open source** — publishers can audit the enforcement behavior.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — "Trusted Server" / "TS" used interchangeably mid-paragraph throughout §1. Pick one and stick with it for readability. CLAUDE.md doesn't enforce, but the project prose elsewhere is consistent.

Copy link
Copy Markdown
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Re-review against unchanged head 8d081287 (no commits since the 2026-04-30 review). The 17 prior findings still stand and remain blocking — re-verified against the current main (post PR #581 / #609 / #610). This review adds 17 new findings surfaced by a deeper re-read; verdict remains REQUEST_CHANGES.

The new findings cluster around three themes:

  1. Hot-path infrastructure that doesn't exist — Signal #6 fetch-correlation needs cross-request state, and /robots.txt augmentation needs an origin-fetch + cache pattern that TS doesn't have today (only synthetic body path is /static/tsjs=). Both are load-bearing and unaddressed.

  2. Stealth detection has gaps the spec doesn't acknowledge — §4.5's ua_spoofed_chrome requires a Chrome JA4 reference set the design doesn't budget for, and the JA4 set as scoped will false-positive against legitimate non-AI clients (CI, monitoring, internal scripts). Combined with the unanswered prior question about how WASM obtains JA4 in the first place, the entire stealth branch is non-functional as specified.

  3. Decision logic is implicit and a success criterion is unmeetable — The bot_name → declared purpose → permits/prohibits precedence table that drives every enforcement decision is never written down. §10 criterion #1 ("100% identification") is unmeetable given §4.3's release-bundled IP-allowlist refresh model.

Blocking

🔧 wrench

  • Signal #6 needs cross-request state Fastly Compute lacks — line 274
  • /robots.txt augmentation has no origin-fetch + cache mechanism — line 635
  • License: directive is an RSL extension, not RFC 9309 — line 500
  • §10 success criterion #1 unmeetable given §4.3 refresh model — line 1009

❓ question

  • Stealth detection requires a Chrome JA4 reference set not in scope — line 345
  • JA4 false-positive surface for legitimate non-AI clients — line 348
  • Bot-name normalization (case folding, match style) unspecified — line 471

Non-blocking

🤔 thinking

  • Confidence enum semantics undefined — line 297
  • 30-day Cache-Control may strand publishers on stale terms — line 648
  • bot_name → purpose → permits/prohibits decision table missing — line 705
  • Phase-2 EMS contradicts §3.5 "no changes" promise — line 996
  • Two-config-file split deviates from trusted-server.toml, no rationale — line 516
  • Default Permissive can hide stealth-detection mis-fires — line 354

🌱 seedling

  • No test plan or testing strategy — cross-cutting. 1055 lines of design with no §test-strategy. How are classifiers unit-tested without JA4 in the test runtime? How are integration tests structured against simulated Fastly geo / TLS metadata? Existing integrations test inline (e.g. crates/trusted-server-core/src/integrations/permutive.rs:643+) but there is no harness for TLS-fingerprint or ASN signals. A short §test-strategy (unit + integration + a sample-crawler end-to-end check) would close this.

📌 out of scope

  • Phase-2 OLP key distribution / rotation unspecified — line 994

⛏ nitpick

  • 402 RSL fragment omits <copyright> — line 568
  • §5.3 example doesn't exercise the "prohibition wins" rule — line 494

CI Status (re-verified locally)

  • cargo fmt: PASS

  • cargo clippy: PASS

  • cargo test: PASS

  • vitest: PASS

  • format-docs: FAIL — still red 14 days after the prior review flagged it. One-line fix:

    cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md

| 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer |
| 4 | **ASN classification** | IP → ASN lookup | Supporting signal only (never decisive alone) | Datacenter/hosting ASNs (AWS, GCP, Azure, DigitalOcean, Hetzner, OVH), VPN/proxy ASNs, residential ASNs |
| 5 | **H2 handshake presence** | Edge TLS/HTTP layer | Supporting signal (humans nearly always H2; many scrapers still H1) | All traffic |
| 6 | **`/robots.txt` and `/license.xml` fetch correlation** | TS request logs | Supporting signal (honest bots fetch before crawling) | All traffic |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — Signal #6 (/robots.txt and /license.xml fetch correlation) needs cross-request state Fastly Compute does not provide.

This is the same root issue as the in-process ring buffer flagged earlier, but here it is a classification signal on the hot path, not just observability. Tracking "did this IP fetch /robots.txt recently?" requires cross-request lookups against Fastly KV / Edge Dictionary / Config Store, with cost and latency consequences that aren't acknowledged. As written, the signal is non-functional in the proposed architecture.

Fix: either drop signal #6 from the table, or re-spec it explicitly as a KV-backed correlation store with the request-budget impact called out in §3.4 and §3.9.

Sitemap: https://example.com/sitemap.xml
```

TS preserves the publisher's existing `robots.txt` content and prepends the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench/robots.txt augmentation needs origin-fetch + cache infrastructure that does not exist today.

"TS preserves the publisher's existing robots.txt content and prepends the License: directive" implies an origin subrequest, body mutation, and an edge cache with an invalidation strategy. The only synthetic-body endpoint pattern in TS today is /static/tsjs= (publisher.rs:114-150), which generates output entirely from in-binary modules — no origin fetch. §3.4 doesn't list robots.txt as compiled-in state, and §3.7 has no fetcher module.

Fix: specify the mechanism explicitly — origin subrequest path, cache key + TTL, surrogate-key purge on license.toml change, and behavior when origin returns 404 / 5xx (the §6.5 "TS generates a minimal one" case). Or constrain the design to a fully-generated robots.txt and acknowledge that publisher origin-side robots.txt content is dropped.

2. **`payment` types match RSL's payment vocabulary** — `purchase`,
`subscription`, `training`, `crawl`, `use`, `contribution`, `attribution`,
`free`.
3. **Route patterns are RFC 9309-compliant** (same syntax as robots.txt) —
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrenchLicense: directive is not RFC 9309.

This bullet says "Route patterns are RFC 9309-compliant (same syntax as robots.txt)". RFC 9309 defines User-agent, Disallow, Allow, Sitemap — there is no License: directive. The License: line in §6.5 is an RSL 1.0 extension to the robots.txt format, not part of RFC 9309. Conflating the two will mislead implementers and reviewers about standards posture.

Fix: clarify that the route-pattern syntax (wildcards, longest-match) is RFC 9309-compliant, while the License: line is an RSL 1.0 extension to robots.txt — and link the RSL spec section.

Set before running against real traffic:

1. **Classification accuracy:** 100% of honest AI crawlers (OpenAI, Anthropic,
Perplexity, Google-Extended, CCBot) correctly identified by UA + IP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — Success criterion #1 cannot hold given the §4.3 IP-allowlist refresh model.

"100% of honest AI crawlers correctly identified by UA + IP allowlist signals" is a hard claim, but §4.3 says IP allowlists are bundled into TS releases ("Recommended publisher refresh cadence: weekly"). Between an operator publishing new IPs and the publisher rolling a new TS build, requests from those IPs match UA but miss the allowlist — falling to Ambiguous or StealthAiCrawler. The criterion as stated is unmeetable for any week-long observation window.

Fix: either qualify the criterion ("…against the bundled allowlist current at deploy time, measured within X days of release") or commit to runtime-loading the IP allowlist from a Fastly Config Store / KV refreshed by the control-plane job mentioned in §4.3.


- Signal: `asn:aws` ✓
- Signal: `ja4:python_requests` ✓
- Signal: `ua_spoofed_chrome` (UA claims Chrome but JA4 says Python) ✓
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — How is ua_spoofed_chrome actually computed?

This stealth example detects "UA claims Chrome but JA4 says Python" — that requires both an LLM-library JA4 set and a known-good browser/Chrome JA4 reference set to detect the mismatch. §3.9 only budgets "~5–10 KB JA4 (a few hundred LLM fetcher fingerprints)" — there is no Chrome reference set in scope. Chrome JA4s additionally vary by version × OS × GREASE, so the reference set is non-trivial.

Without an answer here, the entire stealth branch (§4.5, §6.3) — which is the design's differentiator over plain UA matching — is non-functional as specified.


```toml
# trusted-server.toml
[integrations.rsl]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinking — Two-config-file split deviates from the trusted-server.toml-everything pattern, no rationale given.

Other integrations declare config inline in trusted-server.toml (verified at settings.rs). RSL introduces license.toml + license.private.toml as separate files. The likely reason — publishers want to git-version license.toml publicly while keeping internal settings private — is plausible, but the spec doesn't say so.

Fix: add a sentence to §5.1 explaining why RSL is the exception (public version-controlled terms vs. operational settings), or fold the contents under [integrations.rsl.public] / [integrations.rsl.private] blocks in trusted-server.toml for consistency.


**Default:** Permissive. Block only confirmed crawlers whose license terms
prohibit access. Stealth crawlers and ambiguous traffic are allowed through
but logged for publisher review.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinking — Default Permissive may hide stealth-detection mis-fires from publishers.

Most publishers will enable this integration to enforce. Defaulting to Permissive means false-negatives (stealth crawlers misclassified as Ambiguous, or honest crawlers blocked by a logic bug) are invisible until the publisher reads structured logs.

Worth justifying the rollout posture explicitly (safety-first staging, then opt-in to Strict per route) or recommending Strict-on-premium-routes as a default for new deployments. As written, the publisher gets minimal protection until they manually flip to Strict.

**Split architecture:**

- **Hot path (WASM at edge):** token validation only. HMAC check against a
shared signing key. Sub-millisecond. No KV writes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📌 out of scope — Phase-2 OLP key distribution / rotation unspecified.

HMAC against a shared signing key works only if (a) the WASM has the key, (b) the publisher can rotate it, and (c) revocation is possible after compromise. None of these are in scope today, but they should appear in §11 open questions or §2.2 explicitly so the dependency isn't lost between phases.

Likely answer for Fastly: load via PlatformSecretStore at request time. State this — and acknowledge that key rotation while tokens are in flight needs a grace window that the design doesn't yet specify.

</payment>
</license>
</content>
</rsl>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — 402 RSL fragment omits <copyright>.

The 402 body inlines <content url="/premium/*"> with <license>...<payment> but no <copyright> element. The full /license.xml in §6.6 has it. RSL 1.0 attribution tracking expects <copyright> so consumers know the rights-holder for every fragment served.

Fix: keep at minimum a <copyright type="organization" contactEmail="..." contactUrl="...">Example Publisher, Inc.</copyright> element in 402 bodies.


### 5.5 Key Config Design Points

1. **`permits` / `prohibits` use RSL's usage vocabulary** — `search`, `ai-all`,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — §5.3 example doesn't actually exercise "prohibition wins."

permits = ["search", "ai-input"] and prohibits = ["ai-train", "ai-index"] are disjoint sets, so the precedence rule stated here is never tested by the example.

Fix: add an example with overlapping vocab — e.g., permits = ["ai-all"] paired with prohibits = ["ai-train"] — to demonstrate that prohibition narrows the broader permission. Otherwise readers may not realize the precedence is load-bearing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write Spec for RSL Support

2 participants