Align with the latest techniques for bypassing Cloudflare (2026)

## Overview

This issue tracks a full refactor of the scraper's HTTP transport and browser automation layers to align with the [2026 Cloudflare bypass research](https://github.com/lncrawl/scraper/wiki/Bypassing-Cloudflare).

Modern Cloudflare uses a **composite trust score** — a scraper must pass *all active detection layers simultaneously*. The current codebase fails on several of these layers. This refactor brings the entire repository up to the current standard.

---

## Scope of Work

### 1. HTTP Client — Replace with `curl_cffi`

The current use of `requests` / `httpx` produces non-browser TLS fingerprints, wrong HTTP/2 frame ordering, and incorrect header sequences. These are flagged by Cloudflare before any page logic runs.

- [ ] Replace `requests` and `httpx` (where used for scraping) with `curl_cffi`
- [ ] Use `impersonate="chrome"` alias (not a pinned version like `chrome120`) — the alias auto-tracks the latest fingerprint including the `X25519MLKEM768` post-quantum key share
- [ ] Do **not** pass a custom `headers={}` dict when using `impersonate` — this breaks the correct header order that impersonation sets automatically
- [ ] Verify HTTP/2 + HTTP/3 frame ordering is handled correctly by `curl_cffi` (it is, by default)

### 2. Proxy Layer — Enforce Sticky Sessions

`cf_clearance` is cryptographically bound to the IP it was issued on. Rotating proxies between challenge-solve and page-fetch breaks the cookie silently, causing 403s that are hard to debug.

- [ ] Enforce **sticky session proxies** throughout the pipeline — never rotate IP mid-session
- [ ] Store one browser profile directory per proxy IP to maintain consistent session history
- [ ] Document the sticky session requirement explicitly in the scraper's proxy configuration

### 3. Browser Automation — Migrate to `nodriver`

Based on an independent June 2026 benchmark (651 verdicts, 31 Cloudflare targets), `nodriver` was the only tool to pass all 31 targets with zero blocks. All Playwright-based tools (Camoufox, Patchright, CloakBrowser) remain detectable at the automation-protocol level because they use Playwright's CDP internally.

- [ ] Replace `playwright` / `playwright-extra` stealth plugin (deprecated Feb 2025, now reliably detected) with `nodriver`
- [ ] Add `--disable-webrtc` to all browser launch args — WebRTC STUN leaks the real machine IP even through a proxy
- [ ] Use `headless=False` — headless mode produces wrong GPU renderer strings (`SwiftShader`/`Mesa`), wrong screen dimensions, and JS timing anomalies
- [ ] Enable persistent `user_data_dir` so browser profiles accumulate history across sessions

### 4. Challenge Solving — Cookie Reuse Pattern

Launching a full browser for every scraped page is expensive. The correct production pattern is: solve the JS challenge once with `nodriver`, extract `cf_clearance` + `__cf_bm`, then hand off to `curl_cffi` for all subsequent page requests.

- [ ] Implement the **solve-once / reuse-many** pattern (see Scenario B in the wiki)
- [ ] Extract **both** `cf_clearance` and `__cf_bm` after solving — missing `__cf_bm` causes re-challenges even when `cf_clearance` is still valid
- [ ] Pass the exact same User-Agent string from the browser solve session to `curl_cffi`
- [ ] Monitor for 403 / silent 200-with-challenge-body to detect cookie expiry and trigger a re-solve

### 5. Behavioral Signals — Session Hygiene

Per-zone ML models score behavioral signals independently per site. This cannot be fixed with a single tool — it requires consistent session hygiene across the board.

- [ ] Add session warm-up: visit the homepage first, let JS execute fully, accumulate cookies, then navigate to target pages via a realistic referrer chain
- [ ] Set `Referer` headers to match natural in-site navigation (e.g. listing page → detail page)
- [ ] Replace fixed `time.sleep()` delays with randomized intervals (`random.uniform(1.5, 5.0)` or gamma-distributed with mean ~3s)
- [ ] Cap parallel sessions at **1–3 per IP** — 10+ parallel sessions from one IP is itself a bot signal
- [ ] Persist cookies and session state across scraping runs — never start a fresh cookieless session per run

### 6. Error Handling — Detect Silent Challenges

Cloudflare sometimes serves a JS challenge as a `200 OK` response with challenge HTML in the body. The scraper currently has no detection for this, causing it to silently collect garbage.

- [ ] Add `is_challenge_page()` detection (check body for `cf-browser-verification`, `__cf_chl_`, `jschl_vc`, `checking your browser`)
- [ ] Add proper status code handling: `429` → exponential backoff + rotate IP after 3 failures; `403` → rotate IP + rebuild session; `503` → wait 10–30s + solve challenge; `1010` → switch to `nodriver`/Camoufox; `1020` → switch to mobile carrier IP
- [ ] Add retry loop with a max attempt cap before raising

### 7. Link Crawling — Avoid AI Labyrinth

AI Labyrinth (launched March 2025) embeds hidden `nofollow` links that lead to a maze of AI-generated decoy content. A scraper that follows them gets its fingerprint flagged network-wide and fills its database with garbage — silently.

- [ ] Filter out `rel="nofollow"` links at the link-extraction step
- [ ] Only follow links that are **visually present** in the rendered DOM (not hidden via `display:none`, zero dimensions, or off-screen positioning)
- [ ] Add content relevance validation — if topic drift is detected mid-crawl, abort the session and discard results
- [ ] Limit crawl depth; avoid aggressive depth-first traversal of every link found

### 8. Remove / Deprecate Dead Tools

These tools are confirmed dead in 2026 and should be removed from the codebase and documentation:

- [ ] Remove `cloudscraper` (challenge format changed; not maintained)
- [ ] Remove `FlareSolverr` integration (fingerprint detected reliably since early 2025)
- [ ] Remove `playwright-extra` stealth plugin (deprecated Feb 2025)
- [ ] Remove hardcoded `impersonate` version strings (e.g. `chrome120`) — replace with `"chrome"` alias

### 9. Documentation Updates

- [ ] Update `README` / scraper docs to reflect the current recommended stack per scenario (A/B/C/D/E from the wiki)
- [ ] Document the Wayback Machine CDX API as Scenario E — the cheapest bypass for lightly time-sensitive targets
- [ ] Document the API interception approach (Scenario C) as the first thing to attempt before investing in browser automation
- [ ] Add a note on Web Bot Auth (Layer 15) — currently fails open, but IETF standardization and multi-CDN rollout mean enforcement is coming; document per-target monitoring

---

## Reference

Full technical breakdown: https://github.com/lncrawl/scraper/wiki/Bypassing-Cloudflare

## Priority Order

1. Items 1–3 (HTTP client, proxies, browser) — these are the breaking changes with the highest impact
2. Item 4 (cookie reuse) — highest leverage for production efficiency
3. Items 5–7 (behavioral hygiene, error handling, link filtering) — essential for reliability
4. Items 8–9 (cleanup and docs) — can be done incrementally alongside the above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Align with the latest techniques for bypassing Cloudflare (2026) #4

Overview

Scope of Work

1. HTTP Client — Replace with `curl_cffi`

2. Proxy Layer — Enforce Sticky Sessions

3. Browser Automation — Migrate to `nodriver`

4. Challenge Solving — Cookie Reuse Pattern

5. Behavioral Signals — Session Hygiene

6. Error Handling — Detect Silent Challenges

7. Link Crawling — Avoid AI Labyrinth

8. Remove / Deprecate Dead Tools

9. Documentation Updates

Reference

Priority Order

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Align with the latest techniques for bypassing Cloudflare (2026) #4

Description

Overview

Scope of Work

1. HTTP Client — Replace with curl_cffi

2. Proxy Layer — Enforce Sticky Sessions

3. Browser Automation — Migrate to nodriver

4. Challenge Solving — Cookie Reuse Pattern

5. Behavioral Signals — Session Hygiene

6. Error Handling — Detect Silent Challenges

7. Link Crawling — Avoid AI Labyrinth

8. Remove / Deprecate Dead Tools

9. Documentation Updates

Reference

Priority Order

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. HTTP Client — Replace with `curl_cffi`

3. Browser Automation — Migrate to `nodriver`