Overview
This issue tracks a full refactor of the scraper's HTTP transport and browser automation layers to align with the 2026 Cloudflare bypass research.
Modern Cloudflare uses a composite trust score — a scraper must pass all active detection layers simultaneously. The current codebase fails on several of these layers. This refactor brings the entire repository up to the current standard.
Scope of Work
1. HTTP Client — Replace with curl_cffi
The current use of requests / httpx produces non-browser TLS fingerprints, wrong HTTP/2 frame ordering, and incorrect header sequences. These are flagged by Cloudflare before any page logic runs.
2. Proxy Layer — Enforce Sticky Sessions
cf_clearance is cryptographically bound to the IP it was issued on. Rotating proxies between challenge-solve and page-fetch breaks the cookie silently, causing 403s that are hard to debug.
3. Browser Automation — Migrate to nodriver
Based on an independent June 2026 benchmark (651 verdicts, 31 Cloudflare targets), nodriver was the only tool to pass all 31 targets with zero blocks. All Playwright-based tools (Camoufox, Patchright, CloakBrowser) remain detectable at the automation-protocol level because they use Playwright's CDP internally.
4. Challenge Solving — Cookie Reuse Pattern
Launching a full browser for every scraped page is expensive. The correct production pattern is: solve the JS challenge once with nodriver, extract cf_clearance + __cf_bm, then hand off to curl_cffi for all subsequent page requests.
5. Behavioral Signals — Session Hygiene
Per-zone ML models score behavioral signals independently per site. This cannot be fixed with a single tool — it requires consistent session hygiene across the board.
6. Error Handling — Detect Silent Challenges
Cloudflare sometimes serves a JS challenge as a 200 OK response with challenge HTML in the body. The scraper currently has no detection for this, causing it to silently collect garbage.
7. Link Crawling — Avoid AI Labyrinth
AI Labyrinth (launched March 2025) embeds hidden nofollow links that lead to a maze of AI-generated decoy content. A scraper that follows them gets its fingerprint flagged network-wide and fills its database with garbage — silently.
8. Remove / Deprecate Dead Tools
These tools are confirmed dead in 2026 and should be removed from the codebase and documentation:
9. Documentation Updates
Reference
Full technical breakdown: https://github.com/lncrawl/scraper/wiki/Bypassing-Cloudflare
Priority Order
- Items 1–3 (HTTP client, proxies, browser) — these are the breaking changes with the highest impact
- Item 4 (cookie reuse) — highest leverage for production efficiency
- Items 5–7 (behavioral hygiene, error handling, link filtering) — essential for reliability
- Items 8–9 (cleanup and docs) — can be done incrementally alongside the above
Overview
This issue tracks a full refactor of the scraper's HTTP transport and browser automation layers to align with the 2026 Cloudflare bypass research.
Modern Cloudflare uses a composite trust score — a scraper must pass all active detection layers simultaneously. The current codebase fails on several of these layers. This refactor brings the entire repository up to the current standard.
Scope of Work
1. HTTP Client — Replace with
curl_cffiThe current use of
requests/httpxproduces non-browser TLS fingerprints, wrong HTTP/2 frame ordering, and incorrect header sequences. These are flagged by Cloudflare before any page logic runs.requestsandhttpx(where used for scraping) withcurl_cffiimpersonate="chrome"alias (not a pinned version likechrome120) — the alias auto-tracks the latest fingerprint including theX25519MLKEM768post-quantum key shareheaders={}dict when usingimpersonate— this breaks the correct header order that impersonation sets automaticallycurl_cffi(it is, by default)2. Proxy Layer — Enforce Sticky Sessions
cf_clearanceis cryptographically bound to the IP it was issued on. Rotating proxies between challenge-solve and page-fetch breaks the cookie silently, causing 403s that are hard to debug.3. Browser Automation — Migrate to
nodriverBased on an independent June 2026 benchmark (651 verdicts, 31 Cloudflare targets),
nodriverwas the only tool to pass all 31 targets with zero blocks. All Playwright-based tools (Camoufox, Patchright, CloakBrowser) remain detectable at the automation-protocol level because they use Playwright's CDP internally.playwright/playwright-extrastealth plugin (deprecated Feb 2025, now reliably detected) withnodriver--disable-webrtcto all browser launch args — WebRTC STUN leaks the real machine IP even through a proxyheadless=False— headless mode produces wrong GPU renderer strings (SwiftShader/Mesa), wrong screen dimensions, and JS timing anomaliesuser_data_dirso browser profiles accumulate history across sessions4. Challenge Solving — Cookie Reuse Pattern
Launching a full browser for every scraped page is expensive. The correct production pattern is: solve the JS challenge once with
nodriver, extractcf_clearance+__cf_bm, then hand off tocurl_cffifor all subsequent page requests.cf_clearanceand__cf_bmafter solving — missing__cf_bmcauses re-challenges even whencf_clearanceis still validcurl_cffi5. Behavioral Signals — Session Hygiene
Per-zone ML models score behavioral signals independently per site. This cannot be fixed with a single tool — it requires consistent session hygiene across the board.
Refererheaders to match natural in-site navigation (e.g. listing page → detail page)time.sleep()delays with randomized intervals (random.uniform(1.5, 5.0)or gamma-distributed with mean ~3s)6. Error Handling — Detect Silent Challenges
Cloudflare sometimes serves a JS challenge as a
200 OKresponse with challenge HTML in the body. The scraper currently has no detection for this, causing it to silently collect garbage.is_challenge_page()detection (check body forcf-browser-verification,__cf_chl_,jschl_vc,checking your browser)429→ exponential backoff + rotate IP after 3 failures;403→ rotate IP + rebuild session;503→ wait 10–30s + solve challenge;1010→ switch tonodriver/Camoufox;1020→ switch to mobile carrier IP7. Link Crawling — Avoid AI Labyrinth
AI Labyrinth (launched March 2025) embeds hidden
nofollowlinks that lead to a maze of AI-generated decoy content. A scraper that follows them gets its fingerprint flagged network-wide and fills its database with garbage — silently.rel="nofollow"links at the link-extraction stepdisplay:none, zero dimensions, or off-screen positioning)8. Remove / Deprecate Dead Tools
These tools are confirmed dead in 2026 and should be removed from the codebase and documentation:
cloudscraper(challenge format changed; not maintained)FlareSolverrintegration (fingerprint detected reliably since early 2025)playwright-extrastealth plugin (deprecated Feb 2025)impersonateversion strings (e.g.chrome120) — replace with"chrome"alias9. Documentation Updates
README/ scraper docs to reflect the current recommended stack per scenario (A/B/C/D/E from the wiki)Reference
Full technical breakdown: https://github.com/lncrawl/scraper/wiki/Bypassing-Cloudflare
Priority Order