Skip to content

Align with the latest techniques for bypassing Cloudflare (2026) #4

@dipu-bd

Description

@dipu-bd

Overview

This issue tracks a full refactor of the scraper's HTTP transport and browser automation layers to align with the 2026 Cloudflare bypass research.

Modern Cloudflare uses a composite trust score — a scraper must pass all active detection layers simultaneously. The current codebase fails on several of these layers. This refactor brings the entire repository up to the current standard.


Scope of Work

1. HTTP Client — Replace with curl_cffi

The current use of requests / httpx produces non-browser TLS fingerprints, wrong HTTP/2 frame ordering, and incorrect header sequences. These are flagged by Cloudflare before any page logic runs.

  • Replace requests and httpx (where used for scraping) with curl_cffi
  • Use impersonate="chrome" alias (not a pinned version like chrome120) — the alias auto-tracks the latest fingerprint including the X25519MLKEM768 post-quantum key share
  • Do not pass a custom headers={} dict when using impersonate — this breaks the correct header order that impersonation sets automatically
  • Verify HTTP/2 + HTTP/3 frame ordering is handled correctly by curl_cffi (it is, by default)

2. Proxy Layer — Enforce Sticky Sessions

cf_clearance is cryptographically bound to the IP it was issued on. Rotating proxies between challenge-solve and page-fetch breaks the cookie silently, causing 403s that are hard to debug.

  • Enforce sticky session proxies throughout the pipeline — never rotate IP mid-session
  • Store one browser profile directory per proxy IP to maintain consistent session history
  • Document the sticky session requirement explicitly in the scraper's proxy configuration

3. Browser Automation — Migrate to nodriver

Based on an independent June 2026 benchmark (651 verdicts, 31 Cloudflare targets), nodriver was the only tool to pass all 31 targets with zero blocks. All Playwright-based tools (Camoufox, Patchright, CloakBrowser) remain detectable at the automation-protocol level because they use Playwright's CDP internally.

  • Replace playwright / playwright-extra stealth plugin (deprecated Feb 2025, now reliably detected) with nodriver
  • Add --disable-webrtc to all browser launch args — WebRTC STUN leaks the real machine IP even through a proxy
  • Use headless=False — headless mode produces wrong GPU renderer strings (SwiftShader/Mesa), wrong screen dimensions, and JS timing anomalies
  • Enable persistent user_data_dir so browser profiles accumulate history across sessions

4. Challenge Solving — Cookie Reuse Pattern

Launching a full browser for every scraped page is expensive. The correct production pattern is: solve the JS challenge once with nodriver, extract cf_clearance + __cf_bm, then hand off to curl_cffi for all subsequent page requests.

  • Implement the solve-once / reuse-many pattern (see Scenario B in the wiki)
  • Extract both cf_clearance and __cf_bm after solving — missing __cf_bm causes re-challenges even when cf_clearance is still valid
  • Pass the exact same User-Agent string from the browser solve session to curl_cffi
  • Monitor for 403 / silent 200-with-challenge-body to detect cookie expiry and trigger a re-solve

5. Behavioral Signals — Session Hygiene

Per-zone ML models score behavioral signals independently per site. This cannot be fixed with a single tool — it requires consistent session hygiene across the board.

  • Add session warm-up: visit the homepage first, let JS execute fully, accumulate cookies, then navigate to target pages via a realistic referrer chain
  • Set Referer headers to match natural in-site navigation (e.g. listing page → detail page)
  • Replace fixed time.sleep() delays with randomized intervals (random.uniform(1.5, 5.0) or gamma-distributed with mean ~3s)
  • Cap parallel sessions at 1–3 per IP — 10+ parallel sessions from one IP is itself a bot signal
  • Persist cookies and session state across scraping runs — never start a fresh cookieless session per run

6. Error Handling — Detect Silent Challenges

Cloudflare sometimes serves a JS challenge as a 200 OK response with challenge HTML in the body. The scraper currently has no detection for this, causing it to silently collect garbage.

  • Add is_challenge_page() detection (check body for cf-browser-verification, __cf_chl_, jschl_vc, checking your browser)
  • Add proper status code handling: 429 → exponential backoff + rotate IP after 3 failures; 403 → rotate IP + rebuild session; 503 → wait 10–30s + solve challenge; 1010 → switch to nodriver/Camoufox; 1020 → switch to mobile carrier IP
  • Add retry loop with a max attempt cap before raising

7. Link Crawling — Avoid AI Labyrinth

AI Labyrinth (launched March 2025) embeds hidden nofollow links that lead to a maze of AI-generated decoy content. A scraper that follows them gets its fingerprint flagged network-wide and fills its database with garbage — silently.

  • Filter out rel="nofollow" links at the link-extraction step
  • Only follow links that are visually present in the rendered DOM (not hidden via display:none, zero dimensions, or off-screen positioning)
  • Add content relevance validation — if topic drift is detected mid-crawl, abort the session and discard results
  • Limit crawl depth; avoid aggressive depth-first traversal of every link found

8. Remove / Deprecate Dead Tools

These tools are confirmed dead in 2026 and should be removed from the codebase and documentation:

  • Remove cloudscraper (challenge format changed; not maintained)
  • Remove FlareSolverr integration (fingerprint detected reliably since early 2025)
  • Remove playwright-extra stealth plugin (deprecated Feb 2025)
  • Remove hardcoded impersonate version strings (e.g. chrome120) — replace with "chrome" alias

9. Documentation Updates

  • Update README / scraper docs to reflect the current recommended stack per scenario (A/B/C/D/E from the wiki)
  • Document the Wayback Machine CDX API as Scenario E — the cheapest bypass for lightly time-sensitive targets
  • Document the API interception approach (Scenario C) as the first thing to attempt before investing in browser automation
  • Add a note on Web Bot Auth (Layer 15) — currently fails open, but IETF standardization and multi-CDN rollout mean enforcement is coming; document per-target monitoring

Reference

Full technical breakdown: https://github.com/lncrawl/scraper/wiki/Bypassing-Cloudflare

Priority Order

  1. Items 1–3 (HTTP client, proxies, browser) — these are the breaking changes with the highest impact
  2. Item 4 (cookie reuse) — highest leverage for production efficiency
  3. Items 5–7 (behavioral hygiene, error handling, link filtering) — essential for reliability
  4. Items 8–9 (cleanup and docs) — can be done incrementally alongside the above

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions