HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.
- Cloudflare handling - a convincing browser fingerprint (below) avoids most non-interactive challenges; managed / Turnstile / captcha challenges are detected and either auto-solved with a real browser (see Solving challenges) or surfaced as a clear, actionable exception
- Browser fingerprint impersonation (on by default) - a
curl_cffitransport reproduces a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2 fingerprint, with an automatic fallback to the httpx transport - Composable engine - a middleware pipeline over a pluggable transport
(
scraper.engine), so throttling, stealth, proxy, challenge handling, and retries are independent, swappable stages - Pluggable challenge solvers - auto-solve via a remote FlareSolverr/Byparr
service (
RemoteSolver, no extra deps) or an in-process browser (BrowserSolver,[browser]extra), or reuse acf_clearancecookie solved elsewhere viaapply_browser_clearance - Accurate Client Hints -
sec-ch-ua/sec-fetch-*derived from the chosen UA - Stealth mode - human-like delays, randomized headers, browser quirks
- Proxy support - round-robin proxy rotation with Tor integration (
rotate_proxy()for NEWNYM) and direct fallback - Rate limiting - configurable per-request intervals and concurrency cap
PageSoup- null-safe BeautifulSoup wrapper; selection methods never returnNone- HTTP helpers -
get_soup,get_json,get_image,get_file, and more
pip install lncrawl-scraper # includes curl_cffi for browser fingerprint impersonation
# optional extras:
pip install "lncrawl-scraper[brotli]" # decode brotli (br) responses (brotli)
pip install "lncrawl-scraper[image]" # get_image() support (Pillow)
pip install "lncrawl-scraper[browser]" # in-process BrowserSolver (nodriver + Xvfb)
pip install "lncrawl-scraper[all]" # all of the abovecurl_cffi ships as a core dependency so impersonation works out of the box; the
remaining extras are optional and degrade gracefully when absent - without
brotli, the scraper simply stops advertising br encoding so responses stay
decodable.
from scraper import Scraper
s = Scraper(origin="https://example.com")
# HTML
soup = s.get_soup("https://example.com/page")
title = soup.select_one("h1.title").text # "" if not found, never raises
links = [a["href"] for a in soup.select("a")]
# JSON
data = s.get_json("https://example.com/api/data")
# File download
s.get_file("https://example.com/file.zip", output_file="file.zip")
# Image (returns PIL.Image)
img = s.get_image("https://example.com/cover.jpg")Runnable examples live in examples/ - run any with
uv run python examples/<file>.py.
| Example | Shows |
|---|---|
| 01_basic_html.py | Fetch a page and extract data with get_soup / PageSoup |
| 02_pagesoup_parsing.py | PageSoup tour: CSS select, attrs, navigation, XPath |
| 03_json_api.py | get_json / post_json and raw Response access |
| 04_files_and_images.py | get_file (streamed, atomic) and get_image (Pillow) |
| 05_forms_cookies_headers.py | submit_form, set_header, set_cookie, reset |
| 06_configuration.py | ScraperConfig, default_config(), stealth, browser identity |
| 07_impersonation.py | Real browser TLS/HTTP-2 fingerprint via impersonate |
| 08_browser_clearance.py | Reuse a cf_clearance solved by a real browser |
| 09_proxies.py | Round-robin proxy rotation with direct fallback |
| 10_concurrency_and_abort.py | Threaded fetches and cooperative cancellation via close() |
| 11_error_handling.py | HTTP, Cloudflare, and abort error handling |
| 12_browser_auto_solve.py | Auto-solve challenges with BrowserSolver (nodriver) |
| 13_remote_auto_solve.py | Auto-solve challenges with RemoteSolver (FlareSolverr/Byparr) |
| 14_tor_proxy.py | Tor proxy with rotate_proxy() for a fresh exit circuit |
Pass a ScraperConfig for full control:
from scraper import Scraper
from scraper.config import ScraperConfig, ProxyConfig, StealthConfig, BrowserConfig
config = ScraperConfig(
min_request_interval=2.0,
max_concurrent_requests=1,
rotate_tls_ciphers=True,
stealth=StealthConfig(
enabled=True,
min_delay=1.0,
max_delay=3.0,
human_like_delays=True,
randomize_headers=True,
browser_quirks=True,
),
browser=BrowserConfig(
browser="firefox",
platform="windows",
desktop=True,
),
proxy=ProxyConfig(
fallback_to_direct=True,
proxy_urls=[
"socks5://torproxy:9150",
"http://proxy1:8080",
"http://proxy2:8080",
],
),
)
s = Scraper(origin="https://example.com", config=config)Or start from the library's tuned defaults and tweak:
from scraper import Scraper, default_config
config = default_config()
config.max_concurrent_requests = 4
s = Scraper(origin="https://example.com", config=config)A plain requests stack has a fixed OpenSSL TLS fingerprint and only speaks
HTTP/1.1 - both of which modern Cloudflare detects. The curl_cffi transport
reproduces a real browser's TLS (JA3/JA4) and HTTP/2 fingerprint, and
default_config() enables it (impersonate.target = "chrome") out of the box.
Pick a different target, or disable impersonation to force the httpx transport:
from scraper import Scraper, default_config
config = default_config()
config.impersonate.target = "firefox" # or "chrome124", "safari17_0", "edge", ...
s = Scraper(origin="https://example.com", config=config)
# disable impersonation -> httpx transport
config.impersonate.target = NoneThe spoofed User-Agent family and Client Hints are aligned with the impersonation
target automatically. If curl_cffi cannot be imported, the engine transparently
falls back to the httpx transport.
Modern Cloudflare challenges (managed challenge / Turnstile / captcha) cannot be
solved in pure Python - they require a real browser. The engine detects them and,
if no solver is configured, raises a clear exception (CloudflareChallengeError,
CloudflareTurnstileError, ...). Configure cloudflare.solvers to pass them
automatically: the engine tries each solver in order until one obtains a
cf_clearance cookie, then retries the request transparently.
Remote solver (recommended for servers) - run a
FlareSolverr or
Byparr container and point RemoteSolver
at it. Keeps the scraper itself lightweight (no browser in its image); no extra
dependencies:
docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latestfrom scraper import Scraper, RemoteSolver, default_config
config = default_config()
config.cloudflare.solvers = [RemoteSolver("http://localhost:8191")]
s = Scraper(origin="https://protected.example.com", config=config)In-process browser solver - drives Chrome via nodriver
(pip install "lncrawl-scraper[browser]"). Pops up a real browser window for the
user to view/interact with the challenge. Cloudflare detects true headless Chrome,
so use RemoteSolver for GUI-less server environments:
from scraper import Scraper, BrowserSolver, default_config
config = default_config()
config.cloudflare.solvers = [BrowserSolver()]
# Optional: persist Chrome profile to skip solved challenges on repeat runs
# config.cloudflare.solvers = [BrowserSolver(user_data_dir="/tmp/chrome-profile")]
s = Scraper(origin="https://protected.example.com", config=config)Both implement the ClearanceSolver protocol, so you can plug in your own
(Camoufox, SeleniumBase, a captcha service, ...) by providing an solve() method.
Cloudflare binds
cf_clearanceto the User-Agent and the IP/TLS fingerprint. When a solver returns clearance, the engine automatically adopts the solver's exact User-Agent and aligns the curl_cffi impersonation to that browser's family + major version, so the reused cookie validates. The one thing it can't control is the egress IP - run the solver behind the same IP/proxy as the scraper.
If you solve the challenge elsewhere, hand the cf_clearance cookie and the
browser's exact User-Agent to the session directly:
s.apply_browser_clearance(
"https://protected.example.com",
cf_clearance="<value from the browser>",
user_agent="<the browser's exact UA>",
cookies={"__cf_bm": "<optional>"},
)| Method | Description |
|---|---|
get(url, **kwargs) |
GET request, returns Response |
post(url, **kwargs) |
POST request, returns Response |
ping(url, timeout=5) |
HEAD request for reachability check |
submit_form(url, data, ...) |
POST with form encoding or multipart |
get_json(url, headers, ...) |
GET and parse response as JSON |
post_json(url, data, ...) |
POST and parse response as JSON |
get_soup(url, headers, ...) |
GET and return a PageSoup |
post_soup(url, data, ...) |
POST and return a PageSoup |
get_image(url, ...) |
GET and return a PIL.Image |
get_file(url, output_file, ...) |
Stream download to file (abort-safe) |
make_soup(data, encoding, ...) |
Parse Response, bytes, or str into PageSoup |
set_header(key, value) |
Set a default session header |
set_cookie(name, value) |
Set a session cookie |
put_cookie(name, value, ...) |
Set a cookie on session + jar |
apply_browser_clearance(...) |
Reuse a browser cf_clearance |
rotate_proxy() |
New Tor circuit (NEWNYM) or advance to next proxy |
reset() |
Clear cookies, headers, and state |
close() |
Abort in-progress requests and release resources |
Scraper composes a scraper.engine.Engine (available as scraper.engine); it is
not an httpx.Client subclass, but mirrors the common verb methods (get,
post, head, put, patch, delete, options) plus headers/cookies.
PageSoup wraps a BeautifulSoup Tag. Every selection method returns a PageSoup (never None); an empty PageSoup is falsy and returns safe defaults for all operations.
soup = s.get_soup("https://example.com")
# Selection
soup.select("ul li") # -> List[PageSoup]
soup.select_one(".title") # -> PageSoup (empty if not found)
soup.find("div", class_="content") # -> PageSoup
soup.find_all("a") # -> List[PageSoup]
soup.xpath("//div[@class='body']") # -> List[PageSoup]
soup.closest(".container") # -> nearest matching ancestor
soup.parents(".wrapper") # -> generator of matching ancestors
# Attribute access
el["href"] # get_attr shorthand, returns "" if missing
el.get_attr("src", default="/")
el.has_attr("data-id")
# Text / HTML
el.text # stripped text, always str
el.get_text(separator="\n")
el.inner_html
el.outer_html
# Navigation
el.parent
el.children # List[PageSoup], excludes text nodes
el.next_sibling
el.previous_sibling
# Mutation
soup.decompose(".ads") # remove elements matching selector
el.replace_with(new_el)
el.append(child)uv is required. Clone the repo and install all dependencies including dev extras:
git clone https://github.com/lncrawl/scraper.git
cd scraper
uv sync --all-groups --all-extrasTasks are managed with poethepoet:
| Command | Description |
|---|---|
uv run poe lint |
Run ruff + pyright |
uv run poe lint-fix |
Auto-fix ruff violations and reformat |
uv run poe test |
Run the test suite |
uv run poe build |
Lint -> test -> build wheel |
uv run poe publish |
Build -> publish to PyPI |
Tests live in tests/ and run with pytest:
uv run poe test
# or directly
uv run pytest
uv run pytest -v # verbose
uv run pytest tests/test_dummy.py # a single fileMock HTTP with respx (a dev dependency) so tests make no real network calls.
Inspired by,