Releases · lncrawl/scraper

11 Jun 09:01

github-actions

Immutable

v0.4.2

88a211a

v0.4.2 Latest

Latest

Changed

Engine.trigger_cancel and Scraper.trigger_cancel renamed to abort_on for
clarity (the method registers a signal that will trigger an abort, it does not
trigger one immediately).

Fixed

abort_on now uses an asyncio polling loop instead of run_in_executor, so no
non-daemon executor thread is created. The coroutine exits immediately when the
engine is aborted or closed, and no longer blocks process exit.

Full Changelog: v0.4.1...v0.4.2

Assets 6

11 Jun 07:30

github-actions

Immutable

v0.4.1

be33430

v0.4.1

Added

Engine.trigger_cancel(signal) and Scraper.trigger_cancel(signal): abort
the engine once a threading.Event is set. Uses run_in_executor on the
engine's own event loop -- no extra polling thread needed.

Full Changelog: v0.4.0...v0.4.1

Assets 6

08 Jun 19:11

github-actions

Immutable

v0.4.0

41e966d

v0.4.0

Breaking Changes

requests dependency replaced with httpx[http2,socks]. Code that accesses
scraper.engine.transport or constructs requests-style sessions directly
must be updated.
Scraper no longer has an abort_event attribute. Use CancelToken (now
exported from scraper) and pass it as cancel_token= to any request method.
EventLock removed from scraper.utils - it was an internal primitive
superseded by CancelToken.
JS-interpreter-based challenge handlers (CloudflareV1, CloudflareV2,
CloudflareV3, TurnstileSolver) removed from scraper.engine.challenges.
Challenge detection is now handled by CloudflareDetector; solving is
delegated to pluggable ClearanceSolver implementations.
scraper.engine.challenges module removed entirely; challenge classes are now
at scraper.challenges (detection) and implemented by RemoteSolver /
BrowserSolver.
CloudflareConfig.solver (single) replaced by CloudflareConfig.solvers
(list). The engine tries each in order.
ClearanceSolver.solve_async renamed to solve (the sync solve wrapper is
gone; the solver protocol is now purely async).
ProxyManager.get_proxy() now returns str | None instead of
dict[str, str] | None.
RequestChain and RequestContext (engine.state / engine.context) merged
into RequestState; import paths change accordingly.
UrllibTransport removed; HttpxTransport is the new fallback when
curl_cffi is unavailable.

Added

CancelToken - thread-safe per-request cancellation. Exported from scraper.
ClearanceResult - dataclass holding a solved Cloudflare clearance
(cookie, user_agent, domain, expires, cf_bm_expires, proxy_key).
Exported from scraper.
ClearanceSolver - ABC for pluggable challenge solvers. Exported from scraper.
RemoteSolver - ClearanceSolver adapter for FlareSolverr / Byparr remote
solving. Exported from scraper.
BrowserSolver - ClearanceSolver using an in-process nodriver browser
(requires the browser extra). Exported from scraper.
scraper.challenges package: CloudflareDetector, CloudflareChallengeKind
for programmatic challenge detection without a full solver.
RequestHeaders utility (case-insensitive dict for HTTP headers) in
scraper.utils.
ClearanceStore - in-memory + optional on-disk cache for cf_clearance
records, keyed by (domain, proxy_key).
HttpxTransport - httpx.AsyncClient-backed fallback transport with
per-proxy client pooling (replaces UrllibTransport).
Engine pipeline is fully async: all middleware and the transport are
coroutines; a daemon asyncio event loop runs in a background thread.
anyio[trio] added as dev dependency; respx replaces responses for
HTTP mocking in tests.

Changed

SVG image support removed (cairosvg required a system-level library unavailable
on Windows; dropped to keep the package installable on all platforms).
build_transport now falls back to HttpxTransport when CurlCffiTransport
initialization fails (e.g. missing native library), instead of raising.
Package is now also published to GitHub Packages on each release (alongside
PyPI), installable with pip install --index-url https://pypi.pkg.github.com/lncrawl/lncrawl-scraper.

Fixed

Python 3.9 compatibility in CurlCffiTransport: the perk kwarg (absent in
curl_cffi 0.13.x, the last release supporting Python 3.9) is now guarded by
a runtime capability flag.
Tor cooldown boundary: requests arriving exactly at the cooldown threshold are
now correctly held back (<= comparison instead of <).
Tor cooldown sentinel changed to float('-inf') so the first rotation always
succeeds - the previous sentinel of 0 caused flakiness on fresh CI containers
where time.monotonic() had not yet exceeded the cooldown window.

Full Changelog: v0.3.0...v0.4.0

Assets 6

06 Jun 17:15

github-actions

Immutable

v0.3.0

c8c3faf

v0.3.0

Breaking Changes

Scraper is now a composition facade rather than a requests.Session
subclass. The public helper API (get_soup, get_json, get_file, etc.)
is unchanged, but isinstance(scraper, requests.Session) no longer holds
and Session-internal overrides will not work.
engine/ and utils/ replace the former private _engine/ and _utils/
packages. Direct imports from scraper._engine.* or scraper._utils.*
must be updated to scraper.engine.* / scraper.utils.*.

Added

CloudflareConfig, ImpersonateConfig, HttpVersion, ProxyUrl, and
TorProxyUrl are now exported from the top-level scraper package.
engine/ is now a documented public extension surface: the middleware
pipeline and pluggable transport layer are importable and subclassable.
New middleware pipeline with 11 single-concern middleware classes
(throttle, stealth, proxy, retry_403, challenge, tls_rotation,
concurrency, refresh, ssl_retry, hooks, abort) replacing the
former monolithic engine.
New Transport abstraction with CurlCffiTransport (primary) and
UrllibTransport (fallback) implementations.

Changed

curl_cffi is now a core dependency (not an optional extra). The
impersonate extra is retained for backwards compatibility but is a no-op;
plain pip install lncrawl-scraper includes it automatically.
Default impersonation target is "chrome" — requests ride a real Chrome
TLS/HTTP-2 fingerprint out of the box.
Updated dependencies to latest compatible versions.

Fixed

Atomic-group regex syntax that caused SyntaxError on Python 3.9 and 3.10.
None-valued session options no longer forwarded to curl_cffi, fixing
compatibility with older curl_cffi releases.

Security Fixes

Potential fix for code scanning alert no. 1: Workflow does not contain permissions by @dipu-bd in #1
Fix for Redundant assignment by @dipu-bd in #2

Full Changelog: v0.2.0...v0.3.0

Contributors

dipu-bd

Assets 6

05 Jun 04:32

github-actions

Immutable

v0.2.0

e2574a3

v0.2.0

Changed

brotli is now an optional extra instead of a core dependency. Install
lncrawl-scraper[brotli] (or [all]) to decode brotli (br) responses;
without it the scraper no longer advertises br encoding, so bodies stay
decodable.
default_config() no longer pins a Firefox/Windows identity — the default
User-Agent (and its matching Client Hints) is now randomized across desktop
browsers and platforms.

Added

all optional extra that installs every optional dependency
(brotli, image, impersonate).

Full Changelog: v0.1.0...v0.2.0

Assets 6

04 Jun 21:06

dipu-bd

Immutable

v0.1.0

c3950f4

v0.1.0

Initial public release of lncrawl-scraper, extracted from
lightnovel-crawler.

Added

Scraper — a requests.Session subclass with transparent Cloudflare
challenge handling (v1, v2, v3, Turnstile) and helpers: get_soup,
post_soup, get_json, post_json, get_file, get_image, submit_form,
ping.
PageSoup — a null-safe BeautifulSoup wrapper; selection methods never return
None and text/HTML accessors always return str.
Typed configuration: ScraperConfig, StealthConfig, ProxyConfig,
BrowserConfig, plus the default_config() factory.
Browser fingerprint impersonation (impersonate extra): route requests
through curl_cffi for a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2
fingerprint, with the spoofed User-Agent family aligned to the target.
Browser-assisted clearance: apply_browser_clearance() to reuse a
cf_clearance cookie + User-Agent solved by an external real browser.
Accurate Client Hints: sec-ch-ua / platform / mobile derived from the
chosen User-Agent (Chromium only) instead of hardcoded values.
Stealth mode, proxy rotation with Tor identity refresh, TLS cipher rotation,
rate limiting, and cooperative abort().
py.typed marker (PEP 561) and full type coverage.

Full Changelog: https://github.com/lncrawl/scraper/commits/v0.1.0

Assets 3

Uh oh!

Releases: lncrawl/scraper

v0.4.2

Changed

Fixed

Uh oh!

v0.4.1

Added

Uh oh!

v0.4.0

Breaking Changes

Added

Changed

Fixed

Uh oh!

v0.3.0

Breaking Changes

Added

Changed

Fixed

Security Fixes

Contributors

Uh oh!

v0.2.0

Changed

Added

Uh oh!

v0.1.0

Added

Uh oh!