Skip to content

Releases: lncrawl/scraper

v0.4.2

11 Jun 09:01
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Changed

  • Engine.trigger_cancel and Scraper.trigger_cancel renamed to abort_on for
    clarity (the method registers a signal that will trigger an abort, it does not
    trigger one immediately).

Fixed

  • abort_on now uses an asyncio polling loop instead of run_in_executor, so no
    non-daemon executor thread is created. The coroutine exits immediately when the
    engine is aborted or closed, and no longer blocks process exit.

Full Changelog: v0.4.1...v0.4.2

v0.4.1

11 Jun 07:30
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Added

  • Engine.trigger_cancel(signal) and Scraper.trigger_cancel(signal): abort
    the engine once a threading.Event is set. Uses run_in_executor on the
    engine's own event loop -- no extra polling thread needed.

Full Changelog: v0.4.0...v0.4.1

v0.4.0

08 Jun 19:11
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Breaking Changes

  • requests dependency replaced with httpx[http2,socks]. Code that accesses
    scraper.engine.transport or constructs requests-style sessions directly
    must be updated.
  • Scraper no longer has an abort_event attribute. Use CancelToken (now
    exported from scraper) and pass it as cancel_token= to any request method.
  • EventLock removed from scraper.utils - it was an internal primitive
    superseded by CancelToken.
  • JS-interpreter-based challenge handlers (CloudflareV1, CloudflareV2,
    CloudflareV3, TurnstileSolver) removed from scraper.engine.challenges.
    Challenge detection is now handled by CloudflareDetector; solving is
    delegated to pluggable ClearanceSolver implementations.
  • scraper.engine.challenges module removed entirely; challenge classes are now
    at scraper.challenges (detection) and implemented by RemoteSolver /
    BrowserSolver.
  • CloudflareConfig.solver (single) replaced by CloudflareConfig.solvers
    (list). The engine tries each in order.
  • ClearanceSolver.solve_async renamed to solve (the sync solve wrapper is
    gone; the solver protocol is now purely async).
  • ProxyManager.get_proxy() now returns str | None instead of
    dict[str, str] | None.
  • RequestChain and RequestContext (engine.state / engine.context) merged
    into RequestState; import paths change accordingly.
  • UrllibTransport removed; HttpxTransport is the new fallback when
    curl_cffi is unavailable.

Added

  • CancelToken - thread-safe per-request cancellation. Exported from scraper.
  • ClearanceResult - dataclass holding a solved Cloudflare clearance
    (cookie, user_agent, domain, expires, cf_bm_expires, proxy_key).
    Exported from scraper.
  • ClearanceSolver - ABC for pluggable challenge solvers. Exported from scraper.
  • RemoteSolver - ClearanceSolver adapter for FlareSolverr / Byparr remote
    solving. Exported from scraper.
  • BrowserSolver - ClearanceSolver using an in-process nodriver browser
    (requires the browser extra). Exported from scraper.
  • scraper.challenges package: CloudflareDetector, CloudflareChallengeKind
    for programmatic challenge detection without a full solver.
  • RequestHeaders utility (case-insensitive dict for HTTP headers) in
    scraper.utils.
  • ClearanceStore - in-memory + optional on-disk cache for cf_clearance
    records, keyed by (domain, proxy_key).
  • HttpxTransport - httpx.AsyncClient-backed fallback transport with
    per-proxy client pooling (replaces UrllibTransport).
  • Engine pipeline is fully async: all middleware and the transport are
    coroutines; a daemon asyncio event loop runs in a background thread.
  • anyio[trio] added as dev dependency; respx replaces responses for
    HTTP mocking in tests.

Changed

  • SVG image support removed (cairosvg required a system-level library unavailable
    on Windows; dropped to keep the package installable on all platforms).
  • build_transport now falls back to HttpxTransport when CurlCffiTransport
    initialization fails (e.g. missing native library), instead of raising.
  • Package is now also published to GitHub Packages on each release (alongside
    PyPI), installable with pip install --index-url https://pypi.pkg.github.com/lncrawl/lncrawl-scraper.

Fixed

  • Python 3.9 compatibility in CurlCffiTransport: the perk kwarg (absent in
    curl_cffi 0.13.x, the last release supporting Python 3.9) is now guarded by
    a runtime capability flag.
  • Tor cooldown boundary: requests arriving exactly at the cooldown threshold are
    now correctly held back (<= comparison instead of <).
  • Tor cooldown sentinel changed to float('-inf') so the first rotation always
    succeeds - the previous sentinel of 0 caused flakiness on fresh CI containers
    where time.monotonic() had not yet exceeded the cooldown window.

Full Changelog: v0.3.0...v0.4.0

v0.3.0

06 Jun 17:15
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Breaking Changes

  • Scraper is now a composition facade rather than a requests.Session
    subclass. The public helper API (get_soup, get_json, get_file, etc.)
    is unchanged, but isinstance(scraper, requests.Session) no longer holds
    and Session-internal overrides will not work.
  • engine/ and utils/ replace the former private _engine/ and _utils/
    packages. Direct imports from scraper._engine.* or scraper._utils.*
    must be updated to scraper.engine.* / scraper.utils.*.

Added

  • CloudflareConfig, ImpersonateConfig, HttpVersion, ProxyUrl, and
    TorProxyUrl are now exported from the top-level scraper package.
  • engine/ is now a documented public extension surface: the middleware
    pipeline and pluggable transport layer are importable and subclassable.
  • New middleware pipeline with 11 single-concern middleware classes
    (throttle, stealth, proxy, retry_403, challenge, tls_rotation,
    concurrency, refresh, ssl_retry, hooks, abort) replacing the
    former monolithic engine.
  • New Transport abstraction with CurlCffiTransport (primary) and
    UrllibTransport (fallback) implementations.

Changed

  • curl_cffi is now a core dependency (not an optional extra). The
    impersonate extra is retained for backwards compatibility but is a no-op;
    plain pip install lncrawl-scraper includes it automatically.
  • Default impersonation target is "chrome" — requests ride a real Chrome
    TLS/HTTP-2 fingerprint out of the box.
  • Updated dependencies to latest compatible versions.

Fixed

  • Atomic-group regex syntax that caused SyntaxError on Python 3.9 and 3.10.
  • None-valued session options no longer forwarded to curl_cffi, fixing
    compatibility with older curl_cffi releases.

Security Fixes

  • Potential fix for code scanning alert no. 1: Workflow does not contain permissions by @dipu-bd in #1
  • Fix for Redundant assignment by @dipu-bd in #2

Full Changelog: v0.2.0...v0.3.0

v0.2.0

05 Jun 04:32
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Changed

  • brotli is now an optional extra instead of a core dependency. Install
    lncrawl-scraper[brotli] (or [all]) to decode brotli (br) responses;
    without it the scraper no longer advertises br encoding, so bodies stay
    decodable.
  • default_config() no longer pins a Firefox/Windows identity — the default
    User-Agent (and its matching Client Hints) is now randomized across desktop
    browsers and platforms.

Added

  • all optional extra that installs every optional dependency
    (brotli, image, impersonate).

Full Changelog: v0.1.0...v0.2.0

v0.1.0

04 Jun 21:06
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Initial public release of lncrawl-scraper, extracted from
lightnovel-crawler.

Added

  • Scraper — a requests.Session subclass with transparent Cloudflare
    challenge handling (v1, v2, v3, Turnstile) and helpers: get_soup,
    post_soup, get_json, post_json, get_file, get_image, submit_form,
    ping.
  • PageSoup — a null-safe BeautifulSoup wrapper; selection methods never return
    None and text/HTML accessors always return str.
  • Typed configuration: ScraperConfig, StealthConfig, ProxyConfig,
    BrowserConfig, plus the default_config() factory.
  • Browser fingerprint impersonation (impersonate extra): route requests
    through curl_cffi for a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2
    fingerprint, with the spoofed User-Agent family aligned to the target.
  • Browser-assisted clearance: apply_browser_clearance() to reuse a
    cf_clearance cookie + User-Agent solved by an external real browser.
  • Accurate Client Hints: sec-ch-ua / platform / mobile derived from the
    chosen User-Agent (Chromium only) instead of hardcoded values.
  • Stealth mode, proxy rotation with Tor identity refresh, TLS cipher rotation,
    rate limiting, and cooperative abort().
  • py.typed marker (PEP 561) and full type coverage.

Full Changelog: https://github.com/lncrawl/scraper/commits/v0.1.0