Releases: lncrawl/scraper
Releases · lncrawl/scraper
v0.4.2
Immutable
release. Only release title and notes can be modified.
Changed
Engine.trigger_cancelandScraper.trigger_cancelrenamed toabort_onfor
clarity (the method registers a signal that will trigger an abort, it does not
trigger one immediately).
Fixed
abort_onnow uses an asyncio polling loop instead ofrun_in_executor, so no
non-daemon executor thread is created. The coroutine exits immediately when the
engine is aborted or closed, and no longer blocks process exit.
Full Changelog: v0.4.1...v0.4.2
v0.4.1
Immutable
release. Only release title and notes can be modified.
Added
Engine.trigger_cancel(signal)andScraper.trigger_cancel(signal): abort
the engine once athreading.Eventis set. Usesrun_in_executoron the
engine's own event loop -- no extra polling thread needed.
Full Changelog: v0.4.0...v0.4.1
v0.4.0
Immutable
release. Only release title and notes can be modified.
Breaking Changes
requestsdependency replaced withhttpx[http2,socks]. Code that accesses
scraper.engine.transportor constructsrequests-style sessions directly
must be updated.Scraperno longer has anabort_eventattribute. UseCancelToken(now
exported fromscraper) and pass it ascancel_token=to any request method.EventLockremoved fromscraper.utils- it was an internal primitive
superseded byCancelToken.- JS-interpreter-based challenge handlers (
CloudflareV1,CloudflareV2,
CloudflareV3,TurnstileSolver) removed fromscraper.engine.challenges.
Challenge detection is now handled byCloudflareDetector; solving is
delegated to pluggableClearanceSolverimplementations. scraper.engine.challengesmodule removed entirely; challenge classes are now
atscraper.challenges(detection) and implemented byRemoteSolver/
BrowserSolver.CloudflareConfig.solver(single) replaced byCloudflareConfig.solvers
(list). The engine tries each in order.ClearanceSolver.solve_asyncrenamed tosolve(the syncsolvewrapper is
gone; the solver protocol is now purely async).ProxyManager.get_proxy()now returnsstr | Noneinstead of
dict[str, str] | None.RequestChainandRequestContext(engine.state/engine.context) merged
intoRequestState; import paths change accordingly.UrllibTransportremoved;HttpxTransportis the new fallback when
curl_cffiis unavailable.
Added
CancelToken- thread-safe per-request cancellation. Exported fromscraper.ClearanceResult- dataclass holding a solved Cloudflare clearance
(cookie,user_agent,domain,expires,cf_bm_expires,proxy_key).
Exported fromscraper.ClearanceSolver- ABC for pluggable challenge solvers. Exported fromscraper.RemoteSolver-ClearanceSolveradapter for FlareSolverr / Byparr remote
solving. Exported fromscraper.BrowserSolver-ClearanceSolverusing an in-process nodriver browser
(requires thebrowserextra). Exported fromscraper.scraper.challengespackage:CloudflareDetector,CloudflareChallengeKind
for programmatic challenge detection without a full solver.RequestHeadersutility (case-insensitivedictfor HTTP headers) in
scraper.utils.ClearanceStore- in-memory + optional on-disk cache forcf_clearance
records, keyed by(domain, proxy_key).HttpxTransport-httpx.AsyncClient-backed fallback transport with
per-proxy client pooling (replacesUrllibTransport).- Engine pipeline is fully async: all middleware and the transport are
coroutines; a daemon asyncio event loop runs in a background thread. anyio[trio]added as dev dependency;respxreplacesresponsesfor
HTTP mocking in tests.
Changed
- SVG image support removed (
cairosvgrequired a system-level library unavailable
on Windows; dropped to keep the package installable on all platforms). build_transportnow falls back toHttpxTransportwhenCurlCffiTransport
initialization fails (e.g. missing native library), instead of raising.- Package is now also published to GitHub Packages on each release (alongside
PyPI), installable withpip install --index-url https://pypi.pkg.github.com/lncrawl/lncrawl-scraper.
Fixed
- Python 3.9 compatibility in
CurlCffiTransport: theperkkwarg (absent in
curl_cffi0.13.x, the last release supporting Python 3.9) is now guarded by
a runtime capability flag. - Tor cooldown boundary: requests arriving exactly at the cooldown threshold are
now correctly held back (<=comparison instead of<). - Tor cooldown sentinel changed to
float('-inf')so the first rotation always
succeeds - the previous sentinel of0caused flakiness on fresh CI containers
wheretime.monotonic()had not yet exceeded the cooldown window.
Full Changelog: v0.3.0...v0.4.0
v0.3.0
Immutable
release. Only release title and notes can be modified.
Breaking Changes
Scraperis now a composition facade rather than arequests.Session
subclass. The public helper API (get_soup,get_json,get_file, etc.)
is unchanged, butisinstance(scraper, requests.Session)no longer holds
and Session-internal overrides will not work.engine/andutils/replace the former private_engine/and_utils/
packages. Direct imports fromscraper._engine.*orscraper._utils.*
must be updated toscraper.engine.*/scraper.utils.*.
Added
CloudflareConfig,ImpersonateConfig,HttpVersion,ProxyUrl, and
TorProxyUrlare now exported from the top-levelscraperpackage.engine/is now a documented public extension surface: the middleware
pipeline and pluggable transport layer are importable and subclassable.- New middleware pipeline with 11 single-concern middleware classes
(throttle,stealth,proxy,retry_403,challenge,tls_rotation,
concurrency,refresh,ssl_retry,hooks,abort) replacing the
former monolithic engine. - New
Transportabstraction withCurlCffiTransport(primary) and
UrllibTransport(fallback) implementations.
Changed
curl_cffiis now a core dependency (not an optional extra). The
impersonateextra is retained for backwards compatibility but is a no-op;
plainpip install lncrawl-scraperincludes it automatically.- Default impersonation target is
"chrome"— requests ride a real Chrome
TLS/HTTP-2 fingerprint out of the box. - Updated dependencies to latest compatible versions.
Fixed
- Atomic-group regex syntax that caused
SyntaxErroron Python 3.9 and 3.10. None-valued session options no longer forwarded tocurl_cffi, fixing
compatibility with oldercurl_cffireleases.
Security Fixes
- Potential fix for code scanning alert no. 1: Workflow does not contain permissions by @dipu-bd in #1
- Fix for Redundant assignment by @dipu-bd in #2
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Immutable
release. Only release title and notes can be modified.
Changed
brotliis now an optional extra instead of a core dependency. Install
lncrawl-scraper[brotli](or[all]) to decode brotli (br) responses;
without it the scraper no longer advertisesbrencoding, so bodies stay
decodable.default_config()no longer pins a Firefox/Windows identity — the default
User-Agent (and its matching Client Hints) is now randomized across desktop
browsers and platforms.
Added
alloptional extra that installs every optional dependency
(brotli,image,impersonate).
Full Changelog: v0.1.0...v0.2.0
v0.1.0
Immutable
release. Only release title and notes can be modified.
Initial public release of lncrawl-scraper, extracted from
lightnovel-crawler.
Added
Scraper— arequests.Sessionsubclass with transparent Cloudflare
challenge handling (v1, v2, v3, Turnstile) and helpers:get_soup,
post_soup,get_json,post_json,get_file,get_image,submit_form,
ping.PageSoup— a null-safe BeautifulSoup wrapper; selection methods never return
Noneand text/HTML accessors always returnstr.- Typed configuration:
ScraperConfig,StealthConfig,ProxyConfig,
BrowserConfig, plus thedefault_config()factory. - Browser fingerprint impersonation (
impersonateextra): route requests
throughcurl_cffifor a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2
fingerprint, with the spoofed User-Agent family aligned to the target. - Browser-assisted clearance:
apply_browser_clearance()to reuse a
cf_clearancecookie + User-Agent solved by an external real browser. - Accurate Client Hints:
sec-ch-ua/ platform / mobile derived from the
chosen User-Agent (Chromium only) instead of hardcoded values. - Stealth mode, proxy rotation with Tor identity refresh, TLS cipher rotation,
rate limiting, and cooperativeabort(). py.typedmarker (PEP 561) and full type coverage.
Full Changelog: https://github.com/lncrawl/scraper/commits/v0.1.0