feat(waterdata): Migrate to httpx and add async parallel chunker#285
Draft
thodson-usgs wants to merge 1 commit into
Draft
feat(waterdata): Migrate to httpx and add async parallel chunker#285thodson-usgs wants to merge 1 commit into
thodson-usgs wants to merge 1 commit into
Conversation
c618ea8 to
06d43aa
Compare
b13faa1 to
c41151f
Compare
Replace ``requests`` with ``httpx`` package-wide and add an opt-in async parallel fan-out for the multi-value chunker, gated on the ``API_USGS_CONCURRENT`` env var. * ``httpx`` ships sync and async clients on a unified API, so the same request shape powers both the synchronous getters callers use today and the new ``_fan_out_async`` parallel path; the unmaintained ``requests`` had no async story. * ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial ``ChunkedCall.resume()`` path over one shared ``httpx.Client``. ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or ``unbounded`` fans the plan out through ``_fan_out_async`` over one shared ``httpx.AsyncClient``, bounded by ``asyncio.Semaphore(N)``. * Both paths publish their client on a ``ContextVar`` (``_chunked_session`` / ``_chunked_async_session``) so paginated helpers downstream reuse the connection pool across every sub-request of a chunked call. * The parallel path preserves the same safety contracts as the serial path: it probes the first sub-request alone to read ``x-ratelimit-remaining`` before fanning out the rest (``RequestExceedsQuota``), and uses ``asyncio.gather( return_exceptions=True)`` so a transient failure surfaces as a ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding the sparse-indexed completed sub-requests; ``exc.call.resume()`` re-issues only the unfinished ones via the sync path. * The wrapper falls back to the serial path (with a ``UserWarning``) when ``asyncio.get_running_loop()`` returns — so Jupyter / IPython kernels and async apps don't see a confusing ``RuntimeError`` — and when the decorator was set up without a ``fetch_async=`` sibling. * Three defensive helpers smooth over httpx behaviours that ``requests`` didn't have: ``_safe_request_bytes`` swallows ``httpx.InvalidURL`` so the planner's halving loop keeps shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls back to ``timedelta(0)`` when ``.elapsed`` is missing (mock transports); ``_set_response_url`` rewrites the URL via the bound request, since httpx makes ``Response.url`` read-only. Tests: ``pyproject.toml`` switches ``requests``/``requests-mock`` to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a ``requests_mock``-shaped shim over ``httpx_mock`` and an autouse fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests stay on the deterministic serial path. New async-mode tests cover the parallel fan-out, the probe-first quota check, the resumable ``ChunkInterrupted.call`` after a mid-fan-out failure, the running-event-loop fallback, and the missing-``fetch_async`` warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c41151f to
e31d30f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sits on top of latest
main, which already includes PR #283 (chunker arch) and PR #288 (progress bar). Mergeable.requestswithhttpxpackage-wide.API_USGS_CONCURRENT(default 16 for the server-friendly sweet spot; set=1for the legacy sequential path).ChunkPlan/ChunkedCallarch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drivesChunkedCall.resume()over one sharedhttpx.Client; parallel uses_fan_out_asyncto iterate the same plan viaasyncio.gather+asyncio.Semaphoreover one sharedhttpx.AsyncClient.Benchmarked at 5.84× speedup vs latest
mainon a 19,602-site / 6-stateget_dailycall (PR async @API_USGS_CONCURRENT=16: 4.91s; main serial: 28.65s; distinct date windows per side so the USGS cache can't bias either run).Why httpx
httpxships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path.requestsis unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.API_USGS_CONCURRENT_CONCURRENCY_DEFAULT)≥ 21ChunkedCall.resume()path)unbounded0, negative, malformedValueErrorat call timeConnection-pool sharing across all sub-requests of a single chunked call in both modes via the
_chunked_session(sync) /_chunked_async_session(async)ContextVars —_walk_pages/_walk_pages_async/get_stats_dataread them as fallbacks before opening a fresh client.Parallel path safety contracts
The parallel branch preserves the same safety contracts the serial path provides:
_fan_out_asyncissues the first sub-request alone, readsx-ratelimit-remainingfrom its response, and raisesRequestExceedsQuotabefore dispatching the rest if the remaining plan can't fit the window. MatchesChunkedCall._check_quota_after_first.asyncio.gatherruns withreturn_exceptions=True, so a sibling's transient failure (RateLimited/ServiceUnavailable) doesn't lose the completed work. The raisedChunkInterrupted.callis aChunkedCallholding the sparse-indexed completed sub-requests;exc.call.resume()re-issues only the unfinished indices via the syncfetch_oncepath.asyncio.run()raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper callsasyncio.get_running_loop()first and, when one is active, falls back to the serial path with aUserWarninginstead of crashing.fetch_asyncwarning. IfAPI_USGS_CONCURRENTrequests parallel but the decorator wasn't wired withfetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.Three httpx behavior diffs handled defensively
httpx.InvalidURLraised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by_safe_request_bytes(treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again inChunkPlan.__init__so canonical-URL recovery can fall through to a worst-case sub-request URL.httpx.Response.elapsedonly populated on close (not byhttpx.MockTransport/pytest-httpx)._safe_elapsedfalls back totimedelta(0).httpx.Response.urlis a read-only property._set_response_urlrewrites it via the bound request, with a fallback path forMock-shaped test responses.Backwards-compat
BaseMetadata.headeris nowhttpx.Headersinstead ofrequests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds becausehttpx.Headerscarries auto-added entries (content-type, content-length).BaseMetadata.urlis coerced tostr(previouslystronrequests.Response; nowstr(httpx.Response.url)).API_USGS_CONCURRENTdefaults to 16 (parallel). Set=1to opt back into the sequential path.Test plan
pytest-httpx.tests/conftest.py(new) is arequests_mock-shaped shim overhttpx_mock(URL-prefix match,complete_qsstrict-mode parity, request-history view); an autouse fixture pinsAPI_USGS_CONCURRENT=1so the historical mocked suite stays on the deterministic serial path.tests/waterdata_chunking_test.py(test_async_fan_out_*, one parametrized overrunning-loop+missing-async) cover (a) successful fan-out, (b) probe-firstRequestExceedsQuota, (c) resumableServiceInterrupted.callafter a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback warns and runs serial, (e) the missing-fetch_asyncwarning fires when the env asks for parallel. Plus three new async progress-integration tests covering reporter calls from_paginate_asyncand_fan_out_async(tests/waterdata_progress_test.py,tests/waterdata_utils_test.py).ruff checkandruff format --checkpass._get_resp_data/_handle_stats_nestingfrom main, so most pre-existing column-drift failures (get_daily/get_stats_*/get_channel) should be resolved.Out of scope (follow-ups)
_fan_out_asyncmaterializes all(df, response)pairs before combining. Consider streaming-combine viaasyncio.as_completedif users push concurrency very high.NEWS.mdentry — left for the merger to draft.🤖 Generated with Claude Code