Skip to content

feat(waterdata): Migrate to httpx and add async parallel chunker#285

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration
Draft

feat(waterdata): Migrate to httpx and add async parallel chunker#285
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 21, 2026

Summary

Sits on top of latest main, which already includes PR #283 (chunker arch) and PR #288 (progress bar). Mergeable.

  • Replaces requests with httpx package-wide.
  • Adds an opt-in async parallel branch to the multi-value chunker via API_USGS_CONCURRENT (default 16 for the server-friendly sweet spot; set =1 for the legacy sequential path).
  • Integrates the async path with the ChunkPlan / ChunkedCall arch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drives ChunkedCall.resume() over one shared httpx.Client; parallel uses _fan_out_async to iterate the same plan via asyncio.gather + asyncio.Semaphore over one shared httpx.AsyncClient.

Benchmarked at 5.84× speedup vs latest main on a 19,602-site / 6-state get_daily call (PR async @ API_USGS_CONCURRENT=16: 4.91s; main serial: 28.65s; distinct date windows per side so the USGS cache can't bias either run).

Why httpx

httpx ships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path. requests is unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.

API_USGS_CONCURRENT

Value Behavior
unset / blank parallel, cap = 16 (_CONCURRENCY_DEFAULT)
≥ 2 parallel, semaphore-capped at that value
1 serial (sync ChunkedCall.resume() path)
unbounded parallel, no per-call cap — caller owns the burst risk
0, negative, malformed ValueError at call time

Connection-pool sharing across all sub-requests of a single chunked call in both modes via the _chunked_session (sync) / _chunked_async_session (async) ContextVars — _walk_pages / _walk_pages_async / get_stats_data read them as fallbacks before opening a fresh client.

Parallel path safety contracts

The parallel branch preserves the same safety contracts the serial path provides:

  • Probe-first quota check. _fan_out_async issues the first sub-request alone, reads x-ratelimit-remaining from its response, and raises RequestExceedsQuota before dispatching the rest if the remaining plan can't fit the window. Matches ChunkedCall._check_quota_after_first.
  • Resumable interruptions. asyncio.gather runs with return_exceptions=True, so a sibling's transient failure (RateLimited / ServiceUnavailable) doesn't lose the completed work. The raised ChunkInterrupted.call is a ChunkedCall holding the sparse-indexed completed sub-requests; exc.call.resume() re-issues only the unfinished indices via the sync fetch_once path.
  • Event-loop detection. asyncio.run() raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper calls asyncio.get_running_loop() first and, when one is active, falls back to the serial path with a UserWarning instead of crashing.
  • Missing-fetch_async warning. If API_USGS_CONCURRENT requests parallel but the decorator wasn't wired with fetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.

Three httpx behavior diffs handled defensively

  • httpx.InvalidURL raised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by _safe_request_bytes (treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again in ChunkPlan.__init__ so canonical-URL recovery can fall through to a worst-case sub-request URL.
  • httpx.Response.elapsed only populated on close (not by httpx.MockTransport / pytest-httpx). _safe_elapsed falls back to timedelta(0).
  • httpx.Response.url is a read-only property. _set_response_url rewrites it via the bound request, with a fallback path for Mock-shaped test responses.

Backwards-compat

  • BaseMetadata.header is now httpx.Headers instead of requests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds because httpx.Headers carries auto-added entries (content-type, content-length).
  • BaseMetadata.url is coerced to str (previously str on requests.Response; now str(httpx.Response.url)).
  • API_USGS_CONCURRENT defaults to 16 (parallel). Set =1 to opt back into the sequential path.

Test plan

  • 397 mocked tests pass after migrating to pytest-httpx. tests/conftest.py (new) is a requests_mock-shaped shim over httpx_mock (URL-prefix match, complete_qs strict-mode parity, request-history view); an autouse fixture pins API_USGS_CONCURRENT=1 so the historical mocked suite stays on the deterministic serial path.
  • Async-path coverage: four new async tests in tests/waterdata_chunking_test.py (test_async_fan_out_*, one parametrized over running-loop + missing-async) cover (a) successful fan-out, (b) probe-first RequestExceedsQuota, (c) resumable ServiceInterrupted.call after a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback warns and runs serial, (e) the missing-fetch_async warning fires when the env asks for parallel. Plus three new async progress-integration tests covering reporter calls from _paginate_async and _fan_out_async (tests/waterdata_progress_test.py, tests/waterdata_utils_test.py).
  • ruff check and ruff format --check pass.
  • Live-API CI sweep — picked up the schema-aware _get_resp_data / _handle_stats_nesting from main, so most pre-existing column-drift failures (get_daily / get_stats_* / get_channel) should be resolved.

Out of scope (follow-ups)

  • High-concurrency memory: _fan_out_async materializes all (df, response) pairs before combining. Consider streaming-combine via asyncio.as_completed if users push concurrency very high.
  • NEWS.md entry — left for the merger to draft.

🤖 Generated with Claude Code

Replace ``requests`` with ``httpx`` package-wide and add an opt-in
async parallel fan-out for the multi-value chunker, gated on the
``API_USGS_CONCURRENT`` env var.

* ``httpx`` ships sync and async clients on a unified API, so the
  same request shape powers both the synchronous getters callers
  use today and the new ``_fan_out_async`` parallel path; the
  unmaintained ``requests`` had no async story.
* ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial
  ``ChunkedCall.resume()`` path over one shared ``httpx.Client``.
  ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or
  ``unbounded`` fans the plan out through ``_fan_out_async`` over
  one shared ``httpx.AsyncClient``, bounded by
  ``asyncio.Semaphore(N)``.
* Both paths publish their client on a ``ContextVar``
  (``_chunked_session`` / ``_chunked_async_session``) so paginated
  helpers downstream reuse the connection pool across every
  sub-request of a chunked call.
* The parallel path preserves the same safety contracts as the
  serial path: it probes the first sub-request alone to read
  ``x-ratelimit-remaining`` before fanning out the rest
  (``RequestExceedsQuota``), and uses ``asyncio.gather(
  return_exceptions=True)`` so a transient failure surfaces as a
  ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding
  the sparse-indexed completed sub-requests; ``exc.call.resume()``
  re-issues only the unfinished ones via the sync path.
* The wrapper falls back to the serial path (with a
  ``UserWarning``) when ``asyncio.get_running_loop()`` returns —
  so Jupyter / IPython kernels and async apps don't see a
  confusing ``RuntimeError`` — and when the decorator was set up
  without a ``fetch_async=`` sibling.
* Three defensive helpers smooth over httpx behaviours that
  ``requests`` didn't have: ``_safe_request_bytes`` swallows
  ``httpx.InvalidURL`` so the planner's halving loop keeps
  shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls
  back to ``timedelta(0)`` when ``.elapsed`` is missing (mock
  transports); ``_set_response_url`` rewrites the URL via the
  bound request, since httpx makes ``Response.url`` read-only.

Tests: ``pyproject.toml`` switches ``requests``/``requests-mock``
to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a
``requests_mock``-shaped shim over ``httpx_mock`` and an autouse
fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests
stay on the deterministic serial path. New async-mode tests cover
the parallel fan-out, the probe-first quota check, the resumable
``ChunkInterrupted.call`` after a mid-fan-out failure, the
running-event-loop fallback, and the missing-``fetch_async``
warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant