Skip to content

feat(waterdata): Auto-chunk OGC requests over the URL byte limit#283

Merged
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified
May 23, 2026
Merged

feat(waterdata): Auto-chunk OGC requests over the URL byte limit#283
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 18, 2026

Summary

The OGC waterdata getters (get_daily, get_continuous, get_field_measurements, and the rest of the multi-value-capable functions) previously failed with HTTP 414 once the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from get_monitoring_locations, then feed it into get_daily — was the main offender.

This PR adds a transparent joint chunker: any over-budget request is fanned out across multiple sub-requests under the hood, and one combined DataFrame is returned. Callers see no API change.

from dataretrieval.waterdata import get_daily, get_monitoring_locations

sites_df, _ = get_monitoring_locations(state_name="Ohio", site_type="Stream")
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked, one combined DataFrame returned.
df, md = get_daily(
    monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
    parameter_code="00060",
    time="P7D",
)

Mirrors R dataRetrieval's #870, generalized from one filter axis to N joint axes.

What's new

  • Joint planner: every multi-value list parameter AND the cql-text filter (split on its top-level OR clauses) is modeled as a chunkable axis. Greedy halving picks the biggest chunk across all axes until each sub-request URL fits.
  • Resumable interruption: mid-stream 429 / 5xx / transport failures surface as a typed ChunkInterrupted subclass (QuotaExhausted or ServiceInterrupted) carrying the partial result plus a .call handle. Call exc.call.resume() once the underlying condition clears — only still-pending sub-requests are re-issued.
  • Pre-emptive quota guard: after every non-final sub-request, ChunkedCall reads x-ratelimit-remaining and raises RequestExceedsQuota if the rest of the plan won't fit the window. Set API_USGS_LIMIT=0 to bypass.
  • Shared requests.Session: opened once per chunked call and published via a ContextVar so the paginated-loop helpers downstream reuse the same connection pool across every sub-request (saves one TCP/TLS handshake per sub-request after the first).
  • New public types: ChunkPlan, ChunkedCall, RequestTooLarge, RequestExceedsQuota, ChunkInterrupted / QuotaExhausted / ServiceInterrupted, RateLimited / ServiceUnavailable, multi_value_chunked decorator.

Behavior changes (documented in NEWS.md)

  • BaseMetadata.url still reflects the user's original query.
  • BaseMetadata.header now carries the last page / sub-request headers (so x-ratelimit-remaining is current) rather than the first.
  • BaseMetadata.query_time is now cumulative wall-clock across every page / sub-request rather than the first page's elapsed.

Module layout

  • New dataretrieval.waterdata.chunking: joint planner, exception hierarchy, ChunkPlan, ChunkedCall, multi_value_chunked decorator, shared-session plumbing.
  • dataretrieval.waterdata.utils: paginated-loop body unified behind a _paginate(req, parse_response, follow_up, client) strategy helper shared by _walk_pages and get_stats_data.
  • dataretrieval.waterdata.filters: slimmed to a CQL-parsing leaf; URL-budget and filter-chunking logic moved into the joint planner.

Test plan

  • 51 new unit tests in tests/waterdata_chunking_test.py covering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient-error classification, shared-session reuse.
  • URL-construction stress test against the real _construct_api_requests builder: 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes.
  • Regression tests for the pre-merge code review's bug fixes (empty-frame GDF preservation, single-frame aliasing, iter_sub_args passthrough copy, quota check on resume, RequestExceedsQuota.call handle, missing-features defensiveness).
  • Mid-pagination 429 / 5xx covered for both the OGC and stats paginators.
  • CI matrix (Ubuntu / Windows × Python 3.9 / 3.13 / 3.14) all green.
  • End-to-end live API verification against get_monitoring_locations(state_name="Ohio") → get_daily(...): 2,888 sites → ~52 KB of comma-joined IDs (6.5× the byte limit) → transparently chunked, 1,444 rows returned in ~7s.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay close attention to the layout: are all variables and functions placed logically into their modules? Or has the logic been mixed up.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +125 to +130
class QuotaExhausted(RuntimeError):
"""Raised mid-chunked-call when the API's reported remaining quota
(``x-ratelimit-remaining`` header) drops below the configured safety
floor. The chunker stops before issuing the next sub-request to
avoid a mid-call HTTP 429 that would silently truncate paginated
results.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug. A mid-call HTTP 429 should not silently truncate. If it does, fix it, then we won't need to defend against this case.

Comment thread dataretrieval/waterdata/filters.py Outdated
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 18, 2026
…r helpers, clarify docs

Three review responses bundled together:

- chunking.py module docstring: define ``k`` as the candidate filter
  chunk count before using it in the planner description.
- ``QuotaExhausted`` docstring: drop the stale "silently truncate"
  framing. PR DOI-USGS#273 / DOI-USGS#279 already raise on a mid-pagination 429, so
  this exception is the structured-recovery alternative (partial
  frames in hand) rather than a defense against silent truncation.
- Move chunker-only orphans from filters.py to chunking.py:
  ``_WATERDATA_URL_BYTE_LIMIT`` (the URL byte ceiling),
  ``_FetchOnce`` TypeVar, ``_combine_chunk_frames``, and
  ``_combine_chunk_responses``. filters.py was a leftover home from
  the pre-unification two-decorator stack; these helpers have no
  callers outside the chunker. Test ``test_multi_value_chunked_lazy_url_limit``
  now monkeypatches the constant on its new module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 18, 2026
Three test docstrings/comments still framed their reasoning against the
removed two-decorator stack (PR DOI-USGS#283 unified them). Reword to describe
the current joint-planner behavior on its own terms:

- ``test_plan_joint_fans_out_filter_when_list_alone_cannot_fit``: drop
  the "previous two-decorator design" aside.
- ``test_chunkable_params_skips_filter_passed_as_list``: rewrite the
  "inner filters.chunked is the only place that may shrink filter"
  line to point at ``_plan_joint``.
- ``stress_chunker._bail_floor_baseline``: reframe the baseline as
  "degenerate singleton plan" rather than "worst case the old
  two-decorator design produced."

No behavioral changes; prose only. Chunker tests + offline stress
test still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix my comments

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +97 to +101
# When ``x-ratelimit-remaining`` drops below this between sub-requests,
# the chunker bails with ``QuotaExhausted`` rather than risk a mid-call
# HTTP 429. Carries the partial result so callers can resume from a
# known offset instead of retrying the whole chunked call from scratch.
_DEFAULT_QUOTA_SAFETY_FLOOR = 50
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the retrieving the first chunk, just check x-ratelimit-remaining, and if the plan will not fit within our current rait limit, bail and return an Error message that the query would exceed our rate limit and report by how much.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 19, 2026
…mic rate-limit gate

Addresses PR DOI-USGS#283 review feedback. The static caps
(``_DEFAULT_MAX_CHUNKS=1000``, ``_DEFAULT_QUOTA_SAFETY_FLOOR=50``) and
the matching ``max_chunks`` / ``quota_safety_floor`` decorator
parameters are replaced by a quota check that runs after the first
sub-request, using the real ``x-ratelimit-remaining`` value rather
than a guessed cap.

Behavior:

- After the first sub-request the wrapper reads
  ``x-ratelimit-remaining``. If the rest of the plan won't fit in
  the current rate-limit window, it raises a new
  ``RequestExceedsQuota(ValueError)`` carrying ``planned_chunks``,
  ``available``, and ``deficit`` so the message reports exactly how
  far over budget the call is. The first chunk has already been
  issued; the wrapper stops there rather than burn the rest of the
  quota on a call that will fail mid-way.

- ``QuotaExhausted`` is now triggered only when an actual HTTP 429
  propagates from a sub-request (detected by walking ``__cause__``
  for ``RuntimeError("429: ...")``, the shape ``_raise_for_non_200``
  produces and ``_walk_pages`` wraps). A single-process caller
  should not normally see this — ``RequestExceedsQuota``
  short-circuits in chunk 1; arrival here implies a concurrent
  consumer drained the bucket faster than predicted. Carries the
  partial frame for resume. ``partial_response`` becomes ``None``
  when the 429 hits chunk 0 (no banked responses).

- A non-429 ``RuntimeError`` (e.g. 500) propagates unchanged so the
  real cause surfaces to the caller.

- When the server doesn't echo ``x-ratelimit-remaining``,
  ``_read_remaining`` returns ``_QUOTA_UNKNOWN``; the wrapper skips
  the post-first-chunk quota check (no signal → don't synthesize a
  block).

Planner: ``_plan_list_chunks`` / ``_plan_joint`` no longer carry a
``max_chunks`` cap. ``RequestTooLarge`` fires only when nothing more
can be split (the genuine URL-byte floor). The rate-limit gate
replaces the static cap.

Module docstring rewritten to summarize the current design (joint
planning + dynamic quota gate); historical PR 233 / two-decorator
references dropped.

Tests: ten obsolete cap/floor tests removed; eight new tests added
covering ``RequestExceedsQuota`` after chunk 0, deficit reporting,
the no-header skip path, mid-call 429 → ``QuotaExhausted`` with
partial frame, the first-chunk 429 (partial_response=None) edge
case, and non-429 ``RuntimeError`` pass-through.

``_fetch_once`` in ``utils.py`` calls the decorator with defaults
only, so no call-site changes are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Pushed 01e579e reworking the quota machinery per all the inline comments. Summary:

Module docstring (line 8 comment): tightened. No more PR 233 / two-decorator references — it just describes the current design (joint planner + dynamic quota gate).

_DEFAULT_MAX_CHUNKS / _DEFAULT_QUOTA_SAFETY_FLOOR (line 95, 101 comments): both deleted. After the first sub-request the wrapper reads x-ratelimit-remaining directly; if the remaining plan won't fit in the current window, it raises a new RequestExceedsQuota(ValueError) carrying planned_chunks, available, and deficit so the message reports exactly how far over budget the call is.

QuotaExhausted (line 144 comment): rewritten. It now fires only when an actual HTTP 429 propagates from a sub-request (detected by walking __cause__ for RuntimeError("429: ..."), the shape _raise_for_non_200 produces). The docstring states explicitly that a single-process caller should not normally see this — RequestExceedsQuota short-circuits in chunk 1; arrival here implies a concurrent consumer drained the bucket faster than predicted. Carries the partial frame for resume. partial_response is None when the 429 hits chunk 0.

max_chunks / quota_safety_floor decorator params (line 552, 564 comments): removed. _plan_list_chunks and _plan_joint no longer carry a max_chunks cap; RequestTooLarge fires only on the genuine "nothing left to split" floor. The rate-limit gate replaces the static cap.

Side effects: _fetch_once in utils.py already called the decorator with defaults only, so no call-site changes were needed. Ten obsolete cap/floor tests were removed and eight new ones added covering RequestExceedsQuota after chunk 0, deficit reporting, the no-header skip path, mid-call 429 → QuotaExhausted with partial frame, the first-chunk 429 (partial_response=None) edge case, and non-429 RuntimeError pass-through. All chunker tests + offline stress test pass.

Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix these things

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +439 to +440
@classmethod
def from_args(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to refactor this to a class.init

Comment thread dataretrieval/waterdata/chunking.py Outdated
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 21, 2026
…or, axis-symmetric docstring

Addresses three PR DOI-USGS#283 review comments:

- **Module docstring reframed for axis symmetry.** The previous text
  read as "filter is the outer loop, list dims are inner," which
  obscured that both axis kinds are chunkable dimensions. The new
  framing leads with "every multi-value list parameter and the filter
  are chunkable axes" and explains *why* the algorithm enumerates
  filter counts in the outer loop (filter chunking is discrete in
  OR-clause cardinality; list dims are continuously halvable) rather
  than presenting the asymmetry as arbitrary.

- **``ChunkPlan.from_args`` → ``ChunkPlan.__init__``.** Now that the
  passthrough case is just a trivial plan (never ``None``), the
  classmethod-constructor pattern was unjustified. ``__init__`` does
  the planning directly: ``ChunkPlan(args, build_request, url_limit)``
  reads as "construct a plan for these args." Dropped ``@dataclass``;
  the fields are still simple attributes, just assigned in ``__init__``.
  Extracted the search loop to a free helper ``_search_best_chunking``
  so ``__init__`` stays readable.

- **``_ChunkExecution`` → ``_ChunkExecutor``.** Classes should be nouns;
  "Execution" reads as an event, "Executor" as an actor. Pairs cleanly
  with ``ChunkPlan`` — the plan is the recipe, the executor runs it.

The wrapper is unchanged in shape:

    return ChunkPlan(args, build_request, limit).execute(fetch_once)

Tests updated to use the direct constructor; all 145 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
Replaces ``requests`` with ``httpx`` package-wide and adds an async
parallel branch to the multi-value chunker, governed by
``API_USGS_CONCURRENT`` (parallel-by-default; set ``=1`` for the
legacy sequential path). Benchmarked at ~5.3x speedup on a 52k-site /
10-state ``get_daily`` call.

The parallel fan-out runs on a single shared ``httpx.AsyncClient`` so
sub-requests amortize one TCP+TLS handshake — impossible with the
sync ``requests`` stack without a thread pool. Built on top of the
``ChunkPlan`` / ``ChunkedCall`` arch from DOI-USGS#283: the sync path drives
``ChunkedCall.resume()`` (resumable, with ``ChunkInterrupted``
guarantees); the parallel path uses ``_fan_out_async`` to iterate the
same plan via ``asyncio.gather`` + ``Semaphore``. Both paths publish
their client via ``ContextVar`` so ``_walk_pages`` and
``get_stats_data`` reuse one client across sub-requests.

Backwards-compat: ``BaseMetadata.header`` is now ``httpx.Headers``
(case-insensitive dict reads still work; literal dict equality breaks
because ``httpx.Headers`` carries auto-added entries like ``host``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs changed the title refactor(waterdata): Unify list and filter chunkers into one joint planner feat(waterdata): Auto-chunk OGC requests over the URL byte limit May 22, 2026
@thodson-usgs thodson-usgs force-pushed the chunker-unified branch 6 times, most recently from 4e2e194 to 738cd93 Compare May 23, 2026 22:38
@thodson-usgs thodson-usgs marked this pull request as ready for review May 23, 2026 22:43
The OGC `waterdata` getters previously failed with HTTP 414 when the
request URL exceeded the server's ~8 KB byte limit. A common pattern
— pulling a long site list from `get_monitoring_locations` and
feeding it into `get_daily` — was the main offender:

    sites_df, _ = get_monitoring_locations(state_name="Ohio")
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

Introduces a joint chunker that models every multi-value list
parameter and the cql-text `filter` (split on top-level `OR`) as a
chunkable axis. Greedy halving splits the biggest chunk across all
axes until each sub-request URL fits; the chunker fans out under the
hood and returns one combined DataFrame. Callers see no API change.

Mid-stream 429 / 5xx surface as `ChunkInterrupted` subclasses
(`QuotaExhausted` / `ServiceInterrupted`) carrying the partial result
plus a `.call` resumable handle — `exc.call.resume()` continues only
the still-pending sub-requests. Pre-emptive `RequestExceedsQuota`
catches plans that won't fit the remaining rate-limit window;
`API_USGS_LIMIT=0` bypasses the check.

Behavior changes for paginated / chunked calls:

- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page's headers so
  `x-ratelimit-remaining` is current (was: first page's).
- `BaseMetadata.query_time` is now cumulative wall-clock across pages
  (was: first page's elapsed).

Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@thodson-usgs thodson-usgs merged commit 092f1b0 into DOI-USGS:main May 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant