feat(waterdata): Auto-chunk OGC requests over the URL byte limit#283
Conversation
thodson-usgs
left a comment
There was a problem hiding this comment.
Pay close attention to the layout: are all variables and functions placed logically into their modules? Or has the logic been mixed up.
| class QuotaExhausted(RuntimeError): | ||
| """Raised mid-chunked-call when the API's reported remaining quota | ||
| (``x-ratelimit-remaining`` header) drops below the configured safety | ||
| floor. The chunker stops before issuing the next sub-request to | ||
| avoid a mid-call HTTP 429 that would silently truncate paginated | ||
| results. |
There was a problem hiding this comment.
This seems like a bug. A mid-call HTTP 429 should not silently truncate. If it does, fix it, then we won't need to defend against this case.
…r helpers, clarify docs Three review responses bundled together: - chunking.py module docstring: define ``k`` as the candidate filter chunk count before using it in the planner description. - ``QuotaExhausted`` docstring: drop the stale "silently truncate" framing. PR DOI-USGS#273 / DOI-USGS#279 already raise on a mid-pagination 429, so this exception is the structured-recovery alternative (partial frames in hand) rather than a defense against silent truncation. - Move chunker-only orphans from filters.py to chunking.py: ``_WATERDATA_URL_BYTE_LIMIT`` (the URL byte ceiling), ``_FetchOnce`` TypeVar, ``_combine_chunk_frames``, and ``_combine_chunk_responses``. filters.py was a leftover home from the pre-unification two-decorator stack; these helpers have no callers outside the chunker. Test ``test_multi_value_chunked_lazy_url_limit`` now monkeypatches the constant on its new module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test docstrings/comments still framed their reasoning against the removed two-decorator stack (PR DOI-USGS#283 unified them). Reword to describe the current joint-planner behavior on its own terms: - ``test_plan_joint_fans_out_filter_when_list_alone_cannot_fit``: drop the "previous two-decorator design" aside. - ``test_chunkable_params_skips_filter_passed_as_list``: rewrite the "inner filters.chunked is the only place that may shrink filter" line to point at ``_plan_joint``. - ``stress_chunker._bail_floor_baseline``: reframe the baseline as "degenerate singleton plan" rather than "worst case the old two-decorator design produced." No behavioral changes; prose only. Chunker tests + offline stress test still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs
left a comment
There was a problem hiding this comment.
Please fix my comments
| # When ``x-ratelimit-remaining`` drops below this between sub-requests, | ||
| # the chunker bails with ``QuotaExhausted`` rather than risk a mid-call | ||
| # HTTP 429. Carries the partial result so callers can resume from a | ||
| # known offset instead of retrying the whole chunked call from scratch. | ||
| _DEFAULT_QUOTA_SAFETY_FLOOR = 50 |
There was a problem hiding this comment.
after the retrieving the first chunk, just check x-ratelimit-remaining, and if the plan will not fit within our current rait limit, bail and return an Error message that the query would exceed our rate limit and report by how much.
…mic rate-limit gate Addresses PR DOI-USGS#283 review feedback. The static caps (``_DEFAULT_MAX_CHUNKS=1000``, ``_DEFAULT_QUOTA_SAFETY_FLOOR=50``) and the matching ``max_chunks`` / ``quota_safety_floor`` decorator parameters are replaced by a quota check that runs after the first sub-request, using the real ``x-ratelimit-remaining`` value rather than a guessed cap. Behavior: - After the first sub-request the wrapper reads ``x-ratelimit-remaining``. If the rest of the plan won't fit in the current rate-limit window, it raises a new ``RequestExceedsQuota(ValueError)`` carrying ``planned_chunks``, ``available``, and ``deficit`` so the message reports exactly how far over budget the call is. The first chunk has already been issued; the wrapper stops there rather than burn the rest of the quota on a call that will fail mid-way. - ``QuotaExhausted`` is now triggered only when an actual HTTP 429 propagates from a sub-request (detected by walking ``__cause__`` for ``RuntimeError("429: ...")``, the shape ``_raise_for_non_200`` produces and ``_walk_pages`` wraps). A single-process caller should not normally see this — ``RequestExceedsQuota`` short-circuits in chunk 1; arrival here implies a concurrent consumer drained the bucket faster than predicted. Carries the partial frame for resume. ``partial_response`` becomes ``None`` when the 429 hits chunk 0 (no banked responses). - A non-429 ``RuntimeError`` (e.g. 500) propagates unchanged so the real cause surfaces to the caller. - When the server doesn't echo ``x-ratelimit-remaining``, ``_read_remaining`` returns ``_QUOTA_UNKNOWN``; the wrapper skips the post-first-chunk quota check (no signal → don't synthesize a block). Planner: ``_plan_list_chunks`` / ``_plan_joint`` no longer carry a ``max_chunks`` cap. ``RequestTooLarge`` fires only when nothing more can be split (the genuine URL-byte floor). The rate-limit gate replaces the static cap. Module docstring rewritten to summarize the current design (joint planning + dynamic quota gate); historical PR 233 / two-decorator references dropped. Tests: ten obsolete cap/floor tests removed; eight new tests added covering ``RequestExceedsQuota`` after chunk 0, deficit reporting, the no-header skip path, mid-call 429 → ``QuotaExhausted`` with partial frame, the first-chunk 429 (partial_response=None) edge case, and non-429 ``RuntimeError`` pass-through. ``_fetch_once`` in ``utils.py`` calls the decorator with defaults only, so no call-site changes are needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed Module docstring (line 8 comment): tightened. No more PR 233 / two-decorator references — it just describes the current design (joint planner + dynamic quota gate).
Side effects: |
1d148f3 to
f615db8
Compare
thodson-usgs
left a comment
There was a problem hiding this comment.
fix these things
| @classmethod | ||
| def from_args( |
There was a problem hiding this comment.
I think it would be cleaner to refactor this to a class.init
…or, axis-symmetric docstring Addresses three PR DOI-USGS#283 review comments: - **Module docstring reframed for axis symmetry.** The previous text read as "filter is the outer loop, list dims are inner," which obscured that both axis kinds are chunkable dimensions. The new framing leads with "every multi-value list parameter and the filter are chunkable axes" and explains *why* the algorithm enumerates filter counts in the outer loop (filter chunking is discrete in OR-clause cardinality; list dims are continuously halvable) rather than presenting the asymmetry as arbitrary. - **``ChunkPlan.from_args`` → ``ChunkPlan.__init__``.** Now that the passthrough case is just a trivial plan (never ``None``), the classmethod-constructor pattern was unjustified. ``__init__`` does the planning directly: ``ChunkPlan(args, build_request, url_limit)`` reads as "construct a plan for these args." Dropped ``@dataclass``; the fields are still simple attributes, just assigned in ``__init__``. Extracted the search loop to a free helper ``_search_best_chunking`` so ``__init__`` stays readable. - **``_ChunkExecution`` → ``_ChunkExecutor``.** Classes should be nouns; "Execution" reads as an event, "Executor" as an actor. Pairs cleanly with ``ChunkPlan`` — the plan is the recipe, the executor runs it. The wrapper is unchanged in shape: return ChunkPlan(args, build_request, limit).execute(fetch_once) Tests updated to use the direct constructor; all 145 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces ``requests`` with ``httpx`` package-wide and adds an async parallel branch to the multi-value chunker, governed by ``API_USGS_CONCURRENT`` (parallel-by-default; set ``=1`` for the legacy sequential path). Benchmarked at ~5.3x speedup on a 52k-site / 10-state ``get_daily`` call. The parallel fan-out runs on a single shared ``httpx.AsyncClient`` so sub-requests amortize one TCP+TLS handshake — impossible with the sync ``requests`` stack without a thread pool. Built on top of the ``ChunkPlan`` / ``ChunkedCall`` arch from DOI-USGS#283: the sync path drives ``ChunkedCall.resume()`` (resumable, with ``ChunkInterrupted`` guarantees); the parallel path uses ``_fan_out_async`` to iterate the same plan via ``asyncio.gather`` + ``Semaphore``. Both paths publish their client via ``ContextVar`` so ``_walk_pages`` and ``get_stats_data`` reuse one client across sub-requests. Backwards-compat: ``BaseMetadata.header`` is now ``httpx.Headers`` (case-insensitive dict reads still work; literal dict equality breaks because ``httpx.Headers`` carries auto-added entries like ``host``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1829cb1 to
e51beef
Compare
4e2e194 to
738cd93
Compare
The OGC `waterdata` getters previously failed with HTTP 414 when the
request URL exceeded the server's ~8 KB byte limit. A common pattern
— pulling a long site list from `get_monitoring_locations` and
feeding it into `get_daily` — was the main offender:
sites_df, _ = get_monitoring_locations(state_name="Ohio")
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
Introduces a joint chunker that models every multi-value list
parameter and the cql-text `filter` (split on top-level `OR`) as a
chunkable axis. Greedy halving splits the biggest chunk across all
axes until each sub-request URL fits; the chunker fans out under the
hood and returns one combined DataFrame. Callers see no API change.
Mid-stream 429 / 5xx surface as `ChunkInterrupted` subclasses
(`QuotaExhausted` / `ServiceInterrupted`) carrying the partial result
plus a `.call` resumable handle — `exc.call.resume()` continues only
the still-pending sub-requests. Pre-emptive `RequestExceedsQuota`
catches plans that won't fit the remaining rate-limit window;
`API_USGS_LIMIT=0` bypasses the check.
Behavior changes for paginated / chunked calls:
- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page's headers so
`x-ratelimit-remaining` is current (was: first page's).
- `BaseMetadata.query_time` is now cumulative wall-clock across pages
(was: first page's elapsed).
Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
738cd93 to
714d3cd
Compare
Summary
The OGC
waterdatagetters (get_daily,get_continuous,get_field_measurements, and the rest of the multi-value-capable functions) previously failed with HTTP 414 once the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list fromget_monitoring_locations, then feed it intoget_daily— was the main offender.This PR adds a transparent joint chunker: any over-budget request is fanned out across multiple sub-requests under the hood, and one combined DataFrame is returned. Callers see no API change.
Mirrors R
dataRetrieval's #870, generalized from one filter axis to N joint axes.What's new
filter(split on its top-levelORclauses) is modeled as a chunkable axis. Greedy halving picks the biggest chunk across all axes until each sub-request URL fits.ChunkInterruptedsubclass (QuotaExhaustedorServiceInterrupted) carrying the partial result plus a.callhandle. Callexc.call.resume()once the underlying condition clears — only still-pending sub-requests are re-issued.ChunkedCallreadsx-ratelimit-remainingand raisesRequestExceedsQuotaif the rest of the plan won't fit the window. SetAPI_USGS_LIMIT=0to bypass.requests.Session: opened once per chunked call and published via aContextVarso the paginated-loop helpers downstream reuse the same connection pool across every sub-request (saves one TCP/TLS handshake per sub-request after the first).ChunkPlan,ChunkedCall,RequestTooLarge,RequestExceedsQuota,ChunkInterrupted/QuotaExhausted/ServiceInterrupted,RateLimited/ServiceUnavailable,multi_value_chunkeddecorator.Behavior changes (documented in NEWS.md)
BaseMetadata.urlstill reflects the user's original query.BaseMetadata.headernow carries the last page / sub-request headers (sox-ratelimit-remainingis current) rather than the first.BaseMetadata.query_timeis now cumulative wall-clock across every page / sub-request rather than the first page's elapsed.Module layout
dataretrieval.waterdata.chunking: joint planner, exception hierarchy,ChunkPlan,ChunkedCall,multi_value_chunkeddecorator, shared-session plumbing.dataretrieval.waterdata.utils: paginated-loop body unified behind a_paginate(req, parse_response, follow_up, client)strategy helper shared by_walk_pagesandget_stats_data.dataretrieval.waterdata.filters: slimmed to a CQL-parsing leaf; URL-budget and filter-chunking logic moved into the joint planner.Test plan
tests/waterdata_chunking_test.pycovering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient-error classification, shared-session reuse._construct_api_requestsbuilder: 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes.iter_sub_argspassthrough copy, quota check on resume,RequestExceedsQuota.callhandle, missing-features defensiveness).get_monitoring_locations(state_name="Ohio") → get_daily(...): 2,888 sites → ~52 KB of comma-joined IDs (6.5× the byte limit) → transparently chunked, 1,444 rows returned in ~7s.🤖 Generated with Claude Code