Skip to content

feat(waterdata): replace per-page logger.info with a single progress line#288

Draft
thodson-usgs wants to merge 12 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-progress-bar
Draft

feat(waterdata): replace per-page logger.info with a single progress line#288
thodson-usgs wants to merge 12 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-progress-bar

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 23, 2026

Summary

Paginated and chunked Water Data queries narrate their progress with per-page logger.info calls — Requesting, Next URL, and Remaining requests this hour. These scroll the console one line per page and are silent unless the caller has configured logging.

This replaces them with a single self-updating status line on stderr, rewritten in place as data arrives:

Progress: chunk 2/5 · 14 pages · 8,421 rows · 4,870 requests remaining, resets in 47m

It surfaces the two kinds of work the caller otherwise can't see — CQL-filter chunks (known up front) and next-link pages (count unknown up front) — plus rows accumulated and the requests remaining for the hour (with a countdown to the limit reset, from x-ratelimit-remaining / x-ratelimit-reset, when the server reports them). No new dependency.

Usage

import dataretrieval.waterdata as wd

# Pages through `next` links; progress shows on a single line.
df, md = wd.get_monitoring_locations(state_code="WI")

On an interactive terminal the one line is rewritten in place as each page arrives — these frames overwrite each other (\r), they don't stack:

Progress: 1 page · 10,000 rows · 4,998 requests remaining, resets in 52m
Progress: 2 pages · 20,000 rows · 4,997 requests remaining, resets in 52m
Progress: 3 pages · 24,213 rows · 4,996 requests remaining, resets in 51m

When a long CQL filter is split into URL-safe chunks, the line gains a chunk i/N segment:

Progress: chunk 3/5 · 11 pages · 38,400 rows · 4,990 requests remaining, resets in 49m

If no API key is configured (no API_USGS_PAT), the line appends a one-time pointer to API-key registration (unauthenticated callers hit much lower limits):

Progress: 3 pages · 24,213 rows
No API key detected — register for higher rate limits at https://api.waterdata.usgs.gov/signup/

The line is shown automatically when stderr is an interactive terminal. A source-documented API_USGS_PROGRESS env var (1/0) can force it on/off; per maintainer preference it's intentionally not advertised in the README — only in the module docstring.

What changed

  • New dataretrieval/waterdata/_progress.py — a ProgressReporter (chunks / pages / rows / rate-limit), a progress_context() contextmanager, and a current() accessor. The active reporter lives in a ContextVar so the chunked decorator (chunk counts) and the page-walking loops (page / row / rate-limit counts) both update it without threading a parameter through every signature. On close, if no API_USGS_PAT is set, it prints a one-time API-key-registration pointer.
  • utils.py_walk_pages and get_stats_data report pages + rate-limit; get_ogc_data / get_stats_data wrap their fetch in progress_context(). _next_req_url is now a pure helper. Surviving request-URL logger.info calls drop to logger.debug. The "Geopandas not installed" advisory now logs at WARNING in both the OGC and stats paths (the stats path was previously silent INFO).
  • chunking.pyChunkedCall.resume() reports the total chunk count and per-chunk start (this branch was merged with main's chunking refactor, which moved fan-out out of filters.py into chunking.py; the page/row/rate hooks live in the shared _paginate helper).
  • api.py — the two single-request CSV paths (get_samples, get_samples_summary) drop their Request: logger.info to debug.
  • README.md — the README's "view request URLs and remaining requests at INFO" note is stale after this change (URLs moved to DEBUG, the hourly count is now in the progress line), so it's replaced with an accurate "Tracking progress" section (progress line + a DEBUG-logging pointer for the request URL; no env-var docs).

Logging vs. warnings

Only logger.info (the per-page narration) is replaced by the progress line. The logger.warning / logger.error failure messages stay as logging — Python's "last resort" handler already prints WARNING+ to stderr with no logging setup, so they're visible by default.

Behavior

  • Progress line shown only on an interactive terminal by default, so notebooks, redirected logs, and CI stay clean.
  • The API-key pointer is shown at most once per process, and never when API_USGS_PAT is set.
  • Failure-path logger.warning / logger.error are unchanged.

Tests

  • tests/waterdata_progress_test.py: rendering, no-op-when-disabled, plural/chunk formatting, rate-limit persistence, TTY + env gating, progress_context nesting, the API-key pointer (shown only when unauthenticated, at most once), and a _walk_pages integration check.
  • Tests asserting both code paths warn when geopandas is missing.
  • Full waterdata suite passes; ruff clean.

Open question for review

  • On a failure mid-pagination, a logger.warning routed to stderr can land on the same line as the in-progress status text (cosmetic only). Worth a line-break-before-warning hook, or leave it?

🤖 Generated with Claude Code

thodson-usgs and others added 12 commits May 23, 2026 18:45
…line

Paginated and chunked Water Data queries narrated their progress with
per-page `logger.info` calls ("Requesting", "Next URL", "Remaining
requests this hour"), which scrolled the console and were silent unless
logging was configured.

Replace them with one self-updating status line on stderr, rewritten in
place as data arrives:

    waterdata · chunk 2/5 · 14 pages · 8,421 rows · 4,870 requests left

- New `_progress.py`: a `ProgressReporter` (chunks / pages / rows /
  rate-limit) plus a `progress_context()` contextmanager and `current()`
  accessor. The active reporter lives in a ContextVar so the `chunked`
  decorator (chunk counts) and the page-walking loops (page/row/rate-limit
  counts) update it without threading a param through every signature.
- Shown only on an interactive terminal by default so notebooks, logs,
  and CI stay clean; `DATARETRIEVAL_PROGRESS=1/0` forces it on/off.
- `_next_req_url` is now a pure helper; rate-limit + page reporting moved
  into `_walk_pages` and `get_stats_data`. Surviving request-URL logs drop
  to `debug`. Warnings/errors on failure are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Render the x-ratelimit-remaining value like the row count (e.g.
"4,998 requests left") when it's a plain integer, so the status-line
segments read consistently. Non-numeric header values pass through
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssing geopandas

- Rename the progress toggle env var DATARETRIEVAL_PROGRESS ->
  API_USGS_PROGRESS, matching the existing API_USGS_PAT convention.
- The "Geopandas not installed" advisory in the stats path was logged at
  INFO (silent by default) while the OGC path logged it at WARNING. Bump
  the stats one to WARNING so both paths surface it. Kept as logging
  rather than warnings.warn: logging's last-resort handler already prints
  WARNING+ to stderr without any logging setup, so the message is visible
  by default.
- Add tests asserting both paths warn when geopandas is absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README told users `logging.basicConfig(level=logging.INFO)` would show
request URLs and the hourly remaining-requests count. After the progress-line
change neither is true: request URLs moved to DEBUG and the remaining-requests
count is now part of the progress line, so INFO shows nothing.

Replace that note with an accurate "Tracking progress" section documenting the
self-updating status line and the API_USGS_PROGRESS toggle, plus a pointer to
DEBUG logging for the request URL. Add a NEWS.md entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…env var from README

When a query finishes without ever seeing an x-ratelimit-remaining header —
which typically means no API_USGS_PAT is configured — the progress line now
appends a one-time pointer to API-key registration. Guarded so it never nags a
caller who has a key, and shown at most once per process.

Also stop documenting the API_USGS_PROGRESS toggle in the README (it stays in
the module docstring); there is no environment-variable section in the docs to
host it. README's "Tracking progress" note now just describes the line and its
interactive-terminal default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ate-limit header

Anonymous requests still return an `x-ratelimit-remaining` header on some
endpoints (and omit it on others), so "no rate-limit header seen" is an
unreliable proxy for "no API key" — the pointer would fire inconsistently or
not at all. Key the one-time pointer directly off the absence of `API_USGS_PAT`
instead, which is exactly the condition it's nudging about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…data"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review surfaced two real defects:

- ProgressReporter rendering wrote to stderr unguarded. Because the per-page
  `reporter.add_page()` call sits inside `_walk_pages`' broad `except
  Exception`, a render failure — BrokenPipeError when output is piped to
  `head`, or UnicodeEncodeError for the `·` separator on a non-UTF-8 stderr —
  was misread as a failed request and silently truncated the result; on the
  first (pre-loop) page it crashed the query outright. `_render` and `close`
  now swallow stream errors and disable further rendering, so progress is
  strictly best-effort.

- The "Geopandas not installed" advisory lived in the per-page
  `_handle_stats_nesting`, so a multi-page stats query (now at WARNING) emitted
  one warning per page. Moved it to `get_stats_data` so it fires once, matching
  `_walk_pages`.

Also harden rate-limit formatting against non-decimal unicode digits
(`isascii() and isdigit()`), and add regression tests: a broken progress
stream must not truncate pagination, and the geopandas advisory fires once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ogress-bar

# Conflicts:
#	README.md
#	dataretrieval/waterdata/filters.py
#	dataretrieval/waterdata/utils.py
#	tests/waterdata_utils_test.py
…set; review fixes

The live USGS API returns X-Ratelimit-Limit + X-Ratelimit-Remaining but no
reset header, so the earlier "resets in X" countdown read a header that never
arrives. Replace it with the available "<remaining> / <limit> requests
remaining" (e.g. "995 / 1,000"); the reset parsing/countdown and its tests are
removed.

Also fixes two issues from code review:
- close() used current_chunk as a proxy for "drew something", but start_chunk
  sets it even when it doesn't render, so a query that failed before any page
  printed a stray newline and burned the once-per-process API-key hint. Now
  gated on a _rendered flag set only after a successful write.
- _maybe_hint_api_key set its latch before the write, so a failed write burned
  the hint process-wide; the latch is now set only after a successful write.

Plus doc-comment accuracy: the ContextVar isolates one query's reporter (it does
not give concurrent queries on one stderr separate lines), and references to the
removed ``filters.chunked`` decorator now point at ``chunking`` / ``_paginate``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eved

The status line now leads with "Retrieving: <service>" (e.g. "Retrieving:
continuous · 3 pages · 1,345 rows · 998 / 1,000 requests remaining"). The
service string each getter already passes to get_ogc_data / get_stats_data is
threaded into the reporter via progress_context(service=...). With no service
(bare reporter), it falls back to the "Progress:" label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant