Skip to content

Feat/cmip7 awiesm3 veg hr#266

Open
JanStreffing wants to merge 374 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr
Open

Feat/cmip7 awiesm3 veg hr#266
JanStreffing wants to merge 374 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr

Conversation

@JanStreffing

@JanStreffing JanStreffing commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

CMIP7 cmorization for AWI-ESM3-VEG-HR

Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.

Key changes

CMIP7 data request

  • Load DataRequest from CMIP7_DReq_metadata JSON instead of cmip6 tables
  • Refactor to native compound-name architecture (ocean.tos.tavg-u-hxy-sea.mon.GLB)
  • Fix JSON key mismatch: cmip6_tablecmip6_cmor_table in vendored metadata
  • Improve compound_name matching against cmip6_compound_name and cmip7_compound_name attributes
  • Derive table_id from compound name when not set explicitly
  • Strict ValueError on zero DRV matches (instead of silent skip)

Pipeline

  • Add generic vertical_integrate custom pipeline step
  • Remove duplicate convert() step from DefaultPipeline
  • Fix Prefect State objects not being unwrapped to actual results in parallel runs
  • Propagate pipeline/flow errors instead of silently logging them

Standard library

  • Add time bounds support (src/pycmor/std_lib/time_bounds.py)
  • Fix dimension mapping to use getattr + _pycmor_cfg fallback
  • Fix global_attributes to derive table_id from CMIP6/CMIP7 compound names

Xarray accessor API

  • Lazy accessor registration and StdLibAccessor with .process()

Test infrastructure

  • Modernize with entry-point model discovery (pycmor.fixtures.model_runs)
  • Add pycmor.tutorial dataset system (xarray.tutorial-style API)
  • Fix stub generator to use monotonic coordinate values for multi-file datasets

Misc fixes

  • Python 3.9 entry_points() compatibility
  • Guard pyfesom2 imports for environments without it
  • Fix tarball double-nesting extraction on Python 3.12+
  • Rename non-standard time dimension on load (OpenIFS support)

Test plan

  • Unit tests pass: pytest tests/unit/
  • pycmor process examples/awiesm3-cmip7-minimal.yaml runs successfully on Levante
  • core_atm runs with tco95_core2 on Levante
  • core_land runs with tco95_core2 on Levante
  • core_ocean runs with tco95_core2 on Levante
  • core_seaice runs with tco95_core2 on Levante
  • lrcs_land runs with tco95_core2 on Levante
  • lrcs_oce runs with tco95_core2 on Levante
  • lrcs_seaice runs with tco95_core2 on Levante
  • cap7_atm runs with tco95_core2 on Levante
  • cap7_ocean runs with tco95_core2 on Levante
  • cap7_seaice runs with tco95_core2 on Levante
  • cap7_land runs with tco95_core2 on Levante
  • cap7_aerosol runs with tco95_core2 on Levante
  • veg_atm runs with tco95_core2 on Levante
  • veg_land runs with tco95_core2 on Levante
  • veg_seaice runs with tco95_core2 on Levante
  • extra_atm runs with tco95_core2 on Levante
  • extra_land runs with tco95_core2 on Levante
  • final check of all rules.

@JanStreffing JanStreffing changed the base branch from main to prep-release April 2, 2026 12:55
@JanStreffing JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 1a42875 to 5617a18 Compare April 2, 2026 13:18
@JanStreffing

Copy link
Copy Markdown
Contributor Author

I was able to run both tos and the more complex absscint with the lastest commit on this branch. We may want to work on #267, and certainly need to work on #265. But neither should block us from starting to build up more rules for the picontrol variables.

@JanStreffing

JanStreffing commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

The single failing test (test_library_process[fesom_dev-native-dask-cmip6]) is unrelated to this PR. It failed due to a network issue — the test tried to download pi_uxarray.tar from nextcloud.awi.de but got ConnectionError: [Errno 111] Connection refused. The AWI Nextcloud server was unreachable from the GitHub Actions runner.

Something to worry about? @pgierz

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@mandresm

mandresm commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

@pgierz, will you review this or should I go for it?

@pgierz

pgierz commented Apr 7, 2026

Copy link
Copy Markdown
Member

I will look

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
If the FESOM diag file's 'nod_area' lives on half-levels (nl1),
tripyview's nod_area-rename leaves it on 'nz1', producing a result
with both 'nz' (from w) and 'nz1' (from nod_area). Detect the diag
file vdim and rename w to match so only one vertical dim survives.
Brings pycmor output much closer to wcrp_cmip7 compliance (38 → 30
cchecker findings on sidmassth; remaining are CV lookups blocked on
AWI source_id registration plus a few structural gaps).

- global_attributes.py (CMIP7): emit branded_variable, branding_suffix,
  temporal/vertical/horizontal/area_label, region, drs_specs, license_id,
  parent_experiment_id, title, history. Fix Conventions to
  "CF-1.11 CMIP-7.0". data_specs_version now parses CMIP7_DReq_metadata
  path (v1.2.2.2). tracking_id prefix overridable.
- files.create_filepath: drop institution prefix from filename; add
  CMIP7 DRS format <var>_<branding>_<freq>_<region>_<grid>_<source>_
  <exp>_<variant>[_<time>].nc (detected via compound_name).
- files.save_dataset / _save_dataset_with_native_timespan: strip
  fractional-seconds from time:units (preserve epoch), drop stale
  time:bounds pointing to missing variable, drop stale per-var
  `coordinates` encoding, clear _FillValue on coordinate variables.
- variable_attributes.py: set `missing_value` as CF attribute with
  dtype matching _FillValue (was only in encoding, caused dtype
  mismatch warnings).
- chunking.py / config.py: default _FillValue and missing_value to
  1.0e20 (CMIP/CMOR spec), was 1e30.
- timeaverage.py: default adjust_timestamp to "mid" for MEAN so time
  sits at interval midpoint (CMIP spec).
- cmip7_*_test yamls: use temp source_id AWI-ESM3-VEG-LR, piControl
  (case-sensitive), add activity_id: CMIP and parent_experiment_id.
… dir tree

Brings sidmassth verify run from 30 → 27 wcrp_cmip7 findings; the
remaining are all CV lookups that depend on AWI registrations.

- timeaverage.timeavg: only subtract 1d at offset==1.0 (was always
  subtracting), so default mid-month timestamp lands on the actual
  midpoint (Jan 16, not Jan 15).
- data_request/variable.py: surface cell_measures from CMIP7
  all_var_info.json onto CMIP7DataRequestVariable.attrs, so
  std_lib.variable_attributes.set_variable_attrs propagates it onto
  the data variable. Drop None/empty/`::MODEL` placeholder values.
- std_lib/files.py: emit lat/lon bounds via existing
  std_lib.bounds.add_bounds_from_coords, gated on 1-D, monotonic,
  size>=2 coords (FESOM unstructured node arrays correctly skipped).
  Wired into all four save paths.
- std_lib/global_attributes.CMIP7GlobalAttributes.subdir_path:
  rewrite to emit the 13-component CMIP7 DRS layout
  (drs_specs/mip_era/activity/institution/source/experiment/variant/
  region/frequency/variable/branding_suffix/grid_label/directory_date)
  in place of the CMIP6 10-component path.
- examples/cmip7_*_test.yaml: enable_output_subdirs: true so the
  rewritten DRS path actually surfaces.
- doc/design-qc-integration.md: document the FESOM unstructured
  monotonicity limitation (cchecker VAR005); fix is regrid or
  restructure to `node` dimension.
_ensure_lat_lon_bounds now splits regular coords (infer from
centers) from unstructured ones (copy lat_bnds/lon_bnds from
rule.grid_file when size + center values match). Clears ATTR001
bounds findings for FESOM2 native output.
Plumbing:
- Add qc_tests: [cf, wcrp_cmip7] to every tier's inherit block (17 yamls).
  The cc-plugin-wcrp suite must be installed separately:
  pip install 'cc-plugin-wcrp @ git+ssh://git@github.com/ESGF/cc-plugin-wcrp.git@master'

CF fixes surfaced by cli64 (first QC-on full-tier run):

1. CF §1.2 wo nz1 not strictly monotonic. The DARS mesh's deepest
   midpoint in mesh.depth is a seabed BC artefact (drops back from
   ~6125 m to ~3160 m at the last index). average_w_interfaces_to_midpoints
   now detects the non-monotonic tail and trims it, and sets full CF
   attrs (long_name, standard_name, units, axis, positive) on the new
   vertical coord.

2. CF §3.3 bnds aux variable missing long_name. time_bounds now
   attaches long_name='bounds index' to the bnds index coord, so the
   serialised int64 bnds(bnds) carries a CF-compliant description.

3. CF §2.5.1 coordinate _FillValue. _encoding_from_dask_chunks now
   sets _FillValue=None for every coord variable (lat/lon/time/lev/...),
   not just for data variables. Eliminates the spurious lat:_FillValue
   on areacella/areacello/fx files.

4. CF §3.1 vsfcorr units 'm s-1 psu' not UDUNITS-recognised. The
   CMIP7 DReq still ships 'psu' in vsfcorr's units. variable_attrs
   now rewrites 'psu' -> '1e-3' (the CMIP6+ convention; UDUNITS-
   accepted scaling factor) just before applying attrs.
Source registration: WCRP-CMIP/Essential-Model-Documentation#640
(merged 2026-06-11; AWI-ESM3-4-2-veg-HR ingested into cmip7 1.2.6 at
2026-06-16 10:32 UTC).

Per-tier inherit:
- source_id: AWI-ESM-3 -> AWI-ESM3-4-2-veg-HR  (matches CV drs_name)
- experiment_id: picontrol -> piControl        (CV drs_name)
- activity_id: CMIP                            (new, required by wcrp_cmip7)
- parent_experiment_id: 'no parent'            (new; CMIP6 convention)
- calendar: proleptic_gregorian                (new; matches AWI registration)
- institution: '...Helmholtz...Bremerhaven...' ->
                'Alfred Wegener Institute for Polar and Marine Research'
                                                (matches CV description verbatim)

Per-tier grid_label (from EMD horizontal_computational_grid + subgrids):
- atm + cap7_aerosol (XIOS-regridded 0.25 deg regular): g113
- land tiers + veg_atm (LPJ-GUESS on OIFS reduced gaussian unstructured): g122
- ocean tiers (FESOM unstructured triangular native): g130
- seaice tiers (FESIM embedded in FESOM mesh): g130

Per-tier pycmor:
- enable_output_subdirs: true                  (flips DRS subdir layout on)

Surfaced by cli65 wcrp_cmip7 plugin: 147/147 files had ATTR004 (activity_id,
experiment_id, grid_label, source_id), ATTR009 (institution match), FILE001
(no MIP-DRS7 in path), PATH001/002 (DRS structure vs attrs). Should all
clear now.
The fix_inherit_cv.py edit dropped the institution line everywhere.
pycmor.global_attributes.get_institution() falls back to institution_id
when not user-supplied, so files ended up with institution='AWI' — the
ATTR009 / ATTR004 wcrp checks expect the CV-canonical description.

Adds: institution: "Alfred Wegener Institute for Polar and Marine Research"
(verbatim from CMIP7-CVs cmip7 collection 'institution' term 'AWI').

cli66 surfaced this on 163/163 files. Clears once submitted as cli67.
Three new high/medium wcrp_cmip7 + cf findings appeared once the
DRS/CV plumbing landed clean. None of them block ESGF; all three are
data-quality issues with single-point pycmor fixes.

1. cf:medi:§7.1 'lat outside lat_bnds' on FESOM ocean (63 files).
   _attach_bounds_from_mesh now promotes both the centroid coord and
   the bnds polygon to float64 explicitly. Same geometric content as
   the DARS mesh's float32 storage; the extra decimals give the CF
   in-bounds check enough margin to absorb the ULP drift that put
   ~21k of ~3.1M cells a fraction outside their own polygon. No
   actual mesh change.

2. wcrp_cmip7:medi:[TIME003a] 'time.calendar='standard' recommended
   'proleptic_gregorian'' (37 files). XIOS-produced FESOM files
   arrive with calendar='standard'; time_bounds now forces the encoded
   calendar to 'proleptic_gregorian' on the main time coord (semantic
   no-op for any date past 1582-10-15, which covers every CMIP7
   experiment). Clears the entire FESOM TIME003a class.

3. wcrp_cmip7:high:[VAR004] 'time_centered_bounds missing' (12 files)
   on LPJ-GUESS daily output. The auxiliary 'time_centered' coord
   (NEMO/FESOM XIOS convention for centre-of-averaging-period) carries
   bounds='time_centered_bounds' but pycmor only rebuilds time_bnds,
   leaving the reference dangling. _drop_xios_aux_time_coords now
   strips both time_centered and time_centered_bounds in the pre-save
   pass. Also clears the TIME003a finding on time_centered:calendar
   (which had survived the time-coord fix because it's a different
   variable).
- Add time_bounds to DefaultPipeline so rules that use it pick up the
  TIME001 midpoint realignment without each yaml having to wire it in.
- Loosen _create_mean_bounds month detection: previously required
  approx_interval AND data_freq_days both in the 28-32 day band;
  broadcast-style pipelines (cfc11, ch4, n2o ...) don't carry a
  DReq-derived interval so the rule had approx_interval=None and the
  monthly path was silently skipped. Now any dataset with data spacing
  in the monthly range gets month-start bounds; approx_interval is
  treated as an optional veto only.

Companion to 70bd8af (the TIME001 midpoint write, the §7.1 float64
promotion, the TIME003a calendar override, and the time_centered drop).
source files often write fixed-day-of-month timestamps (e.g. 16th at
12:00) that drift by 1-2 days from the bnds midpoint for non-31-day
months. that's what trips the wcrp TIME001 "time axis check" on every
broadcast-style monthly file (cfc11, ch4, lpj yearly→monthly, ...).

time_bounds.py: after creating bnds, overwrite the time coord with
midpoint(time_bnds). preserves attrs/encoding via DataArray.copy(data=)
and skips the instantaneous path. also handles cftime calendars
(proleptic_gregorian, gregorian, ...) in _midpoint_bounds and
_create_monthly_bounds, since pd.Timestamp can't accept cftime objects
and np.mean can't average them. relaxed the monthly-detection check so
the month-start bnds fire when data_freq_days looks monthly even if
approx_interval is unset. promotes calendar="standard"/"gregorian" to
"proleptic_gregorian" at write so TIME003a stops recommending it.

pipeline.py: added the step to DefaultPipeline right after timeavg.

awi-esm3-veg-hr-variables/: swept set_time_bounds into 17 yamls,
111 pipelines total. uses the std_lib alias (DataArray-safe) so the
pipeline payload still round-trips through the Dataset-only inner fn.
CF §7.1 was reporting 21k+ lat / 23k+ lon points outside their own
bounding boxes on DARS2 (3.1M nodes). Two issues, two fixes.

1) FESOM/XIOS output writes bounds_lat/bounds_lon truncated to nvertex=8.
   For many cells the 8 slots are filled with only 1-2 unique vertices
   padded with the last value — degenerate line / thin-triangle polygons
   whose bbox really doesn't contain the stored centroid. The full
   polygon (up to 16 vertices on DARS2) lives in mesh.nc.

   Fix: when rule.grid_file is configured and the mesh carries matching-
   size lat_bnds/lon_bnds, drop whatever was renamed/recovered from the
   input file and let _attach_bounds_from_mesh pull the canonical mesh
   polygons. Only fires when a mesh is actually configured, so existing
   non-FESOM runs are unaffected.

2) FESOM stores lat/lon as float32 even though the mesh natively uses
   float64. CMIP6/7 cmor-tables specify type=double for latitude /
   longitude (CMOR writes them that way too). pycmor was forwarding the
   downcast.

   Fix: promote lat/lon (and their bnds) to float64 — both in-memory and
   in encoding so the dtype survives the to_netcdf round-trip — for all
   1-D unstructured coords, not just the mesh-attach path. Pure
   precision recovery from the mesh's native values; not invented.

End-to-end on HR DARS2 siconc: cf high=0 medium=0, wcrp_cmip7 high=0
medium=0. Was: cf medium=1 (21k+ outliers).
…-universe#190 in flight

snapshot after the TIME001 + mesh-prefer + float64 fixes landed and the
esgvoc db caught up with the source thin entry. publishable today; only
upstream packaging + registry housekeeping remain.
batch of cli68 follow-up fixes. HR seaice mon + day both verified clean
(cf 0/0, wcrp_cmip7 0/0) on local + slurm runs.

time_bounds.py
- _create_daily_bounds snaps day bnds to midnight regardless of source
  timestamp (FESOM noon stamps no longer offset bnds, midpoint = noon).
- existing-bnds path realigns time = midpoint(time_bnds) instead of
  early-returning (so source-shipped bnds still get the wcrp TIME001
  midpoint convention applied).
- canonical encoding helper unifies calendar (-> proleptic_gregorian),
  units (-> days since), dtype (-> float64) and propagates the same to
  time_bnds. fixes A3 / A4 / A6.
- bnds DataArray no longer carries its own long_name / comment so
  cf §7.1 doesn't flag mismatched boundary attrs. fixes A8.
- fx-style datasets (no time coord) are now a silent no-op instead of
  raising ValueError.

files.py
- save_dataset reruns time_bounds() per resample group, so canonical
  bnds survive the DataArray round-trip through the pipeline (xarray
  refuses to attach the foreign 'bnds' dim as an aux coord). fixes A1.
- _save_dataset_with_native_timespan drops the time coord + time_bnds
  on fx / ofx frequencies; CMIP fx files are time-invariant by spec.
  fixes A5 / A6 fx path.
- _ensure_lat_lon_bounds_impl drops dangling lat.attrs['bounds'] when
  the referenced bnds variable can't be recovered. fixes A7.

__init__.py
- set_time_bounds wrapper preserves the realigned time coord (with its
  encoding) on the returned DataArray; time_bnds re-attaches in
  save_dataset.

core/gather_inputs.py
- _check_compatible_schemas pre-opens each file header and fails fast
  with a clear error when input list has heterogeneous primary dims
  (e.g. native nod2 + regridded lat/lon both matched by a loose .*
  pattern). avoids the multi-petabyte allocation that dask.tokenize
  triggers inside xarray's merge_collected on schema mismatch.

awi-esm3-veg-hr-variables/lrcs_seaice
- 12 hemispheric-scalar sea-ice rules (-u-hm-u branding) now carry
  cell_measures: "" per CMIP7 registry. fixes B1 ATTR001.
two-layer bug. the model_level regex was ^(model_)?level(_\w+)?$ which
misses the OIFS-canonical 'model_levels' (trailing s). value-based
detection then sees the integer indices 1..137 fall inside the 0..360
longitude window and tags the dim as 'longitude'. result: every
atmospheric-model-level variable (cl, cli, clw, hur on al, hus on al,
ta on al, ...) writes its vertical dim as 'longitude' in the output,
which the cf §2.4 dim-order check then flags on 15 files.

regex now matches 'model_levels' / 'levels' too. value-side adds an
early integer-index-sequence check so 1..N or 0..N-1 returns
'model_level' before lat/lon range tests get a chance.

verified on the OIFS atmos_mon_ml_cli source: model_levels now maps
to alevel as expected.
Cli69 left 126 TIME003 and 44 TIME001 HIGH findings on LPJ-GUESS and
FESOM yearly variables. The wcrp filename regex only matches 6- or
8-digit tokens, so the legacy YYYY-YYYY form for yr/yrPt/dec was being
rejected as no token at all; switching to YYYYMMDD-YYYYMMDD (Jan 1 to
Dec 31 of the start/end years) clears TIME003 and stays inside the
canonical CMIP DRS for yearly files. TIME001 was failing because the
single-stamp yearly write path returned early without bnds and the
midpoint check then compared the raw stamp against an implied year
midpoint. Added _create_yearly_bounds (year_start, next_year_start)
that handles datetime64 and cftime, taught _create_mean_bounds to use
it for both single-stamp and multi-stamp yearly input, and gated the
fx-like short-circuit so yearly files no longer fall through to it.
Also reordered the assign_coords pair so the midpoint time is attached
before time_bnds; the previous order silently NaN'd every bnds entry
for any case where the source stamp wasn't already at the midpoint.
cli69 production batch surfaced 28 HIGH findings (14 ATTR004 + 14 FILE001)
for the `30s-90s` region term that IS in upstream cmip7@1.2.6 but was
missing from the esgvoc snapshot installed on the SLURM compute nodes.
Locally the validation passes; only the workers see it as HIGH. Add
`esgvoc use universe@latest && esgvoc use cmip7@latest` right after
`conda activate pycmor_py312` in run_hr_shard.sh and run_walker_compute.sh
so every compute node refreshes its CV cache before pycmor or the walker
imports esgvoc. Each call is idempotent and ~1s if already current, and
soft-fails so a briefly-down registry doesn't abort the run; a stale-cache
HIGH finding reappearing is itself the signal that the refresh missed.
Catches the 68 MEDIUM TIME003a (calendar=standard) and 13 HIGH VAR005 (time
dtype int64) findings on FESOM ocean monthly + yearly files. The
set_time_bounds pipeline step already calls _force_canonical_time_encoding,
but the helper only patches ds[time_label].encoding. The save path then
builds a per-variable encoding dict and passes it to xr.save_mfdataset,
and that dict overrides anything the dataset-level encoding had set. The
yearly path also fell back to _save_dataset_with_native_timespan, which
never ran the helper at all.

Adds canonicalize_time_in_encoding_dict so the encoding dict passed to
the writer also carries proleptic_gregorian / days since / float64, and
calls _force_canonical_time_encoding at every save site (resample-group
loop, native-timespan loop, scalar-time path, and once on the parent
dataset before the resample / native fork). Both helpers are idempotent
and silently skip when there is no time coord, so fx / ofx files are
unchanged.

Also strips fractional seconds and ISO T separators from the units
reference date and collapses it to date-only (xarray's CF encoder
rewrites days since YYYY-M-D HH:MM:SS back to YYYY-M-DTHH:MM:SS on
write, which the wcrp_cmip7 ATTR004 / cchecker units regex rejects).
Derives a clean days since YYYY-MM-DD when the source has no units at
all, so xarray's default encoder no longer mints a .000000 reference.
cli69 left 17 MEDIUM cf §7.1 findings on LPJ-GUESS daily files of the
shape "Bounds variable time_bnds and parent variable time have non
matching boundary related attributes: ['long_name']". The freshly-built
bnds path (commit a705361, A8) already constructs the bnds DataArray
with attrs={} so cf §7.1 is silent there.

The existing-bnds branch — used when the source file ships time_bnds
that pycmor passes through (LPJ-GUESS daily output is one such case) —
never stripped them. Source attrs like long_name="time bounds" survive
through save and the parent time coord's own long_name doesn't match.

Add _strip_bnds_inheritable_attrs and call it in the existing-bnds
branch of time_bounds(). The helper drops the full CF §7.1 list of
boundary-related attrs (long_name, standard_name, units, calendar,
axis, positive, leap_month, leap_year, month_lengths, climatology,
bounds, comment), leaving unrelated user attrs alone. Mirrors the
empty-attrs construction on the new-bnds path so both code paths emit
the same surface.
Three single-rule single-year YAMLs that exercise the code paths the
cli69 round had to fix. Mirrors the existing HR seaice smoke pattern
(local dask cluster, qc_enabled, qc_repack) so they run in a couple of
minutes each and produce a clean QC sidecar when the fixes are in.

- cmip7_hr_qc_smoke_lpj_yearly.yaml exercises E1 (8-digit
  YYYYMMDD-YYYYMMDD filename for yr frequency) and E2 (year-start
  bnds + midpoint realign). Source: HR LPJ-GUESS baresoilFrac yearly.
- cmip7_hr_qc_smoke_fesom_ocean_mon.yaml exercises E3 (calendar to
  proleptic_gregorian), E4 (units canonicalised to days since YYYY-MM-DD,
  no fractional seconds), and E6 (time dtype float64). Source: HR FESOM
  sst monthly. The wcrp_cmip7 plugin gets the FESOM int64 / standard
  paths that DefaultPipeline rules would otherwise miss.
- cmip7_hr_qc_smoke_lpj_daily.yaml exercises E5 (strip CF-inheritable
  attrs from source-shipped time_bnds). Source: HR OIFS atmos_day_land
  via the temporal_diff_pipeline used by dcw_day.

All three reuse the canonical HR identifiers (source_id, grid_label,
DARS2 mesh path) so the wcrp_cmip7 ATTR004 checks land on real CV
terms, not placeholders.
Three small fixes turned up by smoke runs of the cli69 LPJ-daily and
FESOM ocean-monthly fix targets:

- _force_canonical_time_encoding now sets the parent time coord's
  standard_name='time', long_name='time', axis='T' if absent. FESOM
  monthly via timeavg rebuilds the time coord and drops these, which
  trips wcrp ATTR001 and cf §3.3 / §5.1.
- Same helper now also sets encoding['_FillValue']=None on both the
  time coord and time_bnds. xarray's default CF encoder emits a NaN
  fill on every float variable; cf §7.1 explicitly forbids _FillValue
  on bnds.
- dimension_mapping.apply_mapping renames the matching `{src}_bnds`
  aux variable when it renames a dim X→Y, and fixes the parent coord's
  `bounds` attr pointer. LPJ-GUESS daily output uses `time_counter`,
  which lands a stray `time_counter_bnds` and no `time` variable; wcrp
  TIME003 then reports "Missing 'time' variable". The `time_counter`
  name is added to the time regex so apply_mapping recognises it.
get_encoding_with_chunks() iterates only ds.data_vars, so it skips
coord variables — including bounds aux like time_bnds. The companion
_encoding_from_dask_chunks() suppresses _FillValue on every coord
after the data_vars loop, but the trigger_compute-eager path (after
the pipeline materialises) routes through this chunking helper
instead, and time_bnds ends up written with the default NaN fill.
cf §7.1 then flags "Boundary variables 'time_bnds' should not have
the attributes: ['_FillValue']".
CMIP convention is that the on-disk time-axis variable is ALWAYS
named "time"; the numeric suffix on time1/time2/time3 only drives
the cell_methods string ("time: point" vs "time: mean") inside CMOR,
not a separate dim name. When the data request says axis=time1 (e.g.
land.cLitterLut.tpt-u-hxy-multi.yr.glb for LUT-style instantaneous
yearly snapshots), apply_mapping previously rewrote the file dim to
"time1" and shipped time1 + time1_bnds. wcrp_cmip7 TIME003 ("Missing
'time' variable.") and ATTR001 hardcode the name "time" and trip.

Collapse the mapping target before the rename: any source dim mapped
to time1/2/3 becomes "time" in the file. The matching bnds aux is
already renamed by the loop further down, and the parent coord's
`bounds` attr is repointed there too.
extra_land/cmip7_awiesm3-veg-hr_extra_land.yaml mapped the LPJ-GUESS
monthly source (lai_monthly.out) to compound .day.glb. The data is
genuinely monthly (one stamp per month), but the compound told the
pipeline to write a day-frequency file with day-frequency filename
token. wcrp_cmip7 TIME001 then reported "expected 1.5, got 30" — the
gap between time[0] and time[1] is 30 days, not 1 day.

Switch to land.lai.tavg-u-hxy-lnd.mon.glb. core_land already covers
the .mon variant from OpenIFS land output; this rule provides the
LPJ-GUESS variant of the same compound.
Three fixes for the residual HIGHs / MEDIUMs that cli71 surfaced
after the LUT axis-id collapse and lai_mon recipe correction:

1. set_time_bounds: auto-detect cell_methods 'time: point' and treat
   the rule as instantaneous. For yearly + instantaneous (tpt yearly
   LUT compounds like land.cLitterLut.tpt-u-hxy-multi.yr.glb), build
   year-snap bnds (so the bnds span the period) but pin time to the
   bnds start, not the midpoint. wcrp TIME001 with use_midpoint=False
   then sees time = filename_start (≈ -182 in days since year_mid
   epoch) and matches. Clears 15× cLitterLut TIME001.

2. _save_dataset_impl: re-run set_time_bounds on the dataset after
   the DataArray→Dataset conversion when time_bnds is still missing.
   The std_lib wrapper drops the bnds aux when the upstream payload
   is a DataArray, and the existing resample-group path re-attaches
   them per group — but the native-timespan path doesn't. Which
   path fires depends on a non-deterministic
   pd.Timestamp.now()+YS-now comparison; today (June) it goes
   native, in November it goes resample. Run set_time_bounds once
   before the branch so both paths inherit a bnds-carrying dataset.
   Clears 14× FESOM yearly difmxylo TIME003 ("time coverage
   (1851,7)-(1851,7) does not cover filename 18510101-18511231").

3. DataRequestVariable.attrs (CMIP7): keep cell_measures='' through
   the empty-string filter. siextent / sivol / siarea / sisnmass and
   other scalar sea-ice compounds ship cell_measures='' in the data
   request as the canonical "no spatial measure" value, and wcrp
   ATTR001 requires the attribute to exist on the variable. The
   filter dropped any "" field; restricting the exemption to
   cell_measures preserves the safety net for accidental blank
   standard_name / long_name. Clears 24× ATTR001 cell_measures
   missing on sea-ice scalars.

Skipped this round: the 10× cf §7.1 lon-outside-lon_bnds on FESOM
3D elem-dim files. lat / lon and bnds are already float64 (cli67
fdf444f promotion confirmed on cli71 output), and mesh-bnds-prefer
doesn't fire because mesh.nc carries node-dim bnds only — the 3D
fields are on elem-dim, ~2× larger. ~0.03% of triangle bboxes don't
contain their centroid (likely quantization or dateline-crossing).
Real fix needs geometry work (either elem-dim mesh bnds or a
post-hoc bbox-expand step) — out of scope for this iteration.
Four recipe + helper fixes for residual cli71 HIGHs whose root causes
are unit-string mismatch or missing CF metadata, not pipeline bugs:

- variable_attributes psu rewrite uses ``1E-03`` (CMIP7 registry form)
  instead of ``1e-3``. wcrp ATTR004 does literal string compare against
  the registry; ``1e-3`` and ``1E-03`` are the same scaling factor to
  UDUNITS but trip the registry check. Clears 1× sob HIGH.

- compute_hfbasin_tripyview / compute_sltbasin_tripyview attach
  ``standard_name=region`` + ``long_name=Region Selection`` to the
  basin sector axis at construction. set_coordinates only knows about
  spatiotemporal axes, so the basin coord otherwise landed bare and
  tripped wcrp ATTR001. Clears 1× sltbasin HIGH.

- cfc11_mon / cfc12_mon / ch4_mon / n2o_mon: drop the
  scale_factor=1e-12/1e-9 multiplication and label files as ``1E-12`` /
  ``1E-09`` (the CMIP7 registry form for scaled volume-mixing ratios).
  Previous behaviour scaled source-ppt → mol/mol and labelled ``mol mol-1``;
  the registry expects values to STAY in ppt/ppb with the scaling
  factor declared via units. scale_factor=1.0 keeps the pipeline's
  scale_by_constant step happy as a no-op. Clears 4× ATTR004 HIGH.

- vsfcorr: source is m s-1 psu (velocity × per-mille mass fraction);
  registry wants kg m-2 s-1 (mass flux). Conversion: m/s × g/kg × ρ₀
  (1025 kg/m³) = 1.025 kg/m²/s; the 1e-3 prefix on psu cancels the
  1000 g/kg in seawater density. New nan_to_zero_scale_pipeline runs
  scale_by_constant after nan_to_zero; vsfcorr rule sets
  scale_factor=1.025 + scaled_units=kg m-2 s-1. The old rule's
  ``model_unit: kg m-2 s-1`` was a metadata-only claim that the source
  data was already in CMIP units, which was wrong. Clears 1× ATTR004
  + 1× §3.1 UDUNITS HIGH.

Defers: 4× FILE004d uncompressed chunk size (cmip7repack tooling
issue — runs but doesn't grow chunks enough on FESOM high-res
unstructured) and 1× ``No geophysical variable detected`` on the
basin fx file (wcrp upstream check, int32 flag variable not
recognised as geophysical).
wcrp FILE004d (HIGH) requires each data-variable chunk to be at
least 4 MiB uncompressed (CMIP7 storage convention). The dask-
aligned path in ``_encoding_from_dask_chunks`` mirrors whatever
dask chunks the pipeline produced; on FESOM unstructured HR fields
(sfx: 12 × 3146761 nodes, dask chunks (12, 18724)) that lands at
1.8 MiB per netCDF chunk and trips the check. cmip7repack is the
documented post-hoc fix but for these 2D high-res cases it
re-uses the same shape and the file stays under threshold.

Add a per-variable floor: if dask-aligned chunks are below 4 MiB,
grow the trailing (rightmost, typically horizontal) dim's chunk
until product × wordsize crosses the threshold, capped at the dim
size. Leaves the time-axis chunking pycmor already picked alone so
streaming writes stay aligned with dask.

The non-dask path (``calculate_chunks_simple``) already targets
100 MB so it doesn't need the floor.

Clears 4× FILE004d HIGH (sfx, somint, phcint, absscint) on the
cli71 production batch.
Two cli72 residual HIGH fixes whose root causes were in pycmor:

- time_bounds: extend the period-start pin (previously yearly-only)
  to every rule wcrp_cmip7 TIME001 treats as instantaneous. wcrp's
  use_midpoint=False fires for (a) any rule with cell_methods
  "time: point" OR (b) any frequency NOT in its AVG_CORRECTION_FREQ
  set {day, mon, monPt, yr, yrPt, 1hrCM, sem}. So monthly tpt
  (sistressave, sistressmax, fracLut) and decadal (masso, thkcello,
  tauvo, so, volo, thetao, tauuo) need time = bnds[:, 0], not the
  bnds midpoint. Clears 10x TIME001.

- DRS region token: stop lower-casing parts[4] of the compound when
  building the filename and the directory path. CMIP7's region CV
  carries lowercase for simple codes (glb, nh, sh) but uppercase for
  latitude-band tokens (30S-90S, 30N-90N, ...). Forcing lowercase
  trips wcrp FILE001 (DRS directory + filename) and ATTR004 (region
  CV) for the 1hr/3hr south-30 hemispheric files. Preserves whatever
  case the compound carries; the data-request corpus is the source
  of truth. Clears 36x FILE001 + 18x ATTR004 region.
Source files from OpenIFS-XIOS / NEMO-XIOS / FESOM ship ``time_counter``
as the time dim. Pycmor only renamed it to ``time`` when each rule set
``time_dimname`` explicitly; otherwise time_counter survived load and
the downstream behaviour was inconsistent across paths:

- The primary loader (core.gather_inputs.load_mfdataset) skipped its
  year filter because ``"time" in dims`` was False.
- The secondary-input loader (custom_steps._load_secondary_mf*) ran a
  drop_vars block that REMOVED time_counter from coords without first
  renaming it. The secondary returned without any time coord.
- Custom compute_X steps then did primary × secondary arithmetic where
  the primary still carried time_counter and the secondary carried no
  time at all. xarray broadcasting collapsed the result to a single
  time stamp at best, an empty time dim at worst.

cli72 surfaced this on ``sfcWind_1hr_south30``: output file shipped
``time: 0`` (zero-length unlimited dim) with no time variable, tripping
8 HIGH (ATTR004 coordinates, DIM002, cf §5, cf §5.1 each twice).

Add auto-detect for time_counter / time_centered in both loaders so
the rename happens without each rule having to opt in. Explicit
``time_dimname`` still wins. Companion bnds variables (time_counter_bounds,
time_counter_bnds) get renamed alongside the dim.
Two rules sharing cmor_variable + branding_suffix + frequency but
differing in region (rlds_1hr_south30 vs rlds_1hr glb, rsds_1hr_south30
vs rsds_1hr glb, ...) used the same _rule_files prefix, so each rule's
qc step picked up the other rule's files. On 1hr / 3hr / 6hr regional
rules this raced against the glb rule's concurrent cmip7repack: the
south30 rule's strip cleared _QuantizeBitGroom* on the glb file, then
the glb rule rewrote that file (re-introducing the attr transiently),
then the south30 cchecker read the stale state and the sidecar recorded
a §2.3 _QuantizeBitGroom* finding that wasn't actually on disk.

Anchor _rule_files on <var>_<branding>_<freq>_<region>_ so each rule
only sees its own files. The DRS filename layout
<var>_<branding>_<freq>_<region>_<grid>_... makes the region token a
clean prefix boundary.

Clears the 1× §2.3 finding on cli72 (rlds_1hr_south30) and removes the
class of stale-finding races on every south30 / hemispheric rule.
cmip7repack writes ``<file>.nc_cmip7repack``, then atomically renames
it over the original. If the wrapping SLURM job times out mid-repack
(3 h walltime, daily 137-level 3D atm fields like cl_day / pfull_day
each take well over an hour), cmip7repack gets killed between
"successfully created" and the rename. The partial intermediate stays
on disk forever — corrupted (HDF error on open) and taking GB per file.

cli72 left 30 GB of zombies (cl_day + pfull_day at 14.8 GB and 15.0 GB
each). The next batch would inherit them and the disk accounting would
keep climbing every cli iteration.

Add ``_clean_cmip7repack_orphans`` and invoke it before AND after every
``_run_cmip7repack`` pass so:
- before: a previous job's killed-rename zombies are removed before this
  job's repack starts (otherwise a half-baked intermediate could shadow
  the in-progress repack on Lustre's metadata cache).
- after: any zombies left by THIS job (timeouts, OOM, signal) are
  removed at the end of the qc_repack step. No more 30 GB silent
  accumulation per batch.
Investigation of the 1hr / 3hr / 6hr atm files in cli72 found three
distinct issues, all addressed here.

1. timeaverage._get_time_method only recognised CMIP6's "Pt" suffix on
   the frequency string. CMIP7 dropped the suffix and moved the
   tavg-vs-tpt signal into cell_methods ("time: point" vs "time:
   mean"). Every CMIP7 tpt sub-daily / monthly rule was treated as
   MEAN: timeavg ran .mean() + the midpoint shift on already-instant
   data, so stamps moved +3h (6hr) / +1.5h (3hr) / +0.5h (1hr) and the
   on-disk cell_methods still said "time: point". The psl_6hr output
   in cli72 had stamps at 15:00, 21:00, 03:00, 09:00 instead of 06,
   12, 18, 00 — exactly the +3h shift, plus boundary trim.

   Add an inline cell_methods check that overrides MEAN -> INSTANTANEOUS
   when the data request says "time: point". The CMIP6 Pt path stays
   intact; this only kicks in for CMIP7 rules.

2. ps_1hr (cap7_atm) was branded ``atmos.ps.tpt-u-hxy-u.1hr.glb`` but
   used source ``atmos_1h_sfc_sp`` which the OIFS XIOS file_def
   declares as operation="average" (time_centered stamps at HH:30).
   CMIP7 has both ``atmos.ps.tavg-u-hxy-u.1hr.glb`` and the tpt form;
   producing tpt from averaged data is wrong. Switch to the tavg
   compound. The 3hr / 6hr ps rules already use the matching ``_pt_``
   instant sources (atmos_3h_pt_sp, atmos_6h_pt_sp), so no change there.

3. ps_1hr_south30 (extra_atm) and ts_3hr (extra_atm) hit the same
   source mismatch but with no tavg compound available in the CMIP7
   registry for that region/frequency. Disabled with TODO comments
   pointing at the OIFS XIOS config that would need to add the
   ``atmos_1h_pt_sp`` / ``atmos_3h_pt_ts`` instant fields. Re-enable
   once the source data exists.
Both are CMIP7 HIGH-priority variables but the OIFS XIOS file_def in
this production run only emits averaged sources where CMIP7 asks for
``time: point``:

- atmos.ts.tpt-u-hxy-u.3hr.glb wants instants to capture diurnal-
  cycle peaks; we only have ``atmos_1h_ts_ts`` (averaged at HH:30).
- atmos.ps.tpt-u-hxy-u.1hr.30S-90S wants instants to capture sub-
  hourly Southern-Ocean pressure-wave variability; we only have
  ``atmos_1h_sfc_sp`` (averaged at HH:30).

Sampling the averages and labeling them as instants would ship the
wrong data type — the variability the variables are meant to capture
isn't in the source. Production is already underway, so adding the
XIOS fields would require a model restart from an earlier checkpoint.

Skip them honestly. variable_coverage.md is updated with the gap
explanation; YAML rules stay commented out with an inline note.
Two MEDIUM cleanups for cli73:

- variable_attributes ``is_temperature_sn`` now accepts compound CF
  standard names that don't strictly end with ``temperature`` (e.g.
  ``sea_water_potential_temperature_at_sea_floor`` on tob,
  ``sea_water_conservative_temperature``, ``sea_surface_temperature``)
  via an ``_temperature`` substring test. Also accepts ``degC`` /
  ``Celsius`` as absolute-temperature units. Clears CF §3.1.2 ``tob``
  units_metadata MEDIUM.

- When the data-request ``cell_measures`` is the CMIP7 ``--MODEL``
  placeholder, substitute the realm default rather than dropping it:
  ``area: areacello`` for ocean / seaIce / landIce / ocnbgchem,
  ``area: areacella`` for atmos / aerosol / land / atmosChem. wcrp
  ATTR001 requires the attribute to be PRESENT, and the existing
  placeholder-drop left it absent on siu / siv / sidmasstranx /
  sistrxdtop and similar variables whose data-request entry says
  ``--MODEL``. Clears 8x ATTR001 cell_measures MEDIUM.

§2.4 dim-order MEDIUMs on unstructured-grid files (mrsll, zg, hur,
ua_6hr, tsl, mrsol, ...) NOT touched. The CF check classifies single-
horizontal-dim variables as T,Z,(A) and reports they're not in
T,Z,Y,X order. Same shape affects ROMS curvilinear in compliance-
checker's own test corpus; upstream limitation, not real non-
conformance. Leave as advisory MEDIUM.
FESOM unstructured triangles crossing the dateline ship vertices on
both branches (e.g. (179°, -179°, -180°)) while FESOM's element
centroid stores ``lon`` in one branch only. cf §7.1 then sees the
centroid outside the [min, max] vertex bbox, reporting "N points lie
outside the bounding box". cli73 cleared most of the FESOM yearlies
but still tripped on 1814 elements per file for difmxylo / tauuo /
tauvo / sistressave (~0.03 % of cells, all on the dateline).

This isn't a precision issue — both ``lon`` and ``lon_bnds`` are
already double on disk (cli67 fdf444f promotion was symmetric across
lat and lon). lat passes because it doesn't wrap. The real fix is
geometric: shift each vertex into the 360° window centred on the
centroid, identity on the sphere, no change to the represented
geometry. After the shift the bbox contains the centroid by
construction.

Implemented at the tail of ``_ensure_lat_lon_bounds_impl`` so it
applies to every path that lands lon_bnds in the dataset (XIOS
rename, _recover_bounds_from_inputs, _attach_bounds_from_mesh,
add_bounds_from_coords). Vectorised — the 6.2M-element 3D files don't
churn Python.

Verified two ways:
- Direct math on cli73 difmxylo on-disk file: 1814 → 0 outliers.
- Isolated test through _ensure_lat_lon_bounds_impl on a 5-cell
  synthetic dataset with two dateline crossings: 2 → 0 outliers,
  non-dateline cells (and all of lat_bnds) untouched, log line
  emits the expected "shifted N rows" count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants