Feat/cmip7 awiesm3 veg hr#266
Open
JanStreffing wants to merge 374 commits into
Open
Conversation
1a42875 to
5617a18
Compare
Contributor
Author
Contributor
Author
|
The single failing test ( Something to worry about? @pgierz |
Contributor
|
@pgierz, will you review this or should I go for it? |
Member
|
I will look |
If the FESOM diag file's 'nod_area' lives on half-levels (nl1), tripyview's nod_area-rename leaves it on 'nz1', producing a result with both 'nz' (from w) and 'nz1' (from nod_area). Detect the diag file vdim and rename w to match so only one vertical dim survives.
Brings pycmor output much closer to wcrp_cmip7 compliance (38 → 30 cchecker findings on sidmassth; remaining are CV lookups blocked on AWI source_id registration plus a few structural gaps). - global_attributes.py (CMIP7): emit branded_variable, branding_suffix, temporal/vertical/horizontal/area_label, region, drs_specs, license_id, parent_experiment_id, title, history. Fix Conventions to "CF-1.11 CMIP-7.0". data_specs_version now parses CMIP7_DReq_metadata path (v1.2.2.2). tracking_id prefix overridable. - files.create_filepath: drop institution prefix from filename; add CMIP7 DRS format <var>_<branding>_<freq>_<region>_<grid>_<source>_ <exp>_<variant>[_<time>].nc (detected via compound_name). - files.save_dataset / _save_dataset_with_native_timespan: strip fractional-seconds from time:units (preserve epoch), drop stale time:bounds pointing to missing variable, drop stale per-var `coordinates` encoding, clear _FillValue on coordinate variables. - variable_attributes.py: set `missing_value` as CF attribute with dtype matching _FillValue (was only in encoding, caused dtype mismatch warnings). - chunking.py / config.py: default _FillValue and missing_value to 1.0e20 (CMIP/CMOR spec), was 1e30. - timeaverage.py: default adjust_timestamp to "mid" for MEAN so time sits at interval midpoint (CMIP spec). - cmip7_*_test yamls: use temp source_id AWI-ESM3-VEG-LR, piControl (case-sensitive), add activity_id: CMIP and parent_experiment_id.
… dir tree Brings sidmassth verify run from 30 → 27 wcrp_cmip7 findings; the remaining are all CV lookups that depend on AWI registrations. - timeaverage.timeavg: only subtract 1d at offset==1.0 (was always subtracting), so default mid-month timestamp lands on the actual midpoint (Jan 16, not Jan 15). - data_request/variable.py: surface cell_measures from CMIP7 all_var_info.json onto CMIP7DataRequestVariable.attrs, so std_lib.variable_attributes.set_variable_attrs propagates it onto the data variable. Drop None/empty/`::MODEL` placeholder values. - std_lib/files.py: emit lat/lon bounds via existing std_lib.bounds.add_bounds_from_coords, gated on 1-D, monotonic, size>=2 coords (FESOM unstructured node arrays correctly skipped). Wired into all four save paths. - std_lib/global_attributes.CMIP7GlobalAttributes.subdir_path: rewrite to emit the 13-component CMIP7 DRS layout (drs_specs/mip_era/activity/institution/source/experiment/variant/ region/frequency/variable/branding_suffix/grid_label/directory_date) in place of the CMIP6 10-component path. - examples/cmip7_*_test.yaml: enable_output_subdirs: true so the rewritten DRS path actually surfaces. - doc/design-qc-integration.md: document the FESOM unstructured monotonicity limitation (cchecker VAR005); fix is regrid or restructure to `node` dimension.
…gate atmos file blocks by (level, freq, op)
_ensure_lat_lon_bounds now splits regular coords (infer from centers) from unstructured ones (copy lat_bnds/lon_bnds from rule.grid_file when size + center values match). Clears ATTR001 bounds findings for FESOM2 native output.
Plumbing: - Add qc_tests: [cf, wcrp_cmip7] to every tier's inherit block (17 yamls). The cc-plugin-wcrp suite must be installed separately: pip install 'cc-plugin-wcrp @ git+ssh://git@github.com/ESGF/cc-plugin-wcrp.git@master' CF fixes surfaced by cli64 (first QC-on full-tier run): 1. CF §1.2 wo nz1 not strictly monotonic. The DARS mesh's deepest midpoint in mesh.depth is a seabed BC artefact (drops back from ~6125 m to ~3160 m at the last index). average_w_interfaces_to_midpoints now detects the non-monotonic tail and trims it, and sets full CF attrs (long_name, standard_name, units, axis, positive) on the new vertical coord. 2. CF §3.3 bnds aux variable missing long_name. time_bounds now attaches long_name='bounds index' to the bnds index coord, so the serialised int64 bnds(bnds) carries a CF-compliant description. 3. CF §2.5.1 coordinate _FillValue. _encoding_from_dask_chunks now sets _FillValue=None for every coord variable (lat/lon/time/lev/...), not just for data variables. Eliminates the spurious lat:_FillValue on areacella/areacello/fx files. 4. CF §3.1 vsfcorr units 'm s-1 psu' not UDUNITS-recognised. The CMIP7 DReq still ships 'psu' in vsfcorr's units. variable_attrs now rewrites 'psu' -> '1e-3' (the CMIP6+ convention; UDUNITS- accepted scaling factor) just before applying attrs.
Source registration: WCRP-CMIP/Essential-Model-Documentation#640 (merged 2026-06-11; AWI-ESM3-4-2-veg-HR ingested into cmip7 1.2.6 at 2026-06-16 10:32 UTC). Per-tier inherit: - source_id: AWI-ESM-3 -> AWI-ESM3-4-2-veg-HR (matches CV drs_name) - experiment_id: picontrol -> piControl (CV drs_name) - activity_id: CMIP (new, required by wcrp_cmip7) - parent_experiment_id: 'no parent' (new; CMIP6 convention) - calendar: proleptic_gregorian (new; matches AWI registration) - institution: '...Helmholtz...Bremerhaven...' -> 'Alfred Wegener Institute for Polar and Marine Research' (matches CV description verbatim) Per-tier grid_label (from EMD horizontal_computational_grid + subgrids): - atm + cap7_aerosol (XIOS-regridded 0.25 deg regular): g113 - land tiers + veg_atm (LPJ-GUESS on OIFS reduced gaussian unstructured): g122 - ocean tiers (FESOM unstructured triangular native): g130 - seaice tiers (FESIM embedded in FESOM mesh): g130 Per-tier pycmor: - enable_output_subdirs: true (flips DRS subdir layout on) Surfaced by cli65 wcrp_cmip7 plugin: 147/147 files had ATTR004 (activity_id, experiment_id, grid_label, source_id), ATTR009 (institution match), FILE001 (no MIP-DRS7 in path), PATH001/002 (DRS structure vs attrs). Should all clear now.
The fix_inherit_cv.py edit dropped the institution line everywhere. pycmor.global_attributes.get_institution() falls back to institution_id when not user-supplied, so files ended up with institution='AWI' — the ATTR009 / ATTR004 wcrp checks expect the CV-canonical description. Adds: institution: "Alfred Wegener Institute for Polar and Marine Research" (verbatim from CMIP7-CVs cmip7 collection 'institution' term 'AWI'). cli66 surfaced this on 163/163 files. Clears once submitted as cli67.
Three new high/medium wcrp_cmip7 + cf findings appeared once the DRS/CV plumbing landed clean. None of them block ESGF; all three are data-quality issues with single-point pycmor fixes. 1. cf:medi:§7.1 'lat outside lat_bnds' on FESOM ocean (63 files). _attach_bounds_from_mesh now promotes both the centroid coord and the bnds polygon to float64 explicitly. Same geometric content as the DARS mesh's float32 storage; the extra decimals give the CF in-bounds check enough margin to absorb the ULP drift that put ~21k of ~3.1M cells a fraction outside their own polygon. No actual mesh change. 2. wcrp_cmip7:medi:[TIME003a] 'time.calendar='standard' recommended 'proleptic_gregorian'' (37 files). XIOS-produced FESOM files arrive with calendar='standard'; time_bounds now forces the encoded calendar to 'proleptic_gregorian' on the main time coord (semantic no-op for any date past 1582-10-15, which covers every CMIP7 experiment). Clears the entire FESOM TIME003a class. 3. wcrp_cmip7:high:[VAR004] 'time_centered_bounds missing' (12 files) on LPJ-GUESS daily output. The auxiliary 'time_centered' coord (NEMO/FESOM XIOS convention for centre-of-averaging-period) carries bounds='time_centered_bounds' but pycmor only rebuilds time_bnds, leaving the reference dangling. _drop_xios_aux_time_coords now strips both time_centered and time_centered_bounds in the pre-save pass. Also clears the TIME003a finding on time_centered:calendar (which had survived the time-coord fix because it's a different variable).
- Add time_bounds to DefaultPipeline so rules that use it pick up the TIME001 midpoint realignment without each yaml having to wire it in. - Loosen _create_mean_bounds month detection: previously required approx_interval AND data_freq_days both in the 28-32 day band; broadcast-style pipelines (cfc11, ch4, n2o ...) don't carry a DReq-derived interval so the rule had approx_interval=None and the monthly path was silently skipped. Now any dataset with data spacing in the monthly range gets month-start bounds; approx_interval is treated as an optional veto only. Companion to 70bd8af (the TIME001 midpoint write, the §7.1 float64 promotion, the TIME003a calendar override, and the time_centered drop).
source files often write fixed-day-of-month timestamps (e.g. 16th at 12:00) that drift by 1-2 days from the bnds midpoint for non-31-day months. that's what trips the wcrp TIME001 "time axis check" on every broadcast-style monthly file (cfc11, ch4, lpj yearly→monthly, ...). time_bounds.py: after creating bnds, overwrite the time coord with midpoint(time_bnds). preserves attrs/encoding via DataArray.copy(data=) and skips the instantaneous path. also handles cftime calendars (proleptic_gregorian, gregorian, ...) in _midpoint_bounds and _create_monthly_bounds, since pd.Timestamp can't accept cftime objects and np.mean can't average them. relaxed the monthly-detection check so the month-start bnds fire when data_freq_days looks monthly even if approx_interval is unset. promotes calendar="standard"/"gregorian" to "proleptic_gregorian" at write so TIME003a stops recommending it. pipeline.py: added the step to DefaultPipeline right after timeavg. awi-esm3-veg-hr-variables/: swept set_time_bounds into 17 yamls, 111 pipelines total. uses the std_lib alias (DataArray-safe) so the pipeline payload still round-trips through the Dataset-only inner fn.
CF §7.1 was reporting 21k+ lat / 23k+ lon points outside their own bounding boxes on DARS2 (3.1M nodes). Two issues, two fixes. 1) FESOM/XIOS output writes bounds_lat/bounds_lon truncated to nvertex=8. For many cells the 8 slots are filled with only 1-2 unique vertices padded with the last value — degenerate line / thin-triangle polygons whose bbox really doesn't contain the stored centroid. The full polygon (up to 16 vertices on DARS2) lives in mesh.nc. Fix: when rule.grid_file is configured and the mesh carries matching- size lat_bnds/lon_bnds, drop whatever was renamed/recovered from the input file and let _attach_bounds_from_mesh pull the canonical mesh polygons. Only fires when a mesh is actually configured, so existing non-FESOM runs are unaffected. 2) FESOM stores lat/lon as float32 even though the mesh natively uses float64. CMIP6/7 cmor-tables specify type=double for latitude / longitude (CMOR writes them that way too). pycmor was forwarding the downcast. Fix: promote lat/lon (and their bnds) to float64 — both in-memory and in encoding so the dtype survives the to_netcdf round-trip — for all 1-D unstructured coords, not just the mesh-attach path. Pure precision recovery from the mesh's native values; not invented. End-to-end on HR DARS2 siconc: cf high=0 medium=0, wcrp_cmip7 high=0 medium=0. Was: cf medium=1 (21k+ outliers).
…-universe#190 in flight snapshot after the TIME001 + mesh-prefer + float64 fixes landed and the esgvoc db caught up with the source thin entry. publishable today; only upstream packaging + registry housekeeping remain.
batch of cli68 follow-up fixes. HR seaice mon + day both verified clean (cf 0/0, wcrp_cmip7 0/0) on local + slurm runs. time_bounds.py - _create_daily_bounds snaps day bnds to midnight regardless of source timestamp (FESOM noon stamps no longer offset bnds, midpoint = noon). - existing-bnds path realigns time = midpoint(time_bnds) instead of early-returning (so source-shipped bnds still get the wcrp TIME001 midpoint convention applied). - canonical encoding helper unifies calendar (-> proleptic_gregorian), units (-> days since), dtype (-> float64) and propagates the same to time_bnds. fixes A3 / A4 / A6. - bnds DataArray no longer carries its own long_name / comment so cf §7.1 doesn't flag mismatched boundary attrs. fixes A8. - fx-style datasets (no time coord) are now a silent no-op instead of raising ValueError. files.py - save_dataset reruns time_bounds() per resample group, so canonical bnds survive the DataArray round-trip through the pipeline (xarray refuses to attach the foreign 'bnds' dim as an aux coord). fixes A1. - _save_dataset_with_native_timespan drops the time coord + time_bnds on fx / ofx frequencies; CMIP fx files are time-invariant by spec. fixes A5 / A6 fx path. - _ensure_lat_lon_bounds_impl drops dangling lat.attrs['bounds'] when the referenced bnds variable can't be recovered. fixes A7. __init__.py - set_time_bounds wrapper preserves the realigned time coord (with its encoding) on the returned DataArray; time_bnds re-attaches in save_dataset. core/gather_inputs.py - _check_compatible_schemas pre-opens each file header and fails fast with a clear error when input list has heterogeneous primary dims (e.g. native nod2 + regridded lat/lon both matched by a loose .* pattern). avoids the multi-petabyte allocation that dask.tokenize triggers inside xarray's merge_collected on schema mismatch. awi-esm3-veg-hr-variables/lrcs_seaice - 12 hemispheric-scalar sea-ice rules (-u-hm-u branding) now carry cell_measures: "" per CMIP7 registry. fixes B1 ATTR001.
two-layer bug. the model_level regex was ^(model_)?level(_\w+)?$ which misses the OIFS-canonical 'model_levels' (trailing s). value-based detection then sees the integer indices 1..137 fall inside the 0..360 longitude window and tags the dim as 'longitude'. result: every atmospheric-model-level variable (cl, cli, clw, hur on al, hus on al, ta on al, ...) writes its vertical dim as 'longitude' in the output, which the cf §2.4 dim-order check then flags on 15 files. regex now matches 'model_levels' / 'levels' too. value-side adds an early integer-index-sequence check so 1..N or 0..N-1 returns 'model_level' before lat/lon range tests get a chance. verified on the OIFS atmos_mon_ml_cli source: model_levels now maps to alevel as expected.
Cli69 left 126 TIME003 and 44 TIME001 HIGH findings on LPJ-GUESS and FESOM yearly variables. The wcrp filename regex only matches 6- or 8-digit tokens, so the legacy YYYY-YYYY form for yr/yrPt/dec was being rejected as no token at all; switching to YYYYMMDD-YYYYMMDD (Jan 1 to Dec 31 of the start/end years) clears TIME003 and stays inside the canonical CMIP DRS for yearly files. TIME001 was failing because the single-stamp yearly write path returned early without bnds and the midpoint check then compared the raw stamp against an implied year midpoint. Added _create_yearly_bounds (year_start, next_year_start) that handles datetime64 and cftime, taught _create_mean_bounds to use it for both single-stamp and multi-stamp yearly input, and gated the fx-like short-circuit so yearly files no longer fall through to it. Also reordered the assign_coords pair so the midpoint time is attached before time_bnds; the previous order silently NaN'd every bnds entry for any case where the source stamp wasn't already at the midpoint.
cli69 production batch surfaced 28 HIGH findings (14 ATTR004 + 14 FILE001) for the `30s-90s` region term that IS in upstream cmip7@1.2.6 but was missing from the esgvoc snapshot installed on the SLURM compute nodes. Locally the validation passes; only the workers see it as HIGH. Add `esgvoc use universe@latest && esgvoc use cmip7@latest` right after `conda activate pycmor_py312` in run_hr_shard.sh and run_walker_compute.sh so every compute node refreshes its CV cache before pycmor or the walker imports esgvoc. Each call is idempotent and ~1s if already current, and soft-fails so a briefly-down registry doesn't abort the run; a stale-cache HIGH finding reappearing is itself the signal that the refresh missed.
Catches the 68 MEDIUM TIME003a (calendar=standard) and 13 HIGH VAR005 (time dtype int64) findings on FESOM ocean monthly + yearly files. The set_time_bounds pipeline step already calls _force_canonical_time_encoding, but the helper only patches ds[time_label].encoding. The save path then builds a per-variable encoding dict and passes it to xr.save_mfdataset, and that dict overrides anything the dataset-level encoding had set. The yearly path also fell back to _save_dataset_with_native_timespan, which never ran the helper at all. Adds canonicalize_time_in_encoding_dict so the encoding dict passed to the writer also carries proleptic_gregorian / days since / float64, and calls _force_canonical_time_encoding at every save site (resample-group loop, native-timespan loop, scalar-time path, and once on the parent dataset before the resample / native fork). Both helpers are idempotent and silently skip when there is no time coord, so fx / ofx files are unchanged. Also strips fractional seconds and ISO T separators from the units reference date and collapses it to date-only (xarray's CF encoder rewrites days since YYYY-M-D HH:MM:SS back to YYYY-M-DTHH:MM:SS on write, which the wcrp_cmip7 ATTR004 / cchecker units regex rejects). Derives a clean days since YYYY-MM-DD when the source has no units at all, so xarray's default encoder no longer mints a .000000 reference.
cli69 left 17 MEDIUM cf §7.1 findings on LPJ-GUESS daily files of the shape "Bounds variable time_bnds and parent variable time have non matching boundary related attributes: ['long_name']". The freshly-built bnds path (commit a705361, A8) already constructs the bnds DataArray with attrs={} so cf §7.1 is silent there. The existing-bnds branch — used when the source file ships time_bnds that pycmor passes through (LPJ-GUESS daily output is one such case) — never stripped them. Source attrs like long_name="time bounds" survive through save and the parent time coord's own long_name doesn't match. Add _strip_bnds_inheritable_attrs and call it in the existing-bnds branch of time_bounds(). The helper drops the full CF §7.1 list of boundary-related attrs (long_name, standard_name, units, calendar, axis, positive, leap_month, leap_year, month_lengths, climatology, bounds, comment), leaving unrelated user attrs alone. Mirrors the empty-attrs construction on the new-bnds path so both code paths emit the same surface.
Three single-rule single-year YAMLs that exercise the code paths the cli69 round had to fix. Mirrors the existing HR seaice smoke pattern (local dask cluster, qc_enabled, qc_repack) so they run in a couple of minutes each and produce a clean QC sidecar when the fixes are in. - cmip7_hr_qc_smoke_lpj_yearly.yaml exercises E1 (8-digit YYYYMMDD-YYYYMMDD filename for yr frequency) and E2 (year-start bnds + midpoint realign). Source: HR LPJ-GUESS baresoilFrac yearly. - cmip7_hr_qc_smoke_fesom_ocean_mon.yaml exercises E3 (calendar to proleptic_gregorian), E4 (units canonicalised to days since YYYY-MM-DD, no fractional seconds), and E6 (time dtype float64). Source: HR FESOM sst monthly. The wcrp_cmip7 plugin gets the FESOM int64 / standard paths that DefaultPipeline rules would otherwise miss. - cmip7_hr_qc_smoke_lpj_daily.yaml exercises E5 (strip CF-inheritable attrs from source-shipped time_bnds). Source: HR OIFS atmos_day_land via the temporal_diff_pipeline used by dcw_day. All three reuse the canonical HR identifiers (source_id, grid_label, DARS2 mesh path) so the wcrp_cmip7 ATTR004 checks land on real CV terms, not placeholders.
Three small fixes turned up by smoke runs of the cli69 LPJ-daily and
FESOM ocean-monthly fix targets:
- _force_canonical_time_encoding now sets the parent time coord's
standard_name='time', long_name='time', axis='T' if absent. FESOM
monthly via timeavg rebuilds the time coord and drops these, which
trips wcrp ATTR001 and cf §3.3 / §5.1.
- Same helper now also sets encoding['_FillValue']=None on both the
time coord and time_bnds. xarray's default CF encoder emits a NaN
fill on every float variable; cf §7.1 explicitly forbids _FillValue
on bnds.
- dimension_mapping.apply_mapping renames the matching `{src}_bnds`
aux variable when it renames a dim X→Y, and fixes the parent coord's
`bounds` attr pointer. LPJ-GUESS daily output uses `time_counter`,
which lands a stray `time_counter_bnds` and no `time` variable; wcrp
TIME003 then reports "Missing 'time' variable". The `time_counter`
name is added to the time regex so apply_mapping recognises it.
get_encoding_with_chunks() iterates only ds.data_vars, so it skips coord variables — including bounds aux like time_bnds. The companion _encoding_from_dask_chunks() suppresses _FillValue on every coord after the data_vars loop, but the trigger_compute-eager path (after the pipeline materialises) routes through this chunking helper instead, and time_bnds ends up written with the default NaN fill. cf §7.1 then flags "Boundary variables 'time_bnds' should not have the attributes: ['_FillValue']".
CMIP convention is that the on-disk time-axis variable is ALWAYS
named "time"; the numeric suffix on time1/time2/time3 only drives
the cell_methods string ("time: point" vs "time: mean") inside CMOR,
not a separate dim name. When the data request says axis=time1 (e.g.
land.cLitterLut.tpt-u-hxy-multi.yr.glb for LUT-style instantaneous
yearly snapshots), apply_mapping previously rewrote the file dim to
"time1" and shipped time1 + time1_bnds. wcrp_cmip7 TIME003 ("Missing
'time' variable.") and ATTR001 hardcode the name "time" and trip.
Collapse the mapping target before the rename: any source dim mapped
to time1/2/3 becomes "time" in the file. The matching bnds aux is
already renamed by the loop further down, and the parent coord's
`bounds` attr is repointed there too.
extra_land/cmip7_awiesm3-veg-hr_extra_land.yaml mapped the LPJ-GUESS monthly source (lai_monthly.out) to compound .day.glb. The data is genuinely monthly (one stamp per month), but the compound told the pipeline to write a day-frequency file with day-frequency filename token. wcrp_cmip7 TIME001 then reported "expected 1.5, got 30" — the gap between time[0] and time[1] is 30 days, not 1 day. Switch to land.lai.tavg-u-hxy-lnd.mon.glb. core_land already covers the .mon variant from OpenIFS land output; this rule provides the LPJ-GUESS variant of the same compound.
Three fixes for the residual HIGHs / MEDIUMs that cli71 surfaced
after the LUT axis-id collapse and lai_mon recipe correction:
1. set_time_bounds: auto-detect cell_methods 'time: point' and treat
the rule as instantaneous. For yearly + instantaneous (tpt yearly
LUT compounds like land.cLitterLut.tpt-u-hxy-multi.yr.glb), build
year-snap bnds (so the bnds span the period) but pin time to the
bnds start, not the midpoint. wcrp TIME001 with use_midpoint=False
then sees time = filename_start (≈ -182 in days since year_mid
epoch) and matches. Clears 15× cLitterLut TIME001.
2. _save_dataset_impl: re-run set_time_bounds on the dataset after
the DataArray→Dataset conversion when time_bnds is still missing.
The std_lib wrapper drops the bnds aux when the upstream payload
is a DataArray, and the existing resample-group path re-attaches
them per group — but the native-timespan path doesn't. Which
path fires depends on a non-deterministic
pd.Timestamp.now()+YS-now comparison; today (June) it goes
native, in November it goes resample. Run set_time_bounds once
before the branch so both paths inherit a bnds-carrying dataset.
Clears 14× FESOM yearly difmxylo TIME003 ("time coverage
(1851,7)-(1851,7) does not cover filename 18510101-18511231").
3. DataRequestVariable.attrs (CMIP7): keep cell_measures='' through
the empty-string filter. siextent / sivol / siarea / sisnmass and
other scalar sea-ice compounds ship cell_measures='' in the data
request as the canonical "no spatial measure" value, and wcrp
ATTR001 requires the attribute to exist on the variable. The
filter dropped any "" field; restricting the exemption to
cell_measures preserves the safety net for accidental blank
standard_name / long_name. Clears 24× ATTR001 cell_measures
missing on sea-ice scalars.
Skipped this round: the 10× cf §7.1 lon-outside-lon_bnds on FESOM
3D elem-dim files. lat / lon and bnds are already float64 (cli67
fdf444f promotion confirmed on cli71 output), and mesh-bnds-prefer
doesn't fire because mesh.nc carries node-dim bnds only — the 3D
fields are on elem-dim, ~2× larger. ~0.03% of triangle bboxes don't
contain their centroid (likely quantization or dateline-crossing).
Real fix needs geometry work (either elem-dim mesh bnds or a
post-hoc bbox-expand step) — out of scope for this iteration.
Four recipe + helper fixes for residual cli71 HIGHs whose root causes are unit-string mismatch or missing CF metadata, not pipeline bugs: - variable_attributes psu rewrite uses ``1E-03`` (CMIP7 registry form) instead of ``1e-3``. wcrp ATTR004 does literal string compare against the registry; ``1e-3`` and ``1E-03`` are the same scaling factor to UDUNITS but trip the registry check. Clears 1× sob HIGH. - compute_hfbasin_tripyview / compute_sltbasin_tripyview attach ``standard_name=region`` + ``long_name=Region Selection`` to the basin sector axis at construction. set_coordinates only knows about spatiotemporal axes, so the basin coord otherwise landed bare and tripped wcrp ATTR001. Clears 1× sltbasin HIGH. - cfc11_mon / cfc12_mon / ch4_mon / n2o_mon: drop the scale_factor=1e-12/1e-9 multiplication and label files as ``1E-12`` / ``1E-09`` (the CMIP7 registry form for scaled volume-mixing ratios). Previous behaviour scaled source-ppt → mol/mol and labelled ``mol mol-1``; the registry expects values to STAY in ppt/ppb with the scaling factor declared via units. scale_factor=1.0 keeps the pipeline's scale_by_constant step happy as a no-op. Clears 4× ATTR004 HIGH. - vsfcorr: source is m s-1 psu (velocity × per-mille mass fraction); registry wants kg m-2 s-1 (mass flux). Conversion: m/s × g/kg × ρ₀ (1025 kg/m³) = 1.025 kg/m²/s; the 1e-3 prefix on psu cancels the 1000 g/kg in seawater density. New nan_to_zero_scale_pipeline runs scale_by_constant after nan_to_zero; vsfcorr rule sets scale_factor=1.025 + scaled_units=kg m-2 s-1. The old rule's ``model_unit: kg m-2 s-1`` was a metadata-only claim that the source data was already in CMIP units, which was wrong. Clears 1× ATTR004 + 1× §3.1 UDUNITS HIGH. Defers: 4× FILE004d uncompressed chunk size (cmip7repack tooling issue — runs but doesn't grow chunks enough on FESOM high-res unstructured) and 1× ``No geophysical variable detected`` on the basin fx file (wcrp upstream check, int32 flag variable not recognised as geophysical).
wcrp FILE004d (HIGH) requires each data-variable chunk to be at least 4 MiB uncompressed (CMIP7 storage convention). The dask- aligned path in ``_encoding_from_dask_chunks`` mirrors whatever dask chunks the pipeline produced; on FESOM unstructured HR fields (sfx: 12 × 3146761 nodes, dask chunks (12, 18724)) that lands at 1.8 MiB per netCDF chunk and trips the check. cmip7repack is the documented post-hoc fix but for these 2D high-res cases it re-uses the same shape and the file stays under threshold. Add a per-variable floor: if dask-aligned chunks are below 4 MiB, grow the trailing (rightmost, typically horizontal) dim's chunk until product × wordsize crosses the threshold, capped at the dim size. Leaves the time-axis chunking pycmor already picked alone so streaming writes stay aligned with dask. The non-dask path (``calculate_chunks_simple``) already targets 100 MB so it doesn't need the floor. Clears 4× FILE004d HIGH (sfx, somint, phcint, absscint) on the cli71 production batch.
Two cli72 residual HIGH fixes whose root causes were in pycmor:
- time_bounds: extend the period-start pin (previously yearly-only)
to every rule wcrp_cmip7 TIME001 treats as instantaneous. wcrp's
use_midpoint=False fires for (a) any rule with cell_methods
"time: point" OR (b) any frequency NOT in its AVG_CORRECTION_FREQ
set {day, mon, monPt, yr, yrPt, 1hrCM, sem}. So monthly tpt
(sistressave, sistressmax, fracLut) and decadal (masso, thkcello,
tauvo, so, volo, thetao, tauuo) need time = bnds[:, 0], not the
bnds midpoint. Clears 10x TIME001.
- DRS region token: stop lower-casing parts[4] of the compound when
building the filename and the directory path. CMIP7's region CV
carries lowercase for simple codes (glb, nh, sh) but uppercase for
latitude-band tokens (30S-90S, 30N-90N, ...). Forcing lowercase
trips wcrp FILE001 (DRS directory + filename) and ATTR004 (region
CV) for the 1hr/3hr south-30 hemispheric files. Preserves whatever
case the compound carries; the data-request corpus is the source
of truth. Clears 36x FILE001 + 18x ATTR004 region.
Source files from OpenIFS-XIOS / NEMO-XIOS / FESOM ship ``time_counter`` as the time dim. Pycmor only renamed it to ``time`` when each rule set ``time_dimname`` explicitly; otherwise time_counter survived load and the downstream behaviour was inconsistent across paths: - The primary loader (core.gather_inputs.load_mfdataset) skipped its year filter because ``"time" in dims`` was False. - The secondary-input loader (custom_steps._load_secondary_mf*) ran a drop_vars block that REMOVED time_counter from coords without first renaming it. The secondary returned without any time coord. - Custom compute_X steps then did primary × secondary arithmetic where the primary still carried time_counter and the secondary carried no time at all. xarray broadcasting collapsed the result to a single time stamp at best, an empty time dim at worst. cli72 surfaced this on ``sfcWind_1hr_south30``: output file shipped ``time: 0`` (zero-length unlimited dim) with no time variable, tripping 8 HIGH (ATTR004 coordinates, DIM002, cf §5, cf §5.1 each twice). Add auto-detect for time_counter / time_centered in both loaders so the rename happens without each rule having to opt in. Explicit ``time_dimname`` still wins. Companion bnds variables (time_counter_bounds, time_counter_bnds) get renamed alongside the dim.
Two rules sharing cmor_variable + branding_suffix + frequency but differing in region (rlds_1hr_south30 vs rlds_1hr glb, rsds_1hr_south30 vs rsds_1hr glb, ...) used the same _rule_files prefix, so each rule's qc step picked up the other rule's files. On 1hr / 3hr / 6hr regional rules this raced against the glb rule's concurrent cmip7repack: the south30 rule's strip cleared _QuantizeBitGroom* on the glb file, then the glb rule rewrote that file (re-introducing the attr transiently), then the south30 cchecker read the stale state and the sidecar recorded a §2.3 _QuantizeBitGroom* finding that wasn't actually on disk. Anchor _rule_files on <var>_<branding>_<freq>_<region>_ so each rule only sees its own files. The DRS filename layout <var>_<branding>_<freq>_<region>_<grid>_... makes the region token a clean prefix boundary. Clears the 1× §2.3 finding on cli72 (rlds_1hr_south30) and removes the class of stale-finding races on every south30 / hemispheric rule.
cmip7repack writes ``<file>.nc_cmip7repack``, then atomically renames it over the original. If the wrapping SLURM job times out mid-repack (3 h walltime, daily 137-level 3D atm fields like cl_day / pfull_day each take well over an hour), cmip7repack gets killed between "successfully created" and the rename. The partial intermediate stays on disk forever — corrupted (HDF error on open) and taking GB per file. cli72 left 30 GB of zombies (cl_day + pfull_day at 14.8 GB and 15.0 GB each). The next batch would inherit them and the disk accounting would keep climbing every cli iteration. Add ``_clean_cmip7repack_orphans`` and invoke it before AND after every ``_run_cmip7repack`` pass so: - before: a previous job's killed-rename zombies are removed before this job's repack starts (otherwise a half-baked intermediate could shadow the in-progress repack on Lustre's metadata cache). - after: any zombies left by THIS job (timeouts, OOM, signal) are removed at the end of the qc_repack step. No more 30 GB silent accumulation per batch.
Investigation of the 1hr / 3hr / 6hr atm files in cli72 found three
distinct issues, all addressed here.
1. timeaverage._get_time_method only recognised CMIP6's "Pt" suffix on
the frequency string. CMIP7 dropped the suffix and moved the
tavg-vs-tpt signal into cell_methods ("time: point" vs "time:
mean"). Every CMIP7 tpt sub-daily / monthly rule was treated as
MEAN: timeavg ran .mean() + the midpoint shift on already-instant
data, so stamps moved +3h (6hr) / +1.5h (3hr) / +0.5h (1hr) and the
on-disk cell_methods still said "time: point". The psl_6hr output
in cli72 had stamps at 15:00, 21:00, 03:00, 09:00 instead of 06,
12, 18, 00 — exactly the +3h shift, plus boundary trim.
Add an inline cell_methods check that overrides MEAN -> INSTANTANEOUS
when the data request says "time: point". The CMIP6 Pt path stays
intact; this only kicks in for CMIP7 rules.
2. ps_1hr (cap7_atm) was branded ``atmos.ps.tpt-u-hxy-u.1hr.glb`` but
used source ``atmos_1h_sfc_sp`` which the OIFS XIOS file_def
declares as operation="average" (time_centered stamps at HH:30).
CMIP7 has both ``atmos.ps.tavg-u-hxy-u.1hr.glb`` and the tpt form;
producing tpt from averaged data is wrong. Switch to the tavg
compound. The 3hr / 6hr ps rules already use the matching ``_pt_``
instant sources (atmos_3h_pt_sp, atmos_6h_pt_sp), so no change there.
3. ps_1hr_south30 (extra_atm) and ts_3hr (extra_atm) hit the same
source mismatch but with no tavg compound available in the CMIP7
registry for that region/frequency. Disabled with TODO comments
pointing at the OIFS XIOS config that would need to add the
``atmos_1h_pt_sp`` / ``atmos_3h_pt_ts`` instant fields. Re-enable
once the source data exists.
Both are CMIP7 HIGH-priority variables but the OIFS XIOS file_def in this production run only emits averaged sources where CMIP7 asks for ``time: point``: - atmos.ts.tpt-u-hxy-u.3hr.glb wants instants to capture diurnal- cycle peaks; we only have ``atmos_1h_ts_ts`` (averaged at HH:30). - atmos.ps.tpt-u-hxy-u.1hr.30S-90S wants instants to capture sub- hourly Southern-Ocean pressure-wave variability; we only have ``atmos_1h_sfc_sp`` (averaged at HH:30). Sampling the averages and labeling them as instants would ship the wrong data type — the variability the variables are meant to capture isn't in the source. Production is already underway, so adding the XIOS fields would require a model restart from an earlier checkpoint. Skip them honestly. variable_coverage.md is updated with the gap explanation; YAML rules stay commented out with an inline note.
Two MEDIUM cleanups for cli73: - variable_attributes ``is_temperature_sn`` now accepts compound CF standard names that don't strictly end with ``temperature`` (e.g. ``sea_water_potential_temperature_at_sea_floor`` on tob, ``sea_water_conservative_temperature``, ``sea_surface_temperature``) via an ``_temperature`` substring test. Also accepts ``degC`` / ``Celsius`` as absolute-temperature units. Clears CF §3.1.2 ``tob`` units_metadata MEDIUM. - When the data-request ``cell_measures`` is the CMIP7 ``--MODEL`` placeholder, substitute the realm default rather than dropping it: ``area: areacello`` for ocean / seaIce / landIce / ocnbgchem, ``area: areacella`` for atmos / aerosol / land / atmosChem. wcrp ATTR001 requires the attribute to be PRESENT, and the existing placeholder-drop left it absent on siu / siv / sidmasstranx / sistrxdtop and similar variables whose data-request entry says ``--MODEL``. Clears 8x ATTR001 cell_measures MEDIUM. §2.4 dim-order MEDIUMs on unstructured-grid files (mrsll, zg, hur, ua_6hr, tsl, mrsol, ...) NOT touched. The CF check classifies single- horizontal-dim variables as T,Z,(A) and reports they're not in T,Z,Y,X order. Same shape affects ROMS curvilinear in compliance- checker's own test corpus; upstream limitation, not real non- conformance. Leave as advisory MEDIUM.
FESOM unstructured triangles crossing the dateline ship vertices on both branches (e.g. (179°, -179°, -180°)) while FESOM's element centroid stores ``lon`` in one branch only. cf §7.1 then sees the centroid outside the [min, max] vertex bbox, reporting "N points lie outside the bounding box". cli73 cleared most of the FESOM yearlies but still tripped on 1814 elements per file for difmxylo / tauuo / tauvo / sistressave (~0.03 % of cells, all on the dateline). This isn't a precision issue — both ``lon`` and ``lon_bnds`` are already double on disk (cli67 fdf444f promotion was symmetric across lat and lon). lat passes because it doesn't wrap. The real fix is geometric: shift each vertex into the 360° window centred on the centroid, identity on the sphere, no change to the represented geometry. After the shift the bbox contains the centroid by construction. Implemented at the tail of ``_ensure_lat_lon_bounds_impl`` so it applies to every path that lands lon_bnds in the dataset (XIOS rename, _recover_bounds_from_inputs, _attach_bounds_from_mesh, add_bounds_from_coords). Vectorised — the 6.2M-element 3D files don't churn Python. Verified two ways: - Direct math on cli73 difmxylo on-disk file: 1814 → 0 outliers. - Isolated test through _ensure_lat_lon_bounds_impl on a 5-cell synthetic dataset with two dateline crossings: 2 → 0 outliers, non-dateline cells (and all of lat_bnds) untouched, log line emits the expected "shifted N rows" count.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CMIP7 cmorization for AWI-ESM3-VEG-HR
Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.
Key changes
CMIP7 data request
DataRequestfromCMIP7_DReq_metadataJSON instead of cmip6 tablesocean.tos.tavg-u-hxy-sea.mon.GLB)cmip6_table→cmip6_cmor_tablein vendored metadatacompound_namematching againstcmip6_compound_nameandcmip7_compound_nameattributestable_idfrom compound name when not set explicitlyValueErroron zero DRV matches (instead of silent skip)Pipeline
vertical_integratecustom pipeline stepconvert()step fromDefaultPipelineStateobjects not being unwrapped to actual results in parallel runsStandard library
src/pycmor/std_lib/time_bounds.py)getattr+_pycmor_cfgfallbackglobal_attributesto derivetable_idfrom CMIP6/CMIP7 compound namesXarray accessor API
StdLibAccessorwith.process()Test infrastructure
pycmor.fixtures.model_runs)pycmor.tutorialdataset system (xarray.tutorial-style API)Misc fixes
entry_points()compatibilitypyfesom2imports for environments without itTest plan
pytest tests/unit/pycmor process examples/awiesm3-cmip7-minimal.yamlruns successfully on Levante