Shadow performance audit

As investigated by Claude.

# Shadow Performance Investigation — Report

Five parallel audits covered the CPU-side scene assembly, the draw pass, the OpenGL backend, the shader code, and the redundancy/frequency of shadow builds. Below is a consolidated picture of what's actually expensive, then prioritized recommendations.

## The situation in one paragraph

The shadow system uses four-cascade VSM (variance shadow mapping) rendered into a layered `RGBA32F + D32` texture array, fanned to all four cascade layers via a geometry shader with `layout(invocations = 4)`. The pass is built fresh **every rendered frame, even when paused**, runs **twice per frame** (world + cockpit) and **four times** in VR, and is invoked again per UI frame in mission select, tech room, briefing, and lab. Caster-side culling walks every object up to `Highest_object_index` with no `OBJ_NONE` / `OF_RENDERS` / `Should_be_dead` gate, applies up to four full vector rotates per object, and then feeds the survivors through the *full* `ship_render` path (dock walks, warp checks, external weapon model traversal, alpha-cutout texture lookups) before stripping back to a depth-only material. Receiver-side, every shaded pixel of the directional light pays a 16-tap manual Poisson PCF (32 taps at cascade boundaries) against a 2-channel texture that is allocated as 4-channel. Every individual issue is multiplied by the GS 4× amplification.

## Dominant cost drivers, in priority order

### Tier 1 — biggest wins

1. **Geometry-shader 4× cascade fan-out** (`code/def_files/data/effects/main-g.sdr:17`). Every shadow-cast triangle is amplified 4× by GS invocations. Geometry shaders are notoriously slow on AMD/Intel/Apple drivers and on NV Turing+ they're emulated through primitive-shader paths. The `gl_ClipDistance` clamp at `main-g.sdr:167` forces every out-of-frustum primitive to rasterize as a degenerate at the near plane rather than being clipped. This single technique probably costs more than everything else combined on AMD/Intel hardware.

2. **No object pre-filter in `shadows_render_all`** (`code/graphics/shadows.cpp:496-553`). The loop tests every slot up to `Highest_object_index` — including `OBJ_NONE` holes, weapons, fireballs, shockwaves, jump nodes, beams, debris with `!Used` — through up to 4 cascade rotates *before* the switch silently drops them. In a firefight with hundreds of weapon objects this is ~1,000 wasted vector rotates/frame just for shadows. There is also no `OF_RENDERS` / `Should_be_dead` gate that the main render path has.

3. **Far cascade engulfs the whole map** (`code/mod_table/mod_table.cpp:1834`). Defaults are `(200, 600, 2500, 8000)`. At 8000 m the far cascade typically contains every fighter, capship, asteroid, and debris piece in the mission, each contributing 1-4 shadow texels at Low quality — full draw-call cost for invisible contribution. A 400-asteroid field plus a 50-ship engagement enters the queue almost in full.

4. **Receiver: 16-tap manual VSM, doubled at cascade boundaries** (`code/def_files/data/effects/shadows.sdr:33-67`, `shadows.sdr:96-100`). The shadow texture is sampled as `sampler2DArray` (not `sampler2DArrayShadow`), so there is no hardware PCF — every receiving pixel does 16 dependent RG32F fetches plus a Chebyshev evaluation, blowing through L2. In the 20 % cascade-blend band, it's 32 taps. The same shadow texture is bound on **two** sampler units (deferred unit 4 and forward unit 8, `code/graphics/opengl/gropengltnl.cpp:944-955`) so shadow-receiving forward materials pay the kernel a second time.

### Tier 2 — substantial wins

5. **Cockpit shadow pass is a second full layered pass** (`code/ship/ship.cpp:8576-8607`). It clears the full 4-layer FBO and re-renders the player ship + cockpit at locked LOD0 — the densest single mesh in most missions — every frame. In VR this becomes 4 builds per frame (2 eyes × world+cockpit at `freespace2/freespace.cpp:4319, 4326`).

6. **Full layered clear of an oversized FBO every pass**. At Ultra (4096), the array texture is **1.28 GiB** (D32 array + RGBA32F array × 4 layers), and the entire thing is cleared on every shadow start (`code/graphics/opengl/gropengltnl.cpp:697-698`). Even at Medium it's 80 MiB cleared 2-4× per frame.

7. **Color format wastes half its bandwidth**. VSM only needs `RG` (depth, depth²) — `code/def_files/data/effects/main-f.sdr:208` writes `0, 1` to `.ba` and the receiver reads `.xy`. The FBO is allocated `GL_RGBA32F` (fallback `RGBA16F`) at `code/graphics/opengl/gropengltnl.cpp:554-555`.

8. **No shadow-pass LOD**. `model_render_determine_detail` uses *eye* distance (`code/model/modelrender.cpp:1037`), so a ship 5 km away is drawn at the same LOD into the shadow map as into the color pass. Cockpit pass forces LOD0 explicitly. Per-submodel `render_box`/`render_sphere` culling also uses eye position, so subobjects the light sees but the eye doesn't are skipped while subobjects only the eye sees are still drawn into the shadow map.

9. **Built every frame regardless of state**. No camera-stationary / paused / unchanged-scene caching. `game_render_frame` runs even when paused (`freespace2/freespace.cpp:6477`) and the shadow map is rebuilt unconditionally.

### Tier 3 — meaningful but smaller

10. **`shadows_construct_light_frustum` does ~50 % redundant work across cascades** (`code/graphics/shadows.cpp:248-425`). `tanf(fov/2)` 8× per frame, unit `uvec`/`rvec` re-copied 64× per frame, near-plane corners of cascade N+1 ≈ far-plane corners of cascade N (modulo the 20 % overlap), all recomputed from scratch.

11. **Light matrix rebuilt every frame** even though `Static_light.front()` never moves in normal play (`code/graphics/shadows.cpp:437-438`).

12. **`obj_queue_render` runs the entire `ship_render` body** during the shadow pass and then overwrites `render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING` at `code/ship/ship.cpp:21971`. All the setup before that — dock evaluation, warp clip-plane, team color, replacement-texture array dereference, external weapon model traversal at `code/ship/ship.cpp:22026` — runs and is then thrown away.

13. **`model_clear_instance` called twice per frame per object** (once for shadow, once for the lit pass). For big capships with 30+ textures × 6 texture types this is ~180 redundant writes per ship.

14. **Receiver pays unconditional matrix work in deferred** (`code/def_files/data/effects/deferred-f.sdr:260-270`). Every shaded pixel of the directional light's full-screen pass does `mat4*mat4*vec4` + four `transformToShadowMap` calls *per pixel*, regardless of whether the pixel ends up shadowed, because there's no per-vertex precompute for the fullscreen quad.

15. **Vertex shader during shadow pass is not stripped** (`code/def_files/data/effects/main-v.sdr:155-209`). Tangent matrix, texcoord transform, normal pipeline all computed even though the shadow fragment shader only reads `vertIn.position.z`. Cost multiplied by GS fan-out / instancing fallback.

16. **Re-entrancy bug in `shadow_maybe_start_frame`** (`code/graphics/shadows.cpp:570-588`) — uses static `last_override` / `shadow_override_backup`. Not strictly a perf issue, but it means nested UI/render scenarios will leak `Shadow_override` state. Worth fixing alongside any restructuring.

17. **Quality changes don't take effect without a restart** (`code/graphics/shadows.cpp:71` gates the listener on `if (initial)`, FBO is only built in `opengl_tnl_init`). Not a perf bug but a UX bug worth fixing once we touch the FBO sizing.

## Recommendations, prioritized

### Quick wins (small diffs, big returns)

- **Gate the object loop.** Add `if (objp->type == OBJ_NONE) continue;` plus an `OF_RENDERS` / `Should_be_dead` check before any frustum math. Switch the body of the loop to a `switch` on `objp->type` *first*, then per-type cull. Eliminates the vast majority of wasted rotates in busy missions.
- **Precompute `pos_rot` once per object** for `shadows_obj_in_frustum` and pass it to all 4 cascade tests. Saves ~75 % of the function's cost on average.
- **Cache the light matrix and cascade distances**; only recompute when `Static_light.front()`, the eye, or `Shadow_distances` actually changes.
- **Skip the shadow pass when paused** (and when no camera/object has moved since last frame, if you want to go further) — `Shadow_map_texture` persists across frames, so the existing texture remains valid for stationary scenes.
- **Drop unused alpha channels** — switch the color attachment to `GL_RG32F` (or `GL_RG16F` for Low/Medium). Cuts color VRAM and bandwidth in half with zero shader change beyond the format.
- **Drop the depth attachment to `GL_DEPTH_COMPONENT16`**. Nothing samples it; it only exists for primitive ordering during the cast.
- **Bind the shadow texture once** for both deferred and forward paths instead of duplicating to unit 8.
- **Default `Shadow_distances` more aggressively** — `(100, 400, 1200, 3000)` is closer to what's actually visually meaningful given the cascade resolutions, and dramatically shrinks the far-cascade object set. Mods can override.

### Medium-effort wins

- **Bypass `obj_queue_render` / `ship_render` for the shadow pass.** Write a slim shadow-cast path that takes the ship hull + visible external weapon models, sets `render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING` up front, and never touches dock evaluation, warp clip planes, team color, decals, etc.
- **Shadow-pass LOD bias.** Compute LOD against the light-distance / cascade index instead of eye distance for the shadow pass. Far cascades should aggressively prefer the lowest LOD.
- **Per-cascade culling between GS invocations**, or move to per-cascade culled draw lists (the four cascades have very different object sets — cascade 0 has the cockpit; cascade 3 has capships and asteroids). This both reduces GS amplification waste and unlocks per-cascade LOD.
- **Hardware PCF.** Switch to `sampler2DArrayShadow` and `GL_COMPARE_REF_TO_TEXTURE`, render depth-only into a `DEPTH_COMPONENT32F` array, and use a 4-tap `textureGather`-based PCF. This removes VSM moments entirely (no need for the color attachment, no need for the scale hack, no light-bleeding) and lets the GPU do filtering for free in TMU hardware. This is the single biggest receiver-side win.
- **Reorganize cockpit pass to share state.** At minimum, only clear the cockpit's cascade layers (not all 4) — or, if you keep the layered FBO, scissor the clear. Better: render cockpit as cascade 0 of the same pass.

### Bigger architectural changes

- **Replace the GS layering with multi-view rendering** (`GL_OVR_multiview2`/`GL_NV_viewport_array2`/`gl_ViewportIndex`) where supported. Eliminates GS amplification entirely. On drivers without those extensions, fall back to one draw per cascade with culled draw lists — almost always faster than a GS on AMD/Intel.
- **Or drop CSM entirely on Low quality** and use a single tight shadow frustum around the player. Most FS2 gameplay is close-quarters; the far cascade rarely contributes visible detail at 1024² shadow resolution.
- **Reactive shadow rebuild.** Detect camera/light/scene movement and rebuild only when needed. Even partial updates (e.g. cascade 0 every frame, cascade 3 every 4th) are valid for a directional sun.

### Cleanups worth bundling

- Make `Shadow_quality` changes apply at runtime by rebuilding `shadow_fbo` in the option listener instead of gating on `if (initial)`.
- Replace the static-`last_override` mechanism in `shadow_maybe_start_frame` with a stack or an RAII guard to make the override re-entrancy-safe.
- Remove the redundant `model_clear_instance` calls between the shadow pass and the subsequent main pass (one reset per frame per object is enough).

## Biggest-bang-for-buck order

If you want to ship visible frame-rate gains fastest:

1. Object-loop pre-filter + cached `pos_rot` (1 day, kills CPU stalls in busy missions)
2. Tighter default `Shadow_distances` (15 minutes, kills far-cascade draw count)
3. Switch FBO to `RG` + `D16`, drop duplicate sampler binding (half a day, halves VRAM/bandwidth)
4. Hardware PCF via `sampler2DArrayShadow` (1-2 days, halves to quarters receiver-side cost)
5. Slim shadow-cast path bypassing `ship_render` (2-3 days, halves caster CPU + GS load)
6. Eventually: replace GS layering + per-cascade culled draw lists (1-2 weeks, but unlocks AMD/Intel/Mac perf parity)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shadow performance audit #7499

Shadow Performance Investigation — Report

The situation in one paragraph

Dominant cost drivers, in priority order

Tier 1 — biggest wins

Tier 2 — substantial wins

Tier 3 — meaningful but smaller

Recommendations, prioritized

Quick wins (small diffs, big returns)

Medium-effort wins

Bigger architectural changes

Cleanups worth bundling

Biggest-bang-for-buck order

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Shadow performance audit #7499

Description

Shadow Performance Investigation — Report

The situation in one paragraph

Dominant cost drivers, in priority order

Tier 1 — biggest wins

Tier 2 — substantial wins

Tier 3 — meaningful but smaller

Recommendations, prioritized

Quick wins (small diffs, big returns)

Medium-effort wins

Bigger architectural changes

Cleanups worth bundling

Biggest-bang-for-buck order

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions