Skip to content

Shadow performance audit #7499

@Goober5000

Description

@Goober5000

As investigated by Claude.

Shadow Performance Investigation — Report

Five parallel audits covered the CPU-side scene assembly, the draw pass, the OpenGL backend, the shader code, and the redundancy/frequency of shadow builds. Below is a consolidated picture of what's actually expensive, then prioritized recommendations.

The situation in one paragraph

The shadow system uses four-cascade VSM (variance shadow mapping) rendered into a layered RGBA32F + D32 texture array, fanned to all four cascade layers via a geometry shader with layout(invocations = 4). The pass is built fresh every rendered frame, even when paused, runs twice per frame (world + cockpit) and four times in VR, and is invoked again per UI frame in mission select, tech room, briefing, and lab. Caster-side culling walks every object up to Highest_object_index with no OBJ_NONE / OF_RENDERS / Should_be_dead gate, applies up to four full vector rotates per object, and then feeds the survivors through the full ship_render path (dock walks, warp checks, external weapon model traversal, alpha-cutout texture lookups) before stripping back to a depth-only material. Receiver-side, every shaded pixel of the directional light pays a 16-tap manual Poisson PCF (32 taps at cascade boundaries) against a 2-channel texture that is allocated as 4-channel. Every individual issue is multiplied by the GS 4× amplification.

Dominant cost drivers, in priority order

Tier 1 — biggest wins

  1. Geometry-shader 4× cascade fan-out (code/def_files/data/effects/main-g.sdr:17). Every shadow-cast triangle is amplified 4× by GS invocations. Geometry shaders are notoriously slow on AMD/Intel/Apple drivers and on NV Turing+ they're emulated through primitive-shader paths. The gl_ClipDistance clamp at main-g.sdr:167 forces every out-of-frustum primitive to rasterize as a degenerate at the near plane rather than being clipped. This single technique probably costs more than everything else combined on AMD/Intel hardware.

  2. No object pre-filter in shadows_render_all (code/graphics/shadows.cpp:496-553). The loop tests every slot up to Highest_object_index — including OBJ_NONE holes, weapons, fireballs, shockwaves, jump nodes, beams, debris with !Used — through up to 4 cascade rotates before the switch silently drops them. In a firefight with hundreds of weapon objects this is ~1,000 wasted vector rotates/frame just for shadows. There is also no OF_RENDERS / Should_be_dead gate that the main render path has.

  3. Far cascade engulfs the whole map (code/mod_table/mod_table.cpp:1834). Defaults are (200, 600, 2500, 8000). At 8000 m the far cascade typically contains every fighter, capship, asteroid, and debris piece in the mission, each contributing 1-4 shadow texels at Low quality — full draw-call cost for invisible contribution. A 400-asteroid field plus a 50-ship engagement enters the queue almost in full.

  4. Receiver: 16-tap manual VSM, doubled at cascade boundaries (code/def_files/data/effects/shadows.sdr:33-67, shadows.sdr:96-100). The shadow texture is sampled as sampler2DArray (not sampler2DArrayShadow), so there is no hardware PCF — every receiving pixel does 16 dependent RG32F fetches plus a Chebyshev evaluation, blowing through L2. In the 20 % cascade-blend band, it's 32 taps. The same shadow texture is bound on two sampler units (deferred unit 4 and forward unit 8, code/graphics/opengl/gropengltnl.cpp:944-955) so shadow-receiving forward materials pay the kernel a second time.

Tier 2 — substantial wins

  1. Cockpit shadow pass is a second full layered pass (code/ship/ship.cpp:8576-8607). It clears the full 4-layer FBO and re-renders the player ship + cockpit at locked LOD0 — the densest single mesh in most missions — every frame. In VR this becomes 4 builds per frame (2 eyes × world+cockpit at freespace2/freespace.cpp:4319, 4326).

  2. Full layered clear of an oversized FBO every pass. At Ultra (4096), the array texture is 1.28 GiB (D32 array + RGBA32F array × 4 layers), and the entire thing is cleared on every shadow start (code/graphics/opengl/gropengltnl.cpp:697-698). Even at Medium it's 80 MiB cleared 2-4× per frame.

  3. Color format wastes half its bandwidth. VSM only needs RG (depth, depth²) — code/def_files/data/effects/main-f.sdr:208 writes 0, 1 to .ba and the receiver reads .xy. The FBO is allocated GL_RGBA32F (fallback RGBA16F) at code/graphics/opengl/gropengltnl.cpp:554-555.

  4. No shadow-pass LOD. model_render_determine_detail uses eye distance (code/model/modelrender.cpp:1037), so a ship 5 km away is drawn at the same LOD into the shadow map as into the color pass. Cockpit pass forces LOD0 explicitly. Per-submodel render_box/render_sphere culling also uses eye position, so subobjects the light sees but the eye doesn't are skipped while subobjects only the eye sees are still drawn into the shadow map.

  5. Built every frame regardless of state. No camera-stationary / paused / unchanged-scene caching. game_render_frame runs even when paused (freespace2/freespace.cpp:6477) and the shadow map is rebuilt unconditionally.

Tier 3 — meaningful but smaller

  1. shadows_construct_light_frustum does ~50 % redundant work across cascades (code/graphics/shadows.cpp:248-425). tanf(fov/2) 8× per frame, unit uvec/rvec re-copied 64× per frame, near-plane corners of cascade N+1 ≈ far-plane corners of cascade N (modulo the 20 % overlap), all recomputed from scratch.

  2. Light matrix rebuilt every frame even though Static_light.front() never moves in normal play (code/graphics/shadows.cpp:437-438).

  3. obj_queue_render runs the entire ship_render body during the shadow pass and then overwrites render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING at code/ship/ship.cpp:21971. All the setup before that — dock evaluation, warp clip-plane, team color, replacement-texture array dereference, external weapon model traversal at code/ship/ship.cpp:22026 — runs and is then thrown away.

  4. model_clear_instance called twice per frame per object (once for shadow, once for the lit pass). For big capships with 30+ textures × 6 texture types this is ~180 redundant writes per ship.

  5. Receiver pays unconditional matrix work in deferred (code/def_files/data/effects/deferred-f.sdr:260-270). Every shaded pixel of the directional light's full-screen pass does mat4*mat4*vec4 + four transformToShadowMap calls per pixel, regardless of whether the pixel ends up shadowed, because there's no per-vertex precompute for the fullscreen quad.

  6. Vertex shader during shadow pass is not stripped (code/def_files/data/effects/main-v.sdr:155-209). Tangent matrix, texcoord transform, normal pipeline all computed even though the shadow fragment shader only reads vertIn.position.z. Cost multiplied by GS fan-out / instancing fallback.

  7. Re-entrancy bug in shadow_maybe_start_frame (code/graphics/shadows.cpp:570-588) — uses static last_override / shadow_override_backup. Not strictly a perf issue, but it means nested UI/render scenarios will leak Shadow_override state. Worth fixing alongside any restructuring.

  8. Quality changes don't take effect without a restart (code/graphics/shadows.cpp:71 gates the listener on if (initial), FBO is only built in opengl_tnl_init). Not a perf bug but a UX bug worth fixing once we touch the FBO sizing.

Recommendations, prioritized

Quick wins (small diffs, big returns)

  • Gate the object loop. Add if (objp->type == OBJ_NONE) continue; plus an OF_RENDERS / Should_be_dead check before any frustum math. Switch the body of the loop to a switch on objp->type first, then per-type cull. Eliminates the vast majority of wasted rotates in busy missions.
  • Precompute pos_rot once per object for shadows_obj_in_frustum and pass it to all 4 cascade tests. Saves ~75 % of the function's cost on average.
  • Cache the light matrix and cascade distances; only recompute when Static_light.front(), the eye, or Shadow_distances actually changes.
  • Skip the shadow pass when paused (and when no camera/object has moved since last frame, if you want to go further) — Shadow_map_texture persists across frames, so the existing texture remains valid for stationary scenes.
  • Drop unused alpha channels — switch the color attachment to GL_RG32F (or GL_RG16F for Low/Medium). Cuts color VRAM and bandwidth in half with zero shader change beyond the format.
  • Drop the depth attachment to GL_DEPTH_COMPONENT16. Nothing samples it; it only exists for primitive ordering during the cast.
  • Bind the shadow texture once for both deferred and forward paths instead of duplicating to unit 8.
  • Default Shadow_distances more aggressively(100, 400, 1200, 3000) is closer to what's actually visually meaningful given the cascade resolutions, and dramatically shrinks the far-cascade object set. Mods can override.

Medium-effort wins

  • Bypass obj_queue_render / ship_render for the shadow pass. Write a slim shadow-cast path that takes the ship hull + visible external weapon models, sets render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING up front, and never touches dock evaluation, warp clip planes, team color, decals, etc.
  • Shadow-pass LOD bias. Compute LOD against the light-distance / cascade index instead of eye distance for the shadow pass. Far cascades should aggressively prefer the lowest LOD.
  • Per-cascade culling between GS invocations, or move to per-cascade culled draw lists (the four cascades have very different object sets — cascade 0 has the cockpit; cascade 3 has capships and asteroids). This both reduces GS amplification waste and unlocks per-cascade LOD.
  • Hardware PCF. Switch to sampler2DArrayShadow and GL_COMPARE_REF_TO_TEXTURE, render depth-only into a DEPTH_COMPONENT32F array, and use a 4-tap textureGather-based PCF. This removes VSM moments entirely (no need for the color attachment, no need for the scale hack, no light-bleeding) and lets the GPU do filtering for free in TMU hardware. This is the single biggest receiver-side win.
  • Reorganize cockpit pass to share state. At minimum, only clear the cockpit's cascade layers (not all 4) — or, if you keep the layered FBO, scissor the clear. Better: render cockpit as cascade 0 of the same pass.

Bigger architectural changes

  • Replace the GS layering with multi-view rendering (GL_OVR_multiview2/GL_NV_viewport_array2/gl_ViewportIndex) where supported. Eliminates GS amplification entirely. On drivers without those extensions, fall back to one draw per cascade with culled draw lists — almost always faster than a GS on AMD/Intel.
  • Or drop CSM entirely on Low quality and use a single tight shadow frustum around the player. Most FS2 gameplay is close-quarters; the far cascade rarely contributes visible detail at 1024² shadow resolution.
  • Reactive shadow rebuild. Detect camera/light/scene movement and rebuild only when needed. Even partial updates (e.g. cascade 0 every frame, cascade 3 every 4th) are valid for a directional sun.

Cleanups worth bundling

  • Make Shadow_quality changes apply at runtime by rebuilding shadow_fbo in the option listener instead of gating on if (initial).
  • Replace the static-last_override mechanism in shadow_maybe_start_frame with a stack or an RAII guard to make the override re-entrancy-safe.
  • Remove the redundant model_clear_instance calls between the shadow pass and the subsequent main pass (one reset per frame per object is enough).

Biggest-bang-for-buck order

If you want to ship visible frame-rate gains fastest:

  1. Object-loop pre-filter + cached pos_rot (1 day, kills CPU stalls in busy missions)
  2. Tighter default Shadow_distances (15 minutes, kills far-cascade draw count)
  3. Switch FBO to RG + D16, drop duplicate sampler binding (half a day, halves VRAM/bandwidth)
  4. Hardware PCF via sampler2DArrayShadow (1-2 days, halves to quarters receiver-side cost)
  5. Slim shadow-cast path bypassing ship_render (2-3 days, halves caster CPU + GS load)
  6. Eventually: replace GS layering + per-cascade culled draw lists (1-2 weeks, but unlocks AMD/Intel/Mac perf parity)

Metadata

Metadata

Assignees

No one assigned

    Labels

    gameplayA feature or issue that can significantly impact gameplaygraphicsA feature or issue related to graphics (2d and 3d)

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions