As investigated by Claude.
Shadow Performance Investigation — Report
Five parallel audits covered the CPU-side scene assembly, the draw pass, the OpenGL backend, the shader code, and the redundancy/frequency of shadow builds. Below is a consolidated picture of what's actually expensive, then prioritized recommendations.
The situation in one paragraph
The shadow system uses four-cascade VSM (variance shadow mapping) rendered into a layered RGBA32F + D32 texture array, fanned to all four cascade layers via a geometry shader with layout(invocations = 4). The pass is built fresh every rendered frame, even when paused, runs twice per frame (world + cockpit) and four times in VR, and is invoked again per UI frame in mission select, tech room, briefing, and lab. Caster-side culling walks every object up to Highest_object_index with no OBJ_NONE / OF_RENDERS / Should_be_dead gate, applies up to four full vector rotates per object, and then feeds the survivors through the full ship_render path (dock walks, warp checks, external weapon model traversal, alpha-cutout texture lookups) before stripping back to a depth-only material. Receiver-side, every shaded pixel of the directional light pays a 16-tap manual Poisson PCF (32 taps at cascade boundaries) against a 2-channel texture that is allocated as 4-channel. Every individual issue is multiplied by the GS 4× amplification.
Dominant cost drivers, in priority order
Tier 1 — biggest wins
-
Geometry-shader 4× cascade fan-out (code/def_files/data/effects/main-g.sdr:17). Every shadow-cast triangle is amplified 4× by GS invocations. Geometry shaders are notoriously slow on AMD/Intel/Apple drivers and on NV Turing+ they're emulated through primitive-shader paths. The gl_ClipDistance clamp at main-g.sdr:167 forces every out-of-frustum primitive to rasterize as a degenerate at the near plane rather than being clipped. This single technique probably costs more than everything else combined on AMD/Intel hardware.
-
No object pre-filter in shadows_render_all (code/graphics/shadows.cpp:496-553). The loop tests every slot up to Highest_object_index — including OBJ_NONE holes, weapons, fireballs, shockwaves, jump nodes, beams, debris with !Used — through up to 4 cascade rotates before the switch silently drops them. In a firefight with hundreds of weapon objects this is ~1,000 wasted vector rotates/frame just for shadows. There is also no OF_RENDERS / Should_be_dead gate that the main render path has.
-
Far cascade engulfs the whole map (code/mod_table/mod_table.cpp:1834). Defaults are (200, 600, 2500, 8000). At 8000 m the far cascade typically contains every fighter, capship, asteroid, and debris piece in the mission, each contributing 1-4 shadow texels at Low quality — full draw-call cost for invisible contribution. A 400-asteroid field plus a 50-ship engagement enters the queue almost in full.
-
Receiver: 16-tap manual VSM, doubled at cascade boundaries (code/def_files/data/effects/shadows.sdr:33-67, shadows.sdr:96-100). The shadow texture is sampled as sampler2DArray (not sampler2DArrayShadow), so there is no hardware PCF — every receiving pixel does 16 dependent RG32F fetches plus a Chebyshev evaluation, blowing through L2. In the 20 % cascade-blend band, it's 32 taps. The same shadow texture is bound on two sampler units (deferred unit 4 and forward unit 8, code/graphics/opengl/gropengltnl.cpp:944-955) so shadow-receiving forward materials pay the kernel a second time.
Tier 2 — substantial wins
-
Cockpit shadow pass is a second full layered pass (code/ship/ship.cpp:8576-8607). It clears the full 4-layer FBO and re-renders the player ship + cockpit at locked LOD0 — the densest single mesh in most missions — every frame. In VR this becomes 4 builds per frame (2 eyes × world+cockpit at freespace2/freespace.cpp:4319, 4326).
-
Full layered clear of an oversized FBO every pass. At Ultra (4096), the array texture is 1.28 GiB (D32 array + RGBA32F array × 4 layers), and the entire thing is cleared on every shadow start (code/graphics/opengl/gropengltnl.cpp:697-698). Even at Medium it's 80 MiB cleared 2-4× per frame.
-
Color format wastes half its bandwidth. VSM only needs RG (depth, depth²) — code/def_files/data/effects/main-f.sdr:208 writes 0, 1 to .ba and the receiver reads .xy. The FBO is allocated GL_RGBA32F (fallback RGBA16F) at code/graphics/opengl/gropengltnl.cpp:554-555.
-
No shadow-pass LOD. model_render_determine_detail uses eye distance (code/model/modelrender.cpp:1037), so a ship 5 km away is drawn at the same LOD into the shadow map as into the color pass. Cockpit pass forces LOD0 explicitly. Per-submodel render_box/render_sphere culling also uses eye position, so subobjects the light sees but the eye doesn't are skipped while subobjects only the eye sees are still drawn into the shadow map.
-
Built every frame regardless of state. No camera-stationary / paused / unchanged-scene caching. game_render_frame runs even when paused (freespace2/freespace.cpp:6477) and the shadow map is rebuilt unconditionally.
Tier 3 — meaningful but smaller
-
shadows_construct_light_frustum does ~50 % redundant work across cascades (code/graphics/shadows.cpp:248-425). tanf(fov/2) 8× per frame, unit uvec/rvec re-copied 64× per frame, near-plane corners of cascade N+1 ≈ far-plane corners of cascade N (modulo the 20 % overlap), all recomputed from scratch.
-
Light matrix rebuilt every frame even though Static_light.front() never moves in normal play (code/graphics/shadows.cpp:437-438).
-
obj_queue_render runs the entire ship_render body during the shadow pass and then overwrites render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING at code/ship/ship.cpp:21971. All the setup before that — dock evaluation, warp clip-plane, team color, replacement-texture array dereference, external weapon model traversal at code/ship/ship.cpp:22026 — runs and is then thrown away.
-
model_clear_instance called twice per frame per object (once for shadow, once for the lit pass). For big capships with 30+ textures × 6 texture types this is ~180 redundant writes per ship.
-
Receiver pays unconditional matrix work in deferred (code/def_files/data/effects/deferred-f.sdr:260-270). Every shaded pixel of the directional light's full-screen pass does mat4*mat4*vec4 + four transformToShadowMap calls per pixel, regardless of whether the pixel ends up shadowed, because there's no per-vertex precompute for the fullscreen quad.
-
Vertex shader during shadow pass is not stripped (code/def_files/data/effects/main-v.sdr:155-209). Tangent matrix, texcoord transform, normal pipeline all computed even though the shadow fragment shader only reads vertIn.position.z. Cost multiplied by GS fan-out / instancing fallback.
-
Re-entrancy bug in shadow_maybe_start_frame (code/graphics/shadows.cpp:570-588) — uses static last_override / shadow_override_backup. Not strictly a perf issue, but it means nested UI/render scenarios will leak Shadow_override state. Worth fixing alongside any restructuring.
-
Quality changes don't take effect without a restart (code/graphics/shadows.cpp:71 gates the listener on if (initial), FBO is only built in opengl_tnl_init). Not a perf bug but a UX bug worth fixing once we touch the FBO sizing.
Recommendations, prioritized
Quick wins (small diffs, big returns)
- Gate the object loop. Add
if (objp->type == OBJ_NONE) continue; plus an OF_RENDERS / Should_be_dead check before any frustum math. Switch the body of the loop to a switch on objp->type first, then per-type cull. Eliminates the vast majority of wasted rotates in busy missions.
- Precompute
pos_rot once per object for shadows_obj_in_frustum and pass it to all 4 cascade tests. Saves ~75 % of the function's cost on average.
- Cache the light matrix and cascade distances; only recompute when
Static_light.front(), the eye, or Shadow_distances actually changes.
- Skip the shadow pass when paused (and when no camera/object has moved since last frame, if you want to go further) —
Shadow_map_texture persists across frames, so the existing texture remains valid for stationary scenes.
- Drop unused alpha channels — switch the color attachment to
GL_RG32F (or GL_RG16F for Low/Medium). Cuts color VRAM and bandwidth in half with zero shader change beyond the format.
- Drop the depth attachment to
GL_DEPTH_COMPONENT16. Nothing samples it; it only exists for primitive ordering during the cast.
- Bind the shadow texture once for both deferred and forward paths instead of duplicating to unit 8.
- Default
Shadow_distances more aggressively — (100, 400, 1200, 3000) is closer to what's actually visually meaningful given the cascade resolutions, and dramatically shrinks the far-cascade object set. Mods can override.
Medium-effort wins
- Bypass
obj_queue_render / ship_render for the shadow pass. Write a slim shadow-cast path that takes the ship hull + visible external weapon models, sets render_flags = MR_NO_TEXTURING | MR_NO_LIGHTING up front, and never touches dock evaluation, warp clip planes, team color, decals, etc.
- Shadow-pass LOD bias. Compute LOD against the light-distance / cascade index instead of eye distance for the shadow pass. Far cascades should aggressively prefer the lowest LOD.
- Per-cascade culling between GS invocations, or move to per-cascade culled draw lists (the four cascades have very different object sets — cascade 0 has the cockpit; cascade 3 has capships and asteroids). This both reduces GS amplification waste and unlocks per-cascade LOD.
- Hardware PCF. Switch to
sampler2DArrayShadow and GL_COMPARE_REF_TO_TEXTURE, render depth-only into a DEPTH_COMPONENT32F array, and use a 4-tap textureGather-based PCF. This removes VSM moments entirely (no need for the color attachment, no need for the scale hack, no light-bleeding) and lets the GPU do filtering for free in TMU hardware. This is the single biggest receiver-side win.
- Reorganize cockpit pass to share state. At minimum, only clear the cockpit's cascade layers (not all 4) — or, if you keep the layered FBO, scissor the clear. Better: render cockpit as cascade 0 of the same pass.
Bigger architectural changes
- Replace the GS layering with multi-view rendering (
GL_OVR_multiview2/GL_NV_viewport_array2/gl_ViewportIndex) where supported. Eliminates GS amplification entirely. On drivers without those extensions, fall back to one draw per cascade with culled draw lists — almost always faster than a GS on AMD/Intel.
- Or drop CSM entirely on Low quality and use a single tight shadow frustum around the player. Most FS2 gameplay is close-quarters; the far cascade rarely contributes visible detail at 1024² shadow resolution.
- Reactive shadow rebuild. Detect camera/light/scene movement and rebuild only when needed. Even partial updates (e.g. cascade 0 every frame, cascade 3 every 4th) are valid for a directional sun.
Cleanups worth bundling
- Make
Shadow_quality changes apply at runtime by rebuilding shadow_fbo in the option listener instead of gating on if (initial).
- Replace the static-
last_override mechanism in shadow_maybe_start_frame with a stack or an RAII guard to make the override re-entrancy-safe.
- Remove the redundant
model_clear_instance calls between the shadow pass and the subsequent main pass (one reset per frame per object is enough).
Biggest-bang-for-buck order
If you want to ship visible frame-rate gains fastest:
- Object-loop pre-filter + cached
pos_rot (1 day, kills CPU stalls in busy missions)
- Tighter default
Shadow_distances (15 minutes, kills far-cascade draw count)
- Switch FBO to
RG + D16, drop duplicate sampler binding (half a day, halves VRAM/bandwidth)
- Hardware PCF via
sampler2DArrayShadow (1-2 days, halves to quarters receiver-side cost)
- Slim shadow-cast path bypassing
ship_render (2-3 days, halves caster CPU + GS load)
- Eventually: replace GS layering + per-cascade culled draw lists (1-2 weeks, but unlocks AMD/Intel/Mac perf parity)
As investigated by Claude.
Shadow Performance Investigation — Report
Five parallel audits covered the CPU-side scene assembly, the draw pass, the OpenGL backend, the shader code, and the redundancy/frequency of shadow builds. Below is a consolidated picture of what's actually expensive, then prioritized recommendations.
The situation in one paragraph
The shadow system uses four-cascade VSM (variance shadow mapping) rendered into a layered
RGBA32F + D32texture array, fanned to all four cascade layers via a geometry shader withlayout(invocations = 4). The pass is built fresh every rendered frame, even when paused, runs twice per frame (world + cockpit) and four times in VR, and is invoked again per UI frame in mission select, tech room, briefing, and lab. Caster-side culling walks every object up toHighest_object_indexwith noOBJ_NONE/OF_RENDERS/Should_be_deadgate, applies up to four full vector rotates per object, and then feeds the survivors through the fullship_renderpath (dock walks, warp checks, external weapon model traversal, alpha-cutout texture lookups) before stripping back to a depth-only material. Receiver-side, every shaded pixel of the directional light pays a 16-tap manual Poisson PCF (32 taps at cascade boundaries) against a 2-channel texture that is allocated as 4-channel. Every individual issue is multiplied by the GS 4× amplification.Dominant cost drivers, in priority order
Tier 1 — biggest wins
Geometry-shader 4× cascade fan-out (
code/def_files/data/effects/main-g.sdr:17). Every shadow-cast triangle is amplified 4× by GS invocations. Geometry shaders are notoriously slow on AMD/Intel/Apple drivers and on NV Turing+ they're emulated through primitive-shader paths. Thegl_ClipDistanceclamp atmain-g.sdr:167forces every out-of-frustum primitive to rasterize as a degenerate at the near plane rather than being clipped. This single technique probably costs more than everything else combined on AMD/Intel hardware.No object pre-filter in
shadows_render_all(code/graphics/shadows.cpp:496-553). The loop tests every slot up toHighest_object_index— includingOBJ_NONEholes, weapons, fireballs, shockwaves, jump nodes, beams, debris with!Used— through up to 4 cascade rotates before the switch silently drops them. In a firefight with hundreds of weapon objects this is ~1,000 wasted vector rotates/frame just for shadows. There is also noOF_RENDERS/Should_be_deadgate that the main render path has.Far cascade engulfs the whole map (
code/mod_table/mod_table.cpp:1834). Defaults are(200, 600, 2500, 8000). At 8000 m the far cascade typically contains every fighter, capship, asteroid, and debris piece in the mission, each contributing 1-4 shadow texels at Low quality — full draw-call cost for invisible contribution. A 400-asteroid field plus a 50-ship engagement enters the queue almost in full.Receiver: 16-tap manual VSM, doubled at cascade boundaries (
code/def_files/data/effects/shadows.sdr:33-67,shadows.sdr:96-100). The shadow texture is sampled assampler2DArray(notsampler2DArrayShadow), so there is no hardware PCF — every receiving pixel does 16 dependent RG32F fetches plus a Chebyshev evaluation, blowing through L2. In the 20 % cascade-blend band, it's 32 taps. The same shadow texture is bound on two sampler units (deferred unit 4 and forward unit 8,code/graphics/opengl/gropengltnl.cpp:944-955) so shadow-receiving forward materials pay the kernel a second time.Tier 2 — substantial wins
Cockpit shadow pass is a second full layered pass (
code/ship/ship.cpp:8576-8607). It clears the full 4-layer FBO and re-renders the player ship + cockpit at locked LOD0 — the densest single mesh in most missions — every frame. In VR this becomes 4 builds per frame (2 eyes × world+cockpit atfreespace2/freespace.cpp:4319, 4326).Full layered clear of an oversized FBO every pass. At Ultra (4096), the array texture is 1.28 GiB (D32 array + RGBA32F array × 4 layers), and the entire thing is cleared on every shadow start (
code/graphics/opengl/gropengltnl.cpp:697-698). Even at Medium it's 80 MiB cleared 2-4× per frame.Color format wastes half its bandwidth. VSM only needs
RG(depth, depth²) —code/def_files/data/effects/main-f.sdr:208writes0, 1to.baand the receiver reads.xy. The FBO is allocatedGL_RGBA32F(fallbackRGBA16F) atcode/graphics/opengl/gropengltnl.cpp:554-555.No shadow-pass LOD.
model_render_determine_detailuses eye distance (code/model/modelrender.cpp:1037), so a ship 5 km away is drawn at the same LOD into the shadow map as into the color pass. Cockpit pass forces LOD0 explicitly. Per-submodelrender_box/render_sphereculling also uses eye position, so subobjects the light sees but the eye doesn't are skipped while subobjects only the eye sees are still drawn into the shadow map.Built every frame regardless of state. No camera-stationary / paused / unchanged-scene caching.
game_render_frameruns even when paused (freespace2/freespace.cpp:6477) and the shadow map is rebuilt unconditionally.Tier 3 — meaningful but smaller
shadows_construct_light_frustumdoes ~50 % redundant work across cascades (code/graphics/shadows.cpp:248-425).tanf(fov/2)8× per frame, unituvec/rvecre-copied 64× per frame, near-plane corners of cascade N+1 ≈ far-plane corners of cascade N (modulo the 20 % overlap), all recomputed from scratch.Light matrix rebuilt every frame even though
Static_light.front()never moves in normal play (code/graphics/shadows.cpp:437-438).obj_queue_renderruns the entireship_renderbody during the shadow pass and then overwritesrender_flags = MR_NO_TEXTURING | MR_NO_LIGHTINGatcode/ship/ship.cpp:21971. All the setup before that — dock evaluation, warp clip-plane, team color, replacement-texture array dereference, external weapon model traversal atcode/ship/ship.cpp:22026— runs and is then thrown away.model_clear_instancecalled twice per frame per object (once for shadow, once for the lit pass). For big capships with 30+ textures × 6 texture types this is ~180 redundant writes per ship.Receiver pays unconditional matrix work in deferred (
code/def_files/data/effects/deferred-f.sdr:260-270). Every shaded pixel of the directional light's full-screen pass doesmat4*mat4*vec4+ fourtransformToShadowMapcalls per pixel, regardless of whether the pixel ends up shadowed, because there's no per-vertex precompute for the fullscreen quad.Vertex shader during shadow pass is not stripped (
code/def_files/data/effects/main-v.sdr:155-209). Tangent matrix, texcoord transform, normal pipeline all computed even though the shadow fragment shader only readsvertIn.position.z. Cost multiplied by GS fan-out / instancing fallback.Re-entrancy bug in
shadow_maybe_start_frame(code/graphics/shadows.cpp:570-588) — uses staticlast_override/shadow_override_backup. Not strictly a perf issue, but it means nested UI/render scenarios will leakShadow_overridestate. Worth fixing alongside any restructuring.Quality changes don't take effect without a restart (
code/graphics/shadows.cpp:71gates the listener onif (initial), FBO is only built inopengl_tnl_init). Not a perf bug but a UX bug worth fixing once we touch the FBO sizing.Recommendations, prioritized
Quick wins (small diffs, big returns)
if (objp->type == OBJ_NONE) continue;plus anOF_RENDERS/Should_be_deadcheck before any frustum math. Switch the body of the loop to aswitchonobjp->typefirst, then per-type cull. Eliminates the vast majority of wasted rotates in busy missions.pos_rotonce per object forshadows_obj_in_frustumand pass it to all 4 cascade tests. Saves ~75 % of the function's cost on average.Static_light.front(), the eye, orShadow_distancesactually changes.Shadow_map_texturepersists across frames, so the existing texture remains valid for stationary scenes.GL_RG32F(orGL_RG16Ffor Low/Medium). Cuts color VRAM and bandwidth in half with zero shader change beyond the format.GL_DEPTH_COMPONENT16. Nothing samples it; it only exists for primitive ordering during the cast.Shadow_distancesmore aggressively —(100, 400, 1200, 3000)is closer to what's actually visually meaningful given the cascade resolutions, and dramatically shrinks the far-cascade object set. Mods can override.Medium-effort wins
obj_queue_render/ship_renderfor the shadow pass. Write a slim shadow-cast path that takes the ship hull + visible external weapon models, setsrender_flags = MR_NO_TEXTURING | MR_NO_LIGHTINGup front, and never touches dock evaluation, warp clip planes, team color, decals, etc.sampler2DArrayShadowandGL_COMPARE_REF_TO_TEXTURE, render depth-only into aDEPTH_COMPONENT32Farray, and use a 4-taptextureGather-based PCF. This removes VSM moments entirely (no need for the color attachment, no need for the scale hack, no light-bleeding) and lets the GPU do filtering for free in TMU hardware. This is the single biggest receiver-side win.Bigger architectural changes
GL_OVR_multiview2/GL_NV_viewport_array2/gl_ViewportIndex) where supported. Eliminates GS amplification entirely. On drivers without those extensions, fall back to one draw per cascade with culled draw lists — almost always faster than a GS on AMD/Intel.Cleanups worth bundling
Shadow_qualitychanges apply at runtime by rebuildingshadow_fboin the option listener instead of gating onif (initial).last_overridemechanism inshadow_maybe_start_framewith a stack or an RAII guard to make the override re-entrancy-safe.model_clear_instancecalls between the shadow pass and the subsequent main pass (one reset per frame per object is enough).Biggest-bang-for-buck order
If you want to ship visible frame-rate gains fastest:
pos_rot(1 day, kills CPU stalls in busy missions)Shadow_distances(15 minutes, kills far-cascade draw count)RG+D16, drop duplicate sampler binding (half a day, halves VRAM/bandwidth)sampler2DArrayShadow(1-2 days, halves to quarters receiver-side cost)ship_render(2-3 days, halves caster CPU + GS load)