Skip to content

Batch variable-length cell header reads#163

Open
marklam wants to merge 1 commit into
Apollo3zehn:masterfrom
marklam:batched-vl-headers
Open

Batch variable-length cell header reads#163
marklam wants to merge 1 commit into
Apollo3zehn:masterfrom
marklam:batched-vl-headers

Conversation

@marklam

@marklam marklam commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

(Updated to remove references to a JIT problem that was my flaky CPU)

Summary

For variable-length sequences and strings, every cell in a Read<T> decode pass reads its own header from the dataset stream before resolving the heap object and decoding the payload.
On an N-cell read that's N IH5ReadStream.ReadDataset calls, each with their own overhead.

This PR adds a batched code path in DatatypeMessage.GetDecodeInfoForReferenceMemory<T> which is selected if it detects the variable-length sequence/string case at decoder-construction time. When active, it bulk-reads target.Length * cellHeaderSize bytes from the source in one ReadDataset call, wraps the result in a SystemMemoryStream, and feeds the per-cell ElementDecodeDelegate from that in-memory wrapper (the buffer is rented from ArrayPool<byte>.Shared). The per-cell decoder itself is unchanged - it still reads its header through IH5ReadStream, but that's now from memory with no per-call overhead.

This should improve the performance of reading multiple cells at a time, but not penalise the single-cell read, or reads of non-blittable types.

Other decoders (fixed-length string, unmanaged element, reference compound, …) are unaffected.

Performance

The performance improvement in this PR builds on top of other PRs. Without those, the improvement is modest.

I used AI to turn the multiple benchmark results folders from my tests into this synopsis:

Benchmark: benchmarks/PureHDF.Benchmarks/VariableLengthCompoundRead.cs — 1-D dataset of 600 variable-length sequences, each holding 200 elements of a 12-byte blittable struct. Three access patterns:

  • ReadAll — one Read<Sample[][]> for the whole dataset (1× per-Read dispatch, 600 cells)
  • ReadByWindow — 10 reads of 60 cells each
  • ReadPerCell — 600 reads of 1 cell each (1 cell per read, so batching has nothing to batch)

Hardware: i9-13900KS, Windows 11, .NET 8.0.27, BenchmarkDotNet default job.

1. On master (no other PRs applied)

Method Before After Δ wall Δ alloc
ReadAll 5.836 ms 4.843 ms −17% 0%
ReadByWindow 5.670 ms 4.812 ms −15% 0%
ReadPerCell 6.747 ms 6.739 ms 0% 0%

The win is modest because the per-cell decoder on master is dominated by per-element MethodInfo.Invoke plus per-cell Array.CreateInstance and per-element boxing — batching only collapses the per-cell stream-dispatch overhead, which is a small fraction of total time.

2. On top of #161 (reflection caching)

Method Before After Δ wall Δ alloc
ReadAll 4.302 ms 3.351 ms −22% 0%
ReadByWindow 4.233 ms 3.304 ms −22% 0%
ReadPerCell 4.689 ms 4.685 ms 0% 0%

Slightly larger relative gain than master alone — #161 makes the dispatch part of each Read call cheaper, so the share of total time owed to per-cell stream dispatch grows.

3. On top of #162 (blittable VL fast path)

Method Before After Δ wall Δ alloc
ReadAll 940.0 µs 108.1 µs −88% 0%
ReadByWindow 919.9 µs 136.4 µs −85% 0%
ReadPerCell 2 201.9 µs 2 342.5 µs +6% +1%

This is where the change earns its keep. #162 makes the per-cell decode itself essentially free (one MemoryMarshal.Cast + CopyTo), so on the bulk paths the per-cell stream dispatch was the dominant remaining cost — collapsing 600 small reads into 1 takes wall-time down by ~85–88%. ReadPerCell (1 cell per read, nothing to batch) is unchanged within noise.

4. On top of both #161 + #162

Method Before After Δ wall Δ alloc
ReadAll 920.6 µs 103.3 µs −89% 0%
ReadByWindow 902.9 µs 115.5 µs −87% 0%
ReadPerCell 1 386.6 µs 1 391.4 µs 0% +1%

Bulk-path numbers are essentially the same as scenario 3 — #161's contribution on this benchmark is concentrated in ReadPerCell (where it cuts per-Read reflection-dispatch cost), not in the per-cell decoder. On ReadAll / ReadByWindow the dominant remaining cost after #162 is per-cell stream dispatch, which is what this PR removes.

  For variable-length sequence/string datatypes, the per-element decoder
  reads a small fixed-size cell header (4-byte length + global heap id)
  from the dataset stream per cell. An N-cell decode pass becomes N small
  ReadDataset calls into the underlying IH5ReadStream, each paying virtual
  dispatch + position-tracking overhead.

  GetDecodeInfoForReferenceMemory<T> now detects the variable-length
  sequence/string case at decoder construction time and, when active,
  bulk-reads target.Length * cellHeaderSize bytes from source in one
  ReadDataset, wraps the result in a SystemMemoryStream, and feeds the
  per-cell ElementDecodeDelegate from that in-memory wrapper. The per-cell
  decoder is unchanged. Other decoders (fixed-length string, unmanaged
  element, reference compound) are unaffected. The bulk buffer is rented
  from ArrayPool<byte>.Shared so steady-state allocation is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant