Batch variable-length cell header reads by marklam · Pull Request #163 · Apollo3zehn/PureHDF

marklam · 2026-06-04T18:35:29Z

(Updated to remove references to a JIT problem that was my flaky CPU)

Summary

For variable-length sequences and strings, every cell in a Read<T> decode pass reads its own header from the dataset stream before resolving the heap object and decoding the payload.
On an N-cell read that's N IH5ReadStream.ReadDataset calls, each with their own overhead.

This PR adds a batched code path in DatatypeMessage.GetDecodeInfoForReferenceMemory<T> which is selected if it detects the variable-length sequence/string case at decoder-construction time. When active, it bulk-reads target.Length * cellHeaderSize bytes from the source in one ReadDataset call, wraps the result in a SystemMemoryStream, and feeds the per-cell ElementDecodeDelegate from that in-memory wrapper (the buffer is rented from ArrayPool<byte>.Shared). The per-cell decoder itself is unchanged - it still reads its header through IH5ReadStream, but that's now from memory with no per-call overhead.

This should improve the performance of reading multiple cells at a time, but not penalise the single-cell read, or reads of non-blittable types.

Other decoders (fixed-length string, unmanaged element, reference compound, …) are unaffected.

Performance

The performance improvement in this PR builds on top of other PRs. Without those, the improvement is modest.

I used AI to turn the multiple benchmark results folders from my tests into this synopsis:

Benchmark: benchmarks/PureHDF.Benchmarks/VariableLengthCompoundRead.cs — 1-D dataset of 600 variable-length sequences, each holding 200 elements of a 12-byte blittable struct. Three access patterns:

ReadAll — one Read<Sample[][]> for the whole dataset (1× per-Read dispatch, 600 cells)
ReadByWindow — 10 reads of 60 cells each
ReadPerCell — 600 reads of 1 cell each (1 cell per read, so batching has nothing to batch)

Hardware: i9-13900KS, Windows 11, .NET 8.0.27, BenchmarkDotNet default job.

1. On master (no other PRs applied)

Method	Before	After	Δ wall	Δ alloc
ReadAll	5.836 ms	4.843 ms	−17%	0%
ReadByWindow	5.670 ms	4.812 ms	−15%	0%
ReadPerCell	6.747 ms	6.739 ms	0%	0%

The win is modest because the per-cell decoder on master is dominated by per-element MethodInfo.Invoke plus per-cell Array.CreateInstance and per-element boxing — batching only collapses the per-cell stream-dispatch overhead, which is a small fraction of total time.

2. On top of #161 (reflection caching)

Method	Before	After	Δ wall	Δ alloc
ReadAll	4.302 ms	3.351 ms	−22%	0%
ReadByWindow	4.233 ms	3.304 ms	−22%	0%
ReadPerCell	4.689 ms	4.685 ms	0%	0%

Slightly larger relative gain than master alone — #161 makes the dispatch part of each Read call cheaper, so the share of total time owed to per-cell stream dispatch grows.

3. On top of #162 (blittable VL fast path)

Method	Before	After	Δ wall	Δ alloc
ReadAll	940.0 µs	108.1 µs	−88%	0%
ReadByWindow	919.9 µs	136.4 µs	−85%	0%
ReadPerCell	2 201.9 µs	2 342.5 µs	+6%	+1%

This is where the change earns its keep. #162 makes the per-cell decode itself essentially free (one MemoryMarshal.Cast + CopyTo), so on the bulk paths the per-cell stream dispatch was the dominant remaining cost — collapsing 600 small reads into 1 takes wall-time down by ~85–88%. ReadPerCell (1 cell per read, nothing to batch) is unchanged within noise.

4. On top of both #161 + #162

Method	Before	After	Δ wall	Δ alloc
ReadAll	920.6 µs	103.3 µs	−89%	0%
ReadByWindow	902.9 µs	115.5 µs	−87%	0%
ReadPerCell	1 386.6 µs	1 391.4 µs	0%	+1%

Bulk-path numbers are essentially the same as scenario 3 — #161's contribution on this benchmark is concentrated in ReadPerCell (where it cuts per-Read reflection-dispatch cost), not in the per-cell decoder. On ReadAll / ReadByWindow the dominant remaining cost after #162 is per-cell stream dispatch, which is what this PR removes.

For variable-length sequence/string datatypes, the per-element decoder reads a small fixed-size cell header (4-byte length + global heap id) from the dataset stream per cell. An N-cell decode pass becomes N small ReadDataset calls into the underlying IH5ReadStream, each paying virtual dispatch + position-tracking overhead. GetDecodeInfoForReferenceMemory<T> now detects the variable-length sequence/string case at decoder construction time and, when active, bulk-reads target.Length * cellHeaderSize bytes from source in one ReadDataset, wraps the result in a SystemMemoryStream, and feeds the per-cell ElementDecodeDelegate from that in-memory wrapper. The per-cell decoder is unchanged. Other decoders (fixed-length string, unmanaged element, reference compound) are unaffected. The bulk buffer is rented from ArrayPool<byte>.Shared so steady-state allocation is unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch variable-length cell header reads#163

Batch variable-length cell header reads#163
marklam wants to merge 1 commit into
Apollo3zehn:masterfrom
marklam:batched-vl-headers

marklam commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marklam commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

1. On master (no other PRs applied)

2. On top of #161 (reflection caching)

3. On top of #162 (blittable VL fast path)

4. On top of both #161 + #162

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marklam commented Jun 4, 2026 •

edited

Loading