Batch variable-length cell header reads#163
Open
marklam wants to merge 1 commit into
Open
Conversation
For variable-length sequence/string datatypes, the per-element decoder reads a small fixed-size cell header (4-byte length + global heap id) from the dataset stream per cell. An N-cell decode pass becomes N small ReadDataset calls into the underlying IH5ReadStream, each paying virtual dispatch + position-tracking overhead. GetDecodeInfoForReferenceMemory<T> now detects the variable-length sequence/string case at decoder construction time and, when active, bulk-reads target.Length * cellHeaderSize bytes from source in one ReadDataset, wraps the result in a SystemMemoryStream, and feeds the per-cell ElementDecodeDelegate from that in-memory wrapper. The per-cell decoder is unchanged. Other decoders (fixed-length string, unmanaged element, reference compound) are unaffected. The bulk buffer is rented from ArrayPool<byte>.Shared so steady-state allocation is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Updated to remove references to a JIT problem that was my flaky CPU)
Summary
For variable-length sequences and strings, every cell in a
Read<T>decode pass reads its own header from the dataset stream before resolving the heap object and decoding the payload.On an N-cell read that's N
IH5ReadStream.ReadDatasetcalls, each with their own overhead.This PR adds a batched code path in
DatatypeMessage.GetDecodeInfoForReferenceMemory<T>which is selected if it detects the variable-length sequence/string case at decoder-construction time. When active, it bulk-readstarget.Length * cellHeaderSizebytes from the source in oneReadDatasetcall, wraps the result in aSystemMemoryStream, and feeds the per-cellElementDecodeDelegatefrom that in-memory wrapper (the buffer is rented fromArrayPool<byte>.Shared). The per-cell decoder itself is unchanged - it still reads its header throughIH5ReadStream, but that's now from memory with no per-call overhead.This should improve the performance of reading multiple cells at a time, but not penalise the single-cell read, or reads of non-blittable types.
Other decoders (fixed-length string, unmanaged element, reference compound, …) are unaffected.
Performance
The performance improvement in this PR builds on top of other PRs. Without those, the improvement is modest.
I used AI to turn the multiple benchmark results folders from my tests into this synopsis:
Benchmark:
benchmarks/PureHDF.Benchmarks/VariableLengthCompoundRead.cs— 1-D dataset of 600 variable-length sequences, each holding 200 elements of a 12-byte blittable struct. Three access patterns:Read<Sample[][]>for the whole dataset (1× per-Read dispatch, 600 cells)Hardware: i9-13900KS, Windows 11, .NET 8.0.27, BenchmarkDotNet default job.
1. On master (no other PRs applied)
The win is modest because the per-cell decoder on master is dominated by per-element
MethodInfo.Invokeplus per-cellArray.CreateInstanceand per-element boxing — batching only collapses the per-cell stream-dispatch overhead, which is a small fraction of total time.2. On top of #161 (reflection caching)
Slightly larger relative gain than master alone — #161 makes the dispatch part of each
Readcall cheaper, so the share of total time owed to per-cell stream dispatch grows.3. On top of #162 (blittable VL fast path)
This is where the change earns its keep. #162 makes the per-cell decode itself essentially free (one
MemoryMarshal.Cast+CopyTo), so on the bulk paths the per-cell stream dispatch was the dominant remaining cost — collapsing 600 small reads into 1 takes wall-time down by ~85–88%.ReadPerCell(1 cell per read, nothing to batch) is unchanged within noise.4. On top of both #161 + #162
Bulk-path numbers are essentially the same as scenario 3 — #161's contribution on this benchmark is concentrated in
ReadPerCell(where it cuts per-Readreflection-dispatch cost), not in the per-cell decoder. OnReadAll/ReadByWindowthe dominant remaining cost after #162 is per-cell stream dispatch, which is what this PR removes.