Scan API and Engine Integrations by gatesn · Pull Request #44 · vortex-data/rfcs

gatesn · 2026-04-08T14:28:50Z

This RFC looks at how we can expose deeper integration with query engine internals like scheduling, threading models, buffer pools, and so on

Signed-off-by: Nicholas Gates <nick@nickgates.com>

robert3005 · 2026-04-08T16:30:06Z

+
+- The Scan API is not itself a full relational query engine.
+- `LayoutReader` should not grow unknown-cardinality operator semantics.
+- Vortex should not require a specific Rust async runtime such as Tokio.


It's weird that we would go through all of this and still assume Tokio but I haven't read all of it yet

We already don't assume tokio, it just continues to be an explicit goal

The double negation here implies the opposite? You want the goal to be that the runtime doesn't assume tokio? maybe I am reading too much into random ai generated strings

robert3005 · 2026-04-08T16:53:31Z

+
+- the host may provide a CPU scheduler
+- Vortex may use it for bounded split-local CPU work
+- Vortex must not assume ownership of the whole query runtime


What does this statement mean in practice? I think there's intent behind it but I fail to understand what this means in practice?

I'd guess in particular in terms of use of resources, e.g. spawning threads but also unix process ownership e.g. Vortex should never crash the host. @gatesn correct me if you had sth else in mind.

Vortex should never crash the host.

Error handling might deserve a small section in this PR. I briefly talked about this with @myrrc but I think we'll need a panic handler (the host maybe can configure) to prevent that we never crash a host.

robert3005 · 2026-04-08T17:35:45Z

+- split lookahead policy
+- efficient materialization of output batches
+
+### What `Partitioning` Means


Words are hard, partitioning usually means some arrangement of data which this is not about. But maybe this is Partitioning and the other thing is Arrangament

robert3005 · 2026-04-08T18:02:52Z

+
+Correctness is more important than maximal pushdown.
+
+## Ordering, Limits, and Future Dynamic Filters


You should mention Partitioning here (or a I redefined it Arrangement). It's a super set of ordering

0ax1

Just a thought, maybe worthwhile clauding some ascii diagrams to illustrate some of the aspects.

Signed-off-by: Nicholas Gates <nick@nickgates.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Nicholas Gates <gatesn@users.noreply.github.com>

AdamGS · 2026-04-20T12:07:15Z

+
+This layer belongs in `vortex-scan`.
+
+### 3. Operator Layer


This seems like the only place where this part is called an "operator", maybe "Host Layer" is more consistent with the rest of the doc?

AdamGS · 2026-04-20T12:20:41Z

+
+The same Scan API should support all of these, but not all with the same level of host control.
+
+### DataFusion


We currently have 2 DataFusion codepath's, I think its worth expanding on the difference here

AdamGS · 2026-04-20T12:21:13Z

+- moderate
+- simpler than Velox
+
+## Why `Partition` and `Split` Both Exist


just editing wise, if its this important should it be earlier on?

AdamGS · 2026-04-20T12:25:53Z

+
+That keeps the current implementation direction intact.
+
+## `VortexTableScan`


What is this? Aside from "a scan of a vortex table"?

AdamGS · 2026-04-20T12:59:07Z

+### `Partition`
+
+A `Partition` is the unit of work exposed by the Scan API to a host engine.


Is a partition completely opaque? What does it return?

AdamGS · 2026-04-20T13:04:01Z

+
+This lines up with:
+
+- DataFusion task-per-partition execution


The DataFusion FileSource API is moving towards a lower level morsel-like API, where partitions return a collection of smaller splits that can end up being stolen by other paritions.

AdamGS · 2026-04-20T13:08:10Z

+pub struct ScanBudget {
+    pub max_active_splits: usize,
+    pub max_prefetch_splits: usize,
+    pub max_prefetch_bytes: u64,
+    pub max_inflight_reads: usize,
+    pub max_buffered_batches: usize,
+}


If we expose lower level pieces, why do we need a budget? This is 5 extra knobs that I think could be more clearly be part of the API, allowing each host to navigate it in whichever way makes sense for that system/language/paradigm.

AdamGS · 2026-04-20T13:12:20Z

+    pub offset: u64,
+    pub len: u64,
+    pub alignment: usize,
+    pub priority: IoPriority,


What's priority here? How is that different than intent ?IDK how IO looks in all the systems we're looking at, I think ideally the API should only include things we surely know how to use.

AdamGS · 2026-04-20T13:15:35Z

+
+This keeps the public API stable while allowing Vortex to remain storage-aware internally.
+
+## Pushdown and Host Functions


This seems like a key issue here, might be worth pulling up before the section about specific engines, so its more natural to expand on the differences in each one

AdamGS · 2026-04-20T13:16:24Z

+### Predicate Model
+
+The scan request should not treat all filters as equally pushdown-safe.
+
+Conceptually:
+
+```rust
+pub struct Predicate {
+    pub exact: Option<Expression>,
+    pub residual: Option<Expression>,
+}
+```
+
+Meaning:
+
+- `exact`: safe for Vortex to apply fully
+- `residual`: must still be evaluated by the host after scan
+
+In practice an engine adapter may derive these from its own expression IR.


Is this part of the interface or just a general comment about predicate pushdown?

AdamGS · 2026-04-20T13:17:24Z

+Host-specific functions should be wrapped as registered Vortex functions with extra semantic
+metadata.
+
+```rust


Is this per-function or per-host? Do we expect host implementation to wrap every function we want to push down?

AdamGS · 2026-04-20T13:21:42Z

+### Probabilistic Prefetch
+
+The source should be free to rank candidate splits with a probabilistic score, for example:
+
+`priority ~= P(split still needed) * stall_saved / resource_cost`
+
+Where:
+
+- `P(split still needed)` depends on selectivity and limits, and may later incorporate
+  late-arriving filter information if the Scan API grows that extension
+- `stall_saved` estimates the latency hidden by early I/O
+- `resource_cost` estimates bytes, memory pressure, and decode work
+
+This is not a host-engine concern. It is a scan-source concern, constrained by host budgets.


Can a host reject a prefetch request?

Scan API

83863ca

Signed-off-by: Nicholas Gates <nick@nickgates.com>

robert3005 reviewed Apr 8, 2026

View reviewed changes

0ax1 reviewed Apr 9, 2026

View reviewed changes

gatesn and others added 2 commits April 12, 2026 11:09

Blobs

5c09c8b

Signed-off-by: Nicholas Gates <nick@nickgates.com>

Merge branch 'develop' into ngates/scan-api

868958c

gatesn deployed to rfc-preview April 16, 2026 15:53 — with GitHub Actions View deployment

Move RFC to rfcs/ directory

80a2335

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gatesn deployed to rfc-preview April 16, 2026 15:54 — with GitHub Actions View deployment

gatesn changed the title ~~Scan API and Engine Integrations~~ RFC 0034: Scan API and Engine Integrations Apr 16, 2026

gatesn changed the title ~~RFC 0034: Scan API and Engine Integrations~~ Scan API and Engine Integrations Apr 16, 2026

Update author format in Scan API RFC

2aab5cd

Signed-off-by: Nicholas Gates <gatesn@users.noreply.github.com>

gatesn deployed to rfc-preview April 16, 2026 17:58 — with GitHub Actions View deployment

AdamGS reviewed Apr 20, 2026

View reviewed changes


		Correctness is more important than maximal pushdown.

		## Ordering, Limits, and Future Dynamic Filters


		The same Scan API should support all of these, but not all with the same level of host control.

		### DataFusion


		That keeps the current implementation direction intact.

		## `VortexTableScan`

		### `Partition`

		A `Partition` is the unit of work exposed by the Scan API to a host engine.


		This lines up with:

		- DataFusion task-per-partition execution


		This keeps the public API stable while allowing Vortex to remain storage-aware internally.

		## Pushdown and Host Functions

Conversation

gatesn commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamGS Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AdamGS Apr 20, 2026 •

edited

Loading