Skip to content

Scan API and Engine Integrations#44

Open
gatesn wants to merge 5 commits intodevelopfrom
ngates/scan-api
Open

Scan API and Engine Integrations#44
gatesn wants to merge 5 commits intodevelopfrom
ngates/scan-api

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented Apr 8, 2026

This RFC looks at how we can expose deeper integration with query engine internals like scheduling, threading models, buffer pools, and so on

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Comment thread rfcs/0034-scan-api.md

- The Scan API is not itself a full relational query engine.
- `LayoutReader` should not grow unknown-cardinality operator semantics.
- Vortex should not require a specific Rust async runtime such as Tokio.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that we would go through all of this and still assume Tokio but I haven't read all of it yet

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already don't assume tokio, it just continues to be an explicit goal

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double negation here implies the opposite? You want the goal to be that the runtime doesn't assume tokio? maybe I am reading too much into random ai generated strings

Comment thread rfcs/0034-scan-api.md

- the host may provide a CPU scheduler
- Vortex may use it for bounded split-local CPU work
- Vortex must not assume ownership of the whole query runtime
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this statement mean in practice? I think there's intent behind it but I fail to understand what this means in practice?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess in particular in terms of use of resources, e.g. spawning threads but also unix process ownership e.g. Vortex should never crash the host. @gatesn correct me if you had sth else in mind.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vortex should never crash the host.

Error handling might deserve a small section in this PR. I briefly talked about this with @myrrc but I think we'll need a panic handler (the host maybe can configure) to prevent that we never crash a host.

Comment thread rfcs/0034-scan-api.md
- split lookahead policy
- efficient materialization of output batches

### What `Partitioning` Means
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Words are hard, partitioning usually means some arrangement of data which this is not about. But maybe this is Partitioning and the other thing is Arrangament

Comment thread rfcs/0034-scan-api.md

Correctness is more important than maximal pushdown.

## Ordering, Limits, and Future Dynamic Filters
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention Partitioning here (or a I redefined it Arrangement). It's a super set of ordering

Copy link
Copy Markdown

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, maybe worthwhile clauding some ascii diagrams to illustrate some of the aspects.

gatesn and others added 2 commits April 12, 2026 11:09
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gatesn gatesn changed the title Scan API and Engine Integrations RFC 0034: Scan API and Engine Integrations Apr 16, 2026
@gatesn gatesn changed the title RFC 0034: Scan API and Engine Integrations Scan API and Engine Integrations Apr 16, 2026
Signed-off-by: Nicholas Gates <gatesn@users.noreply.github.com>
Comment thread rfcs/0034-scan-api.md

This layer belongs in `vortex-scan`.

### 3. Operator Layer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the only place where this part is called an "operator", maybe "Host Layer" is more consistent with the rest of the doc?

Comment thread rfcs/0034-scan-api.md

The same Scan API should support all of these, but not all with the same level of host control.

### DataFusion
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have 2 DataFusion codepath's, I think its worth expanding on the difference here

Comment thread rfcs/0034-scan-api.md
- moderate
- simpler than Velox

## Why `Partition` and `Split` Both Exist
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just editing wise, if its this important should it be earlier on?

Comment thread rfcs/0034-scan-api.md

That keeps the current implementation direction intact.

## `VortexTableScan`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Aside from "a scan of a vortex table"?

Comment thread rfcs/0034-scan-api.md
Comment on lines +158 to +160
### `Partition`

A `Partition` is the unit of work exposed by the Scan API to a host engine.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a partition completely opaque? What does it return?

Comment thread rfcs/0034-scan-api.md

This lines up with:

- DataFusion task-per-partition execution
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataFusion FileSource API is moving towards a lower level morsel-like API, where partitions return a collection of smaller splits that can end up being stolen by other paritions.

Comment thread rfcs/0034-scan-api.md
Comment on lines +381 to +387
pub struct ScanBudget {
pub max_active_splits: usize,
pub max_prefetch_splits: usize,
pub max_prefetch_bytes: u64,
pub max_inflight_reads: usize,
pub max_buffered_batches: usize,
}
Copy link
Copy Markdown
Collaborator

@AdamGS AdamGS Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expose lower level pieces, why do we need a budget? This is 5 extra knobs that I think could be more clearly be part of the API, allowing each host to navigate it in whichever way makes sense for that system/language/paradigm.

Comment thread rfcs/0034-scan-api.md
pub offset: u64,
pub len: u64,
pub alignment: usize,
pub priority: IoPriority,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's priority here? How is that different than intent ?IDK how IO looks in all the systems we're looking at, I think ideally the API should only include things we surely know how to use.

Comment thread rfcs/0034-scan-api.md

This keeps the public API stable while allowing Vortex to remain storage-aware internally.

## Pushdown and Host Functions
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a key issue here, might be worth pulling up before the section about specific engines, so its more natural to expand on the differences in each one

Comment thread rfcs/0034-scan-api.md
Comment on lines +931 to +949
### Predicate Model

The scan request should not treat all filters as equally pushdown-safe.

Conceptually:

```rust
pub struct Predicate {
pub exact: Option<Expression>,
pub residual: Option<Expression>,
}
```

Meaning:

- `exact`: safe for Vortex to apply fully
- `residual`: must still be evaluated by the host after scan

In practice an engine adapter may derive these from its own expression IR.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the interface or just a general comment about predicate pushdown?

Comment thread rfcs/0034-scan-api.md
Host-specific functions should be wrapped as registered Vortex functions with extra semantic
metadata.

```rust
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this per-function or per-host? Do we expect host implementation to wrap every function we want to push down?

Comment thread rfcs/0034-scan-api.md
Comment on lines +713 to +726
### Probabilistic Prefetch

The source should be free to rank candidate splits with a probabilistic score, for example:

`priority ~= P(split still needed) * stall_saved / resource_cost`

Where:

- `P(split still needed)` depends on selectivity and limits, and may later incorporate
late-arriving filter information if the Scan API grows that extension
- `stall_saved` estimates the latency hidden by early I/O
- `resource_cost` estimates bytes, memory pressure, and decode work

This is not a host-engine concern. It is a scan-source concern, constrained by host budgets.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a host reject a prefetch request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants