Skip to content

Add PersistentProgramCache (sqlite + filestream backends)#1912

Open
cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178
Open

Add PersistentProgramCache (sqlite + filestream backends)#1912
cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

  • Convert cuda.core.utils from a module to a package; expose cache APIs lazily via __getattr__ so from cuda.core.utils import StridedMemoryView stays lightweight.
  • Add ProgramCacheResource ABC with bytes | str keys, context manager, pickle-safety warning, and rejection of path-backed ObjectCode at write time.
  • Add make_program_cache_key() — blake2b(32) digest with backend-specific gates that mirror Program/Linker:
    • Versions: cuda-core, NVRTC (c++), libNVVM lib+IR (nvvm), linker backend+version (ptx); driver only on the cuLink path.
    • Validates code_type/target_type against Program.compile's SUPPORTED_TARGETS; rejects bytes-like code for non-NVVM and extra_sources for non-NVVM.
    • NVRTC side-effect (create_pch, time, fdevice_time_trace) and external-content (include_path, pre_include, pch, use_pch, pch_dir) options require extra_digest; NVVM use_libdevice=True likewise.
    • PTX (Linker) options pass through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options; ptxas_options canonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time, ptxas_options, split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma collapse under driver linker.
    • Failed env probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and repeated calls.
  • Add SQLiteProgramCache — single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap, wal_checkpoint(TRUNCATE) + VACUUM after evictions so the cap bounds real on-disk usage. __contains__ is read-only; __len__ validates and prunes corrupt rows. threading.RLock serialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty; OperationalError (lock/busy) propagates without nuking the file (and closes the partial connection).
  • Add FileStreamProgramCache — multi-process via tmp + os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded (snapshot (ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer's os.replace is preserved. Stale temp files swept on open; live temps count toward the size cap. Windows ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionError and all POSIX failures propagate. __len__ also rejects stored_key/path mismatch.

Program.compile(cache=...) integration is out of scope (tracked by #176/#179).

Test plan

  • 177 cache tests — single-process CRUD; LRU/size-cap (logical and on-disk); corruption + __len__ pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError); lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity test that parses _program.pyx via tokenize + ast.literal_eval.
  • End-to-end: real CUDA C++ compile → store in cache → reopen → get_kernel on the deserialised ObjectCode, parametrized over both backends.
  • CI: clean across all platforms.

Closes #178

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 2 times, most recently from 4407cef to c534df1 Compare April 20, 2026 11:54
Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.

Public API (cuda.core.utils):
  * ProgramCacheResource  -- abstract bytes|str -> ObjectCode mapping
    with context manager and pickle-safety warning. Path-backed
    ObjectCode is rejected at write time (would store only the path).
  * SQLiteProgramCache    -- single-file sqlite3 backend (WAL mode,
    autocommit) with LRU eviction against an optional size cap. A
    threading.RLock serialises connection use so one cache object is
    safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
    evictions so the size cap bounds real on-disk usage. __contains__
    is read-only -- it does not bump LRU. __len__ counts only entries
    that survive validation and prunes corrupt rows. Schema-version
    mismatch on open drops the tables and rebuilds; corrupt /
    non-SQLite files are detected and the cache reinitialises empty.
    Transient OperationalError ("database is locked") propagates
    without nuking the file (and closes the partial connection).
  * FileStreamProgramCache -- directory of atomically-written entries
    (tmp + os.replace) safe across concurrent processes. On-disk
    filenames are blake2b(32) hashes of the key so arbitrary-length
    keys never overflow filesystem name limits. Reader pruning is
    stat-guarded: only delete a corrupt-looking file if its inode/
    size/mtime have not changed since the read, so a concurrent
    os.replace by a writer is preserved. clear() and _enforce_size_cap
    use the same stat guard. Stale temp files (older than 1 hour) are
    swept on open and during eviction; live temp files count toward
    the size cap. Windows ERROR_SHARING_VIOLATION (32) and
    ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
    backoff (~185ms) before being treated as a non-fatal cache miss;
    other PermissionErrors and all POSIX failures propagate. __len__
    matches __getitem__ semantics (rejects schema/key/value mismatch).
  * make_program_cache_key -- stable 32-byte blake2b key over code,
    code_type, ProgramOptions, target_type, name expressions, cuda
    core/NVRTC versions, NVVM lib+IR version, linker backend+version
    for PTX inputs (driver version included only on the cuLink path).
    Backend-specific gates mirror Program/Linker:
      * code_type lower-cased to match Program_init.
      * code_type/target_type combination validated against Program's
        SUPPORTED_TARGETS matrix.
      * NVRTC side-effect options (create_pch, time, fdevice_time_trace)
        and external-content options (include_path, pre_include, pch,
        use_pch, pch_dir) require an extra_digest from the caller. The
        per-field set/unset predicate (_option_is_set) mirrors the
        compiler's emission gates; collections.abc.Sequence is the
        is_sequence check, matching _prepare_nvrtc_options_impl.
      * NVVM use_libdevice=True requires extra_digest because libdevice
        bitcode comes from the active toolkit. extra_sources is
        rejected for non-NVVM. Bytes-like ``code`` is rejected for
        non-NVVM (Program() requires str there).
      * PTX (Linker) input options are normalised through per-field
        gates that match _prepare_nvjitlink_options /
        _prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
        to a sentinel under the driver linker (it ignores them).
        ptxas_options canonicalises across str/list/tuple/empty shapes.
        The driver linker's hard rejections (time, ptxas_options,
        split_compile) raise at key time.
      * name_expressions are gated on backend == "nvrtc"; PTX/NVVM
        ignore them, matching Program.compile.
  * Failed environment probes mix the exception class name into a
    *_probe_failed label so broken environments never collide with
    working ones, while staying stable across processes and across
    repeated calls within a process.

Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.

Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 2dc5c8f to 5da111b Compare April 20, 2026 12:18
@cpcloud cpcloud requested review from leofang and rwgk April 20, 2026 13:21
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 20, 2026

Generated with the help of Cursor GPT-5.4 Extra High Fast


High: make_program_cache_key() misses implicit source-directory header dependencies

make_program_cache_key() only forces extra_digest for explicit include/PCH options in cuda_core/cuda/core/utils/_program_cache.py:393 and cuda_core/cuda/core/utils/_program_cache.py:592, but NVRTC also implicitly searches the source file's directory unless no_source_include is set in cuda_core/cuda/core/_program.pyx:1001.

Program passes options.name straight to nvrtcCreateProgram() in cuda_core/cuda/core/_program.pyx:635, while the key builder only hashes that path string in cuda_core/cuda/core/utils/_program_cache.py:778. That means a workflow like options.name="/path/to/kernel.cu" plus #include "local.h" can reuse a stale cached ObjectCode after local.h changes.

The new tests cover explicit include/PCH knobs, but not this default source-directory include path (cuda_core/tests/test_program_cache.py:765).

Medium: FileStreamProgramCache._enforce_size_cap() can over-evict under concurrent capped writers

After the re-stat at cuda_core/cuda/core/utils/_program_cache.py:1515, a concurrent deleter can remove the candidate before path.unlink(). That FileNotFoundError is suppressed at cuda_core/cuda/core/utils/_program_cache.py:1530, but total is not adjusted, so eviction continues and can delete newer entries unnecessarily.

For a backend explicitly documented for multi-process use, that turns ordinary contention into avoidable cache data loss. The current multiprocess coverage exercises concurrent writes and prune races, but not max_size_bytes under concurrency (cuda_core/tests/test_program_cache_multiprocess.py).

Reduced simulation I used locally:

import time
from pathlib import Path

from cuda.core._module import ObjectCode
from cuda.core.utils import FileStreamProgramCache

cache = FileStreamProgramCache("/tmp/cuda_cache_review_race", max_size_bytes=1000)
cache[b"old"] = ObjectCode._init(b"a" * 600, "cubin", name="old")
time.sleep(0.01)

old_path = cache._path_for_key(b"old")
orig_unlink = Path.unlink
state = {"done": False}


def flaky_unlink(self, *args, **kwargs):
    if self == old_path and not state["done"]:
        state["done"] = True
        # Simulate another process deleting the file after stat() but before
        # _enforce_size_cap() updates its bookkeeping.
        orig_unlink(self, *args, **kwargs)
        raise FileNotFoundError(self)
    return orig_unlink(self, *args, **kwargs)


Path.unlink = flaky_unlink
try:
    cache[b"new"] = ObjectCode._init(b"b" * 600, "cubin", name="new")
finally:
    Path.unlink = orig_unlink

remaining = [key for key in (b"old", b"new") if cache.get(key) is not None]
print(remaining)  # []

That produced [] for me: once the first deletion race is swallowed without decrementing total, the loop keeps evicting and drops the fresh entry too.

Low: from cuda.core.utils import * now eagerly imports the cache stack

The package conversion keeps explicit imports like from cuda.core.utils import StridedMemoryView lightweight, but from cuda.core.utils import * walks __all__, resolves the lazy cache symbols, and imports _program_cache (cuda_core/cuda/core/utils/__init__.py:10, cuda_core/cuda/core/utils/__init__.py:32).

I verified that star-import now loads cuda.core.utils._program_cache. That said, this only affects import *, which is already discouraged. I think a short comment explaining that the laziness guarantee is intended for explicit imports, not star-import, seems sufficient here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cuda.core.utils.PersistentProgramCache

2 participants