Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager and pickle-safety warning. Path-backed
ObjectCode is rejected at write time (would store only the path).
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction against an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the size cap bounds real on-disk usage. __contains__
is read-only -- it does not bump LRU. __len__ counts only entries
that survive validation and prunes corrupt rows. Schema-version
mismatch on open drops the tables and rebuilds; corrupt /
non-SQLite files are detected and the cache reinitialises empty.
Transient OperationalError ("database is locked") propagates
without nuking the file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. On-disk
filenames are blake2b(32) hashes of the key so arbitrary-length
keys never overflow filesystem name limits. Reader pruning is
stat-guarded: only delete a corrupt-looking file if its inode/
size/mtime have not changed since the read, so a concurrent
os.replace by a writer is preserved. clear() and _enforce_size_cap
use the same stat guard. Stale temp files (older than 1 hour) are
swept on open and during eviction; live temp files count toward
the size cap. Windows ERROR_SHARING_VIOLATION (32) and
ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
backoff (~185ms) before being treated as a non-fatal cache miss;
other PermissionErrors and all POSIX failures propagate. __len__
matches __getitem__ semantics (rejects schema/key/value mismatch).
* make_program_cache_key -- stable 32-byte blake2b key over code,
code_type, ProgramOptions, target_type, name expressions, cuda
core/NVRTC versions, NVVM lib+IR version, linker backend+version
for PTX inputs (driver version included only on the cuLink path).
Backend-specific gates mirror Program/Linker:
* code_type lower-cased to match Program_init.
* code_type/target_type combination validated against Program's
SUPPORTED_TARGETS matrix.
* NVRTC side-effect options (create_pch, time, fdevice_time_trace)
and external-content options (include_path, pre_include, pch,
use_pch, pch_dir) require an extra_digest from the caller. The
per-field set/unset predicate (_option_is_set) mirrors the
compiler's emission gates; collections.abc.Sequence is the
is_sequence check, matching _prepare_nvrtc_options_impl.
* NVVM use_libdevice=True requires extra_digest because libdevice
bitcode comes from the active toolkit. extra_sources is
rejected for non-NVVM. Bytes-like ``code`` is rejected for
non-NVVM (Program() requires str there).
* PTX (Linker) input options are normalised through per-field
gates that match _prepare_nvjitlink_options /
_prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
to a sentinel under the driver linker (it ignores them).
ptxas_options canonicalises across str/list/tuple/empty shapes.
The driver linker's hard rejections (time, ptxas_options,
split_compile) raise at key time.
* name_expressions are gated on backend == "nvrtc"; PTX/NVVM
ignore them, matching Program.compile.
* Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide with
working ones, while staying stable across processes and across
repeated calls within a process.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.
Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight.ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.SQLiteProgramCache— single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap,wal_checkpoint(TRUNCATE) + VACUUMafter evictions so the cap bounds real on-disk usage.__contains__is read-only;__len__validates and prunes corrupt rows.threading.RLockserialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty;OperationalError(lock/busy) propagates without nuking the file (and closes the partial connection).FileStreamProgramCache— multi-process via tmp +os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning,clear(), and_enforce_size_capare all stat-guarded (snapshot(ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer'sos.replaceis preserved. Stale temp files swept on open; live temps count toward the size cap. WindowsERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONonos.replaceare retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; otherPermissionErrorand all POSIX failures propagate.__len__also rejectsstored_key/path mismatch.Program.compile(cache=...)integration is out of scope (tracked by #176/#179).Test plan
__len__pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); Windows vs POSIXPermissionErrornarrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close onOperationalError); lazy-import subprocess test;_SUPPORTED_TARGETS_BY_CODE_TYPEparity test that parses_program.pyxviatokenize+ast.literal_eval.get_kernelon the deserialisedObjectCode, parametrized over both backends.Closes #178