Conversation
Each subpackage's `Pkg.test` runner is now a minimal call to PTR's `runtests`, which spawns one worker process per test file and runs them concurrently. `setup.jl` is loaded via `init_code` so each worker picks up the shared fixtures. Side effects: - Extract cuTENSOR's inline "kernel cache" testset to its own file (lib/cutensor/test/kernel_cache.jl) since runtests.jl is no longer the place for test code. - cuSPARSE's array.jl had three show-output tests that implicitly relied on `using cuSPARSE, SparseArrays` being in Main (where CUDA.jl's top-level runner incidentally loaded them). PTR workers run tests in an isolated submodule, so pass an explicit `:module => @__MODULE__` context to `sprint(show, …)` so the type names are qualified against the worker module's bindings rather than Main's. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites `test/runtests.jl` on top of `ParallelTestRunner.runtests`, unifying CUDA.jl with the subpackages migrated in 94dedf5. The homegrown runner (~580 lines in runtests.jl + ~210 in setup.jl) becomes a thin wrapper plus a CUDA-specific `AbstractTestRecord`: - `CUDATestRecord` carries the standard fields plus `gpu_bytes`, `gpu_time`, and `gpu_rss`; an `execute(::Type{CUDATestRecord}, ...)` method uses `CUDA.@timed` to capture GPU alloc stats and queries NVML for per-process GPU RSS. `print_test_finished`/`print_test_failed` overrides add `GPU Alloc (MB)` and `GPU RSS (MB)` columns. - Worker count is capped by free GPU memory (~2 GiB/worker) in addition to PTR's CPU/RAM default. - `--sanitize[=tool]` wraps every worker by passing a compute-sanitizer `Cmd` as `runtests`'s `exename` kwarg (new in PTR 2.6). - `--all` (or an explicit `libraries/*` positional) includes subpackage tests under `lib/*/test/`, using `Base.set_active_project` to activate the subpackage's Project.toml. - Context-destroying tests (`core/initialization`, `core/cudadrv`) are isolated on a fresh worker via the `test_worker` hook and use plain Julia timing (since CUDA events invalidate with the context). Per-worker setup (`CUDATestRecord`, NVML helpers, GPUArrays TestSuite include, `CUDA.precompile_runtime`) lives in `test/setup.jl` and runs via `init_worker_code`. Per-test helpers (`testf`, `sink`, `@grab_output`, `@on_device`, `julia_exec`) are in a new `test/helpers.jl` included via `init_code`, so subpackage setup.jl's `testf` redefinitions don't clash with an imported binding. Drops: `--gpu=…` multi-device selection, exclusive-mode downgrade, interactive `?` key. GPU selection now goes through `CUDA_VISIBLE_DEVICES`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
51/41/37 mins here vs 42/34/33 on master for Julia 1.11/1.12/1.13. And this surprisingly looks like the actual tests slowing down, e.g. on Julia 1.12: vs |
Contributor
|
Try with |
Member
Author
|
Good idea. It's probably related to my tuning of the memory pool heuristics though, and not because of PTR. |
A compute-sanitizer-wrapped worker starts by printing its banner
('========= COMPUTE-SANITIZER') to stdout, which collides with Malt's
port handshake (the first stdout line must be parseable as a UInt16 port
number). Passing `--log-file=<dir>/%p.log` redirects sanitizer text to a
per-process file, leaving the worker's stdout clean for Malt.
After `runtests` returns (or throws), scan the directory and surface any
logs missing the "ERROR SUMMARY: 0 errors" line; emit a colored
one-liner summary otherwise. This preserves the signal while keeping
clean runs quiet.
Also silence the `Pkg.activate`/`Pkg.add` chatter during CUDA_SDK_jll
install (`io = devnull`) — the only output we want is the sanitizer
version banner we explicitly print.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers run many tests back-to-back, and pool-cached buffers stay resident because the release threshold is unbounded and the idle pool-cleanup task only runs when `isinteractive()`. Calling `CUDA.reclaim()` after the post-test GC trims the pool and empties library handle caches, reducing GPU RSS accumulation without invalidating compiled kernels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match ParallelTestRunner's new composition pattern for `AbstractTestRecord`: carry a `base::TestRecord` field and delegate Julia-timed execution to `ParallelTestRunner.execute(TestRecord, …)` instead of redeclaring every baseline field and re-implementing the non-CUDA timing path inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-test GPU RSS data (collected after adding post-test CUDA.reclaim()) showed a handful of tests blowing well past the 4 GiB per-worker budget: - test/core/array.jl: the 512^3 sum!() case allocated ~1 GiB Float64 to exercise the big-mapreduce path; (85, 1320, 100) already exercises the same serial kernel path. Drop the 512^3 case. - test/core/sorting.jl: the "large sizes" quicksort input at 2^25 Float32 was 128 MiB; 2^22 still exercises the multi-block quicksort path. - examples/peakflops.jl: default n=5000 built four 5000x5000 Float32 matrices (~400 MiB); n=1024 is enough to demonstrate the example. - lib/cutensornet/test/contractions.jl: max_ws_size=2^32 (4 GiB workspace hint) was inflating cuTensorNet to ~1.5 GiB; 2^28 covers the same tuning paths. Library tests (cusolver/cusparse/cudnn/cutensor/etc.) still sit at 1-2 GiB due to persistent library workspace that's not pool-allocated and therefore not released by CUDA.reclaim() between tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@test_throws UndefRefError current_context()` / `current_device()`
assertions at the top of initialization.jl require that CUDA hasn't been
touched yet in the current Julia process. With PTR, every worker runs
`setup.jl` as `init_worker_code`, and that already does
`CUDA.functional(true)` / `precompile_runtime` / pool config — so the
worker is never in a fresh state by the time the test runs, and these
assertions fail ("Expected: UndefRefError, No exception thrown").
Run those four assertions (and the paired "now cause initialization"
check) in a subprocess instead, the same way the issue-1331 test at the
bottom of the file already does. The rest of initialization.jl doesn't
depend on fresh state and runs fine on a normal worker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The top-band GPU label spanned 30 columns but the bottom-band GPU cells (GC/Alloc/RSS) sum to 33, shifting every pipe after the GPU section three columns left. Widen the dashes (12 + 13) to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Needs JuliaTesting/ParallelTestRunner.jl#129