Skip to content

Switch to ParallelTestRunner#3110

Draft
maleadt wants to merge 12 commits intomasterfrom
tb/ptr
Draft

Switch to ParallelTestRunner#3110
maleadt wants to merge 12 commits intomasterfrom
tb/ptr

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 18, 2026

maleadt and others added 2 commits April 18, 2026 15:54
Each subpackage's `Pkg.test` runner is now a minimal call to PTR's
`runtests`, which spawns one worker process per test file and runs them
concurrently. `setup.jl` is loaded via `init_code` so each worker picks
up the shared fixtures.

Side effects:
- Extract cuTENSOR's inline "kernel cache" testset to its own file
  (lib/cutensor/test/kernel_cache.jl) since runtests.jl is no longer
  the place for test code.
- cuSPARSE's array.jl had three show-output tests that implicitly
  relied on `using cuSPARSE, SparseArrays` being in Main (where
  CUDA.jl's top-level runner incidentally loaded them). PTR workers
  run tests in an isolated submodule, so pass an explicit
  `:module => @__MODULE__` context to `sprint(show, …)` so the type
  names are qualified against the worker module's bindings rather
  than Main's.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites `test/runtests.jl` on top of `ParallelTestRunner.runtests`,
unifying CUDA.jl with the subpackages migrated in 94dedf5. The homegrown
runner (~580 lines in runtests.jl + ~210 in setup.jl) becomes a thin
wrapper plus a CUDA-specific `AbstractTestRecord`:

- `CUDATestRecord` carries the standard fields plus `gpu_bytes`,
  `gpu_time`, and `gpu_rss`; an `execute(::Type{CUDATestRecord}, ...)`
  method uses `CUDA.@timed` to capture GPU alloc stats and queries NVML
  for per-process GPU RSS. `print_test_finished`/`print_test_failed`
  overrides add `GPU Alloc (MB)` and `GPU RSS (MB)` columns.
- Worker count is capped by free GPU memory (~2 GiB/worker) in addition
  to PTR's CPU/RAM default.
- `--sanitize[=tool]` wraps every worker by passing a compute-sanitizer
  `Cmd` as `runtests`'s `exename` kwarg (new in PTR 2.6).
- `--all` (or an explicit `libraries/*` positional) includes subpackage
  tests under `lib/*/test/`, using `Base.set_active_project` to activate
  the subpackage's Project.toml.
- Context-destroying tests (`core/initialization`, `core/cudadrv`) are
  isolated on a fresh worker via the `test_worker` hook and use plain
  Julia timing (since CUDA events invalidate with the context).

Per-worker setup (`CUDATestRecord`, NVML helpers, GPUArrays TestSuite
include, `CUDA.precompile_runtime`) lives in `test/setup.jl` and runs
via `init_worker_code`. Per-test helpers (`testf`, `sink`, `@grab_output`,
`@on_device`, `julia_exec`) are in a new `test/helpers.jl` included via
`init_code`, so subpackage setup.jl's `testf` redefinitions don't clash
with an imported binding.

Drops: `--gpu=…` multi-device selection, exclusive-mode downgrade,
interactive `?` key. GPU selection now goes through `CUDA_VISIBLE_DEVICES`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 19, 2026

51/41/37 mins here vs 42/34/33 on master for Julia 1.11/1.12/1.13. And this surprisingly looks like the actual tests slowing down, e.g. on Julia 1.12:

gpuarrays/linalg/core                          (9) │   152.89 │   0.02 │      28.29 │   148.00 │   4.48 │  2.9 │   13223.51 │  4035.78 │
gpuarrays/linalg/norm                         (13) │   281.17 │   0.02 │       0.03 │   130.00 │   7.46 │  2.7 │   17295.61 │  3769.99 │

vs

gpuarrays/linalg/core                         (3) |   126.39 |   0.02 |  0.0 |      28.29 |   226.00 |   3.43 |  2.7 |   11518.49 | 10666.16 |
gpuarrays/linalg/norm                         (4) |   250.23 |   0.02 |  0.0 |       0.03 |   146.00 |   6.00 |  2.4 |   14957.74 |  6904.70 |

@giordano
Copy link
Copy Markdown
Contributor

Try with --verbose which shows also the init time?

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 19, 2026

Good idea. It's probably related to my tuning of the memory pool heuristics though, and not because of PTR.

maleadt and others added 10 commits April 19, 2026 10:24
A compute-sanitizer-wrapped worker starts by printing its banner
('========= COMPUTE-SANITIZER') to stdout, which collides with Malt's
port handshake (the first stdout line must be parseable as a UInt16 port
number). Passing `--log-file=<dir>/%p.log` redirects sanitizer text to a
per-process file, leaving the worker's stdout clean for Malt.

After `runtests` returns (or throws), scan the directory and surface any
logs missing the "ERROR SUMMARY: 0 errors" line; emit a colored
one-liner summary otherwise. This preserves the signal while keeping
clean runs quiet.

Also silence the `Pkg.activate`/`Pkg.add` chatter during CUDA_SDK_jll
install (`io = devnull`) — the only output we want is the sanitizer
version banner we explicitly print.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers run many tests back-to-back, and pool-cached buffers stay
resident because the release threshold is unbounded and the idle
pool-cleanup task only runs when `isinteractive()`. Calling
`CUDA.reclaim()` after the post-test GC trims the pool and empties
library handle caches, reducing GPU RSS accumulation without
invalidating compiled kernels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match ParallelTestRunner's new composition pattern for `AbstractTestRecord`:
carry a `base::TestRecord` field and delegate Julia-timed execution to
`ParallelTestRunner.execute(TestRecord, …)` instead of redeclaring every
baseline field and re-implementing the non-CUDA timing path inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-test GPU RSS data (collected after adding post-test CUDA.reclaim())
showed a handful of tests blowing well past the 4 GiB per-worker budget:

- test/core/array.jl: the 512^3 sum!() case allocated ~1 GiB Float64 to
  exercise the big-mapreduce path; (85, 1320, 100) already exercises the
  same serial kernel path. Drop the 512^3 case.
- test/core/sorting.jl: the "large sizes" quicksort input at 2^25 Float32
  was 128 MiB; 2^22 still exercises the multi-block quicksort path.
- examples/peakflops.jl: default n=5000 built four 5000x5000 Float32
  matrices (~400 MiB); n=1024 is enough to demonstrate the example.
- lib/cutensornet/test/contractions.jl: max_ws_size=2^32 (4 GiB
  workspace hint) was inflating cuTensorNet to ~1.5 GiB; 2^28 covers
  the same tuning paths.

Library tests (cusolver/cusparse/cudnn/cutensor/etc.) still sit at
1-2 GiB due to persistent library workspace that's not pool-allocated
and therefore not released by CUDA.reclaim() between tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@test_throws UndefRefError current_context()` / `current_device()`
assertions at the top of initialization.jl require that CUDA hasn't been
touched yet in the current Julia process. With PTR, every worker runs
`setup.jl` as `init_worker_code`, and that already does
`CUDA.functional(true)` / `precompile_runtime` / pool config — so the
worker is never in a fresh state by the time the test runs, and these
assertions fail ("Expected: UndefRefError, No exception thrown").

Run those four assertions (and the paired "now cause initialization"
check) in a subprocess instead, the same way the issue-1331 test at the
bottom of the file already does. The rest of initialization.jl doesn't
depend on fresh state and runs fine on a normal worker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The top-band GPU label spanned 30 columns but the bottom-band GPU cells
(GC/Alloc/RSS) sum to 33, shifting every pipe after the GPU section
three columns left. Widen the dashes (12 + 13) to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants