Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749
Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749DLemming wants to merge 6 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/review |
mrpositron
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I left a few comments.
| if trainer.strategy.root_device.type == "cuda": | ||
| # Record end event — queried at next step start | ||
| self._step_end_event = torch.cuda.Event(enable_timing=True) # type: ignore[no-untyped-call] | ||
| self._step_end_event.record() # type: ignore[no-untyped-call] |
There was a problem hiding this comment.
record() only enqueues the marker; elapsed_time() needs the GPU to have actually reached _step_end_event. On a compute-bound run the CPU can reach the next on_train_batch_start while the GPU is still finishing the previous step, then elapsed_time raises RuntimeError and the run crashes.
| self.batch_time: float | None = None | ||
| # CUDA-only: events bracketing the GPU step. | ||
| self._step_start_event: torch.cuda.Event | None = None | ||
| self._step_end_event: torch.cuda.Event | None = None |
There was a problem hiding this comment.
Event-bracketing timing logic is now duplicated in both tqdm_progress_bar.py and method.py, each with its own _step_start_event / _step_end_event state. Could the progress bar instead read the metrics the Method already logs (profiling/data_time / profiling/batch_time) rather than re-timing the GPU independently?
|
Thanks for the feedback. Implemented your suggestions. The progress bar now reads One honest limitation worth noting: When the GPU hasn't reached the end event yet we skip that step's update rather than block, resulting in batch_time/data_time going stale, and metrics being slightly biased toward less compute-bound steps. I think that's an acceptable trade-off vs. the old behavior, which reported flat-out wrong numbers (~60% data_wait instad of ~10%). In practice I haven't been able to trigger the stale path once even on clearly compute-bound runs (e.g. DINOv2), so something in the loop seems to synchronize often enough that the end event is usually ready by the next step start anyway. Happy to look more into that edge case later, but for now this seems like a solid improvement over the currently available version. |
What has changed and why?
On CUDA,
on_train_batch_endfires before the GPU finishes the backward pass and optimizer step. The previous implementation recordedbatch_end_timeat that point, so the remaining async GPU work bled into the next step's data_time, rendering all data/batch-time realted metrics misleading / wildly inaccurate.The result: data_wait reported ~60% when the true value was ~10%. This is actively misleading — users would chase data-loading bottlenecks that don't exist while the GPU was fully utilized the entire time.
_callbacks/tqdm_progress_bar.py:profiling/data_timeandprofiling/batch_timefromtrainer.callback_metrics_methods/method.pytorch.cuda.Eventaton_train_batch_start/on_train_batch_end, measuring real gpu timedata_time = wall_gap (start → start) − GPU duration (from events)time.perf_counter()approach, which remains accurate when compute is synchronous.How has it been tested?
Changes have been tested on cuda running a SimCLR-Pretraining (bs=1232, res=224px) and a DINOv2-Pretraining (bs=128, global_crop_res=196px).
data_waitdropped from ~60% to ~10% for SimCLR, and increased from ~0.8% to ~2.5% for DINOv2.Did you update CHANGELOG.md?
Did you update the documentation?