Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through) by DLemming · Pull Request #749 · lightly-ai/lightly-train

DLemming · 2026-05-26T15:53:33Z

What has changed and why?

On CUDA, on_train_batch_end fires before the GPU finishes the backward pass and optimizer step. The previous implementation recorded batch_end_time at that point, so the remaining async GPU work bled into the next step's data_time, rendering all data/batch-time realted metrics misleading / wildly inaccurate.

The result: data_wait reported ~60% when the true value was ~10%. This is actively misleading — users would chase data-loading bottlenecks that don't exist while the GPU was fully utilized the entire time.

_callbacks/tqdm_progress_bar.py:

Read profiling/data_time and profiling/batch_time from trainer.callback_metrics

_methods/method.py

Separate branches for CPU and CUDA
For CUDA, register non-blocking torch.cuda.Event at on_train_batch_start / on_train_batch_end, measuring real gpu time
data_time = wall_gap (start → start) − GPU duration (from events)
CPU-only training falls back to the previous time.perf_counter() approach, which remains accurate when compute is synchronous.

How has it been tested?

Changes have been tested on cuda running a SimCLR-Pretraining (bs=1232, res=224px) and a DINOv2-Pretraining (bs=128, global_crop_res=196px). data_wait dropped from ~60% to ~10% for SimCLR, and increased from ~0.8% to ~2.5% for DINOv2.

Did you update CHANGELOG.md?

Yes
Not needed (internal change)

Did you update the documentation?

Yes
Not needed (internal change without effects for user)

chatgpt-codex-connector · 2026-05-26T15:53:38Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

CLAassistant · 2026-05-26T15:53:48Z

All committers have signed the CLA.

mrpositron · 2026-05-27T06:48:43Z

/review

mrpositron

Thanks for the contribution! I left a few comments.

mrpositron · 2026-05-27T13:39:19Z

+        if trainer.strategy.root_device.type == "cuda":
+            # Record end event — queried at next step start
+            self._step_end_event = torch.cuda.Event(enable_timing=True)  # type: ignore[no-untyped-call]
+            self._step_end_event.record()  # type: ignore[no-untyped-call]


record() only enqueues the marker; elapsed_time() needs the GPU to have actually reached _step_end_event. On a compute-bound run the CPU can reach the next on_train_batch_start while the GPU is still finishing the previous step, then elapsed_time raises RuntimeError and the run crashes.

mrpositron · 2026-05-27T13:47:01Z

        self.batch_time: float | None = None
+        # CUDA-only: events bracketing the GPU step.
+        self._step_start_event: torch.cuda.Event | None = None
+        self._step_end_event: torch.cuda.Event | None = None


Event-bracketing timing logic is now duplicated in both tqdm_progress_bar.py and method.py, each with its own _step_start_event / _step_end_event state. Could the progress bar instead read the metrics the Method already logs (profiling/data_time / profiling/batch_time) rather than re-timing the GPU independently?

DLemming · 2026-06-03T15:05:09Z

Thanks for the feedback.

Implemented your suggestions. The progress bar now reads profiling/data_time / profiling/batch_time from trainer.callback_metrics instead of re-timing the GPU, so the event-bracketing logic lives only in Method. I also guarded the elapsed_time crash possibility you mentioned with a non-blocking _end_event.query().

One honest limitation worth noting: When the GPU hasn't reached the end event yet we skip that step's update rather than block, resulting in batch_time/data_time going stale, and metrics being slightly biased toward less compute-bound steps. I think that's an acceptable trade-off vs. the old behavior, which reported flat-out wrong numbers (~60% data_wait instad of ~10%).

In practice I haven't been able to trigger the stale path once even on clearly compute-bound runs (e.g. DINOv2), so something in the loop seems to synchronize often enough that the end event is usually ready by the next step start anyway.

Happy to look more into that edge case later, but for now this seems like a solid improvement over the currently available version.

DLemming added 2 commits May 26, 2026 17:17

Fix misleading data_wait on cuda

7c844a8

Update CHANGELOG.md

936f64f

Format CHANGELOG.md

e6eba50

mrpositron reviewed May 27, 2026

View reviewed changes

DLemming added 2 commits June 3, 2026 15:58

Deduplicate event-bracketing timing logic by reading from method's logs

e281e14

Ensure _end_event has been recorded before elapsed_time()

0883027

Merge branch 'main' into dlemming-fix-cuda-async-data-time

6392502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749

Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749
DLemming wants to merge 6 commits into
lightly-ai:mainfrom
DLemming:dlemming-fix-cuda-async-data-time

DLemming commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

CLAassistant commented May 26, 2026 •

edited

Loading

Uh oh!

mrpositron commented May 27, 2026

Uh oh!

mrpositron left a comment

Uh oh!

mrpositron May 27, 2026

Uh oh!

mrpositron May 27, 2026

Uh oh!

DLemming commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DLemming commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What has changed and why?

How has it been tested?

Did you update CHANGELOG.md?

Did you update the documentation?

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

CLAassistant commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrpositron commented May 27, 2026

Uh oh!

mrpositron left a comment

Choose a reason for hiding this comment

Uh oh!

mrpositron May 27, 2026

Choose a reason for hiding this comment

Uh oh!

mrpositron May 27, 2026

Choose a reason for hiding this comment

Uh oh!

DLemming commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DLemming commented May 26, 2026 •

edited

Loading

CLAassistant commented May 26, 2026 •

edited

Loading