Skip to content

Enable rocm/7.2.1#67

Merged
PatrickRMiles merged 4 commits into
LBANN:mainfrom
michaelmckinsey1:rocm-721
May 27, 2026
Merged

Enable rocm/7.2.1#67
PatrickRMiles merged 4 commits into
LBANN:mainfrom
michaelmckinsey1:rocm-721

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented May 7, 2026

Enable rocm/7.2.1 using public AMD wheels https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2.1/ torch wheels https://download.pytorch.org/whl/torch/. Apparently moving forward, we should not rely on the WCI wheels, so it is likely we will maintain scripts/install-tuolumne-torchpypi.sh. And use pre-installed rccl plugin /collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-7.2.0/install/lib/librccl-net.so

At scale 7, 1,1,2 sharding, 10 epochs, rocm/7.2.1 is slightly faster than 7.1.0:

  • 1% faster on 1 node
  • 6% faster on 2 nodes
  • 4% faster on 4 nodes

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 7, 2026
@michaelmckinsey1
Copy link
Copy Markdown
Collaborator Author

michaelmckinsey1 commented May 8, 2026

Caveat that I found a regression with this version where you will get the No suitable algorithm was found to execute the required convolution for batch_size>1. This does not happen in rocm/7.1.1

I still think we proceed with updating the version, as the 7.1.1 install WCI version is separate.

@PatrickRMiles PatrickRMiles merged commit fd254cd into LBANN:main May 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants