Skip to content

Normalize NVIDIA library paths to /usr/lib/#919

Open
maherthomsi wants to merge 2 commits into
bottlerocket-os:developfrom
maherthomsi:nvidia-normalization
Open

Normalize NVIDIA library paths to /usr/lib/#919
maherthomsi wants to merge 2 commits into
bottlerocket-os:developfrom
maherthomsi:nvidia-normalization

Conversation

@maherthomsi

@maherthomsi maherthomsi commented May 5, 2026

Copy link
Copy Markdown
Contributor

Merge with bottlerocket-os/bottlerocket-kernel-kit#425

Description of changes:

Normalize NVIDIA library paths in containers so libraries appear at /usr/lib/ instead of the Bottlerocket cross-compilation sysroot path (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/).

  • nvidia-container-toolkit: Add --additional-symlinks flag patch to nvidia-ctk cdi generate that creates backwards-compatibility symlinks in a specified directory pointing to each discovered library. Configure generate-cdi-specs.service with --additional-symlinks /usr/lib/nvidia/tesla.
  • nvidia-k8s-device-plugin: Set containerDriverRoot to the Bottlerocket sysroot path so the device plugin discovers libraries correctly and generates CDI specs with normalized /usr/lib/ container paths. Enable CreateLibSymlinksHook via patch 1002 so the device plugin's CDI spec includes backwards-compat symlinks. Add --cdi-enabled-hooks create-lib-symlinks to the exec-start template to pass the hook enablement through the systemd drop-in. Fix containerDriverRoot macro expansion from %{_cross_sysroot} (undefined) to /%{_target_cpu}-bottlerocket-linux-gnu/sys-root.

Result: containers see libraries at /usr/lib/libcuda.so.580.159.03 with backwards-compat symlinks at /usr/lib/nvidia/tesla/libcuda.so.580.159.03 pointing to /usr/lib/libcuda.so.580.159.03.

Note: Backwards-compat symlinks are supported in cdi-cri mode (the default device-list-strategy). The legacy volume-mounts mode does not use CDI hooks and therefore does not create symlinks in containers.

Testing done:

  • Built core-kit for x86_64, published, and built aws-k8s-1.35-nvidia variant AMI
  • Launched g6.xlarge node on EKS cluster in ap-northeast-1
  • Verified containerDriverRoot resolves to /x86_64-bottlerocket-linux-gnu/sys-root
  • Verified device plugin starts and registers GPUs with kubelet
  • Verified nvidia-smi works in container
  • Verified libraries at /usr/lib/ inside container
  • Verified backwards-compat symlinks at /usr/lib/nvidia/tesla/ point to /usr/lib/
  • Verified generate-cdi-specs.service passes on boot
  • Verified CDI spec at /var/run/cdi/ includes symlink hooks
  • Verified ldconfig resolves libcuda.so.1 on host
  • Verified no ld.so.conf.d entry needed (libs in standard path)
  • Verified nvidia-smi works on host (dlopen via both paths)
  • Tested negative case: without --cdi-enabled-hooks, symlinks do not appear in containers
  • Ran nvidia smoke tests - all passed

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@maherthomsi maherthomsi changed the title Normalize library paths to /usr/lib/ Normalize NVIDIA library paths to /usr/lib/ May 5, 2026
@maherthomsi maherthomsi requested review from arnaldo2792 and mgsharm May 5, 2026 23:54
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch 2 times, most recently from cf2022c to 7614bfc Compare May 6, 2026 00:04
@maherthomsi

Copy link
Copy Markdown
Contributor Author

Added Signed off by to commits

Comment thread packages/nvidia-container-toolkit/0002-add-additional-symlinks-flag.patch Outdated
Comment thread packages/nvidia-container-toolkit/generate-cdi-specs.service
Comment thread packages/nvidia-k8s-device-plugin/1003-vendor-add-CreateLibSymlinksHook.patch Outdated
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch 2 times, most recently from 64e28c4 to d3f12f0 Compare May 7, 2026 23:15
@maherthomsi maherthomsi requested a review from piyush-jena May 8, 2026 19:12
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from d3f12f0 to b16ac81 Compare May 18, 2026 19:17
{{/if}}
deviceIDStrategy: {{default "index" settings.kubelet-device-plugins.nvidia.device-id-strategy}}
containerDriverRoot: "/"
containerDriverRoot: "/x86_64-bottlerocket-linux-gnu/sys-root"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, you are hardcoding the path for both arches. If this is the way to go from now on, you will have to make this file an "input" templated file. Similar to what was done in the files you removed from the kernel kit.

@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from b16ac81 to 270b737 Compare May 21, 2026 21:28
@maherthomsi maherthomsi requested a review from arnaldo2792 May 21, 2026 22:16
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from 270b737 to 8be6943 Compare May 26, 2026 21:02
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from 8be6943 to d73ab01 Compare June 8, 2026 21:36
Add --additional-symlinks flag to nvidia-ctk cdi generate that creates
symlinks in a specified directory pointing to each discovered library.

Configure generate-cdi-specs.service with:
- --driver-root /x86_64-bottlerocket-linux-gnu/sys-root
- --dev-root /
- --additional-symlinks /usr/lib/nvidia/tesla

This ensures libraries appear at /usr/lib/ in containers with
backwards-compat symlinks at /usr/lib/nvidia/tesla/.

Signed-off-by: Maher Homsi <maherhom@amazon.com>
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from d73ab01 to 11e6c8d Compare June 8, 2026 21:37
…rmalization

Set containerDriverRoot to the Bottlerocket sysroot path so the device
plugin discovers libraries correctly and generates CDI specs with
normalized /usr/lib/ container paths.

Add --additional-symlinks support patches (1002, 1003) to the device
plugin vendored nvidia-container-toolkit code.

Signed-off-by: Maher Homsi <maherhom@amazon.com>
@maherthomsi maherthomsi force-pushed the nvidia-normalization branch from 11e6c8d to f5e65aa Compare June 15, 2026 21:02
@maherthomsi

Copy link
Copy Markdown
Contributor Author

Force-push: Added --cdi-enabled-hooks create-lib-symlinks to exec-start template

install -d %{buildroot}%{_cross_unitdir}/nvidia-k8s-device-plugin.service.d
install -d %{buildroot}%{_cross_unitdir}/nvidia-mps-control-daemon.service.d
install -D -m 0644 %{S:2} %{buildroot}%{_cross_templatedir}/nvidia-k8s-device-plugin-conf
sed -e 's|__PREFIX__|/%{_target_cpu}-bottlerocket-linux-gnu/sys-root|g' %{S:2} > nvidia-k8s-device-plugin-conf

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%{__target_cpu} is not the correct macro, you should use %{_cross_arch}

@KCSesh

KCSesh commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Nit: Please call out in the future variant releases, we will remove this symlink. This is a change for existing variants 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants