Added support for AMD GPUs in "docker run --gpus". #49952

sgopinath1 · 2025-05-10T07:23:24Z

This change adds support for AMD GPUs in docker run --gpus command.

Short term fix for re-implement --gpus flag using CDI (was "AMD GPU support") #49824.

- What I did

Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs.

- How I did it

Followed the same approach as Nvidia by registering a new driver with gpu capability.
Similar to the Nvidia GPU driver, the AMD driver maps the --gpus input in the docker command to an Environment Variable, AMD_VISIBLE_DEVICES, that is handled by the AMD container runtime.
The AMD driver is registered only if the Nvidia container runtime is not installed on the system and the AMD container runtime is installed.

- How to verify it

AMD container runtime must be installed on the system to verify this functionality. AMD container runtime is expected to be published as an open-source project soon.

The following commands can be used to specify which GPUs are required inside the container and rocm-smi output can be used to verify the correct GPUs are made available inside the container.

To use all available GPUs

docker run --runtime=amd --gpus all rocm/rocm-terminal rocm-smi

OR

docker run --runtime=amd --gpus device=all rocm/rocm-terminal rocm-smi
To use any 2 GPUs

docker run --runtime=amd --gpus 2 rocm/rocm-terminal rocm-smi
To use a set of specific GPUs

docker run --runtime=amd --gpus 1,2,3 rocm/rocm-terminal rocm-smi

OR

docker run --runtime=amd --gpus '"device=1,2,3"' rocm/rocm-terminal rocm-smi

- Human readable description for the release notes

Add support for AMD GPUs in `docker run --gpus`.

elezar · 2025-05-13T15:06:13Z

@sgopinath1 as a maintainer of the NVIDIA Container Toolkit and its components I would strongly recommend against using the environment variable to control this behaviour -- even as an interim solution. Adding this behaviour now means that we have to keep in mind when implemening a --gpus flag to CDI mapping as discussed in #49824.

sgopinath1 · 2025-05-15T06:13:50Z

@elezar a couple of points:

This PR is neither introducing new user-visible behavior nor changing the existing behavior. The backend code is also identical to the Nvidia driver. So, IMO, this should not add any new variables / considerations when we move to the long-term solution of mapping --gpus flag to CDI.
The AMD container toolkit will support CDI. However, this PR is for customers who are insisting on parity with Nvidia w.r.t the --gpus flag. As I understand, there is no timeline for the long-term solution yet. We need to provide customers a way to use the --gpus flag with AMD GPUs asap.

deke997 · 2025-05-17T23:54:50Z

We would love to be able to use --gpus for AMD!

BTW, the AMD Container Toolkit is now published: https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/overview.html

elezar · 2025-05-20T13:43:27Z

daemon/amd_linux.go

+// countToDevicesAMD returns the list 0, 1, ... count-1 of deviceIDs.
+func countToDevicesAMD(count int) string {
+	devices := make([]string, count)
+	for i := range devices {
+		devices[i] = strconv.Itoa(i)
+	}
+	return strings.Join(devices, ",")
+}


This is the same implementation as countToDevices in nvidia_linux.go. Does it make sense to just use that function?

Yes, makes sense. Changed accordingly.

elezar · 2025-05-20T13:43:31Z

daemon/amd_linux.go

+
+const amdContainerRuntime = "amd-container-runtime"
+
+func init() {


@sgopinath1 instead of having a separate init function for nvidia and amd GPUs, does it make sense to refactor this and the code in nvidia_linux.go to have a single init function that checks for the existence of the various executables and registers the drivers accordingly?

Agreed. I have modified the init function in nvidia_linux.go to register the AMD driver also.

sgopinath1 · 2025-05-21T12:14:01Z

@elezar thanks for reviewing. I have updated the code as per your suggestions. Please review.

elezar · 2025-05-21T14:21:25Z

daemon/nvidia_linux.go

-const nvidiaHook = "nvidia-container-runtime-hook"
+const (
+	nvidiaHook = "nvidia-container-runtime-hook"
+	amdHook    = "amd-container-runtime"


Suggested change

amdHook = "amd-container-runtime"

amdContainerRuntimeExecutableName = "amd-container-runtime"

Done as suggested.

sgopinath1 · 2025-05-26T09:33:06Z

@elezar Let me know if there are any further comments on the changes. Thanks.

elezar · 2025-05-27T15:29:44Z

LGTM

thaJeztah · 2025-05-27T16:11:04Z

Could you do a quick rebase and squash the commits?

sgopinath1 · 2025-05-27T16:51:26Z

Could you do a quick rebase and squash the commits?

Done.

thaJeztah · 2025-05-27T20:39:55Z

daemon/nvidia_linux.go

+	} else {
+		// no "gpu" capability
 	}


Hm.. looks like the linter doesn't like this (even with a comment inside the branch 🤔)

daemon/nvidia_linux.go:55:9: SA9003: empty branch (staticcheck) } else { ^

thaJeztah · 2025-05-27T20:42:50Z

daemon/nvidia_linux.go

+	if _, err := exec.LookPath(nvidiaHook); err == nil {
+		capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}
+		nvidiaDriver := &deviceDriver{
+			capset:     capset,
+			updateSpec: setNvidiaGPUs,
+		}
+		for c := range allNvidiaCaps {
+			nvidiaDriver.capset[string(c)] = struct{}{}
+		}
+		registerDeviceDriver("nvidia", nvidiaDriver)
+	} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {
+		capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}
+		amdDriver := &deviceDriver{
+			capset:     capset,
+			updateSpec: setAMDGPUs,
+		}
+		registerDeviceDriver("amd", amdDriver)
+	} else {
+		// no "gpu" capability
 	}


Perhaps an early return would work;

Suggested change

if _, err := exec.LookPath(nvidiaHook); err == nil {

capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}

nvidiaDriver := &deviceDriver{

capset: capset,

updateSpec: setNvidiaGPUs,

}

for c := range allNvidiaCaps {

nvidiaDriver.capset[string(c)] = struct{}{}

}

registerDeviceDriver("nvidia", nvidiaDriver)

} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {

capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}

amdDriver := &deviceDriver{

capset: capset,

updateSpec: setAMDGPUs,

}

registerDeviceDriver("amd", amdDriver)

} else {

// no "gpu" capability

}

if _, err := exec.LookPath(nvidiaHook); err == nil {

capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}

for c := range allNvidiaCaps {

capset[string(c)] = struct{}{}

}

registerDeviceDriver("nvidia", &deviceDriver{

capset: capset,

updateSpec: setNvidiaGPUs,

})

return

}

if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {

registerDeviceDriver("amd", &deviceDriver{

capset: capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}},

updateSpec: setAMDGPUs,

})

return

}

// no "gpu" capability

Curious though; should amd and nvidia be considered mutually exclusive? Would splitting this into two init funcs (once in the nvidia file, one in the amd file) and both register a driver (if present) work?

They are mutually exclusive, but I suggested that the same init function be used because @sgopinath1 was also checking for the NVIDIA runtime in the AMD init function. The intent was to ensure that the AMD runtime is not used if the NVIDIA hook is present and that the nvidia logic takes precedence.

It may be cleaner to repeat some code here to keep these logically separate.

Gotcha; yup, makes sense

Perhaps an early return would work;

@thaJeztah Changed as suggested.

daemon/nvidia_linux.go

sgopinath1 · 2025-06-03T04:53:13Z

@thaJeztah Let me know if there are any further comments or the changes look good. Thanks.

thaJeztah · 2025-06-04T22:45:35Z

I have renamed amd_linux.go to gpu_amd_linux.go in this PR. Could you please rename nvidia_linux.go in your PR?

It's probably good to rename it as part of this PR to have them both follow the same naming pattern; git should be smart enough to handle the rename if the other PR is rebased.

Can you do the above, and squash the commits ?

thaJeztah · 2025-06-04T23:16:54Z

Hm... actually, I just realised that effectively the related selection code is in devices.go - perhaps my suggestion to use a common gpu_ prefix would make more sense to use devices_ as prefix; WDYT?

daemon/devices.go
daemon/devices_amd_linux.go
daemon/devices_nvidia_linux.go

Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs. Signed-off-by: Sudheendra Gopinath <[email protected]> Reused common functions from nvidia_linux.go. Removed duplicate code in amd_linux.go by reusing the init() and countToDevices() functions in nvidia_linux.go. AMD driver is registered in init(). Signed-off-by: Sudheendra Gopinath <[email protected]> Renamed amd-container-runtime constant Signed-off-by: Sudheendra Gopinath <[email protected]> Removed empty branch to keep linter happy. Also renamed amd_linux.go to gpu_amd_linux.go. Signed-off-by: Sudheendra Gopinath <[email protected]> Renamed nvidia_linux.go and gpu_amd_linux.go. Signed-off-by: Sudheendra Gopinath <[email protected]>

sgopinath1 · 2025-06-05T14:48:27Z

Hm... actually, I just realised that effectively the related selection code is in devices.go - perhaps my suggestion to use a common gpu_ prefix would make more sense to use devices_ as prefix; WDYT?

daemon/devices.go

daemon/devices_amd_linux.go

daemon/devices_nvidia_linux.go

@thaJeztah I have renamed the files as above and squashed the commits.
cc: @elezar

thaJeztah

thanks!

LGTM

vvoland · 2025-06-09T10:20:10Z

daemon/devices_nvidia_linux.go

+	// Register AMD driver if AMD helper binary is present.
+	if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {
+		registerDeviceDriver("amd", &deviceDriver{
+			capset:     capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}},
+			updateSpec: setAMDGPUs,
+		})
+		return


I think it's okay-ish for now, but since this already needs the AMD container toolkit to be installed... what I'd really love is:

Shell out to amd-ctk cdi generate to generate the actual CDI configs (on init, and then perhaps on every gpu requests?)

And then just rewrite the gpus value to a proper CDI device ID and pass it to the CDI driver

This way, the user won't need to pass --runtime=amd to override the container runtime to the AMD runc wrapper.

@vvoland We should consider this as part of the long-term plan of migrating ---gpus to CDI format. @elezar mentioned that he would be working on a plan for this here.

laurazard

LGTM

thaJeztah · 2025-06-11T12:47:43Z

I "hijacked" the original ticket to re-purpose it for the discussion on reimplementing --gpus through CDI; the ticket already had a lot of information captured around implementing through CDI, so I thought it was better to repurpose the ticket than to create a new one (and loosing that information), but we can still do so;

re-implement --gpus flag using CDI (was "AMD GPU support") #49824

thaJeztah

still LGTM

let's bring this one in

sgopinath1 mentioned this pull request May 10, 2025

re-implement --gpus flag using CDI (was "AMD GPU support") #49824

Open

thaJeztah added status/2-code-review impact/changelog area/daemon labels May 12, 2025

thompson-shaun added this to Maintainer spotlight May 12, 2025

github-project-automation bot moved this to Up next in Maintainer spotlight May 12, 2025

elezar reviewed May 20, 2025

View reviewed changes

elezar reviewed May 21, 2025

View reviewed changes

sgopinath1 force-pushed the 49824-amd-gpu branch from 42f3332 to b7a6be3 Compare May 27, 2025 16:48

thaJeztah reviewed May 27, 2025

View reviewed changes

elezar mentioned this pull request May 28, 2025

Remove support for specifying nvidia driver capabilities in --gpus flag #50099

Open

thompson-shaun moved this from New to Open in Maintainer spotlight May 29, 2025

sgopinath1 force-pushed the 49824-amd-gpu branch from ee184ff to e32715e Compare June 5, 2025 14:46

thaJeztah approved these changes Jun 5, 2025

View reviewed changes

vvoland self-assigned this Jun 6, 2025

vvoland requested review from vvoland and elezar June 6, 2025 17:31

vvoland reviewed Jun 9, 2025

View reviewed changes

vvoland added this to the 28.3.0 milestone Jun 9, 2025

vvoland approved these changes Jun 9, 2025

View reviewed changes

laurazard approved these changes Jun 11, 2025

View reviewed changes

thaJeztah approved these changes Jun 11, 2025

View reviewed changes

thaJeztah merged commit 3b1d2f7 into moby:master Jun 11, 2025
231 of 232 checks passed

thaJeztah added the docs/revisit label Jun 11, 2025

vvoland added the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label Jun 13, 2025

dmcgowan moved this from Open to Complete in Maintainer spotlight Jun 26, 2025


		const amdContainerRuntime = "amd-container-runtime"

		func init() {

	amdHook = "amd-container-runtime"
	amdContainerRuntimeExecutableName = "amd-container-runtime"

Added support for AMD GPUs in "docker run --gpus". #49952

Added support for AMD GPUs in "docker run --gpus". #49952

Uh oh!

Conversation

sgopinath1 commented May 10, 2025 • edited by vvoland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented May 13, 2025

Uh oh!

sgopinath1 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deke997 commented May 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgopinath1 commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgopinath1 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented May 27, 2025

Uh oh!

thaJeztah commented May 27, 2025

Uh oh!

sgopinath1 commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgopinath1 commented Jun 3, 2025

Uh oh!

thaJeztah commented Jun 4, 2025

Uh oh!

thaJeztah commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgopinath1 commented Jun 5, 2025

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurazard left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Jun 11, 2025

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgopinath1 commented May 10, 2025 •

edited by vvoland

Loading

sgopinath1 commented May 15, 2025 •

edited

Loading

sgopinath1 commented May 26, 2025 •

edited

Loading

thaJeztah commented Jun 4, 2025 •

edited

Loading