--set devicePlugin.deviceMemoryScaling=2 无效 #705

su161021 · 2024-12-13T04:02:30Z

helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=2 -n kube-system
这是我执行的命令，发现并没有把显存变大，我使用的是v2.4.1版本测试的，其他功能都是正常，显存变大功能测试无效

elrondwong · 2024-12-13T10:06:42Z

Reason

In the chart HAMi/charts/hami/templates/device-plugin/configmap.yaml, the configuration looks like this:

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

The actual node name is:

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

Therefore, the configuration does not take effect.

Solution

To fix this, update the ConfigMap with the actual host name:

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

Then, restart the DaemonSet hami-device-plugin to apply the changes.

elrondwong · 2024-12-13T10:09:08Z

/assign elrondwong

su161021 · 2024-12-13T10:27:02Z

原因

在图表中HAMi/charts/hami/templates/device-plugin/configmap.yaml，配置如下：

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

实际名称是：

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

因此配置不会生效。

解決

要解决此问题，请使用实际主机名更新ConfigMap：

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

然后，重新启动 DaemonSethami-device-plugin以应用更改。

感觉大佬指点我去试一下

su161021 · 2024-12-16T08:22:23Z

原因

在图表中HAMi/charts/hami/templates/device-plugin/configmap.yaml，配置如下：

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

实际名称是：

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

因此配置不会生效。

解決

要解决此问题，请使用实际主机名更新ConfigMap：

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

然后，重新启动 DaemonSethami-device-plugin以应用更改。

感觉大佬指点我去试一下

我修改了里面的参数，并重启了kubectl rollout restart daemonset hami-device-plugin -n kube-system，但是看显存超卖还是没有生效，请大佬帮忙看看。

su161021 · 2024-12-16T08:33:29Z

这个问题解决了，谢谢大佬，是使用了pod以后监控页面才能显示，现在我的页面正常了。

elrondwong · 2024-12-17T09:11:22Z

After investigation, there is no issue with the code. You need to configure the global deviceMemoryScaling value in k get cm hami-scheduler-device -o yaml or specify it based on the node in k get cm hami-device-plugin -o yaml. The priority order is: hami-device-plugin > hami-scheduler-device.

su161021 · 2024-12-18T09:11:58Z

经过排查，代码没有问题，需要deviceMemoryScaling在中配置全局值k get cm hami-scheduler-device -o yaml或者根据中的节点指定k get cm hami-device-plugin -o yaml，优先级顺序为_hami-device-plugin_ > hami-scheduler-device。

我发现一个新的问题，我4090显卡，实际是24G显存，62G内存，我启动了一个60000MIB 的pod，里面看显卡都是正常的，但是在使用的时候报错，请大佬帮忙看看。

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 0 has a total capacity of 58.59 GiB of which 58.22 GiB is free. Process 167892 has 23.04 GiB memory in use. Of the allocated memory 22.58 GiB is allocated by PyTorch, and 17.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[HAMI-core Msg(11733:140477140360064:multiprocess_memory_limit.c:497)]: Calling exit handler 11733

elrondwong · 2024-12-18T09:38:06Z

我理解，你的显存是24g，如果你的程序使用的内存超过24g就会OOM。你的内存是60G，但是当你使用了GPU时，处理程序是加载在GPU显存的而不是内存
超分的意义在于提高资源利用率，例如你有一个24g的显存，可以跑4个20g的pod，但是这四个pod可能是pod1用了10g pod2 用了3g 这样分布，同时所有pod的显存使用量加起来小于物理显存才能正常运行，如果加起来大于了就会OOM，而不是一个物理资源有24G，可以一下子用到25G

su161021 · 2024-12-18T09:43:59Z

我理解，你的显存是24g，如果你的程序使用的内存超过24g就会OOM。你的内存是60G，但是当你使用了GPU时，处理程序是加载在GPU显存的而不是内存超分的意义在于提高资源利用率，例如你有一个24g的显存，可以跑4个20g的pod，但是这四个pod可能是pod1用了10g pod2 用了3g 这样分布，同时所有pod的显存使用量加起来小于物理显存才能正常运行，如果加起来大于了就会OOM，而不是一个物理资源有24G，可以一下子用到25G

不是这样的吧，超分vgpu的意义是可以把24G切割成多快GPU，但是超卖的话，我在容器内部已经可以看到60G显存了，所以我应该是可以使用60G的，只不过是它底层会把内存当成显存使用。

huangjiasingle · 2024-12-24T02:46:10Z

@su161021 我这开启显存虚拟化后测试的结果跟 @elrondwong理解的一样，所以我认为 hami实现的显存超卖跟没实现一样，我理解的显存虚拟化，应该是如果我开启了显存虚拟化，如果显存不够，可以直接使用内存

su161021 added the kind/bug Something isn't working label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--set devicePlugin.deviceMemoryScaling=2 无效 #705

--set devicePlugin.deviceMemoryScaling=2 无效 #705

su161021 commented Dec 13, 2024

elrondwong commented Dec 13, 2024 •

edited

Loading

elrondwong commented Dec 13, 2024

su161021 commented Dec 13, 2024

原因

解決

su161021 commented Dec 16, 2024

原因

解決

su161021 commented Dec 16, 2024

elrondwong commented Dec 17, 2024

su161021 commented Dec 18, 2024

elrondwong commented Dec 18, 2024

su161021 commented Dec 18, 2024

huangjiasingle commented Dec 24, 2024

--set devicePlugin.deviceMemoryScaling=2 无效 #705

--set devicePlugin.deviceMemoryScaling=2 无效 #705

Comments

su161021 commented Dec 13, 2024

elrondwong commented Dec 13, 2024 • edited Loading

Reason

Solution

elrondwong commented Dec 13, 2024

su161021 commented Dec 13, 2024

原因

解決

su161021 commented Dec 16, 2024

原因

解決

su161021 commented Dec 16, 2024

elrondwong commented Dec 17, 2024

su161021 commented Dec 18, 2024

elrondwong commented Dec 18, 2024

su161021 commented Dec 18, 2024

huangjiasingle commented Dec 24, 2024

elrondwong commented Dec 13, 2024 •

edited

Loading