Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--set devicePlugin.deviceMemoryScaling=2 无效 #705

Open
su161021 opened this issue Dec 13, 2024 · 10 comments
Open

--set devicePlugin.deviceMemoryScaling=2 无效 #705

su161021 opened this issue Dec 13, 2024 · 10 comments
Labels
kind/bug Something isn't working

Comments

@su161021
Copy link

helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=2 -n kube-system
这是我执行的命令,发现并没有把显存变大,我使用的是v2.4.1版本测试的,其他功能都是正常,显存变大功能测试无效

@su161021 su161021 added the kind/bug Something isn't working label Dec 13, 2024
@elrondwong
Copy link
Contributor

elrondwong commented Dec 13, 2024

Reason

In the chart HAMi/charts/hami/templates/device-plugin/configmap.yaml, the configuration looks like this:

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

The actual node name is:

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

Therefore, the configuration does not take effect.

Solution

To fix this, update the ConfigMap with the actual host name:

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

Then, restart the DaemonSet hami-device-plugin to apply the changes.

image

@elrondwong
Copy link
Contributor

/assign elrondwong

@su161021
Copy link
Author

原因

在图表中HAMi/charts/hami/templates/device-plugin/configmap.yaml,配置如下:

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

实际名称是:

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

因此配置不会生效。

解決

要解决此问题,请使用实际主机名更新ConfigMap:

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

然后,重新启动 DaemonSethami-device-plugin以应用更改。

图像

感觉大佬指点 我去试一下

@su161021
Copy link
Author

原因

在图表中HAMi/charts/hami/templates/device-plugin/configmap.yaml,配置如下:

{
    "nodeconfig": [
        {
            "name": "m5-cloudinfra-online02",
            "devicememoryscaling": 1.8,
            "devicesplitcount": 10,
            "migstrategy":"none",
            "filterdevices": {
              "uuid": [],
              "index": []
            }
        }
    ]
}

In the code, the configuration is processed in the following part of the file:

`pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go: 109-127`

```go
klog.Infof("Device Plugin Configs: %v", fmt.Sprintf("%v", deviceConfigs))
for _, val := range deviceConfigs.Nodeconfig {
if os.Getenv(util.NodeNameEnvName) == val.Name {
	klog.Infof("Reading config from file %s", val.Name)
	if val.Devicememoryscaling > 0 {
		sConfig.DeviceMemoryScaling = val.Devicememoryscaling
	}
	if val.Devicecorescaling > 0 {
		sConfig.DeviceCoreScaling = val.Devicecorescaling
	}
	if val.Devicesplitcount > 0 {
		sConfig.DeviceSplitCount = val.Devicesplitcount
	}
	if val.FilterDevice != nil && (len(val.FilterDevice.UUID) > 0 || len(val.FilterDevice.Index) > 0) {
		nvidia.DevicePluginFilterDevice = val.FilterDevice
	}
	klog.Infof("FilterDevice: %v", val.FilterDevice)
}
}

实际名称是:

k exec  hami-device-plugin-tp8b9 -c device-plugin -- env|grep NODE_NAME
NODE_NAME=node1

因此配置不会生效。

解決

要解决此问题,请使用实际主机名更新ConfigMap:

k edit cm hami-device-plugin
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {   
              // Change "m5-cloudinfra-online02" to the actual host name
                "name": m5-cloudinfra-online02",
                devicememoryscaling": 2,
                devicesplitcount": 10,
                migstrategy":"none",
                filterdevices": {
                  uuid": [],
                  index": []
                }
            }
        ]
    }

然后,重新启动 DaemonSethami-device-plugin以应用更改。
图像

感觉大佬指点我去试一下

我修改了里面的参数,并重启了kubectl rollout restart daemonset hami-device-plugin -n kube-system,但是看显存超卖还是没有生效,请大佬帮忙看看。

Snipaste_2024-12-16_16-19-42
Snipaste_2024-12-16_16-20-02

@su161021
Copy link
Author

这个问题解决了,谢谢大佬,是使用了pod以后监控页面才能显示,现在我的页面正常了。
Uploading Snipaste_2024-12-16_16-32-32.png…

@elrondwong
Copy link
Contributor

After investigation, there is no issue with the code. You need to configure the global deviceMemoryScaling value in k get cm hami-scheduler-device -o yaml or specify it based on the node in k get cm hami-device-plugin -o yaml. The priority order is: hami-device-plugin > hami-scheduler-device.

@su161021
Copy link
Author

经过排查,代码没有问题,需要deviceMemoryScaling在 中配置全局值k get cm hami-scheduler-device -o yaml或者根据 中的节点指定k get cm hami-device-plugin -o yaml,优先级顺序为_hami-device-plugin_ > hami-scheduler-device

我发现一个新的问题,我4090显卡,实际是24G显存,62G内存,我启动了一个60000MIB 的pod,里面看显卡都是正常的,但是在使用的时候报错,请大佬帮忙看看。

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 0 has a total capacity of 58.59 GiB of which 58.22 GiB is free. Process 167892 has 23.04 GiB memory in use. Of the allocated memory 22.58 GiB is allocated by PyTorch, and 17.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[HAMI-core Msg(11733:140477140360064:multiprocess_memory_limit.c:497)]: Calling exit handler 11733

Snipaste_2024-12-16_16-32-32
Snipaste_2024-12-18_16-55-38

@elrondwong
Copy link
Contributor

我理解,你的显存是24g,如果你的程序使用的内存超过24g就会OOM。你的内存是60G,但是当你使用了GPU时,处理程序是加载在GPU显存的而不是内存
超分的意义在于提高资源利用率,例如你有一个24g的显存,可以跑4个20g的pod,但是这四个pod可能是pod1用了10g pod2 用了3g 这样分布,同时所有pod的显存使用量加起来小于物理显存才能正常运行,如果加起来大于了就会OOM,而不是一个物理资源有24G,可以一下子用到25G

@su161021
Copy link
Author

我理解,你的显存是24g,如果你的程序使用的内存超过24g就会OOM。你的内存是60G,但是当你使用了GPU时,处理程序是加载在GPU显存的而不是内存 超分的意义在于提高资源利用率,例如你有一个24g的显存,可以跑4个20g的pod,但是这四个pod可能是pod1用了10g pod2 用了3g 这样分布,同时所有pod的显存使用量加起来小于物理显存才能正常运行,如果加起来大于了就会OOM,而不是一个物理资源有24G,可以一下子用到25G

不是这样的吧,超分vgpu的意义是可以把24G切割成多快GPU,但是超卖的话,我在容器内部已经可以看到60G显存了,所以我应该是可以使用60G的,只不过是它底层会把内存当成显存使用。

@huangjiasingle
Copy link

@su161021 我这开启显存虚拟化后测试的结果跟 @elrondwong理解的一样,所以我认为 hami实现的显存超卖跟没实现一样,我理解的显存虚拟化,应该是如果我开启了显存虚拟化,如果显存不够,可以直接使用内存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants