Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If it is possible to run inference with OVIS 1.6 on a single 4090 GPU? #22

Open
Raven625 opened this issue Sep 23, 2024 · 6 comments
Open

Comments

@Raven625
Copy link

Could anyone please advise if it is possible to run inference with OVIS 1.6 on a single 4090 GPU? After loading the model, it appears to consume approximately 20GB of VRAM. I attempted an inference, but the demo exited due to insufficient memory. Are there any solutions to this issue?

@leave-zym
Copy link

The same question, is there a quantitative way of reasoning

@thunder95
Copy link

same issue

@FennelFetish
Copy link

Offload some layers of the visual tokenizer to the CPU using a device map.
I use this function to generate the device map:

    def makeDeviceMap(llmGpuLayers: int, visGpuLayers: int) -> dict:
        llmGpuLayers = min(llmGpuLayers, 41)
        visGpuLayers = min(visGpuLayers, 26)

        deviceMap = dict()
        cpu = "cpu"
        cuda = 0

        deviceMap["llm.model.embed_tokens"] = cuda
        deviceMap["llm.model.norm"] = cuda
        deviceMap["llm.lm_head.weight"] = cuda
        deviceMap["vte.weight"] = cuda

        deviceMap["llm.model.layers.0"] = cuda
        for l in range(1, llmGpuLayers):
            deviceMap[f"llm.model.layers.{l}"] = cuda
        for l in range(llmGpuLayers, 41):
            deviceMap[f"llm.model.layers.{l}"] = cpu
        deviceMap["llm.model.layers.41"] = cuda

        deviceMap["visual_tokenizer"] = cuda
        deviceMap["visual_tokenizer.backbone.vision_model.encoder.layers.0"] = cuda
        for l in range(1, visGpuLayers):
            deviceMap[f"visual_tokenizer.backbone.vision_model.encoder.layers.{l}"] = cuda
        for l in range(visGpuLayers, 26):
            deviceMap[f"visual_tokenizer.backbone.vision_model.encoder.layers.{l}"] = cpu
        deviceMap["visual_tokenizer.backbone.vision_model.encoder.layers.26"] = cuda

        # print("mkDeviceMap:")
        # for k, v in device_map.items():
        #     print(f"{k} -> {v}")

        return deviceMap

It works on my 4090 with arguments of 41 and 6:

        self.model = AutoModelForCausalLM.from_pretrained(
            modelPath,
            torch_dtype=torch.bfloat16,
            multimodal_max_length=8192,
            #attn_implementation='flash_attention_2',
            device_map=self.makeDeviceMap(41, 6),
            trust_remote_code=True
        )

@nmandic78
Copy link

nmandic78 commented Sep 28, 2024

I run their HF demo snippet (https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B) on 3090 without issues. Ubuntu, ~500Mb VRAM in use before loading the model. ~21.7Gb during inference.
image

And it is very good!

@dustinjoe
Copy link

dustinjoe commented Oct 6, 2024

I am trying to run it on a single 3090
The model itself seems very good but I can only infer once and then would run into the error:
'HybridCache' object has no attribute 'max_batch_size' error when doing inference

Also add some details here: #31

Thanks for FennelFetish's comment, this issue mentioned has been solved.

@aceliuchanghong
Copy link

the same
只能推理一次,然后就会遇到错误:
AttributeError: 'HybridCache' object has no attribute 'max_batch_size'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants