I was messing with LLMs recently and landed in a situation whereby I was running a 3080 and a 7900xt in the same system. I started with the 3080 so my system had Nvidia drivers already installed. Adding the AMD was 100% plug and play in Bazzite and Ubuntu 24.04.
If you want to use both cards simultaneously, you can! I spent way too much figuring it out. What you have to do is find how to run llama.cpp with Vulkan. Also, don't try doing that in a docker container because Vulkan in docker is broken. ASK ME HOW I KNOW lamo. Performance is actually really good so have fun!
Yep. Vulkan is recommended for cross-vendor setups, more commonly where there's integrated graphics.
I actually had ti and xtx variants, so vram was 12+24GB = 36 GB. Vulkan is implemented cross-vendor and running vulkan-based llama.cpp yielded similar (though slightly worse) performance than CUDA on the 3080ti as a point of reference.
I don't have this well documened but, from memory, Llama3.1 8B_k4 could reliably get arund 110 tk/s on CUDA and 100 on Vulkan on the same computer
I used this setup specifically to take advantage of the vastly increased VRAM of having two cards. I was able to use 32B_k4 models which were outside of the VRAM of either card and tracked power and RAM uasage with Lact. Performance seemed pretty great compared to my friend running the same models on a 4x4060ti setup using just CUDA.
If this is interesting to a lot of people, I could put this setup together to answer more questions / do a separate post. I took the setup apart because it physically used more space than what my case could accommodate and I had the 3080ti literally hanging out of a riser.