publication croisée depuis : https://lemmy.world/post/1474932

Hi there.

I wanted to run LLMs locally on my server (for better privacy), and was wondering if:

  1. I could use Intel ARC/AMD GPUs - these are often less expensive and AMD has open source drivers, which is something I like.
  2. If a PCIe x4 Gen 3 slot would be enough (it’s an x16 slot with x4 speeds) - this is an important consideration.
  3. Would 8GB of RAM (in the GPU, I believe it’s called VRAM?) be enough?

I’m looking at language models to train on my Reddit and Lemmy content, in an aim to make it write like me (and maybe even better than me? Who knows). I don’t quite know which models I will train, or how I will do so (I certainly won’t be writing anything from scratch), but I was wondering; with the explosion of FOSS AI models, maybe something like this would be possible with the hardware constraints I mentioned above?

Does the speed of the connection between the GPU and the CPU really matter in such applications?

Thanks!

  • SteveTech@programming.dev
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    1 year ago

    I may have been doing something wrong, but in my experience llama.cpp with openCL offloading isn’t much faster than CPU only, it uses the same CPU usage with the addition of my GPU making typewriter noises.

    I have written this gist to run fastchat-t5-3b-v1.0 using Intel’s IPEX and it runs quite well, I have an A770 16GB but it seems to use under 8GB when using bfloat16. It could be easily be modified to run something else though.

    Or if you want a GUI (or a nice CLI), I’ve added support for Intel XPUs in FastChat.