• SabinStargem@lemmy.today
    link
    fedilink
    English
    arrow-up
    3
    ·
    16 hours ago

    You can use something like KoboldCPP on Linux, which allows both RAM and VRAM combined to run a model. O’course, not as fast when compared to pure VRAM or the Mac approach, but it is an option. I use my 128gb RAM with some GPUs for running models.

      • SabinStargem@lemmy.today
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 minute ago

        Speed depends on how much of the model is on VRAM, and the dense/MoE architecture of that model. The RAM’s benefit is more about having the ability to run the model in the first place. In any case, a dense Qwen3.6 27b would take up about 27-33gb-ish of memory, plus whatever context size you set.

        Upcoming implementation of MTP will increase the size of models, but in exchange, they will also run faster. About a 30%ish boost for dense models, a bit less for Mixture of Expert varieties, from the looks of it.