• 0 Posts
  • 2 Comments
Joined 3 years ago
cake
Cake day: June 28th, 2023

help-circle
  • Yes, it is. But I have llama-swap, openweb-ui. If you spend some time on the llama-swap configuration, then the you have a good chance to run the model on 2 cards is through llama.cpp. The winnings, however, will not be x2 of course and will fall non-linearly from the number of cards. And you need motherboard with good PCI-E lines (2 pci-e x16 or more). But it’s still cheaper than one large card. Example:

    HIP_VISIBLE_DEVICES=0,1 \
    /opt/llama.cpp/build/bin/llama-server \
      --host 127.0.0.1 \
      --port 8082 \
      --model /storage/models/model.gguf \
      --n-gpu-layers all \
      --split-mode layer \
      --tensor-split 1,1 \
      --ctx-size 32768 \
      --batch-size 512 \
      --ubatch-size 512 \
      --flash-attn on \
      --parallel 1
    

    There is a less stable but more productive one: --split-mode row

    P.S. By the way, one RX9070XT on my instance translates posts and comments. You can test it if you want. =)