Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

  • alexcleac@szmer.info
    link
    fedilink
    English
    arrow-up
    1
    ·
    27 minutes ago

    I’ve been running ministral on CPU on a home-server: works pretty nicely, not very performant for everyday tasks and the savings were not sufficient for it to make sense. It still was cheaper and faster to just use Mistral API and get better models.

    • zutto@lemmy.fedi.zutto.fi
      link
      fedilink
      English
      arrow-up
      1
      ·
      47 minutes ago

      This dwarfstar looks interesting, can you elaborate on your setup and what kind of inference speeds you are getting?

  • placebo@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 hours ago

    I tried Qwen 3.6 a3b and Gemma 4 a4b, but both were too stupid for everyday work.

  • dfgxx@lemmy.zip
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 hours ago

    I ran through lmstudio because it really eazy, I ran some kind of qwen 3.6 27b imatrix neo code DI, it is the best local model for coding I tried, I think it can be better than some cloud model

  • JustEnoughDucks@slrpnk.net
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 hours ago

    I run Handy with Parakeet for speech to text, and home assistant with Whiper for the same. Whisper+ on my phone.

    I think that counts. But I have more relevant and useful things to do on my hardware and no 2000€+ to get LLM-capable hardware 😂

  • wrinkle2409@lemmy.cafe
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    5 hours ago

    I set up ollama on our thinkstation in the lab and I use it for looking up documentation, generating readmes, searching papers, and sometimes coding when I know what to do but don’t feel it is worth it to spend time on it myself. So basically the chat with web search.

    • pinball_wizard@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 hours ago

      I agree that the concerns listed there are smells, and I wasn’t aware of some of the options listed there.

      Thank you for sharing this!

    • comrademiao@piefed.social
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      3
      ·
      edit-2
      6 hours ago

      looks like extreme nitpicking without any real issues beyond some VC funding a FOSS issues.

      //whyre you spamming the comment to everyone? its quite alarmist actually

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        1
        ·
        edit-2
        4 hours ago

        I completely disagree.

        Frankly, I find the description “VC funding a FOSS” offensive. They aren’t funding the engine. I’ve been messing with LLM inference engines since 2022, and Ollama is the worst I’ve seen in the community.

        They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn’t really work and they’ll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

        And if that’s all fine, they’re enshittifying the app with closed code, and pointers to cloud models.

        They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the “default.” Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that’s not even getting into the interpersonal drama they’ve stirred.

        They are a leech that’s a net drag to the whole community, that we can’t get rid of because they’re attention grifters. And they’ve gotten worse and worse over time.


        It’s more morale to use any cloud API over Ollama, in my eyes. They’re a grift.


        EDIT: And, to be clear, I’m not against VC funded downstream stuff.

        LM Studio is good! Even though it’s closed source.

        Tons of downstream projects are great.

  • algernon@lemmy.ml
    link
    fedilink
    English
    arrow-up
    57
    arrow-down
    8
    ·
    11 hours ago

    Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.

  • D_Air1@lemmy.ml
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    1
    ·
    9 hours ago

    Yeah, I’m using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp’s such as ddg-search to make it truly useful by actually being able to search online.

    I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.

    Not to say that I couldn’t do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.

    I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I’m having trouble finding certain information. I’ll ask it to find me some resources to look at.

    Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.

    What I don’t like is the way companies try to market it to people. I don’t believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don’t expect a machine to be able to decide things for me or to be some filter between me and others.

  • Steve@startrek.website
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    8 hours ago

    I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.

    Conclusion, they are both fucking useless. Free tier claude runs circles.

    • e0qdk@reddthat.com
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 hours ago

      If you just pulled the default version of qwen3.5 from ollama’s repo you downloaded a mediocre one that only uses ~6GB.

      Check ollama show qwen3.5 and see if you get something like this in the result:

        Model
          architecture        qwen35    
          parameters          9.7B      
          context length      262144    
          embedding length    4096      
          quantization        Q4_K_M 
      

      This is the default version I got when I first tried using ollama without any experience. It worked, but it’s a heavily quantized, lower parameter version of the model – i.e. it’s pretty dumb – compared to what you can actually run on your hardware.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      8 hours ago

      Did you serve them with ollama?

      It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          2
          ·
          edit-2
          7 hours ago

          https://sleepingrobots.com/dreams/stop-using-ollama/

          And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

          LM Studio is better, and easy.

          If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

          It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.

          But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          edit-2
          7 hours ago

          Oh, and I just saw you have a 3090.

          To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

          Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

          If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

          If you have 64GB, I’d suggest a quantization of Step 3.7.

          If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.