How to Setup a Local Coding Agent on macOS

dofm • 1 hour ago

Useful stuff in here that I wish I'd seen a few days ago :-)

I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.

Fiddling about with local models has done so much for my conceptual understanding of what is going on.

FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.

Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.

ig0r0 • 1 hour ago

I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/

namnnumbr • 1 hour ago

oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.

c-hendricks • 2 hours ago

Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:

  LLAMA_CACHE="models" ./llama-server \
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    ...

dofm • 2 hours ago

Yes.

-hfd for the draft model.

cdolan • 2 hours ago

Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this