Auto-detect local servers
One click probes the common ports — 11434 (Ollama), 1234 (LM Studio), 8013 (MLX), 8080 (llama.cpp) — and wires endpoints up.
GoatLLM is a VS Code extension that chats with open-source models running on your own hardware. No accounts. No API keys. No tokens billed. Drop in MLX, Ollama, LM Studio, or any OpenAI-compatible server and code with an agent that has read, write, and shell tools — entirely offline.
src/server.ts to stream tokens and add a /healthz endpoint
I refactored handleRequest to stream via
res.write chunks and added a /healthz route
that returns { status: "ok", uptime }. Should I run the tests?
Works out-of-the-box with
Everything Cursor and Copilot do, against a model you control. No prompt logging, no rate limits, no "your code helped train…" footnote.
One click probes the common ports — 11434 (Ollama), 1234 (LM Studio), 8013 (MLX), 8080 (llama.cpp) — and wires endpoints up.
Native tool calling for read_file, write_file, list_directory, and run_command, with approval gates on writes and shell.
A hands-off mode where the agent iterates against your codebase until the task is done. Great for refactors, scaffolding, and migrations.
Run MLX on this Mac, vLLM on a Thunderbolt-attached box, and a remote exo cluster — flip between them from the status bar.
Streaming responses with token throughput, prompt tokens, and total latency surfaced right in the status bar.
No telemetry. No logins. API keys (if you bring one) sit in VS Code SecretStorage, never in settings.json.
Pick whichever you prefer. Both give you the same extension.
.vsix.vsix below (~100 KB).goatllm-vscode-1.0.1.vsix file directly onto the Extensions panel.
Alternative: from the command line, run code --install-extension goatllm-vscode-1.0.1.vsix.
GoatLLM in the search box.goatllm.
Marketplace publication is in review — until it lands, use the .vsix method on the left.
Once published, this page will update with a direct install link and you'll receive auto-updates.
GoatLLM speaks the OpenAI-compatible HTTP API, so anything that exposes /v1/chat/completions works. Here are the four most common paths.
The fastest path on any platform. Cross-platform binary, model registry, zero config.
# 1. Install Ollama
brew install ollama # macOS
curl -fsSL https://ollama.com/install.sh | sh # Linux
# 2. Pull a coding model
ollama pull qwen2.5-coder:32b
# 3. Start the server (usually auto-starts)
ollama serve
# 4. In VS Code: open the GoatLLM sidebar →
# click "Detect local servers". Done.
Default endpoint: http://localhost:11434/v1
The friendliest GUI. Search models, download, and toggle a local server from a menu bar.
Qwen2.5 Coder 32B Instruct) and download it.Default endpoint: http://localhost:1234/v1
Native Apple Silicon performance via mlx-lm. Pulls models directly from Hugging Face.
# 1. Install MLX (Apple Silicon only)
pip install -U "git+https://github.com/ml-explore/mlx-lm.git"
# 2. Start the server with any HF model
mlx_lm.server \
--model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
--port 8013 --host localhost
# 3. The model auto-downloads from huggingface.co on first run.
# 4. In VS Code: GoatLLM sidebar → Detect local servers.
Default endpoint: http://localhost:8013/v1. Browse compatible weights on huggingface.co/mlx-community.
The bare-metal option. GGUF quantizations, runs everywhere, smallest footprint.
# 1. Build llama.cpp (or grab a release binary)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# 2. Download a GGUF from Hugging Face, then:
./llama-server -m ./models/qwen2.5-coder-32b-instruct-q4_k_m.gguf \
--port 8080 --host 127.0.0.1
# 3. In VS Code: GoatLLM sidebar → Detect local servers.
Default endpoint: http://localhost:8080/v1
GoatLLM ships with three operating modes. Pick one from the dropdown at the top of the chat panel. Approval gates only apply in Agent mode.
| Mode | Tools available | Approval policy | Best for |
|---|---|---|---|
| Chat | none — pure conversation | — | Q&A, explanations, brainstorming |
| Agent | read_file, list_directory, write_file, run_command |
Reads auto-approve. Writes & shell prompt for confirmation. | Reviewed edits, debugging with file context |
| Agent (full access) | All four tools | Everything auto-approves (deny list still enforced) | Hands-off refactors, scaffolding, long-running tasks |
Agent modes use OpenAI-style tool_choice: auto. Your local model must support tool calling.
Verified to work: Qwen 2.5-Coder, Llama 3.1+, Gemma 2+, DeepSeek-Coder-V2, Mistral 0.3+,
Phi-3.5, and most fine-tunes thereof.
Every option is under goatllm.* in VS Code settings. API keys live in SecretStorage — they never get written to settings.json.
| Setting | What it does | Default |
|---|---|---|
goatllm.endpoints |
Array of {name, baseUrl, apiKey?} — every server you've connected. |
auto-populated by Detect |
goatllm.activeEndpoint |
Name of the currently-selected endpoint. | first detected |
goatllm.defaultModel |
Default model id; falls back to the first entry from /v1/models. |
unset |
goatllm.temperature |
0 = deterministic → 2 = creative. | 0.4 |
goatllm.maxTokens |
Cap on tokens generated per response. | 4096 |
goatllm.systemPrompt.chat |
Override the system prompt for Chat mode. | built-in |
goatllm.systemPrompt.agent |
Override the system prompt for Agent mode. | built-in |
goatllm.systemPrompt.agentFull |
Override the system prompt for Agent (full access). | built-in |
goatllm.commandDenyList |
Extra substring patterns to block in Agent modes. | [] |
goatllm.allowSudo |
Allow sudo in Agent modes. Off by default for safety. |
false |
Run a model on another machine and point GoatLLM at it. Useful for a beefy Mac Studio or a Linux box with an RTX GPU.
// In VS Code settings.json
{
"goatllm.endpoints": [
{ "name": "Studio (MLX)", "baseUrl": "http://10.0.0.20:8013/v1" },
{ "name": "GPU box (vLLM)", "baseUrl": "http://10.0.0.30:8000/v1" },
{ "name": "Local Ollama", "baseUrl": "http://localhost:11434/v1" }
],
"goatllm.activeEndpoint": "Local Ollama"
}
Click the GoatLLM status bar item to flip between endpoints without leaving your editor.
Agent modes are powerful — they execute commands on your machine. Here's exactly what GoatLLM allows.
Even in Agent (full access), commands matching these patterns are refused:
rm -rf / and variants targeting /, /*, ~, $HOMEmkfs, dd writes to block devices:(){ :|:& };: and friends)sudo — unless goatllm.allowSudo is explicitly set to truegoatllm.commandDenyListIn Agent mode, these prompt before running:
write_file — every write, with a diff previewrun_command — every shell call, with the full command shownReads (read_file, list_directory) auto-approve in both Agent modes because they're side-effect-free.
GoatLLM only makes HTTP requests to endpoints you've configured under goatllm.endpoints. There is:
If you connect to a server that requires an API key:
SecretStorage (OS keychain on macOS, libsecret on Linux, DPAPI on Windows)settings.json or any workspace fileGoatLLM polls GET /v1/models from your active server. Use the model picker at the top of the sidebar — switching is instant and per-conversation, so you can prototype with a small model and finalize with a bigger one.
For a 7B coding model in Q4: any modern laptop with 8 GB RAM. For 32B at decent quality: 32 GB+ unified memory (M1/M2/M3 Pro/Max) or a 24 GB GPU. The extension itself is <1 MB and uses negligible resources — the heavy lifting is in your runtime of choice.
Two common reasons: (1) the model wasn't fine-tuned for tools — try Qwen 2.5-Coder or Llama 3.1+ instead; (2) the runtime doesn't pass the tools field through. Ollama, LM Studio, MLX, and llama.cpp all support tool calling on recent versions. Check the runtime's release notes.
The OpenAI endpoint works as a fallback (baseUrl: "https://api.openai.com/v1" + API key), but GoatLLM is designed for local use — that's the entire point. For cloud, Copilot and Cursor are perfectly good. For "I want this code never to leave my laptop," that's where GoatLLM shines.
Right now the source repo is private. Email b.charleson1@gmail.com with a description and (if relevant) the contents of the GoatLLM output channel (View → Output → GoatLLM). Public issue tracking is coming once the Marketplace listing goes live.
Open VS Code's Output panel (⇧⌘U / Ctrl+Shift+U) and pick GoatLLM from the dropdown. Network traffic, tool calls, and any tokenizer/throughput weirdness all surface there.
Tokens per second on the assistant's response, computed once the stream completes. GoatLLM uses a tuned 3.7 chars/token fallback when the runtime doesn't report token counts directly, with a 150 ms minimum-sample guard so very short replies don't post inflated numbers.
Yes. Install the .vsix on a machine with no network. Point GoatLLM at http://localhost:<port>/v1. It will run forever without internet — no license server, no activation, no phone-home.
code styling inside cells./v1/models with one-click switching.read_file, list_directory, write_file, run_command.rm -rf /, mkfs, fork bombs, sudo) with user-extendable patterns.SecretStorage. Editor menus: Explain Selection, Generate Code.Stop renting intelligence by the token.
Install GoatLLM