Yes. GoatLLM is free to download and use. There is no subscription, no usage-based billing, and no telemetry. The source repository is currently private; the project may open source in the future.

Does GoatLLM work offline?

Yes — completely. GoatLLM is designed for airgapped use. Install the .vsix on a machine with no internet, point it at a local server (e.g. http://localhost:11434/v1 for Ollama), and it runs forever without phoning home.

Which LLM runtimes does GoatLLM support?

MLX (Apple Silicon), Ollama, LM Studio, llama.cpp, vLLM, exo (distributed), Hugging Face models, and any other server exposing an OpenAI-compatible /v1/chat/completions endpoint.

GoatLLM Get the extension

v1.0.1 — now on Apple Silicon, Linux & Windows

Local AI coding.
Nothing leaves your machine.

Name: GoatLLM
Author: Brandon Charleson

GoatLLM is a VS Code extension that chats with open-source models running on your own hardware. No accounts. No API keys. No tokens billed. Drop in MLX, Ollama, LM Studio, or any OpenAI-compatible server and code with an agent that has read, write, and shell tools — entirely offline.

Install GoatLLM Read the setup guide

Built by Brandon Charleson · free to use · runs 100% locally

GoatLLM — Agent qwen2.5-coder:32b · 64.3 tok/s

you refactor src/server.ts to stream tokens and add a /healthz endpoint

goat

read_file src/server.ts ✓ 84 lines

write_file src/server.ts approve?

I refactored handleRequest to stream via res.write chunks and added a /healthz route that returns { status: "ok", uptime }. Should I run the tests?

Agent Ask GoatLLM… ⌘↵

Works out-of-the-box with

MLX
Ollama
LM Studio
llama.cpp
vLLM
exo
Hugging Face
any OpenAI-compatible endpoint

Features

Built for people who run their own models.

Everything Cursor and Copilot do, against a model you control. No prompt logging, no rate limits, no "your code helped train…" footnote.

⚡

Auto-detect local servers

One click probes the common ports — 11434 (Ollama), 1234 (LM Studio), 8013 (MLX), 8080 (llama.cpp) — and wires endpoints up.

🛠

Real agent mode

Native tool calling for read_file, write_file, list_directory, and run_command, with approval gates on writes and shell.

🚀

Full autonomy

A hands-off mode where the agent iterates against your codebase until the task is done. Great for refactors, scaffolding, and migrations.

📡

Hot-swap endpoints

Run MLX on this Mac, vLLM on a Thunderbolt-attached box, and a remote exo cluster — flip between them from the status bar.

📈

Live tok/s & metrics

Streaming responses with token throughput, prompt tokens, and total latency surfaced right in the status bar.

🔒

Nothing phones home

No telemetry. No logins. API keys (if you bring one) sit in VS Code SecretStorage, never in settings.json.

Install

Two ways to install in under a minute.

Pick whichever you prefer. Both give you the same extension.

Recommended for now

1 · Drag-and-drop the `.vsix`

Download the latest .vsix below (~100 KB).
Open VS Code and click the Extensions icon in the activity bar (⇧⌘X on macOS, Ctrl+Shift+X elsewhere).
Drag the goatllm-vscode-1.0.1.vsix file directly onto the Extensions panel.
VS Code will install it and prompt for a reload. The 🐐 mark appears in your activity bar.

Download v1.0.1 (.vsix)

Alternative: from the command line, run code --install-extension goatllm-vscode-1.0.1.vsix.

Pending publish

2 · Search the Marketplace

Open the Extensions panel in VS Code.
Type GoatLLM in the search box.
Click Install on the entry published by goatllm.

Marketplace publication is in review — until it lands, use the .vsix method on the left. Once published, this page will update with a direct install link and you'll receive auto-updates.

Run a model

Pick a runtime. Run a model. Connect.

GoatLLM speaks the OpenAI-compatible HTTP API, so anything that exposes /v1/chat/completions works. Here are the four most common paths.

The fastest path on any platform. Cross-platform binary, model registry, zero config.

# 1. Install Ollama
brew install ollama          # macOS
curl -fsSL https://ollama.com/install.sh | sh   # Linux

# 2. Pull a coding model
ollama pull qwen2.5-coder:32b

# 3. Start the server (usually auto-starts)
ollama serve

# 4. In VS Code: open the GoatLLM sidebar →
#    click "Detect local servers". Done.

Default endpoint: http://localhost:11434/v1

The friendliest GUI. Search models, download, and toggle a local server from a menu bar.

Download LM Studio and install it.
In the Discover tab, search for a coding model (e.g. Qwen2.5 Coder 32B Instruct) and download it.
Open the Developer tab → Start Server.
In VS Code, open the GoatLLM sidebar and click Detect local servers.

Default endpoint: http://localhost:1234/v1

Native Apple Silicon performance via mlx-lm. Pulls models directly from Hugging Face.

# 1. Install MLX (Apple Silicon only)
pip install -U "git+https://github.com/ml-explore/mlx-lm.git"

# 2. Start the server with any HF model
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
  --port 8013 --host localhost

# 3. The model auto-downloads from huggingface.co on first run.

# 4. In VS Code: GoatLLM sidebar → Detect local servers.

Default endpoint: http://localhost:8013/v1. Browse compatible weights on huggingface.co/mlx-community.

The bare-metal option. GGUF quantizations, runs everywhere, smallest footprint.

# 1. Build llama.cpp (or grab a release binary)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# 2. Download a GGUF from Hugging Face, then:
./llama-server -m ./models/qwen2.5-coder-32b-instruct-q4_k_m.gguf \
  --port 8080 --host 127.0.0.1

# 3. In VS Code: GoatLLM sidebar → Detect local servers.

Default endpoint: http://localhost:8080/v1

Docs

The three modes.

GoatLLM ships with three operating modes. Pick one from the dropdown at the top of the chat panel. Approval gates only apply in Agent mode.

Mode	Tools available	Approval policy	Best for
Chat	none — pure conversation	—	Q&A, explanations, brainstorming
Agent	`read_file`, `list_directory`, `write_file`, `run_command`	Reads auto-approve. Writes & shell prompt for confirmation.	Reviewed edits, debugging with file context
Agent (full access)	All four tools	Everything auto-approves (deny list still enforced)	Hands-off refactors, scaffolding, long-running tasks

Agent modes use OpenAI-style tool_choice: auto. Your local model must support tool calling. Verified to work: Qwen 2.5-Coder, Llama 3.1+, Gemma 2+, DeepSeek-Coder-V2, Mistral 0.3+, Phi-3.5, and most fine-tunes thereof.

Reference

Settings reference.

Every option is under goatllm.* in VS Code settings. API keys live in SecretStorage — they never get written to settings.json.

Setting	What it does	Default
`goatllm.endpoints`	Array of `{name, baseUrl, apiKey?}` — every server you've connected.	auto-populated by Detect
`goatllm.activeEndpoint`	Name of the currently-selected endpoint.	first detected
`goatllm.defaultModel`	Default model id; falls back to the first entry from `/v1/models`.	unset
`goatllm.temperature`	0 = deterministic → 2 = creative.	`0.4`
`goatllm.maxTokens`	Cap on tokens generated per response.	`4096`
`goatllm.systemPrompt.chat`	Override the system prompt for Chat mode.	built-in
`goatllm.systemPrompt.agent`	Override the system prompt for Agent mode.	built-in
`goatllm.systemPrompt.agentFull`	Override the system prompt for Agent (full access).	built-in
`goatllm.commandDenyList`	Extra substring patterns to block in Agent modes.	`[]`
`goatllm.allowSudo`	Allow `sudo` in Agent modes. Off by default for safety.	`false`

Connecting a remote endpoint

Run a model on another machine and point GoatLLM at it. Useful for a beefy Mac Studio or a Linux box with an RTX GPU.

// In VS Code settings.json
{
  "goatllm.endpoints": [
    { "name": "Studio (MLX)", "baseUrl": "http://10.0.0.20:8013/v1" },
    { "name": "GPU box (vLLM)", "baseUrl": "http://10.0.0.30:8000/v1" },
    { "name": "Local Ollama", "baseUrl": "http://localhost:11434/v1" }
  ],
  "goatllm.activeEndpoint": "Local Ollama"
}

Click the GoatLLM status bar item to flip between endpoints without leaving your editor.

Security

What's blocked, what's gated, what's free.

Agent modes are powerful — they execute commands on your machine. Here's exactly what GoatLLM allows.

Blocked unconditionally

Even in Agent (full access), commands matching these patterns are refused:

rm -rf / and variants targeting /, /*, ~, $HOME
mkfs, dd writes to block devices
Fork bombs (:(){ :|:& };: and friends)
sudo — unless goatllm.allowSudo is explicitly set to true
Any substring you add to goatllm.commandDenyList

What requires approval

In Agent mode, these prompt before running:

write_file — every write, with a diff preview
run_command — every shell call, with the full command shown

Reads (read_file, list_directory) auto-approve in both Agent modes because they're side-effect-free.

Network surface

GoatLLM only makes HTTP requests to endpoints you've configured under goatllm.endpoints. There is:

No telemetry, no analytics, no error reporting
No login, no account system, no rate limiting
No auto-update ping (updates flow through VS Code itself)

Secrets handling

If you connect to a server that requires an API key:

Keys are stored in VS Code's SecretStorage (OS keychain on macOS, libsecret on Linux, DPAPI on Windows)
They are never written to settings.json or any workspace file
You can audit and clear them via the GoatLLM: Manage Endpoint Keys command

FAQ

Common questions.

How do I switch models mid-conversation?

GoatLLM polls GET /v1/models from your active server. Use the model picker at the top of the sidebar — switching is instant and per-conversation, so you can prototype with a small model and finalize with a bigger one.

What hardware do I actually need?

For a 7B coding model in Q4: any modern laptop with 8 GB RAM. For 32B at decent quality: 32 GB+ unified memory (M1/M2/M3 Pro/Max) or a 24 GB GPU. The extension itself is <1 MB and uses negligible resources — the heavy lifting is in your runtime of choice.

Why is my model not tool-calling?

Two common reasons: (1) the model wasn't fine-tuned for tools — try Qwen 2.5-Coder or Llama 3.1+ instead; (2) the runtime doesn't pass the tools field through. Ollama, LM Studio, MLX, and llama.cpp all support tool calling on recent versions. Check the runtime's release notes.

Does it work with cloud APIs like OpenAI or Anthropic?

The OpenAI endpoint works as a fallback (baseUrl: "https://api.openai.com/v1" + API key), but GoatLLM is designed for local use — that's the entire point. For cloud, Copilot and Cursor are perfectly good. For "I want this code never to leave my laptop," that's where GoatLLM shines.

How do I report a bug or request a feature?

Right now the source repo is private. Email b.charleson1@gmail.com with a description and (if relevant) the contents of the GoatLLM output channel (View → Output → GoatLLM). Public issue tracking is coming once the Marketplace listing goes live.

Where do logs live?

Open VS Code's Output panel (⇧⌘U / Ctrl+Shift+U) and pick GoatLLM from the dropdown. Network traffic, tool calls, and any tokenizer/throughput weirdness all surface there.

What's the throughput indicator measuring?

Tokens per second on the assistant's response, computed once the stream completes. GoatLLM uses a tuned 3.7 chars/token fallback when the runtime doesn't report token counts directly, with a 150 ms minimum-sample guard so very short replies don't post inflated numbers.

Can I use it offline-offline (airgapped)?

Yes. Install the .vsix on a machine with no network. Point GoatLLM at http://localhost:<port>/v1. It will run forever without internet — no license server, no activation, no phone-home.

Changelog

What's new.

v1.0.1

UI & metrics polish

Code blocks now render with a header row showing the language, a Copy button, and an Insert at Cursor button.
Streaming is flicker-free — switched from full-bubble re-renders to incremental block-aware appends.
GFM tables render with proper borders, header tinting, row hovers, and inline code styling inside cells.
Throughput uses a tuned 3.7 chars/token fallback (instead of the OpenAI-prose 4) for more honest tok/s on mixed English + code.
150 ms minimum-sample guard kills inflated readings on single-chunk replies.
Status bar now uses the 🐐 goat emoji as the brand mark.

v1.0.0

Initial release

Chat panel in the Activity Bar with three modes: Chat, Agent, Agent (full access).
Auto-detection of local OpenAI-compatible servers (MLX, Ollama, LM Studio, llama.cpp, exo, vLLM).
Live model list from /v1/models with one-click switching.
Streaming responses with live tokens/sec in the status bar.
Native tool calling: read_file, list_directory, write_file, run_command.
Approval gates in Agent mode; full autonomy in Agent (full access) mode.
Built-in command deny list (rm -rf /, mkfs, fork bombs, sudo) with user-extendable patterns.
API keys in SecretStorage. Editor menus: Explain Selection, Generate Code.
Original side-profile robot-goat brand mark.

Your code. Your model. Your machine.

Stop renting intelligence by the token.

Install GoatLLM

Local AI coding. Nothing leaves your machine.