Open-Weight Models

An open-weight model ships its trained parameters publicly (Llama 3, Mistral, Qwen, Gemma) so you can download, run, fine-tune, and quantize it on your own hardware.

Why it matters

Open weights give an agent three things hosted APIs can’t: data stays in your VPC, the model can’t be deprecated out from under you, and marginal cost is GPU-hours not per-token — decisive at high volume or under strict privacy. The trade is that you own the serving stack, the quality ceiling is usually a step behind the frontier closed-weight-models, and “open weights” rarely means open data or a truly free license.

How it works

You pull a checkpoint (often from Hugging Face) and serve it behind an OpenAI-compatible endpoint with an inference engine like vLLM, TGI, or llama.cpp.

Quantization shrinks weights from fp16 → 8/4-bit so a model fits a given GPU; a 70B model needs ~140 GB at fp16 but ~40 GB at 4-bit, at a small quality cost.
Serving throughput comes from batching and paged-KV-cache (vLLM); a single A100 can serve an 8B model to many concurrent agents.
Customization you fully control: LoRA/QLoRA fine-tunes (see fine-tuning-vs-prompt-engineering), custom tool grammars, logit bias, and forced-grammar JSON.

Size class	fp16 VRAM	4-bit VRAM	Typical use
7–8B	~16 GB	~6 GB	fast tool-calling agents
13B	~26 GB	~10 GB	better reasoning
70B	~140 GB	~40 GB	near-frontier, multi-GPU

Example

Serve Llama-3-8B and call it like OpenAI:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --port 8000
POST http://localhost:8000/v1/chat/completions
body: { model:"...8B-Instruct", messages:[...] }
→ runs entirely in your VPC; no per-token bill, just GPU time

Swapping a hosted endpoint for this is a base-URL change when the server is OpenAI-compatible.

Pitfalls

License ≠ open source. Llama’s license restricts some commercial use and forbids training competitors; read it before shipping.
Quantize too aggressively. 3-bit can quietly wreck tool-call formatting and math; benchmark your own evals, not just perplexity.
Underestimating ops. GPUs, autoscaling, KV-cache OOMs, and upgrades are now your problem — a 429 becomes a 3am page.
Weaker native function calling. Many open models need a server-enforced JSON grammar to match hosted tool-use reliability.

tech-studies

Explorer

Open-Weight Models

Open-Weight Models

Why it matters

How it works

Example

Pitfalls

See also

Graph View

Table of Contents

Backlinks