Run local LLMs on Apple Silicon, around twice as fast.

MTPLX is a native Mac app and a command line for running local language models with multi-token prediction. Modern models like Qwen 3.5/3.6 ship with built-in MTP heads. Almost no runtime uses them. MTPLX does: the model drafts several tokens ahead of itself, verifies them in one batched forward pass, and keeps only what passes exact rejection sampling. Same model, same output distribution, measured 1.6x faster on a 16 GB M4 Mac mini and 2.24x on an M5 Max.

There is no second draft model eating your RAM, and no greedy shortcut that quietly changes what the model would have said at real sampling settings. The acceptance math is the Leviathan and Chen rejection sampling theorem with residual correction, so temperature=0.6, top_p=0.95 behaves exactly like normal decoding, just faster.

Get it

The Mac app is the easiest way in. Download the DMG at mtplx.com, drag it to Applications, and the app takes care of everything else: it checks your hardware, recommends a model that actually fits your memory, downloads it, sets up its own Python engine (no Homebrew needed), installs fan control, puts mtplx on your PATH, and then measures your machine to pick the fastest decoding depth.

The CLI on its own:

brew install youssofal/mtplx/mtplx
mtplx start

or python3 -m pip install mtplx if you prefer pip. All releases are listed at mtplx.com/releases.

Requirements: Apple Silicon (M1 or newer), macOS 14+. 16 GB of memory runs the 4B and 9B models comfortably; 27B wants 32 GB and up. The app checks this for you before recommending anything.

The app

The dashboard shows what your model is doing while it does it: live tokens per second, acceptance rate by draft depth, the verify waterfall, cache state, and system pressure. When you start a chat, code an agent against the local server, or run a benchmark, the numbers are right there.

Chat is native, streams with thinking cards, takes file attachments, and can search the web. One click launches OpenCode, Pi, Hermes, Open WebUI, or anything else that speaks the OpenAI or Anthropic API against your local server. There is also a built-in AIME benchmark runner with fully disclosed, coaching-free prompts, so you can score a model yourself instead of trusting a chart.

Auto-tune

The right draft depth depends on your specific Mac: chip, memory bandwidth, thermals. During onboarding (and any time after), MTPLX runs the real model on your machine at each depth, with fans pinned for clean timing, and keeps autoregressive decoding as the baseline. If an MTP depth beats it, that depth is saved. If nothing beats the baseline, nothing is saved and the app says so. From the terminal it is one command:

mtplx tune --model <model-or-path> --retune

On a 16 GB M4 Mac mini, tuning the 9B model lands on depth 1: 14.4 tok/s baseline becomes 23.0 tok/s.

Forge: make your own MTP models

Forge takes a Hugging Face repo and turns it into an MTPLX-ready MTP model: convert to MLX, train the MTP adapter, verify that the result is actually faster and still exact, and publish back to the Hub if you want to share it. The honest part matters: Forge measures before and after on your hardware and shows you the verdict ("Depth 1 is fastest: 227.1 to 296.1, 1.30x") rather than assuming the adapter helped. Available in the app and as mtplx forge.

The official catalog lives on Hugging Face under Youssofal: Qwen 3.5 (4B, 9B), Qwen 3.6 (27B, 35B MoE) in speed, balance, and quality builds, plus Gemma 4. The app recommends from these based on your hardware.

The server

mtplx start (or the app's play button) serves an OpenAI-compatible API on 127.0.0.1:8000: /v1/chat/completions, /v1/completions, /v1/models, plus an Anthropic-compatible /v1/messages with streaming, tool calls in both styles, /health, and /metrics. Claude Code, Cline, Continue, Open WebUI, curl, the openai and anthropic Python clients: if it speaks the API, it works. The app and CLI share one server, so mtplx start attaches to the app's running model instead of loading a second copy.

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"mtplx","messages":[{"role":"user","content":"hi"}],"stream":true}'

Sessions survive: a warm-prefix session bank keeps multi-turn chats fast, and an optional SSD cache restores sessions near-instantly across restarts.

CLI quick reference

mtplx start                # interactive: pick model, mode, surface, then chat
mtplx serve --port 8000    # API server only
mtplx stop                 # stop the running server cleanly
mtplx pull <hf-repo>       # download a model safely
mtplx models               # what is cached, sizes, validation
mtplx inspect <model>      # compatibility report before anything runs
mtplx tune --retune        # measure AR vs D1/D2/D3 on your Mac
mtplx forge                # build, verify, and publish MTP models
mtplx bench aime --quick   # run the AIME benchmark from the terminal
mtplx doctor               # install and integration health
mtplx max --install        # fan control (one sudo prompt, crash-safe)
mtplx settings get/set     # read or change live server settings

Every command takes --json and --help. The CLI works without MLX installed for everything that does not need a model, so doctor and inspect run on any machine.

Modes

Mode	What it does	When
Sustained	Default. Long-context MTP path with chunked prefill and request-sized KV	Everyday use, big files, 16K-200K prompts
Sustained Max	Sustained with fans pinned at 100%	Long work where you want maximum cooling
Burst	Legacy short-context benchmark lane, loud	Short prompts and benchmarks only

Fan-backed modes restore your fans to automatic if MTPLX dies for any reason, including kill -9 and closing the terminal. A detached watchdog handles it; this is verified on hardware, not assumed.

Compatibility, honestly

mtplx inspect classifies any model into four tiers before anything runs: verified, architecture-compatible but unverified, incompatible architecture, or no MTP heads at all. Unverified models refuse to run unless you explicitly force them. There are no silent fallbacks: if MTPLX cannot run a model correctly, it tells you instead of running it badly.

What MTPLX is not

Not an external-drafter system. The drafter is the target model's own MTP heads.
Not a greedy-argmax trick. Acceptance is exact rejection sampling, correct at any temperature.
Not a CUDA project. MTPLX is MLX-native and Apple Silicon first. For Linux, use vLLM.

License and credit

Apache-2.0: use it, modify it, ship it commercially. Keep the license and NOTICE attribution if you redistribute. MTPLX builds on MLX and the Qwen and Gemma model families; the speculative sampling math follows Leviathan and Chen (2023). Fan control via ThermalForge. Model weights remain governed by their upstream licenses.

If MTPLX powers a public project, benchmark, or paper, please credit it:

Powered by MTPLX by Youssof Altoukhi https://github.com/youssofal/MTPLX

Built by Youssof Altoukhi. Bug reports and benchmark replications welcome via Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.github		.github
apps/MTPLXApp		apps/MTPLXApp
benchmarks/results		benchmarks/results
bin		bin
dashboard		dashboard
docs		docs
examples		examples
mtplx		mtplx
native_extensions/verify_mlp		native_extensions/verify_mlp
scripts		scripts
templates/qwen36_froggeric_v19		templates/qwen36_froggeric_v19
tests		tests
tools/bench		tools/bench
vllm_metal		vllm_metal
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
RELEASE_NOTES_v0.2.0.md		RELEASE_NOTES_v0.2.0.md
RELEASE_NOTES_v0.2.1.md		RELEASE_NOTES_v0.2.1.md
SECURITY.md		SECURITY.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run local LLMs on Apple Silicon, around twice as fast.

Get it

The app

Auto-tune

Forge: make your own MTP models

The server

CLI quick reference

Modes

Compatibility, honestly

What MTPLX is not

License and credit

About

Uh oh!

Releases 20

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Run local LLMs on Apple Silicon, around twice as fast.

Get it

The app

Auto-tune

Forge: make your own MTP models

The server

CLI quick reference

Modes

Compatibility, honestly

What MTPLX is not

License and credit

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages