Mistral Medium 3.5 128B: One Model to Rule Coding, Reasoning, and Chat

Mistral dropped their first "flagship merged" model on April 29, 2026. Mistral Medium 3.5 replaces three separate model lines into one dense 128B set of weights—and it changes how you actually use Mistral models.

Architecture

Mistral Medium 3.5 is dense, not MoE—all 128B parameters fire on every token. Context window is 256K tokens. A vision encoder was trained from scratch to support variable image sizes and aspect ratios. The model can run on as few as 4 GPUs for self-hosted inference.

Spec	Value
Parameters	128B (dense)
Context window	256K tokens
Architecture	Dense transformer + vision encoder
Inference	vLLM, llama.cpp, Ollama, SGLang
Min GPUs	4 (self-hosted)
Release	April 29, 2026
License	Modified MIT

The key architectural feature: configurable reasoning_effort per request. Set it to "none" for instant chat replies, "high" for complex agentic tasks with internal reasoning traces wrapped in [THINK] tags. Same weights, different behavior depending on what you ask.

Benchmarks

Benchmark	Mistral Med 3.5	Devstral 2	Qwen 3.5 397B A17B
SWE-Bench Verified	77.6%	~72%	~75%
τ³-Telecom	91.4%	~85%	~88%
AIME 2025	Limited data	~40%	83.1%

The SWE-Bench Verified result is the operationally important one. 77.6% puts Mistral Medium 3.5 ahead of its own coding specialist (Devstral 2) and past Qwen 3.5 397B—a model with 3x the total parameters. It's competitive with proprietary leaders without being best-in-class.

Pricing

Metric	Cost
Input (per 1M tok)	$1.50
Output (per 1M tok)	$7.50

At $1.50/$7.50, it sits between GPT-5.4 mini and Claude Sonnet 4.6 on cost-per-intelligence. Compared to Mistral Large 3 ($0.50/$1.50), it's 3x more expensive on input and 5x on output—but it replaces three separate models, so for teams that were running Magistral + Devstral + Medium 3.1, the blended cost may actually be lower.

What's Actually New

The merged model is the headline, but two features matter more:

Async remote agents in Vibe CLI. You spawn a coding task from your terminal or Le Chat, and Mistral works through it independently in the cloud. Sessions run in parallel. You can inspect them mid-flight with file diffs and tool call traces. A local CLI session can be "teleported" up to the cloud when you need to walk away.

Le Chat Work mode. An agent that reads emails, checks calendars, searches documents, and uses multiple tools in parallel across a single conversation. Sessions persist longer than a typical chat reply, so it can work through trial and error over many turns.

Both connect to GitHub, Linear, Jira, Sentry, Slack, and Teams—so the agent can open PRs, file issues, check CI status, and report back without you watching.

The Community Take

"Mistral Medium 3.5 is the most interesting open-weight release of the year so far, not because it crosses the proprietary frontier, but because it is the first time a Western lab has shipped a single dense 128B model that is genuinely good at coding, reasoning, and chat at the same time."

On r/LocalLLaMA, the reception is split. Performance is impressive for a 128B dense model, but hardware requirements are real. A Q4 quant still needs ~80GB VRAM—you're looking at a Mac Studio with 128GB or dual RTX 6000s for serious local use.

"The 77.6% SWE-Bench Verified result puts Mistral Medium 3.5 ahead of Mistral's own previous coding specialist, Devstral 2... it is also competitive with the proprietary leaders without being best-in-class."

The License Gotcha

Mistral released the weights under a "Modified MIT License." It allows most commercial use but restricts building competing hosted-API services and requires attribution/notification for certain use cases. Safe for internal fine-tuning and vertical-specific customer solutions. Friction if you want to build a hosted service that competes with Mistral's API.

Verdict

Mistral Medium 3.5 is the Pareto model for the 128B class: ~85-90% of what 400B+ giants do, at a fraction of the hardware cost. If you're already in the Mistral ecosystem, it simplifies your stack from three models to one. If you're evaluating open-weight options for agentic workflows, this is the one to test against Qwen 3.5 and DeepSeek V4.

The EAGLE head for speculative decoding makes 128B inference practical at scale. The Vibe CLI async agent workflow is genuinely new. The model itself is solid, not spectacular. That's enough.

Sources

https://huggingface.co/mistralai/Mistral-Medium-3.5-128B https://letsdatascience.com/blog/mistral-medium-3-5-128b-open-weight-merged-model https://docs.mistral.ai/models/overview https://www.reddit.com/r/MistralAI/comments/1sz1yxh/mistral_medium_35_benchmarks/ https://ollama.com/library/mistral-medium-3.5 https://artificialanalysis.ai/providers/mistral

Architecture

Benchmarks

Pricing

What's Actually New

The Community Take

The License Gotcha

Verdict

Sources

RELATED_ENTRIES

One video diffusion model to handle 30 different tasks

Your AI assistant lives in a sterile chat window. This one boots from a BIOS screen.

ComfyUI took 4 hours. This took 14 minutes on the same GPU.