Mistral Small 4: Configurable Reasoning at $0.15/M Tokens

What It Is

Mistral Small 4 dropped March 16, 2026. It's the first model to unify three previously separate products into one: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding). One model, three modes.

The headline feature is configurable reasoning. Set reasoning_effort="none" for fast chat. Set it to "high" for deep chain-of-thought. Same deployment, adjustable per request.

Key Insight: This isn't just cost optimization. It's architectural innovation. MoE models with configurable routing have been theorized for years. Mistral shipped it.

Technical Specifications

Spec	Value
Total Parameters	119B
Active Params/Token	~6.5B
Architecture	MoE (128 experts, 4 active)
Context Window	256K tokens
Modalities	Text + Image
License	Apache 2.0 (fully open)
Output Speed	137-177 tokens/sec
Time to First Token	0.97-4.84s

Hardware Requirements

Minimum: 4x NVIDIA HGX H100, 2x H200, or 1x DGX B200
VRAM: ~60-70GB (4-bit quantized), ~240GB (16-bit)
Supported: vLLM, llama.cpp, SGLang, Transformers

Benchmarks

Benchmark	Mistral Small 4	GPT-OSS 120B	DeepSeek R1
AIME 2025	93	~85	76.0%
LiveCodeBench	64	63	77.0%
GPQA Diamond	71.2%	-	81.3%
Intelligence Index	27.8	-	-
AA LCR	0.72	-	-

Efficiency: On AA LCR, Mistral scores 0.72 with 1.6K characters. Qwen needs 5.8-6.1K characters for comparable performance. That's 3.5-4x shorter output.

DeepSeek R1 wins 6/6 benchmarks on raw performance. But Mistral costs 9x less. Different tradeoffs for different workflows.

Pricing

Model	Input ($/1M)	Output ($/1M)
Mistral Small 4	$0.15	$0.60
GPT-5.4 Mini	$0.75 (5x more)	$4.50 (7.5x more)
DeepSeek R1	$1.35 (9x more)	$4.20
Gemini Flash-Lite	$0.075	-

The value proposition: $0.15/M input is among the cheapest multimodal reasoning models available. Flash-Lite is cheaper but lacks configurable reasoning.

Community Sentiment

Reddit r/LocalLLaMA (PROS):

"Best open-weight small model for combined workloads"
"$0.60/1M output is a steal"
Apache 2.0 praised for commercial freedom

Reddit r/MistralAI (CONS):

"Kind of awful with images" (API testing feedback)
"Lost to Chinese/Korean/Saudi models badly"
Document OCR: Qwen 85.5 vs Mistral 66 (math OCR weakest)

Hacker News (#47404575):

"MoE models keep beating much larger dense ones"
"Just enough to fit onto single H100 with 4-bit quant"
Mixed views on benchmark trustworthiness

Known Limitations

Image handling: Multiple reports of poor multimodal performance
Spatial reasoning: SVG generation failures in testing
Context limit: 256K vs competitors' 400K-1M+
Math OCR: 66 vs Qwen 85.5 on document math
Benchmark transparency: Selective publishing vs DeepSeek

Real-World Use Cases

Best For:

Cost-conscious high-volume deployments
Single-model simplicity requirements
Open-source/self-hosting needs (Apache 2.0)
EU-hosted inference (data sovereignty)
Variable-complexity pipelines (configurable reasoning)

Not Best For:

Maximum reasoning performance (use DeepSeek R1)
Image-intensive workflows (reported issues)
256K context needs
Computer use/autonomous agents

The Bottom Line

Mistral Small 4 isn't trying to beat DeepSeek on raw benchmarks. It's trying to win on value. 5x cheaper input, 7.5x cheaper output, Apache 2.0 license, and the first configurable reasoning architecture shipped to production.

For enterprise buyers running millions of tokens daily, the math is straightforward. DeepSeek R1 costs $1.35/M input. Mistral Small 4 costs $0.15/M. That's $1.20 saved per million tokens. Scale that across a year.

The configurable reasoning feature is the real innovation. One model handles both fast chat and deep reasoning. No need to maintain separate deployments. No need to route requests between Magistral and Small 3.2. Same API endpoint, different reasoning_effort parameter.

March 2026 was a blitz for Mistral: 6 products in 15 days. Small 4, Voxtral TTS, Leanstral, Forge, Spaces CLI, and NVIDIA Nemotron Coalition founding membership. ARR hit $400M. Valuation $13.8B. The European OpenAI label is starting to look less like hype.

What It Is

Technical Specifications

Hardware Requirements

Benchmarks

Pricing

Community Sentiment

Known Limitations

Real-World Use Cases

The Bottom Line

RELATED_ENTRIES

One video diffusion model to handle 30 different tasks

Your AI assistant lives in a sterile chat window. This one boots from a BIOS screen.

ComfyUI took 4 hours. This took 14 minutes on the same GPU.