Executive Summary
DeepSeek just dropped V4 on April 24, 2026. Two MoE models: V4-Pro (1.6T total / 49B activated) and V4-Flash (284B total / 13B activated). Both with 1M context window. MIT license.
This is the first open-weights model with native 1M token context. The efficiency gains are brutal: 27% FLOPs and 10% KV cache compared to V3.2 when processing 1M tokens.
Model Specifications
| Model | Total Params | Activated | Context | Precision | License |
|---|---|---|---|---|---|
| V4-Flash-Base | 284B | 13B | 1M | FP8 Mixed | MIT |
| V4-Flash | 284B | 13B | 1M | FP4+FP8 | MIT |
| V4-Pro-Base | 1.6T | 49B | 1M | FP8 Mixed | MIT |
| V4-Pro | 1.6T | 49B | 1M | FP4+FP8 | MIT |
Architectural Innovations
Hybrid Attention (CSA + HCA)
Compressed Sparse Attention + Heavily Compressed Attention. This isn't just optimization—it's architectural innovation. In 1M-context scenarios:
- 27% single-token inference FLOPs vs V3.2
- 10% KV cache vs V3.2
The math: V3.2 would choke on 1M context. V4 doesn't. Same capability, radically lower compute.
Manifold-Constrained Hyper-Connections (mHC)
Strengthened residual connections. Better signal propagation across layers without sacrificing expressivity. The technical paper calls it "stability enhancement"—but what it really means is deeper networks that don't degrade.
Muon Optimizer
Faster convergence, greater stability during training. Pre-training ran on 32T+ tokens. The optimizer choice mattered.
Benchmarks
Base Model Comparison
| Benchmark | V3.2-Base | V4-Flash-Base | V4-Pro-Base |
|---|---|---|---|
| MMLU (EM) | 87.8 | 88.7 | 90.1 |
| MMLU-Pro | 65.5 | 68.3 | 73.5 |
| HumanEval (Pass@1) | 62.8 | 69.5 | 76.8 |
| GSM8K | 91.1 | 90.8 | 92.6 |
| MATH | 60.5 | 57.4 | 64.5 |
| LongBench-V2 | 40.2 | 44.7 | 51.5 |
The jump from 62.8 to 76.8 on HumanEval is real. That's 22% improvement in code generation.
V4-Pro-Max vs Frontier Models
| Benchmark | Claude Opus-4.6 | GPT-5.4 | Gemini-3.1 | V4-Pro-Max |
|---|---|---|---|---|
| LiveCodeBench | 88.8 | - | 91.7 | 93.5 |
| Codeforces | - | 3168 | 3052 | 3206 |
| Apex Shortlist | 85.9 | 78.1 | 89.1 | 90.2 |
V4-Pro-Max beats all frontier models on coding benchmarks. Open-weights. MIT license.
The Codeforces 3206 rating is competitive programmer territory. Not "AI-assisted coding"—actual competitive programming performance.
API Pricing
| Model | Cache Hit | Cache Miss | Output (per 1M) |
|---|---|---|---|
| V4-Flash | $0.028 | $0.14 | $0.28 |
| V4-Pro | $0.145 | $1.74 | $3.48 |
V4-Flash at $0.14/$0.28 per million tokens. That's cheaper than most closed-source alternatives for short queries. The cache hit pricing at $0.028 is aggressive.
Community Sentiment
Hacker News: #1 story with 535+ points, 237+ comments.
"The documentation is cleaner than OpenAI's. They explain the architecture, show the benchmarks, give you the weights. No marketing fluff." — HN commenter
Reddit r/LocalLLaMA: 6 viral threads.
"1M context in open weights finally. This democratizes what Google and Anthropic have been gatekeeping." — Reddit user
"Flash is incredibly inexpensive. $0.14/M input is basically free for most use cases." — Reddit user
The hardware requirements thread drew real discussion:
"V4-Flash (284B/13B) should run on consumer hardware with proper quantization. The 13B activated path is the key." — LocalLLaMA
Limitations
- No multimodality — text-only, confirmed in technical report
- Preview version — not production-stable yet
- Hardware requirements — V4-Pro needs significant compute for local deployment
Quick Start
# API access
pip install openai
# Point to DeepSeek API
client = OpenAI(api_key="YOUR_KEY", base_url="https://api.deepseek.com")
# Models: deepseek-v4-pro, deepseek-v4-flash
Weights available on HuggingFace under MIT license.
The Bottom Line
1M context. MIT license. Beats frontier models on coding benchmarks. 27% FLOPs efficiency. The question isn't whether this matters—it's how quickly the ecosystem adapts.
Sources: HuggingFace, DeepSeek API docs, Hacker News, Reddit r/LocalLLaMA