DeepSeek V4: Million-Token Context Intelligence

Executive Summary

DeepSeek just dropped V4 on April 24, 2026. Two MoE models: V4-Pro (1.6T total / 49B activated) and V4-Flash (284B total / 13B activated). Both with 1M context window. MIT license.

This is the first open-weights model with native 1M token context. The efficiency gains are brutal: 27% FLOPs and 10% KV cache compared to V3.2 when processing 1M tokens.

Model Specifications

Model	Total Params	Activated	Context	Precision	License
V4-Flash-Base	284B	13B	1M	FP8 Mixed	MIT
V4-Flash	284B	13B	1M	FP4+FP8	MIT
V4-Pro-Base	1.6T	49B	1M	FP8 Mixed	MIT
V4-Pro	1.6T	49B	1M	FP4+FP8	MIT

Architectural Innovations

Hybrid Attention (CSA + HCA)

Compressed Sparse Attention + Heavily Compressed Attention. This isn't just optimization—it's architectural innovation. In 1M-context scenarios:

27% single-token inference FLOPs vs V3.2
10% KV cache vs V3.2

The math: V3.2 would choke on 1M context. V4 doesn't. Same capability, radically lower compute.

Manifold-Constrained Hyper-Connections (mHC)

Strengthened residual connections. Better signal propagation across layers without sacrificing expressivity. The technical paper calls it "stability enhancement"—but what it really means is deeper networks that don't degrade.

Muon Optimizer

Faster convergence, greater stability during training. Pre-training ran on 32T+ tokens. The optimizer choice mattered.

Benchmarks

Base Model Comparison

Benchmark	V3.2-Base	V4-Flash-Base	V4-Pro-Base
MMLU (EM)	87.8	88.7	90.1
MMLU-Pro	65.5	68.3	73.5
HumanEval (Pass@1)	62.8	69.5	76.8
GSM8K	91.1	90.8	92.6
MATH	60.5	57.4	64.5
LongBench-V2	40.2	44.7	51.5

The jump from 62.8 to 76.8 on HumanEval is real. That's 22% improvement in code generation.

V4-Pro-Max vs Frontier Models

Benchmark	Claude Opus-4.6	GPT-5.4	Gemini-3.1	V4-Pro-Max
LiveCodeBench	88.8	-	91.7	93.5
Codeforces	-	3168	3052	3206
Apex Shortlist	85.9	78.1	89.1	90.2

V4-Pro-Max beats all frontier models on coding benchmarks. Open-weights. MIT license.

The Codeforces 3206 rating is competitive programmer territory. Not "AI-assisted coding"—actual competitive programming performance.

API Pricing

Model	Cache Hit	Cache Miss	Output (per 1M)
V4-Flash	$0.028	$0.14	$0.28
V4-Pro	$0.145	$1.74	$3.48

V4-Flash at $0.14/$0.28 per million tokens. That's cheaper than most closed-source alternatives for short queries. The cache hit pricing at $0.028 is aggressive.

Community Sentiment

Hacker News: #1 story with 535+ points, 237+ comments.

"The documentation is cleaner than OpenAI's. They explain the architecture, show the benchmarks, give you the weights. No marketing fluff." — HN commenter

Reddit r/LocalLLaMA: 6 viral threads.

"1M context in open weights finally. This democratizes what Google and Anthropic have been gatekeeping." — Reddit user

"Flash is incredibly inexpensive. $0.14/M input is basically free for most use cases." — Reddit user

The hardware requirements thread drew real discussion:

"V4-Flash (284B/13B) should run on consumer hardware with proper quantization. The 13B activated path is the key." — LocalLLaMA

Limitations

No multimodality — text-only, confirmed in technical report
Preview version — not production-stable yet
Hardware requirements — V4-Pro needs significant compute for local deployment

Quick Start

# API access
pip install openai
# Point to DeepSeek API
client = OpenAI(api_key="YOUR_KEY", base_url="https://api.deepseek.com")

# Models: deepseek-v4-pro, deepseek-v4-flash

Weights available on HuggingFace under MIT license.

The Bottom Line

1M context. MIT license. Beats frontier models on coding benchmarks. 27% FLOPs efficiency. The question isn't whether this matters—it's how quickly the ecosystem adapts.

Sources: HuggingFace, DeepSeek API docs, Hacker News, Reddit r/LocalLLaMA