Executive Summary

DeepSeek just dropped V4 on April 24, 2026. Two MoE models: V4-Pro (1.6T total / 49B activated) and V4-Flash (284B total / 13B activated). Both with 1M context window. MIT license.

This is the first open-weights model with native 1M token context. The efficiency gains are brutal: 27% FLOPs and 10% KV cache compared to V3.2 when processing 1M tokens.

Model Specifications

Model Total Params Activated Context Precision License
V4-Flash-Base 284B 13B 1M FP8 Mixed MIT
V4-Flash 284B 13B 1M FP4+FP8 MIT
V4-Pro-Base 1.6T 49B 1M FP8 Mixed MIT
V4-Pro 1.6T 49B 1M FP4+FP8 MIT

Architectural Innovations

Hybrid Attention (CSA + HCA)

Compressed Sparse Attention + Heavily Compressed Attention. This isn't just optimization—it's architectural innovation. In 1M-context scenarios:

  • 27% single-token inference FLOPs vs V3.2
  • 10% KV cache vs V3.2

The math: V3.2 would choke on 1M context. V4 doesn't. Same capability, radically lower compute.

Manifold-Constrained Hyper-Connections (mHC)

Strengthened residual connections. Better signal propagation across layers without sacrificing expressivity. The technical paper calls it "stability enhancement"—but what it really means is deeper networks that don't degrade.

Muon Optimizer

Faster convergence, greater stability during training. Pre-training ran on 32T+ tokens. The optimizer choice mattered.

Benchmarks

Base Model Comparison

Benchmark V3.2-Base V4-Flash-Base V4-Pro-Base
MMLU (EM) 87.8 88.7 90.1
MMLU-Pro 65.5 68.3 73.5
HumanEval (Pass@1) 62.8 69.5 76.8
GSM8K 91.1 90.8 92.6
MATH 60.5 57.4 64.5
LongBench-V2 40.2 44.7 51.5

The jump from 62.8 to 76.8 on HumanEval is real. That's 22% improvement in code generation.

V4-Pro-Max vs Frontier Models

Benchmark Claude Opus-4.6 GPT-5.4 Gemini-3.1 V4-Pro-Max
LiveCodeBench 88.8 - 91.7 93.5
Codeforces - 3168 3052 3206
Apex Shortlist 85.9 78.1 89.1 90.2

V4-Pro-Max beats all frontier models on coding benchmarks. Open-weights. MIT license.

The Codeforces 3206 rating is competitive programmer territory. Not "AI-assisted coding"—actual competitive programming performance.

API Pricing

Model Cache Hit Cache Miss Output (per 1M)
V4-Flash $0.028 $0.14 $0.28
V4-Pro $0.145 $1.74 $3.48

V4-Flash at $0.14/$0.28 per million tokens. That's cheaper than most closed-source alternatives for short queries. The cache hit pricing at $0.028 is aggressive.

Community Sentiment

Hacker News: #1 story with 535+ points, 237+ comments.

"The documentation is cleaner than OpenAI's. They explain the architecture, show the benchmarks, give you the weights. No marketing fluff." — HN commenter

Reddit r/LocalLLaMA: 6 viral threads.

"1M context in open weights finally. This democratizes what Google and Anthropic have been gatekeeping." — Reddit user

"Flash is incredibly inexpensive. $0.14/M input is basically free for most use cases." — Reddit user

The hardware requirements thread drew real discussion:

"V4-Flash (284B/13B) should run on consumer hardware with proper quantization. The 13B activated path is the key." — LocalLLaMA

Limitations

  1. No multimodality — text-only, confirmed in technical report
  2. Preview version — not production-stable yet
  3. Hardware requirements — V4-Pro needs significant compute for local deployment

Quick Start

# API access
pip install openai
# Point to DeepSeek API
client = OpenAI(api_key="YOUR_KEY", base_url="https://api.deepseek.com")

# Models: deepseek-v4-pro, deepseek-v4-flash

Weights available on HuggingFace under MIT license.

The Bottom Line

1M context. MIT license. Beats frontier models on coding benchmarks. 27% FLOPs efficiency. The question isn't whether this matters—it's how quickly the ecosystem adapts.


Sources: HuggingFace, DeepSeek API docs, Hacker News, Reddit r/LocalLLaMA