
What It Is
HY-World 2.0 is Tencent's multi-modal world model that generates actual 3D geometry—not video—from text, images, or video input. It outputs 3D Gaussian Splats, meshes, and point clouds that can be imported directly into Unity, Unreal Engine, Blender, and NVIDIA Isaac Sim.
This is the first open-source 3D world model that competes with closed-source alternatives like World Labs' Marble on Stanford's WorldScore benchmark.
Architecture & Specs
| Component | Parameters | Purpose |
|---|---|---|
| WorldMirror 2.0 | ~1.2B | Feed-forward reconstruction model |
| HY-Pano 2.0 | TBD | Panorama generation (text/image to 360°) |
| WorldNav | TBD | Trajectory planning and navigation |
| WorldStereo 2.0 | TBD | View generation with memory |
WorldMirror 2.0 Architecture: Unified Transformer backbone with DPT decoder heads, simultaneously predicting depth, normals, camera parameters, and 3DGS attributes in a single forward pass.
| Improvement | v1.0 | v2.0 |
|---|---|---|
| Position Encoding | Absolute RoPE | Normalized RoPE |
| Depth Supervision | GT depth only | GT depth + normals |
| Resolution Range | 100K-250K | 50K-500K pixels |
| Curriculum | 2 stages | 3 stages |
Benchmarks
WorldStereo 2.0 Camera Control
| Method | RotErr | TransErr | CLIP-I |
|---|---|---|---|
| SEVA | 1.690 | 1.578 | 77.16 |
| Gen3C | 0.944 | 1.580 | 82.33 |
| WorldStereo 2.0 | 0.492 | 0.968 | 89.43 |
Reconstruction Quality (Tanks-and-Temples / MipNeRF360)
| Method | F1 Score |
|---|---|
| SEVA | 36.73 / 28.75 |
| Lyra | 32.54 / 36.05 |
| WorldStereo 2.0 | 41.43 / 51.27 |
Capabilities
- Real 3D Assets: Generates actual geometry—not pixel videos
- Persistent Worlds: Build once, keep forever; unlimited duration
- Native 3D Consistency: No flickering, inherent spatial coherence
- Engine Import: Direct to Unity, Unreal, Blender, Isaac Sim
- Physics Support: Collision detection, real-time rendering
Competitor Comparison
| Aspect | HY-World 2.0 | Marble | Genie 3 |
|---|---|---|---|
| Access | Open source | Commercial ($) | Google AI Ultra |
| Output | Real 3D | 3DGS | Pixel video |
| Duration | Unlimited | Downloadable | ~1 min |
| Editability | Fully editable | Partial | Non-editable |
| Self-host | Yes | No | No |
Community Reality Check
Reddit's r/LocalLLaMA discussion (51 upvotes, 19 comments):
"Some BIG asterisks here. The code available is for making Gaussian splats from images and videos. Many of the more interesting features and models are not available yet."
What's actually released: WorldMirror 2.0 only. HY-Pano, WorldNav, WorldStereo 2.0 coming soon.
License note: Open source but NOT FOSS—commercial restrictions apply.
Quality concern: "If you look at the video full screen, both texture and mesh resolution are very low." Generates entire scenes, not individual editable objects.
Key Takeaway
HY-World 2.0 is a paradigm shift from video world models (ephemeral playback) to persistent, navigable 3D environments. The partial release and license restrictions are real limitations, but this is the first genuinely competitive open-source option for 3D world generation—and it outputs geometry you can actually use in production pipelines.