What It Is
ShowUI is a 2B parameter Vision-Language-Action model for GUI agents. Developed by Show Lab (NUS) and Microsoft, it achieved 75.1% accuracy on ScreenSpot grounding — beating GPT-4V (~70%) and Qwen2-VL-72B (~68%) while being 36x smaller.
Published at CVPR 2025. MIT license. Fully open-source.
Technical Specs
| Component | Details | |-----------|---------|| | Base Model | Qwen2-VL-2B-Instruct | | Parameters | 2B | | License | MIT | | Training Data | 256K samples | | Platforms | Web, Mobile, Desktop | | VRAM | ~10GB |
Three Key Innovations
1. UI-Guided Visual Token Selection
Formulates screenshots as a UI-connected graph. Identifies redundant tokens between UI elements. Result: 33% fewer visual tokens, 1.4x speedup.
2. Interleaved Vision-Language-Action Streaming
Unifies diverse GUI tasks. Handles visual-action history in navigation. Supports multi-turn query-action sequences.
3. Small-scale High-quality Training
256K curated samples — vs 13M for OS-ATLAS. Resampling strategy addresses data imbalance.
Benchmarks
ScreenSpot Grounding
| Model | Accuracy | Size | Training |
|---|---|---|---|
| ShowUI-2B | 75.1% | 2B | 256K |
| GPT-4V | ~70% | ~1.8T | Large |
| OS-Atlas-7B | ~72% | 7B | 13M |
| Qwen2-VL-72B | ~68% | 72B | Large |
Key achievement: SOTA accuracy with smallest model and smallest training dataset.
OSWorld (ShowUI-Aloha)
| Agent | Success Rate |
|---|---|
| ShowUI-Aloha | 60.1% (217/361) |
| Claude Computer Use | ~35-40% |
Competitor Comparison
vs OS-ATLAS
| Aspect | ShowUI | OS-ATLAS |
|---|---|---|
| Size | 2B | 4B/7B |
| Training | 256K | 13M |
| Approach | Unified VLA | Multi-mode |
vs Claude Computer Use
| Aspect | ShowUI | Claude |
|---|---|---|
| Architecture | Open-source | Closed API |
| Customization | Full | Limited |
| Cost | Free (local) | API costs |
Community Sentiment
Reddit r/computervision
"ShowUI-2B is simultaneously impressive and frustrating as hell. Dual output modes are chef's kiss. But it uses TAP on desktop randomly — zero environment awareness."
Voxel51 Review
"This thing is genuinely fast. Positioning is like having a friend point at your screen from across the room — technically correct, practically useless."
HuggingFace
"Beat GPT-4V without needing HTML/DOM. Visual-only grounding is the future."
Pros & Cons
Pros
- Lightweight 2B, local deployment
- MIT license, fully open-source
- SOTA 75.1% ScreenSpot
- Multi-platform
- Visual-only (no DOM needed)
- 1.4x speedup
Cons
- Can't distinguish web vs mobile
- OCR struggles with small text
- Points around elements, not at them
- ~10GB VRAM (high for 2B)
- 7.7% on ScreenSpot-Pro
- No reasoning/planning
Installation
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("showlab/ShowUI-2B")
processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")
GitHub: https://github.com/showlab/ShowUI HuggingFace: https://huggingface.co/showlab/ShowUI-2B