ShowUI: 2B VLA Model That Actually Sees Your Screen

What It Is

ShowUI is a 2B parameter Vision-Language-Action model for GUI agents. Developed by Show Lab (NUS) and Microsoft, it achieved 75.1% accuracy on ScreenSpot grounding — beating GPT-4V (~70%) and Qwen2-VL-72B (~68%) while being 36x smaller.

Published at CVPR 2025. MIT license. Fully open-source.

Technical Specs

| Component | Details | |-----------|---------|| | Base Model | Qwen2-VL-2B-Instruct | | Parameters | 2B | | License | MIT | | Training Data | 256K samples | | Platforms | Web, Mobile, Desktop | | VRAM | ~10GB |

Three Key Innovations

1. UI-Guided Visual Token Selection

Formulates screenshots as a UI-connected graph. Identifies redundant tokens between UI elements. Result: 33% fewer visual tokens, 1.4x speedup.

2. Interleaved Vision-Language-Action Streaming

Unifies diverse GUI tasks. Handles visual-action history in navigation. Supports multi-turn query-action sequences.

3. Small-scale High-quality Training

256K curated samples — vs 13M for OS-ATLAS. Resampling strategy addresses data imbalance.

Benchmarks

ScreenSpot Grounding

Model	Accuracy	Size	Training
ShowUI-2B	75.1%	2B	256K
GPT-4V	~70%	~1.8T	Large
OS-Atlas-7B	~72%	7B	13M
Qwen2-VL-72B	~68%	72B	Large

Key achievement: SOTA accuracy with smallest model and smallest training dataset.

OSWorld (ShowUI-Aloha)

Agent	Success Rate
ShowUI-Aloha	60.1% (217/361)
Claude Computer Use	~35-40%

Competitor Comparison

vs OS-ATLAS

Aspect	ShowUI	OS-ATLAS
Size	2B	4B/7B
Training	256K	13M
Approach	Unified VLA	Multi-mode

vs Claude Computer Use

Aspect	ShowUI	Claude
Architecture	Open-source	Closed API
Customization	Full	Limited
Cost	Free (local)	API costs

Community Sentiment

Reddit r/computervision

"ShowUI-2B is simultaneously impressive and frustrating as hell. Dual output modes are chef's kiss. But it uses TAP on desktop randomly — zero environment awareness."

Voxel51 Review

"This thing is genuinely fast. Positioning is like having a friend point at your screen from across the room — technically correct, practically useless."

HuggingFace

"Beat GPT-4V without needing HTML/DOM. Visual-only grounding is the future."

Pros & Cons

Pros

Lightweight 2B, local deployment
MIT license, fully open-source
SOTA 75.1% ScreenSpot
Multi-platform
Visual-only (no DOM needed)
1.4x speedup

Cons

Can't distinguish web vs mobile
OCR struggles with small text
Points around elements, not at them
~10GB VRAM (high for 2B)
7.7% on ScreenSpot-Pro
No reasoning/planning

Installation

pip install transformers torch

from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("showlab/ShowUI-2B")
processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")

GitHub: https://github.com/showlab/ShowUI HuggingFace: https://huggingface.co/showlab/ShowUI-2B