What It Is

ShowUI is a 2B parameter Vision-Language-Action model for GUI agents. Developed by Show Lab (NUS) and Microsoft, it achieved 75.1% accuracy on ScreenSpot grounding — beating GPT-4V (~70%) and Qwen2-VL-72B (~68%) while being 36x smaller.

Published at CVPR 2025. MIT license. Fully open-source.

Technical Specs

| Component | Details | |-----------|---------|| | Base Model | Qwen2-VL-2B-Instruct | | Parameters | 2B | | License | MIT | | Training Data | 256K samples | | Platforms | Web, Mobile, Desktop | | VRAM | ~10GB |

Three Key Innovations

1. UI-Guided Visual Token Selection

Formulates screenshots as a UI-connected graph. Identifies redundant tokens between UI elements. Result: 33% fewer visual tokens, 1.4x speedup.

2. Interleaved Vision-Language-Action Streaming

Unifies diverse GUI tasks. Handles visual-action history in navigation. Supports multi-turn query-action sequences.

3. Small-scale High-quality Training

256K curated samples — vs 13M for OS-ATLAS. Resampling strategy addresses data imbalance.

Benchmarks

ScreenSpot Grounding

Model Accuracy Size Training
ShowUI-2B 75.1% 2B 256K
GPT-4V ~70% ~1.8T Large
OS-Atlas-7B ~72% 7B 13M
Qwen2-VL-72B ~68% 72B Large

Key achievement: SOTA accuracy with smallest model and smallest training dataset.

OSWorld (ShowUI-Aloha)

Agent Success Rate
ShowUI-Aloha 60.1% (217/361)
Claude Computer Use ~35-40%

Competitor Comparison

vs OS-ATLAS

Aspect ShowUI OS-ATLAS
Size 2B 4B/7B
Training 256K 13M
Approach Unified VLA Multi-mode

vs Claude Computer Use

Aspect ShowUI Claude
Architecture Open-source Closed API
Customization Full Limited
Cost Free (local) API costs

Community Sentiment

Reddit r/computervision

"ShowUI-2B is simultaneously impressive and frustrating as hell. Dual output modes are chef's kiss. But it uses TAP on desktop randomly — zero environment awareness."

Voxel51 Review

"This thing is genuinely fast. Positioning is like having a friend point at your screen from across the room — technically correct, practically useless."

HuggingFace

"Beat GPT-4V without needing HTML/DOM. Visual-only grounding is the future."

Pros & Cons

Pros

  • Lightweight 2B, local deployment
  • MIT license, fully open-source
  • SOTA 75.1% ScreenSpot
  • Multi-platform
  • Visual-only (no DOM needed)
  • 1.4x speedup

Cons

  • Can't distinguish web vs mobile
  • OCR struggles with small text
  • Points around elements, not at them
  • ~10GB VRAM (high for 2B)
  • 7.7% on ScreenSpot-Pro
  • No reasoning/planning

Installation

pip install transformers torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("showlab/ShowUI-2B")
processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")

GitHub: https://github.com/showlab/ShowUI HuggingFace: https://huggingface.co/showlab/ShowUI-2B