Bonsai 1.7B: 290MB 1-bit LLM Running in Your Browser

What It Is

A 1.7B parameter language model compressed to 290MB using 1-bit quantization, running entirely in-browser via WebGPU. No server. No API fees. No data leaving your device.

Developed by Prism ML, based on Qwen3-1.7B architecture, with WebGPU implementation hosted by the WebML Community on HuggingFace.

Technical Specifications

Specification	Value
Parameters	1.7B (1.4B non-embedding)
Quantization	Q1_0 g128 (1.125 bits/weight)
Context Length	32,768 tokens
Vocabulary	151,936
Layers	28 Transformer blocks
Attention	GQA (16 query / 8 KV heads)

Memory Compression

Format	Size	Compression
FP16	3.44 GB	Baseline
Browser (1-bit)	290 MB	14.2x
GGUF Q1_0	0.24 GB	14.2x

The quantization method: 1 sign bit + 1 FP16 scale per 128 weights. Weights map to 0 (negative scale) or 1 (positive scale). All layers quantized—embeddings, attention, MLP, LM head.

Performance Benchmarks

Platform	Throughput	vs FP16
RTX 4090 (CUDA)	674 tok/s	3.0x faster
M4 Pro 48GB (Metal)	250 tok/s	3.8x faster
iPhone (MLX Swift)	130 tok/s	—

The 1-bit kernels are faster than FP16 because memory bandwidth dominates inference latency. Less data to fetch = faster decoding.

Competitor Analysis

Solution	Size	Key Feature
Bonsai 1.7B	290 MB	1-bit quantization
WebLLM	2-4 GB	OpenAI API compatible
Transformers.js	Variable	No GPU required
Secret Llama	2-4 GB	Privacy-focused UI

Bonsai is 7-14x smaller than typical Q4 browser models. The tradeoff: quality vs size. For prototyping, quick completions, or constrained devices, 290MB is compelling.

What This Means

Privacy-first AI with zero infrastructure cost.

Wearables & IoT: 290MB fits on constrained devices
Offline capability: Works after initial download
Cross-platform: Same model runs on CUDA, Metal, WebGPU, iOS

Resources

WebGPU Demo: https://huggingface.co/spaces/webml-community/bonsai-webgpu
GGUF Model: huggingface.co/prism-ml/Bonsai-1.7B-gguf
GitHub: github.com/PrismML-Eng/Bonsai-demo
Website: https://prismml.com

What It Is

Technical Specifications

Memory Compression

Performance Benchmarks

Competitor Analysis

What This Means

Resources

RELATED_ENTRIES

One video diffusion model to handle 30 different tasks

Your AI assistant lives in a sterile chat window. This one boots from a BIOS screen.

ComfyUI took 4 hours. This took 14 minutes on the same GPU.