What It Is

A 1.7B parameter language model compressed to 290MB using 1-bit quantization, running entirely in-browser via WebGPU. No server. No API fees. No data leaving your device.

Developed by Prism ML, based on Qwen3-1.7B architecture, with WebGPU implementation hosted by the WebML Community on HuggingFace.


Technical Specifications

Specification Value
Parameters 1.7B (1.4B non-embedding)
Quantization Q1_0 g128 (1.125 bits/weight)
Context Length 32,768 tokens
Vocabulary 151,936
Layers 28 Transformer blocks
Attention GQA (16 query / 8 KV heads)

Memory Compression

Format Size Compression
FP16 3.44 GB Baseline
Browser (1-bit) 290 MB 14.2x
GGUF Q1_0 0.24 GB 14.2x

The quantization method: 1 sign bit + 1 FP16 scale per 128 weights. Weights map to 0 (negative scale) or 1 (positive scale). All layers quantized—embeddings, attention, MLP, LM head.


Performance Benchmarks

Platform Throughput vs FP16
RTX 4090 (CUDA) 674 tok/s 3.0x faster
M4 Pro 48GB (Metal) 250 tok/s 3.8x faster
iPhone (MLX Swift) 130 tok/s

The 1-bit kernels are faster than FP16 because memory bandwidth dominates inference latency. Less data to fetch = faster decoding.


Competitor Analysis

Solution Size Key Feature
Bonsai 1.7B 290 MB 1-bit quantization
WebLLM 2-4 GB OpenAI API compatible
Transformers.js Variable No GPU required
Secret Llama 2-4 GB Privacy-focused UI

Bonsai is 7-14x smaller than typical Q4 browser models. The tradeoff: quality vs size. For prototyping, quick completions, or constrained devices, 290MB is compelling.


What This Means

Privacy-first AI with zero infrastructure cost.

  1. Wearables & IoT: 290MB fits on constrained devices
  2. Offline capability: Works after initial download
  3. Cross-platform: Same model runs on CUDA, Metal, WebGPU, iOS

Resources