No. The model weights are downloaded once from the Hugging Face CDN and cached in your browser. All inference happens on your GPU via WebGPU — no token, message, or prompt ever leaves your device. You could disconnect your internet after loading the model and it would keep working.
Granite is IBM's family of open-source foundation models, developed by IBM Research and released under the Apache 2.0 license. Unlike many open-weight models, Granite was built with enterprise use in mind: transparent training data, rigorous safety evaluations, and strong performance on structured tasks like code generation, JSON extraction, tool calling, and RAG (retrieval-augmented generation). Granite 4.0 introduced a hybrid SSM (State Space Model) + attention architecture for the smaller models, making them more efficient at edge inference — which is why the Nano variants can run in a browser.
Granite 4.1 (3B / 8B / 30B) requires an ONNX-web conversion before it can run in the browser via Transformers.js. IBM published ONNX-web builds for the Granite 4.0 Nano family (350M, 1B, Micro 3.4B), but 4.1 conversions haven't been released yet. The picker includes a "coming soon" slot — the moment IBM or the onnx-community publishes an ONNX-web build, it will activate here.
Granite 4.0 — 350M — Pick this for speed. At ~250 MB it loads fast and works on devices with limited GPU memory. The hybrid SSM architecture gives it surprising efficiency for its size. Best for quick rewrites, simple Q&A, and testing on constrained hardware.
Granite 4.0 — 1B ⭐ (recommended) — The best starting point. 1B parameters with the hybrid SSM architecture, ~700 MB download. Strong instruction-following, tool-call awareness, and good JSON output. The right balance of download size and quality for everyday tasks.
Granite 4.0 Micro — 3.4B — Best quality. Dense transformer at 3.4B parameters — noticeably better reasoning, code, and structured output. Requires a modern GPU and ~2.2 GB download, but delivers the closest to production Granite quality available in-browser.
The 350M and 1B Granite 4.0 Nano models use a hybrid architecture that combines State Space Models (SSM) with standard attention layers. SSM layers process sequences more efficiently than attention alone — they scale linearly with context length rather than quadratically. This makes the small Nano models more efficient at longer contexts and lower power consumption, which is why they're ideal for edge and browser inference despite their small size.
WebGPU is a modern browser API that gives JavaScript direct access to your GPU for compute workloads, not just graphics. Running a language model requires billions of matrix multiplications — these run in parallel on a GPU in milliseconds, but would take seconds on a CPU. Without WebGPU, browser AI is impractically slow for models larger than ~100M parameters. Chrome 113+, Edge 113+, and Brave support WebGPU by default on desktop. Safari requires enabling it in Develop → Feature Flags.
Transformers.js is Hugging Face's JavaScript port of the Python transformers library. It runs ONNX-exported models in the browser using ONNX Runtime Web, with optional WebGPU acceleration. It handles tokenization, chat template formatting, and streaming generation — everything you'd do in Python, running directly in a browser tab. IBM Granite's ONNX-web models are converted specifically for use with Transformers.js.
Yes, completely free. No account, no subscription, no ads, no tracking beyond basic analytics. One of 120 free browser-based tools at jasperbernaers.com.