██████╗ ██████╗ █████╗ ███╗ ██╗██╗████████╗███████╗ █████╗ ██╗ ██╔════╝ ██╔══██╗██╔══██╗████╗ ██║██║╚══██╔══╝██╔════╝ ██╔══██╗██║ ██║ ███╗██████╔╝███████║██╔██╗ ██║██║ ██║ █████╗ ███████║██║ ██║ ██║██╔══██╗██╔══██║██║╚██╗██║██║ ██║ ██╔══╝ ██╔══██║██║ ╚██████╔╝██║ ██║██║ ██║██║ ╚████║██║ ██║ ███████╗ ██║ ██║██║ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚══════╝ ╚═╝ ╚═╝╚═╝

IBMGranite AI_

IBM's open-source enterprise LLM · running 100% in your browser · no login · no server · no data leaves your device

Granite 4.0 Nano running now via WebGPU · Granite 4.1 slot ready — awaiting ONNX-web build · Apache 2.0

▸ Model not loaded

Select:

▸ System prompt

▸ Chat

sys

Select a Granite model above and hit "Load model" to begin. Weights are cached after the first download — this page works offline once loaded.

temp: 0.70 max tokens: tokens: 0 · 0 tok/s

▸ About IBM Granite

Runtime

Transformers.js + WebGPU

Where inference runs

Your GPU. Your tab.

Data leaving device

Zero

License

Apache 2.0

Origin

IBM Research

After first download

Works offline

IBM Granite 4.1 — The full 4.1 family (3B / 8B / 30B) is trained on ~15 trillion tokens with five-phase pre-training, long-context extension up to 512K tokens, and improved tool calling. The 3B model will slot in here the moment an ONNX-web build becomes available. ibm-granite/granite-4.1-language-models →

Frequently Asked Questions — IBM Granite AI

Does my data get sent to IBM or any server?

No. The model weights are downloaded once from the Hugging Face CDN and cached in your browser. All inference happens on your GPU via WebGPU — no token, message, or prompt ever leaves your device. You could disconnect your internet after loading the model and it would keep working.

What is IBM Granite?

Granite is IBM's family of open-source foundation models, developed by IBM Research and released under the Apache 2.0 license. Unlike many open-weight models, Granite was built with enterprise use in mind: transparent training data, rigorous safety evaluations, and strong performance on structured tasks like code generation, JSON extraction, tool calling, and RAG (retrieval-augmented generation). Granite 4.0 introduced a hybrid SSM (State Space Model) + attention architecture for the smaller models, making them more efficient at edge inference — which is why the Nano variants can run in a browser.

Why is the app showing Granite 4.0 instead of 4.1?

Granite 4.1 (3B / 8B / 30B) requires an ONNX-web conversion before it can run in the browser via Transformers.js. IBM published ONNX-web builds for the Granite 4.0 Nano family (350M, 1B, Micro 3.4B), but 4.1 conversions haven't been released yet. The picker includes a "coming soon" slot — the moment IBM or the onnx-community publishes an ONNX-web build, it will activate here.

Which model should I pick?

Granite 4.0 — 350M — Pick this for speed. At ~250 MB it loads fast and works on devices with limited GPU memory. The hybrid SSM architecture gives it surprising efficiency for its size. Best for quick rewrites, simple Q&A, and testing on constrained hardware.

Granite 4.0 — 1B ⭐ (recommended) — The best starting point. 1B parameters with the hybrid SSM architecture, ~700 MB download. Strong instruction-following, tool-call awareness, and good JSON output. The right balance of download size and quality for everyday tasks.

Granite 4.0 Micro — 3.4B — Best quality. Dense transformer at 3.4B parameters — noticeably better reasoning, code, and structured output. Requires a modern GPU and ~2.2 GB download, but delivers the closest to production Granite quality available in-browser.

What is the hybrid SSM architecture in Granite 4.0?

The 350M and 1B Granite 4.0 Nano models use a hybrid architecture that combines State Space Models (SSM) with standard attention layers. SSM layers process sequences more efficiently than attention alone — they scale linearly with context length rather than quadratically. This makes the small Nano models more efficient at longer contexts and lower power consumption, which is why they're ideal for edge and browser inference despite their small size.

What is WebGPU and why does this need it?

WebGPU is a modern browser API that gives JavaScript direct access to your GPU for compute workloads, not just graphics. Running a language model requires billions of matrix multiplications — these run in parallel on a GPU in milliseconds, but would take seconds on a CPU. Without WebGPU, browser AI is impractically slow for models larger than ~100M parameters. Chrome 113+, Edge 113+, and Brave support WebGPU by default on desktop. Safari requires enabling it in Develop → Feature Flags.

What is Transformers.js?

Transformers.js is Hugging Face's JavaScript port of the Python transformers library. It runs ONNX-exported models in the browser using ONNX Runtime Web, with optional WebGPU acceleration. It handles tokenization, chat template formatting, and streaming generation — everything you'd do in Python, running directly in a browser tab. IBM Granite's ONNX-web models are converted specifically for use with Transformers.js.

Is this tool free?

Yes, completely free. No account, no subscription, no ads, no tracking beyond basic analytics. One of 120 free browser-based tools at jasperbernaers.com.