Google just released Gemma 4, its most capable open model family to date. Daneel now supports it across both WebGPU (in-browser) and Ollama (local server) backends, with three model variants to choose from.
Why Gemma 4 matters
Gemma 4 represents a significant leap from Gemma 3 on every measurable axis. The improvements are not incremental — on long-context retrieval (MRCR v2 at 128K tokens), the new family scores 66.4% where Gemma 3 managed 13.5%. On visual document understanding (OmniDocBench), edit distance drops from 0.365 to 0.131. Reasoning benchmarks follow the same trend across the board.
The practical takeaway: Gemma 4 handles longer documents, understands questions better, and reasons more carefully than its predecessor, and you can run it locally without sending a single byte to the cloud.
What's new in the architecture
Gemma 4 introduces a hybrid attention mechanism that interleaves local sliding window attention with full global attention. This is what makes the 128K context window practical rather than theoretical: the model can actually retrieve and reason over information buried deep in long documents.
Other highlights from Google's release:
- Built-in reasoning mode with step-by-step thinking (similar to chain-of-thought)
- Native function calling for agentic workflows and tool use
- 35+ languages supported, pre-trained on 140+
- Per-Layer Embeddings in smaller variants, maximizing quality per parameter for on-device deployment
Three ways to run it in Daneel
WebGPU — in-browser, zero setup
Gemma 4 E2B (2.3B parameters) runs directly in your browser at q4f16 quantization, using roughly 2 GB of GPU memory. No server, no API key, no network connection required after the initial model download. Select it from Settings > Models — it appears automatically.
Requires a GPU with shader-f16 support (most GPUs from the last 3-4 years).
Ollama E2B — local server, lightweight
Pull gemma4:e2b for the same 2.3B model running on your local Ollama instance. Good for machines where WebGPU isn't available or when you want to share the model across multiple tools.
Ollama E4B — local server, higher quality
Pull gemma4:e4b for the 4.5B dense variant. Stronger reasoning and broader knowledge at the cost of roughly 4 GB VRAM. This is the sweet spot for users with 8+ GB GPUs who want the best local quality.
How it stacks up
| Model | Params | Context | Thinking | Languages | Runs in browser |
|---|---|---|---|---|---|
| Granite 4.0 Micro | 3B | 8K | No | English | Yes |
| Gemma 3 1B | 1B | 32K | No | Limited | Yes |
| Gemma 4 E2B | 2.3B | 128K | Yes | 35+ | Yes |
| Gemma 4 E4B | 4.5B | 128K | Yes | 35+ | No (Ollama) |
Gemma 4 E2B is now the most capable model in Daneel's WebGPU catalog: largest context window, built-in reasoning, and multilingual support — all at 2 GB.
Leaning more about Gemma 4
What's next
Gemma 4 is natively multi-modal at the model level — it understands images and short audio alongside text. Daneel currently uses it for text tasks. Image and audio input support is on our roadmap and will unlock these capabilities in a future update.
Getting started
Update Daneel to the latest version from the Chrome Web Store. The new models appear automatically in Settings > Models.
For Ollama users, make sure your server is up to date: docker pull ollama/ollama:latest (Gemma 4 requires a recent Ollama version.)