---
id: "three-walls-browser-llm-inference-2026"
date: "2026-04-25"
title: "The three walls of in-browser LLM inference: the state of affairs in April 2026"
summary: "In April 2026 browser LLM inference passed an inflection point: runtime architecture is solved, and per-tab VRAM is the new ceiling."
abstract: "In-browser LLM inference has lived under three constraint walls in sequence: protobuf's 2 GB cap on `.onnx` files, WebAssembly's 4 GB linear-memory limit, and the browser-allocated WebGPU VRAM budget.

As of April 2026, the first two are effectively cleared — the first by ONNX's External Data format, the second by ONNX Runtime's new C++ WebGPU execution provider — while the third remains the binding constraint, with no portable spec query and substantial per-platform variance.

This paper traces how each wall arose, how the February 2026 transformers.js v4 / ORT C++ EP inflection collapsed two of them, and what genuinely fits in a browser tab today across Apple Silicon, NVIDIA, AMD, Intel, and mobile."
header: "Research"
topic: "local-inference"
tags: ["WebGPU", "transformers.js", "ONNX Runtime", "in-browser LLM", "WebAssembly"]
authors: ["Julien Borrel", "Claude Opus"]
image: "/medias/research.webgpu.color.png"
---

On 9 February 2026, Hugging Face shipped a transformers.js v4 preview with a single headline benchmark: GPT-OSS 20B at `q4f16` running at roughly 60 tokens per second in a browser tab on an M4 Pro Max[^v4-blog]. Six months earlier that number was not possible at any quantization, on any consumer hardware, in any tab.

The reason is not that GPUs got faster. The reason is that two of the three walls that had constrained in-browser LLM inference for years quietly fell.

This paper is about those walls — what each one was, why it mattered, how each was breached, and which one is left.

## TL;DR

- The historical sequence of binding constraints on `.onnx` LLMs in the browser has been: (1) the protobuf 2 GB cap on `ModelProto`, (2) the WebAssembly 32-bit 4 GB linear-memory cap, and (3) the per-tab WebGPU VRAM budget allocated by the browser.
- Wall 1 has been solved for years by the **ONNX External Data** format: a `.onnx` graph plus a sidecar weight blob[^onnx-extdata].
- Wall 2 was effectively retired by **direct GPU-buffer weight loading**, shipped in ONNX Runtime 1.20 (November 2024) and made the default path in the new C++ WebGPU execution provider that arrived for browsers in early 2026[^ort-large][^ort-1-20]. WebAssembly Memory64 turned out to be a sideshow for ORT — the build option was deleted in PR #25181 (June 2025) as "incomplete"[^pr-25181].
- Wall 3 — **the per-tab WebGPU VRAM budget** — has no portable spec query[^gpuweb-5505] and is set by browser policy on top of adapter limits. Empirically, ~3–4 GB on Apple Silicon, ~4 GB single-buffer on RTX 4090 (D3D12), ~2 GB on Snapdragon X Elite. This is the new ceiling.
- Practical April 2026 ceilings: **~20 B q4 on Apple's high-end** (the 60 tok/s GPT-OSS number), **~7–8 B q4f16 on 8 GB-class consumer GPUs**, **≤ 2 B for broad device compatibility including mobile**[^v4-blog].
- WebGPU shipped by default in all four major browsers as of November 25, 2025 — Chrome 113 (May 2023), Edge 113, Firefox 141 on Windows (July 2025) / 145 on macOS, Safari 26 (September 2025)[^webgpu-com-news][^web-dev-supported].

The browser is finally a real LLM runtime. What follows is the mechanics of how it got there.

```mermaid
timeline
    title From wall to wall — a sequence of binding constraints
    Pre-2024 : Wall 1 — protobuf 2 GB ModelProto cap
             : Solved by ONNX External Data format
    Feb 2024 (ORT 1.17) : JSEP ships — first widely-available WebGPU EP for ORT-Web
                        : Wall 2 (4 GB WASM heap) is the binding limit
    Nov 2024 (ORT 1.20) : Direct GPU-buffer weight loading
                        : Wall 2 effectively bypassed
    Jan-Feb 2025 : Memory64 ships in Chrome 133 + Firefox 134
                 : But ORT pivots away from wasm64
    Jun 2025 : ORT deletes wasm64 build option (PR 25181)
    Feb 2026 : transformers.js v4 + new C++ WebGPU EP in ORT
             : 60 tok/s GPT-OSS 20B q4f16 on M4 Pro Max
    Now : Wall 3 — per-tab WebGPU VRAM is the ceiling
```

## The three walls

### Wall 1 — protobuf's 2 GB cap on `.onnx` files

ONNX serializes models as protobuf. Protobuf uses signed `int32` byte offsets, which means a single `ModelProto` cannot exceed 2,147,483,647 bytes — a hard, encoded-in-the-spec ceiling[^onnx-3275][^ort-15349]. Hit it and you get the well-known `ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB`.

The fix shipped years ago and is now table stakes: the **External Data format**[^onnx-extdata]. The `.onnx` file holds only the graph; weights live in a sidecar (`.onnx_data` or `.data`) referenced by `location` (relative path), `offset` (recommended in 4 KB / 64 KB-aligned chunks on Windows), `length`, and an optional SHA-1 `checksum`. A 7 B model fits cleanly in this layout; a 70 B does too in principle, modulo what happens later in the WASM and GPU layers.

ORT Web has a JavaScript-specific wrinkle: the runtime cannot probe a filesystem to find the sidecar, so callers must pass `externalData` explicitly to `InferenceSession.create()`. transformers.js triggers this path with `use_external_data_format: true` — that is how, for example, `Xenova/Phi-3-mini-4k-instruct_fp16` loads at `dtype: 'q4'`[^ort-large].

There is one subtlety worth knowing if you are pushing the upper end. **Chrome's `ArrayBuffer` cap is `0x7fe00000` — about 2 GiB.** For external-data files larger than 2 GB, ORT Web does not allocate via `fetch().arrayBuffer()`; it allocates via `new WebAssembly.Memory()`. The consequence: such buffers are not *Transferable* to a Web Worker, which means the ORT-Web Proxy worker feature is incompatible with very large models[^ort-large]. If you need both > 2 GB weights and a worker-based session, the architecture forces you into the same-thread path. This is rarely a binding concern at q4 quantizations on consumer-scale models, but it shapes deployment for 14B+ class.

### Wall 2 — the 4 GB WebAssembly linear-memory cap

WebAssembly 1.0 / wasm32 uses 32-bit pointers, which gives a hard 4 GiB linear-memory ceiling. V8 raised its previous internal 2 GB cap to the full 4 GB years ago[^v8-4gb]. ORT's own large-models documentation acknowledges the consequence in plain language: *"WebAssembly has a memory limit of 4GB. … We may support it in the future either by using WASM64 or by using direct GPU weight loading."*[^ort-large]

Two escape hatches were proposed. Only one panned out for ORT.

#### Escape A — WebAssembly Memory64 (`wasm64`)

Memory64 reached phase-4 on 5 November 2024. Chrome 133 shipped it by default on 4 February 2025[^chromestatus-mem64][^chrome-blink], and Firefox 134 followed on 7 January 2025[^firefox-134]. Safari, as of April 2026, still has not shipped Memory64 — Safari 26.2 (December 2025) added resizable Wasm Memory and JS String Builtins, but not 64-bit memory[^safari-26-2].

Two practical caveats apply even where it has shipped.

The first is the **16 GB browser cap**. The theoretical Memory64 ceiling is 16 EB; the browser ceiling is 16 GB[^spider-mem64]. SpiderMonkey explained the gap directly: *"because WebAssembly makes no distinction between 'reserved' and 'committed' memory, browsers cannot freely allocate large quantities of memory without running into system commit limits."*[^spider-mem64] You do not get to address 4 TB of RAM from a browser tab; you get 16 GB of declared linear memory, and the browser may even decline some of that.

The second is the **performance penalty**, which SpiderMonkey put bluntly: *"It is impossible to beat the absolute removal of all bounds checks found in 32-bit WebAssembly."*[^spider-mem64] In wasm32, engines reserve a full 4 GB virtual address region per module and rely on hardware page protection to elide every bounds check. Memory64 cannot do that trick — every load and store needs an explicit bounds check, costing 10 % to over 100 % depending on workload. V8 has work in flight on bounds-check elimination via value-range analysis[^v8-bce-cl], but as of April 2026 there is no engine-level fix.

ORT's initial Memory64 effort was PR #21260 (5 July 2024), closed on 23 August 2024 in favor of follow-up work in #21836[^pr-21260]. Then on 25 June 2025, PR #25181 deleted the wasm64 build option entirely with a one-line rationale: *"Delete WASM64 build option, because the feature was incomplete. Likely we will need to reimplement it later."*[^pr-25181] The team was already pursuing the second escape hatch and judged Memory64 not worth carrying as a parallel build path.

#### Escape B — direct GPU-buffer weight loading

This is the path ORT actually took. The idea: weights skip WASM linear memory entirely. Instead of `fetch` → WASM heap → GPU upload, the path is `fetch` → stream directly into `GPUBuffer`s via `device.queue.writeBuffer()` or `mappedAtCreation`, with `usage = STORAGE | COPY_DST`. The WASM module holds only handles — about 150 bytes per buffer per WebGPU explainer — and orchestration state. The model weights themselves never live in the 4 GB heap.

```mermaid
flowchart LR
    A[".onnx graph<br/>(protobuf)"] --> Fetch[fetch]
    B[".onnx_data sidecar<br/>(weights blob)"] --> Fetch
    Fetch --> OPFS[OPFS cache]
    OPFS --> GPU["GPUBuffer<br/>STORAGE | COPY_DST"]
    Kernel["C++ kernel<br/>in WASM module"] --> GPU
    Kernel --> Pipeline[WebGPU compute pipeline]
    style GPU fill:#1f4068,stroke:#3aa,color:#fff
    style Kernel fill:#1f4068,stroke:#3aa,color:#fff
```

The capability shipped in **ONNX Runtime 1.20 (November 2024)**, where the release notes describe it as *"on-demand weight loading support (offloads Wasm32 heap and enables 8B-parameter LLMs)"*[^ort-1-20]. PR #23910, "[WebGPU] Direct CPU→GPU buffer upload for UMA", added the unified-memory-architecture optimization that eliminates staging buffers when the device has UMA — reported memory drop on Tiger Lake from 5.1 GB to 3.2 GB without any throughput regression[^pr-23910].

The structural consequence is what makes this the *correct* fix: this path **does not require Memory64 at all**. Memory64 raises the ceiling on a thing (the WASM heap) that the new architecture has stopped putting weights into. ORT's June 2025 deletion of wasm64 was not a retreat. It was a recognition that the problem had moved.

### Wall 3 — WebGPU buffer limits and the per-tab VRAM budget

The hardware-allocation problem is the new frontier. It has three layers: spec defaults (the floor), adapter-exposed limits (what your card claims), and the per-tab VRAM budget the browser actually grants (what you actually get). All three must be reasoned about.

The **WebGPU spec defaults** are conservative[^webgpu-limits][^webgpu-spec]:

- `maxBufferSize` — 256 MiB
- `maxStorageBufferBindingSize` — 128 MiB
- `maxUniformBufferBindingSize` — 64 KiB
- `maxComputeWorkgroupStorageSize` — 16 KiB
- `maxComputeInvocationsPerWorkgroup` — 256

Apps must opt into higher limits via `requiredLimits` on `requestDevice()`. Chrome 133 even nudges you with the diagnostic: *"Buffer size (268435457) exceeds the max buffer size limit (268435456). This adapter supports a higher maxBufferSize of 4294967296, which can be specified in requiredLimits…"*[^chrome-133]

In practice, **adapter-exposed `maxBufferSize` is typically 4 GiB on desktop** — both NVIDIA and AMD running on Windows D3D12 cap there. **`maxStorageBufferBindingSize` is typically about 2 GiB** — just under `INT32_MAX`, the D3D12 raw-buffer SRV limit. Both are real ceilings, not soft preferences.

The 2 GiB storage-binding cap has a direct architectural consequence for LLMs. **A single weight matrix in an 8 B q4 model can exceed 2 GiB.** Frameworks must shard one logical tensor across multiple `GPUBuffer`s and rebind per kernel invocation. ORT Web's WebGPU EP forces `memory_pattern = false` precisely because *"the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size)"* — one logical tensor can span multiple physical buffers[^pr-23697][^pr-23910]. This is invisible to model authors but visible in kernel design.

Above adapter limits sits the layer that has no spec at all: **the per-tab VRAM budget**. The W3C spec language is candid: *"A WebGPU implementation may limit the available GPU memory to an application, in order to keep other applications responsive."*[^webgpu-spec] Issue gpuweb#5505, "Support for querying maximum/available GPU memory," is open[^gpuweb-5505]. A Dawn engineer summarized the absence: *"There isn't a way to measure the VRAM usage with the WebGPU API, for a variety of reasons ranging from non-portability, to fingerprinting, to not having a clear way to do this on some underlying APIs."*[^dawn-graphics]

Chrome 121 added experimental `requestAdapterInfo().memoryHeaps[]` behind `chrome://flags/#enable-webgpu-developer-features`[^chrome-121], but this is an adapter-info plumbing feature, not a portable application-side budget query. Production code cannot rely on it.

What is left is empirical observation. As of April 2026 the practical per-tab ceilings reported by community demos and benchmarks settle around: **Apple Silicon 3–4 GB even on 36–64 GB unified-memory machines**; **NVIDIA RTX 4090/5090 on Windows D3D12 about 4 GB single buffer**, with multi-buffer totals higher; **Snapdragon X Elite Adreno about 2 GB**. These are observed regimes, not policy. The exact heuristic that Chromium and Dawn use to set them does not appear to be documented in any public source we could verify; treat the numbers as where things break in practice today, not as commitments for tomorrow.

The shape of Wall 3 is therefore qualitatively different from Walls 1 and 2. Walls 1 and 2 were single, well-defined, specced numbers — 2 GB protobuf, 4 GB WASM — that yielded to engineering pushes. Wall 3 is a budget set by browser policy, varies by adapter and OS, and has no portable query. It will almost certainly improve, but it will not be removed by a single PR.

## The February 2026 inflection — transformers.js v4 + the new C++ WebGPU runtime

Two changes shipped tightly together in early 2026 and produced a step-function in what runs in the browser. Naming them gives the ground truth on which everything else rests.

### transformers.js v4

The preview shipped on 9 February 2026; v4.0.0 GA followed on 30 March 2026 — about a year of development by Joshua "Xenova" and Nico Martin starting in March 2025[^v4-blog][^v4-tag]. Install is `npm i @huggingface/transformers`, with `@next` for the preview.

The headline change is verbatim: *"The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We've worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures."*[^v4-blog]

The same code now runs WebGPU-accelerated in browsers, Node.js, Bun, and Deno (server-side WebGPU lives behind Dawn or wgpu; Deno PR #1546 added the Deno path). For practitioners the consequence is large: a single inference path covers four runtimes.

v4 leverages four ORT contrib operators with non-trivial impact: `com.microsoft.GroupQueryAttention`, `com.microsoft.MatMulNBits`, `com.microsoft.QMoE`, and `com.microsoft.MultiHeadAttention`. The blog calls out an outsized BERT-class win specifically: **MultiHeadAttention adoption gave roughly a 4× speedup for BERT-based embedding models.** The headline LLM benchmark is **GPT-OSS 20B at `q4f16`, ~60 tokens per second on an M4 Pro Max**[^v4-blog]. As of April 2026 this is the strongest published browser-side number for a 20 B-class model; it is also currently a single-vendor benchmark, with no comparable RTX 4090 / 5090 figure for the same model in the same runtime.

v4 also adds new architectures inaccessible in v3 — GPT-OSS (`QMoE`-based), Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, Youtu-LLM. Three architecture families had to be added to the runtime to support these: **Mamba (state-space), Multi-head Latent Attention, and Mixture of Experts**. None of those is a pure inference detail; each requires kernel work.

The build and packaging changes are useful to know about because they reflect a real focus on perf-savvy adoption. Webpack was replaced by esbuild — build time **2 s → 200 ms (10×)**, bundle ~10 % smaller, and `transformers.web.js` 53 % smaller. The repo moved to a pnpm workspaces monorepo, and the 8000-line `models.js` was split into per-model files (PR #1498). A new `@huggingface/tokenizers` standalone package shipped at **8.8 kB gzipped, zero dependencies** — a useful primitive for downstream tooling[^v4-blog].

A new `ModelRegistry` API exposes `get_pipeline_files`, `get_file_metadata`, `is_pipeline_cached`, `clear_pipeline_cache`, `get_available_dtypes`, plus a `progress_total` event for cleaner download UX. New env knobs include `env.useWasmCache`, `env.fetch`, and `env.logLevel` (`DEBUG/INFO/WARNING/ERROR/NONE`); ORT WebGPU warnings are hidden by default. PR #1549 introduces an experimental Cross-Origin Storage cache backend[^v4-cos].

One caveat: on Apple M2 Pro under Chrome 145, **Qwen3.5-4B at `q4f16` is roughly 3× slower in decode and 20× slower in time-to-first-token versus Qwen3-4B**, likely due to a missing kernel for hybrid sliding-window/full attention[^v4-issue-1599]. v4 is not uniformly faster than v3 on every model on every platform. New architectures sometimes hit kernel gaps before they hit kernel optimizations.

### The new C++ WebGPU EP in ONNX Runtime Web

The architectural change underneath transformers.js v4 has been quietly underway in ORT for a year and a half. It deserves a precise account because it is what made Wall 2 collapse.

**The old path was JSEP** (JavaScript Execution Provider, ORT 1.17–1.21). Yulong Wang (`fs-eire`) introduced it in PR #14579 in February 2023[^pr-14579]. JSEP is hybrid: ORT's WASM module exposes hooks (`jsepInit`, `jsepAlloc`, `jsepFree`, `jsepCopy`, `jsepCopyAsync`, `jsepCreateKernel`, `jsepReleaseKernel`, `jsepRun`); the kernels themselves are TypeScript and WGSL on the JS side, communicating with C++ via Emscripten `EM_ASM_*` and Asyncify. Build flag `--use_jsep`; artifact `ort-wasm-simd-threaded.jsep.wasm/.mjs`. JSEP shipped in ORT 1.17 (February 2024), which Microsoft Open Source called *"the official launch of ONNX Runtime Web featuring WebGPU"*[^ms-feb24].

**The new path is the C++ WebGPU EP**, in ORT 1.22+. The migration PR is #23697, titled "[WIP] migrate WebGPU EP to WebAssembly to replace JSEP"[^pr-23697]. Build flag `--use_webgpu`; link `-s USE_WEBGPU=1`. Kernels are reimplemented in **C++** under `onnxruntime/core/providers/webgpu/`. The C++ code calls WebGPU directly from the WASM module via Emscripten's native WebGPU bindings, eliminating the cross-language dance.

The single most useful side effect: **the same EP runs natively on desktop**. Issue #25952 confirms `WebGpuExecutionProvider` is enumerable from `Ort::GetAvailableProviders()` in ORT 1.23.0 native; nightly `onnxruntime-webgpu` Python wheels are on PyPI (1.25.0.dev wheels from 12 February 2026; 1.26.0.dev from 13 April 2026)[^ort-25952][^pypi-ort-webgpu]. One EP, two surfaces.

The version timeline for practitioners trying to pin behavior:

| Version | Date | Notable |
|---|---|---|
| 1.17 | Feb 2024 | JSEP first ships; "official WebGPU launch" |
| 1.20 | Nov 2024 | On-demand weight loading offloads Wasm32 heap; enables 8 B LLMs |
| 1.22 | mid-2025 | First widely-available native WebGPU EP via `--use_webgpu`; macOS / Linux / Windows |
| 1.23 | mid-2025 | `WebGpuExecutionProvider` available; `onnxruntime-web` exposes both legacy JSEP and the C++ EP |
| 1.24.x | Q4 2025 / early 2026 | Currently published as `onnxruntime-web@1.24.3`; release notes call out Flash Attention optimizations, graph capture, Split-K MatMul, qMoE, WGSL templates |
| 1.24.1 | 6 Feb 2026 | Patch |
| 1.25.0 | ~Feb 2026 | C++20 required; ArmNN EP removed; CUDA Plugin EP. WebGPU: deterministic Split-K; binary-size reductions |
| 1.26.0.dev | 13 Apr 2026 | Current dev branch |

Side-by-side, the architectural change is large enough that it is worth summarizing as a single comparison:

| Aspect | JSEP (1.17–1.21) | C++ WebGPU EP (1.22+) |
|---|---|---|
| Kernel impl | TS / WGSL on the JS side | C++ in WASM, calls WebGPU directly |
| Build flag | `--use_jsep` | `--use_webgpu` |
| WASM artifact | `ort-wasm-simd-threaded.jsep.wasm` | unified `ort-wasm-simd-threaded.wasm` (`USE_WEBGPU=1`) |
| Cross-language barrier | `EM_ASM`, Asyncify, `jsepAlloc` / `jsepRun` | Native WebGPU bindings |
| Reusable on desktop? | No (browser-only) | Yes (Win / macOS / Linux + Android / iOS via 1.20+) |
| 4 GB limit | Hits it | Bypassed via direct WebGPU buffer loading |
| Operators | TS implementations | Contrib ops: `GroupQueryAttention`, `MatMulNBits`, `QMoE`, `MultiHeadAttention`, Flash Attention |

Three caveats deserve a place in any honest read.

The first is labeling. The WebGPU EP is **still labeled "experimental"** on the ORT build documentation as of April 2026[^onnx-build-web], a long-running discrepancy with marketing claims about the runtime elsewhere. This is not a real blocker for use, but it is a real signal about how the ORT team views surface stability.

The second is Safari / WebKit 26 in JSEP mode. Issue #26827 (filed November 2025) documents severe bugs: CPU at 400 %+ and memory growth past 14 GB after inference, affecting ORT-Web 1.20.0 through 1.23.2[^ort-26827]. If you need to support Safari today, the JSEP path is risky and the C++ EP path needs validation.

The third is binary footprint. The ORT WebGPU build still requires a WASM module — about 20 MB default, 8 MB optimized, 3 MB with `--minimal_build`[^ort-large]. The runtime and kernel code live in WASM; only the *weights* are off-heap. The 4 GB wall does not return as a binding constraint at consumer-scale models, but the WASM bundle itself is not zero.

## Hardware reality — what actually runs where

The empirical picture varies sharply by adapter, OS, and browser policy. What follows is what runs where, with the per-platform numbers that anchor each regime.

### Apple Silicon — Chrome on macOS (Metal backend)

Chrome 113 (May 2023) was default-on for WebGPU; Chrome 94 beta enabled WebGPU on Metal in August 2021[^chrome-113]. Apple's WWDC25 talk "Unlock GPU computing with WebGPU" (Mike Wyrzykowski, WebKit) gave the architectural summary in one sentence: *"Most calls are one-to-one mapping with Metal framework calls."*[^wwdc25] `GPUBuffer` → `MTLBuffer`; `GPUTexture` → `MTLTexture`; `GPUBindGroup` → Metal argument buffers (creating a bind group allocates a new `MTLBuffer`); `GPURenderPipeline` → `MTLRenderPipelineState`; `GPUCommandQueue` → `MTLCommandQueue`. Apple GPUs are tile-based deferred renderers, which matters for graphics workloads more than for LLMs.

Unified memory is Apple's structural advantage and a source of confusion. Metal's `maxBufferLength` is typically about 25–75 % of unified memory (~9 GB on a 16 GB M2; ~24 GB on a 36 GB M3 Max), but Dawn surfaces a much more conservative WebGPU limit. The historical `MTLBuffer` cap on early macOS was 256 MB[^mtl-256]. Reported `maxBufferSize` on M-series GPUs via webgpureport.org: **2 GB to 4 GB** depending on chip and unified memory; `maxStorageBufferBindingSize` typically 1–2 GB[^webgpu-report].

The practical per-tab ceiling on Apple is **about 3–4 GB**. A representative user report on a 36 GB M3 Max: *"I get an AI error… A context size of 32768 with 4 sequences is too large for the available VRAM."* That phrasing is almost a caricature of Wall 3 — a machine with 36 GB of unified memory, but the browser tab gets a sliver of it.

In-Chrome tok/s benchmarks, all from independent sources, give the empirical floor:

- WebLLM, M3 Max, Llama 3.1 8B Q4 → **~41 tok/s decode**, about 71 % of native MLC-LLM[^arxiv-webllm]
- WebLLM, M3 Max, Phi 3.5 mini → **~71 tok/s**[^arxiv-webllm]
- WebLLM paper: *"~90 tokens/s on an Apple M3 laptop"* for a 4-bit-quantized 3 B model[^arxiv-webllm]
- HF transformers.js v4, M4 Pro Max, GPT-OSS 20B `q4f16` → **~60 tok/s**[^v4-blog]
- WeInfer (ACM Web Conf 2025): Qwen2-0.5B got only a 1.12× WebLLM boost on M2 versus 2.01× for SmolLM-135M, *"due to the unique characteristics of Metal APIs or Apple's GPU hardware."*[^acm-weinfer]

The headline reading is consistent: Apple high-end runs 8 B models comfortably, 20 B models at acceptable speed if quantized aggressively, and small models with kernel-dependent variance.

### NVIDIA discrete (Chrome on Windows / Linux)

Backend is Dawn → D3D12 on Windows, Vulkan on Linux. Linux Chrome WebGPU shipped for Intel Gen12+ in Chrome 144 Beta and for NVIDIA driver 535.183.01+ on Wayland in Chrome 147[^webgpu-impl-status]. Chrome 121 also switched D3D12 from FXC to **DXC** for SM6+, reporting *"a 20 % average increase in compute shader compilation speed"*[^chrome-121].

Typical reported limits on a 24 GB RTX 4090 via webgpureport.org:

- `maxBufferSize` 4,294,967,296 (4 GiB)
- `maxStorageBufferBindingSize` 2,147,483,644 (~2 GiB; the D3D12 raw-buffer SRV cap)
- `maxComputeWorkgroupStorageSize` 32,768 (32 KiB)
- `maxComputeInvocationsPerWorkgroup` 1024

`chrome://flags/#enable-unsafe-webgpu` removes blocklist guardrails; `#enable-webgpu-developer-features` exposes timestamp dequantization and `powerPreference` adapter info as of Chrome 137[^chrome-137]. There is one subtle constraint: **Chrome does not support using multiple GPU adapters simultaneously**[^chrome-troubleshoot]. A multi-GPU machine is a single-GPU machine from a tab's perspective.

A 2026 arXiv preprint on RTX 5090 (32 GB, Ubuntu 24.04, Dawn + Chrome) reports that *"on Vulkan, kernel fusion reduces dispatches from 876 to 564, improving throughput by 53 %"* — and that WebLLM reaches roughly 80 % of native MLC-LLM performance in the best case[^arxiv-rtx5090]. ORT 1.17 had already reported *Stable Diffusion Turbo end-to-end < 1 s* on RTX 4090 in the browser[^ms-feb24]. ORT WebGPU on Phi-3-mini, RTX 4090: **>70 tok/s**[^ms-phi3]. A background-removal model: **20× over multi-threaded CPU and 550× over single-threaded CPU** on M3 Max[^bgremoval]. Segment Anything: **19× encoder, 3.8× decoder** versus WASM on RTX 3060 + i9[^ms-feb24].

The pattern is the same as Apple: WebGPU narrows the gap to native dramatically without closing it.

### AMD discrete

Backend is Dawn D3D12 on Windows; Vulkan on Linux (rolling per Chrome 144). Limits are similar to NVIDIA on D3D12: `maxBufferSize` 4 GB and `maxStorageBufferBindingSize` ~2 GB on RX 7900 XTX. The subgroup-matrix path (used heavily by Intel; see below) is currently Intel-only on Vulkan; AMD support is tracked as future work in Dawn. AMD on WebGPU sees less independent benchmarking than Apple or NVIDIA, and the public picture is correspondingly thinner.

### Intel integrated — the XMX path

Intel deserves a sidebar because it is the most architecturally distinct route to LLM acceleration on integrated graphics in early 2026. The primary source is the Intel Web Platform Team's "Boost AI Inference Performance with WebGPU on Intel Platforms"[^intel-xmx]. Two quotes give the shape:

> "Intel® Xe Matrix Extensions (Intel® XMX) is a dedicated hardware engine on Intel® Arc™ GPUs for Artificial Intelligence workloads … accelerating matrix multiplication with specialized Dot Product Accumulate Systolic (DPAS) instructions in 2D systolic arrays."[^intel-xmx]

> "As Microsoft has not yet released D3D12 related API specifications, Intel has leveraged the Vulkan API to accelerate LLMs on WebGPU EP of ONNX Runtime with the Intel XMX engine."[^intel-xmx]

The path is **Dawn → Vulkan → XMX** via the WGSL extension `chromium_experimental_subgroup_matrix`. Build flags: `onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=ON`, `onnxruntime_ENABLE_DAWN_BACKEND_D3D12=OFF`. The kernel that benefits most directly is ORT's `MatMulNBits`. Typical config on Lunar Lake: `T = f16, M = 8, K = 16, N = 16`, with best tile `m = 64, n = 64`. WGSL types added to support this: `subgroup_matrix_left/right/result<T,K,M>`; functions `subgroupMatrixLoad`, `subgroupMatrixStore`, `subgroupMatrixMultiplyAccumulate`. Direct register loads bypass workgroup shared memory.

The Windows wrinkle — and it is a real one: **"Dawn SubgroupMatrix is not currently available on D3D12. This prevents wide adoption of Intel XMX on Microsoft Windows."** Until that path opens, Intel XMX acceleration in WebGPU is effectively a Linux-Vulkan story.

For grounding, gpuweb 2024-09 F2F notes from an Intel rep: *"Got about 3× perf improvement by using Dawn with subgroups under ONNX … slightly worse than native vulkan. On an M1 without dedicated tensor units, ~80 % perf."*[^gpuweb-2024-09] Hardware tested in the Intel article: Intel Core Ultra 7 258V (Lunar Lake) plus Arc 140V, driver 32.0.101.6913, Phi-3.5 ONNX (`web-accuracy4-gqa`), ORT commit 440ac68a. Models with verified XMX benefit: Phi-3.5-mini-instruct-ONNX-GQA, Qwen3-0.6B-ONNX, DeepSeek-R1-Distill-Qwen-1.5B-ONNX. The roadmap visible in the article: switch FP16 → INT8 XMX, plus inline 4-bit dequantization with per-block scales. The author byline and publication date on the Intel article are not machine-readable; the Intel Web Platform Team attribution should be treated as inferred.

### Mobile (Android Chrome)

WebGPU was enabled by default in Chrome 121 on Android on 17 January 2024 (general-availability rollout 23 January per Phoronix)[^chrome-121][^phoronix-121]. Brandon Jones (Google) at Vulkanised 2024, in "Shipping WebGPU on Android," summarized the porting effort succinctly: *"Mostly worked first try!"* — with caveats around resource sharing and one vendor's compute-then-sample texture-corruption bug[^vulk-2024].

Snapdragon X Elite (laptop, Adreno-8xx), Chrome 131: 2 GB grantable `maxBufferSize`[^stable-diff-issue]. Adreno 750 (Pixel-class): peak ALU about 5.7 TFLOPS FP32; arXiv 2410.03613 measures roughly 42.9 GB/s actual bandwidth[^arxiv-mobile]. The mobile LLM ceiling generally lands at **3–7 B Q4 for 12 GB+ RAM phones**, with a hard caveat below.

Chrome 133 standardized **`featureLevel: "compatibility"`**, which lets WebGPU run on OpenGL ES 3.1 phones with a subset of features[^chrome-133]. This broadens reach beyond Vulkan-1.1-capable phones. Chrome 144 announced the first alpha of **`androidx.webgpu` Kotlin bindings**[^chrome-144], which signals where the Android WebGPU story is heading on the native side.

The mobile-specific quantization caveat is the biggest gotcha. **Most Qualcomm Adreno mobile GPUs cannot expose 16-bit values in uniforms or storage**, which means `shader-f16` cannot be enabled — and `q4f16` and `fp16` weights will not load[^gpuweb-5006]. On Android-flagship targets, plan around `q4` (4-bit weights with fp32 activations) and `q8`. This is the difference between "model works" and "model does not load," not a perf detail.

### Per-platform `requiredLimits` cheat sheet

| Platform / Hardware | `maxBufferSize` (grantable) | `maxStorageBufferBindingSize` | Practical VRAM ceiling per tab | Backend |
|---|---|---|---|---|
| Spec floor | 256 MB | 128 MiB | n/a | any |
| Apple M1 (8 GB) | ~2 GB | ~1 GB | ~3 GB | Metal |
| Apple M3 Max (36–64 GB) | 2–4 GB | 1–2 GB | ~3–4 GB per tab | Metal |
| RTX 4090 24 GB / RTX 5090 32 GB | 4 GB (D3D12 cap) | ~2 GB | ~4 GB single buffer | D3D12 |
| Radeon (Win) | 4 GB | ~2 GB | similar to NVIDIA | D3D12 |
| Intel Arc 140V (LNL) | ~2 GB | ~1 GB | ~3 GB | Vulkan (Linux), D3D12 (Win) |
| Snapdragon X Elite Adreno-8xx | 2 GB | ~1 GB | ~2 GB | Vulkan |
| Adreno 7xx / 8xx phone | 256–512 MB | 128 MiB | ~1–2 GB | Vulkan |

These are **observed grantable limits**, not policy guarantees. The 6–8 GB / 4–6 GB ranges that circulate in some community writeups are empirical and we could not locate authoritative Chromium / Dawn source-code references nailing the exact heuristic. Treat the table as where things break in April 2026.

## What actually fits, today

### Quantization formats

Source: the transformers.js dtype guide[^v4-dtypes]. The model card determines which dtypes are exported, but the common ones are:

| dtype | Meaning | WebGPU? | Notes |
|---|---|---|---|
| `fp32` | full precision | default for WebGPU | universally compatible; largest |
| `fp16` | half precision | requires `shader-f16` | ~25 % ALU + up to 50 % memory-bound speedup |
| `bf16` | brain-float | rare in HF ONNX repos | mostly training-side |
| `q8` / `int8` / `uint8` | 8-bit weights | WASM default; works on WebGPU | INT8 GPU op coverage gappy |
| `q4` | 4-bit weights, fp32 acts | broad | recommended low-end |
| `q4f16` | 4-bit weights + fp16 acts | best perf where `shader-f16` is supported | preferred for ≥1 B on WebGPU |
| `bnb4` | bitsandbytes-style 4-bit | where exported | only some models |

`shader-f16` shipped in Chrome 120 (December 2023)[^chrome-120]. Chrome 120's own perf numbers on Llama-2 7B M1 Pro: **prefill +28 %, decode +41 %**[^chrome-120]. Apple Silicon, NVIDIA Pascal+, recent Intel (Arc, 11th-gen+), and AMD RX 6000+ all expose it. **Most Qualcomm Adreno mobile GPUs do not** — `q4f16` and `fp16` will simply not load on those targets[^gpuweb-5006].

There is one known active model bug worth flagging: **Gemma-3 `fp16`/`q4f16` is broken on WebGPU**, outputting `<unused56>` repetitions. The workaround is `fp32` or `q4`[^ort-26732]. This is a current ORT issue (#26732), not a historical artifact.

### Practical hardware tiers

Three tiers describe the April 2026 reality. The HF v4 blog gave the broad-compatibility floor directly: *"Hugging Face's own guidance suggests sticking to models under 2B parameters for broad device compatibility."*[^v4-blog]

- **≤ 2 B for broad compatibility** — integrated GPUs, 4 GB VRAM machines, mobile with `shader-f16`-capable adapters.
- **≤ 7–8 B with v4 + WebGPU + 8 GB VRAM** — Qwen3-8B-ONNX, Granite-3.3-8B, SmolLM3-3B, LFM2-2.6B, Phi-3.5-mini at `q4f16`. NVIDIA's Phi-3.5-mini int4 card states: *"6 GB or higher VRAM GPUs are recommended."*[^nv-phi35] Note that "8 GB VRAM is comfortable" overstates the reality on long contexts: empirically, 3 B-class LLMs with extended KV cache push 8 GB cards into OOM territory.
- **≤ 20 B on high-end (24 GB+ unified memory)** — GPT-OSS 20B, LFM2-24B-A2B, Hermes-4-14B. HF has tested on M4 Pro Max at the headline ~60 tok/s.

A useful corner case: Liquid AI's own LFM2-8B-A1B-ONNX (Mixture-of-Experts, 8.3 B total / 1.5 B active) card carries the explicit warning *"This model is too large for WebGPU browser inference"* — Node, Deno, or Bun WebGPU only[^liquid-lfm2-card]. The WebGPU HF Space showcase calls it *"an 8.3-billion-parameter language model running entirely in your browser tab"*[^liquid-space], which is true *only* on very-high-end GPUs. The vendor's own card is the more cautious read.

### A representative catalog

The full model surface is larger than what fits here; below is a trimmed slice usable as a working set. All hosted at `https://huggingface.co/<org>/<repo>`.

| Model | Repo | Params | q4f16 disk | Tier |
|---|---|---|---|---|
| Qwen3-0.6B | onnx-community/Qwen3-0.6B-ONNX | 0.6 B | ~0.45 GB | low |
| Qwen3-1.7B | onnx-community/Qwen3-1.7B-ONNX | 1.7 B | ~1.1 GB | low-mid |
| Qwen3-4B | onnx-community/Qwen3-4B-ONNX | 4 B | ~2.4 GB | mid |
| Qwen3-8B | onnx-community/Qwen3-8B-ONNX | 8 B | ~4.6 GB | mid-high (v4) |
| Llama 3.2 1B | onnx-community/Llama-3.2-1B-Instruct-ONNX | 1 B | ~1.24 GB | low (Meta-gated) |
| Llama 3.2 3B | onnx-community/Llama-3.2-3B-Instruct-ONNX | 3 B | ~2 GB | mid |
| Phi-3.5-mini | onnx-community/Phi-3.5-mini-instruct-onnx-web | 3.8 B | ~2.2 GB | mid (int4 web RTN) |
| gemma-3-1b-it | onnx-community/gemma-3-1b-it-ONNX | 1 B | — | low; **avoid `fp16`/`q4f16`** |
| SmolLM2-360M | HuggingFaceTB/SmolLM2-360M-Instruct | 360 M | ~250 MB | very low |
| SmolLM3-3B | HuggingFaceTB/SmolLM3-3B-ONNX | 3 B | — | mid (dual reasoning) |
| LFM2-1.2B | onnx-community/LFM2-1.2B-ONNX | 1.2 B | — | low (hybrid conv + GQA) |
| LFM2-2.6B | onnx-community/LFM2-2.6B-ONNX | 2.6 B | — | mid |
| LFM2-8B-A1B (MoE) | LiquidAI/LFM2-8B-A1B-ONNX | 8.3 B / 1.5 B active | — | **Node/Deno/Bun only** |
| GPT-OSS 20B | onnx-community/gpt-oss-20b-ONNX | 21 B / 3.6 B active | — | high (v4 only; QMoE) |
| Granite 4.0 micro | onnx-community/granite-4.0-micro-ONNX | — | — | low (v4) |
| Granite 3.3 8B | onnx-community/Granite-3.3-8B-Instruct-Onnx | 8 B | — | mid-high (needs ORT-GenAI) |
| DeepSeek-R1-Distill-Qwen-1.5B | onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX | 1.5 B | — | low-mid |
| Ministral-3-3B | mistralai/Ministral-3-3B-Instruct-2512-ONNX | 3 B | — | mid (vision/text) |
| Hermes-4-14B | NousResearch/Hermes-4-14B-Onnx | 14 B | — | high (v4) |

### Announced but not actually shipping (yet)

A practitioner-honest section. As of 25 April 2026, the following are referenced or implied as runnable but **do not yet have working browser-targeted ONNX builds**:

- GPT-OSS-120B (browser; only desktop ORT-GenAI builds exist)
- Phi-4 14B (no transformers.js WebGPU build; `microsoft/Phi-4-mini-instruct-onnx` requires ORT-GenAI)
- Llama 3.3 70B / 8B (no `onnx-community` entry)
- Gemma-3 4B / 12B / 27B
- Qwen3-VL-MoE / Qwen3 Next (referenced in v4 release notes; ONNX repos still being uploaded)
- Apertus / FalconH1 / HunYuanDenseV1 / Youtu-LLM (v4-supported but limited public ONNX checkpoints; usually requires self-conversion)
- DeepSeek-R1-Distill-Qwen-7B (clean transformers.js port)

These are fast-moving — re-check at any deployment decision; this list is a snapshot of an active uploading window.

## Caching, cold start, and the download problem

A 4 GB `q4f16` weight blob over a household connection is not an inference problem; it is a UX problem. Two storage layers and two cold-start details determine whether the second visit is fast.

**Cache API quotas** are larger than commonly believed. Chromium grants **60 % per origin and 80 % browser-wide**; Chromium Incognito is around 5 % of disk (~100 MB). Firefox is best-effort: `min(10 % disk, 10 GiB eTLD+1)`. Safari macOS 14+ / iOS 17+ is roughly 20 % per origin in browser, ~60 % if installed as a web app[^web-dev-storage][^mdn-storage-quota]. (Earlier writeups citing "~6 % of disk" for Cache API are outdated; the current Chromium quota docs are explicit at 60 % per origin.)

**Opaque-response inflation** is a real footgun: a few-KB opaque (no-CORS) response counts as roughly 7 MB against the quota[^chrome-workbox]. Anything cached cross-origin without explicit `Access-Control-*` headers eats budget disproportionately.

**OPFS — the Origin Private File System** — is the right primitive for multi-GB weight files in 2026. Chrome has shipped it since 102 (2022); Safari since macOS 12.2 / iOS 15.2; Firefox via the November 2022 intent-to-ship[^web-dev-opfs][^mdn-opfs]. The crucial OPFS API is `FileSystemSyncAccessHandle.write()` — **synchronous, in-Worker, byte-range writes**. RxDB benchmarks measured roughly 2–4× faster cold reads than IndexedDB; OPFS also avoids the opaque-response inflation issue because it is direct file I/O, not HTTP-cached responses. **For multi-GB model weights, OPFS is the right answer; Cache API is not.**

The cold-start story has a second component beyond the network: **shader compilation**. SitePoint summarized the user-visible behavior accurately: *"Watch for first-run shader compilation stalls … WebGPU pipelines may compile on initial use, causing a brief pause. Subsequent runs benefit from pipeline caching in the browser."*[^sitepoint] An LLM with O(20–40) distinct WGSL kernels can stall for several seconds on the first run while Tint compiles and caches.

ORT exposes a real mitigation: **`enableGraphCapture: true`** captures WebGPU command sequences on the first run and replays them, bypassing the per-call dispatch overhead. The session option requires static shapes and all kernels on the WebGPU EP[^ort-env]. Used correctly it eliminates the second-run "warm-up tax" that otherwise persists across the first prefill.

Chrome's pipeline cache infrastructure has been improving — Chrome 134 release notes mention better caching of pipelines with automatically generated layouts. Whether that cache reliably survives across tab close and browser restart is implementation-defined; the safest mental model is "warm runs benefit from in-process caching," not "cold start is free after the first run."

## Roadmap — the next 12-24 months

Roadmap content is high-variance by definition; we restrict ourselves to four themes with credible signals.

### Memory64 cross-browser parity

Safari has not shipped Memory64. Safari 26.2 (December 2025) added resizable Wasm Memory and JS String Builtins, but not 64-bit memory[^safari-26-2]. WebKit's cadence makes any specific timing speculative — there is no public commitment we could find. The 16 GB browser cap is also a hard ceiling on Chrome and Firefox; engines are exploring bounds-check elimination via value-range analysis to reduce the 10–100 % perf penalty[^v8-bce-cl][^spider-mem64]. Memory64 will mostly affect non-LLM workloads (large compiled C++ apps, scientific computing) before it affects LLMs — direct GPU loading already captures the LLM use case.

### WebGPU spec evolution

There is **no official "WebGPU 2.0"** at W3C. The closest formal grouping is `core-features-and-limits` (Intent to Ship 2025). The shipped feature drumbeat is the more honest signal:

- **`shader-f16`** — Chrome 120 (December 2023)[^chrome-120].
- **Subgroups** — shipped Chrome 134 (February 2025) after origin trial 128–131. Google Meet measured 2.3–2.9× speedup vs integer dot products for matrix-vector multiply[^chrome-134].
- **`subgroup_id` / `num_subgroups`** — Intent to Ship October 2025; shipping Chrome 144[^chrome-144].
- **`subgroups-f16`** — deprecated as of Chrome 133; use `shader-f16` + `subgroups`[^chrome-133].
- **64-bit atomics** — proposal stage (gpuweb#5071), not shipped[^gpuweb-5071]. The note in-thread is honest: *"64 bit atomic operations are impossible (?) to emulate but are desired by some applications like Nanite."*
- **Cooperative-matrix / cooperative-matrix-2** — discussed at the 2024-09 F2F; not shipped[^gpuweb-2024-09].
- **Bindless** — proposal stage (`gpuweb/proposals/bindless.md`)[^gpuweb-bindless]. Quote: *"With current WebGPU, limited set of resources for each shader invocation … 16 is not enough."* This proposal is the most LLM-relevant of the list because it directly addresses the multi-buffer sharding cost imposed by the 2 GB binding cap.
- **Ray tracing** — no official spec; only third-party WebRTX compute-emulation. Not relevant to LLM workloads.

The pattern: per-feature shipping, not a versioned step. The reader should not look for a "WebGPU 2026" announcement; they should track Chrome blog posts.

### WebNN and the NPU question

The W3C published an updated WebNN Candidate Recommendation snapshot on 22 January 2026, with feedback open through 22 March 2026[^webnn-cr][^webnn-news]. Chrome and Edge expose WebNN behind `--enable-features=WebMachineLearningNeuralNetwork`; Microsoft Learn (April 2026) explicitly states *"GPU and NPU support remain in preview"* — not for production[^webnn-io].

The backend stack is straightforward: Windows → ONNX Runtime (Win11 24H2+, `kWebNNOnnxRuntime` flag) → DirectML → TFLite; Apple → Core ML → TFLite. Firefox and Safari have no implementation. NPU mapping: DirectML (Intel AI Boost, Qualcomm Hexagon, AMD XDNA) on Windows; Core ML (Neural Engine) on Apple; QNN for Snapdragon.

ORT itself ships a WebNN EP with `deviceType: 'cpu' | 'gpu' | 'npu'` and uses `MLTensor` for IO binding to avoid CPU↔NPU copy. ORT 1.25 added GQA local attention, GatherBlockQuantized, ConvInteger, and MatMulInteger to the WebNN EP[^ort-webnn].

The marketing-versus-reality flag: *"A model that runs at 10× speed on WebGPU might run at 50× on a dedicated NPU via WebNN"* is vendor-promotional and should be treated as speculation. The actual NPU win on existing 40 TOPS Copilot+ baseline silicon, in a browser tab, in production, has not been published in a form we could cite.

### When 30B+ becomes browser-realistic (speculation, flagged)

The conditions for comfortable 30 B Q4 in-browser:

1. Safari Memory64 ships.
2. The WebGPU / Wasm 16 GB JS-API cap is raised — at present a 30 B Q4 model is roughly 15 GB of weights plus KV cache, putting it over the cap.
3. Bindless or larger `maxStorageBufferBindingSize` to avoid sharding monolithic weights.
4. Consumer 32 GB+ VRAM common (RTX 5090 32 GB GDDR7 is shipping; M-series Max up to 128 GB unified — but the browser-side 16 GB cap binds before VRAM does).
5. NPU pathway via WebNN matures to bypass GPU memory limits via DirectML / Core ML allocators.

Each is plausible on a 2026–2028 timeline; none is guaranteed. **Realistic projection: late 2027 to 2028 for "comfortable" 30 B Q4 in-browser.** This is speculation grounded in shipped specs and vendor cadence, not a commitment. Frame conservatively.

### Hardware trends (briefly)

Useful anchors: the Copilot+ baseline is **40 TOPS NPU**. Qualcomm Snapdragon X2 Elite/Plus targets **80 TOPS NPU**[^pcworld]. Intel Lunar Lake is 48 TOPS NPU; Intel Panther Lake (Core Ultra Series 3, mid-2026) targets 50 TOPS NPU and 180 platform TOPS versus Lunar Lake's 120. AMD Ryzen AI 400 (XDNA 2+): 60 TOPS; 300 series: 50 TOPS. Apple Neural Engine on M4 / M5 sits in the 38–48 TOPS range. **NPU-mandatory framing has cooled** — Microsoft's PCWorld CES 2026 messaging via Windows AI Foundry explicitly routes between GPU / CPU / NPU[^pcworld]. The RTX 5090 (January 2025) at 32 GB GDDR7, 1.79 TB/s, 21,760 CUDA cores, 575 W, and FP4 hardware is the desktop reference for the next 18 months.

## Alternatives — when WebGPU+ORT isn't the right tool

WebGPU + ORT is one stack among several. A practitioner-grade comparison helps decide when it is right and when it is not.

### WebLLM / MLC AI

WebLLM compiles WGSL kernels via TVM (vs ORT's general-purpose ops) and auto-tunes per model. The npm package `@mlc-ai/web-llm` is at 0.2.82 per the docs[^webllm-docs]. Supported models include Llama, Phi, Gemma, RedPajama, Mistral, Qwen, plus custom MLC-format models. The API is OpenAI-compatible — streaming, JSON-mode, function-calling (work-in-progress), with Web Worker / Service Worker / Chrome Extension support out of the box. The reference paper (Ruan et al., arXiv:2412.15803, updated 2026) reports on M3 Max benchmarks: Llama 3.1 8B Q4 ~41 tok/s, Phi-3.5 mini ~71 tok/s, and roughly 80 % of native MLC-LLM on NVIDIA in the best case[^arxiv-webllm].

When to pick it: chat-style LLMs, OpenAI-compatible API needed, willing to live within MLC-format model availability.

### Ollama / llama.cpp (native baseline)

Pure C/C++ with CUDA, Metal, ROCm, SYCL, and Vulkan backends. GGUF format. Quantizations from 1.58-bit to 8-bit. Ollama wraps llama.cpp in a Go server; v0.19 (March 2026) reportedly auto-uses MLX backend on Apple Silicon. Practitioner numbers worth carrying: llama.cpp on RTX 4060 + Qwen3-14B Q4_K_M reaches 32 K context; M4 Max at ~15–18 tok/s for 13 B models (roughly RTX 4070 territory).

The structural reason native is faster than browser is unchanged: no sandbox, no Wasm bounds checks, no 16 GB JS-API cap, no buffer-size caps, direct CUDA / Metal, GGUF mmap. The headline "70B on a workstation" demos exist because of that stack, not despite it.

When to pick it: max performance, server / desktop targets, no sandbox required, GGUF acceptable.

### Chrome Prompt API / Gemini Nano

Chrome built-in inference of Gemini Nano via the Prompt API. The Prompt API for Chrome Extensions is **stable in Chrome 138** (announced Google I/O 2025). The Prompt API for web pages is in origin trial Chrome 139–144, ending **24 March 2026** — re-check at deployment time, since OTs are routinely extended or graduated[^chrome-ai-builtin][^chrome-ai-io25]. Multimodal (audio + image) is included.

Sibling APIs in the family: Summarizer / Language Detector / Translator are stable since Chrome 138; Writer / Rewriter are in origin trial; Proofreader is in early-preview program. Multilingual support since Chrome 140 (English / Spanish / Japanese in/out). Default ~6 K token context.

Hardware requirements: Win10/11, macOS 13+, Linux, ChromeOS Plus. ≥ 22 GB free storage. > 4 GB VRAM **or** 16 GB RAM + 4 cores. **Audio input requires GPU.** Mobile not supported. Model parameters: `defaultTopK 3, maxTopK 128, defaultTemperature 1, maxTemperature 2`.

When to pick it: lightweight Chrome-only feature with no model selection, willing to ship inside Chrome's built-in Gemini Nano. Not when you need a specific model, cross-browser support, or anything beyond the fixed Nano context window.

### Decision matrix

| Attribute | ORT-Web (WebGPU EP) | transformers.js v4 | WebLLM / MLC | Chrome Prompt API | Ollama / llama.cpp |
|---|---|---|---|---|---|
| Cross-browser | Yes (Chrome / FF / Safari) | Yes | Yes | Chrome only | n/a (native) |
| Model format | ONNX | ONNX (HF Hub exports) | MLC (TVM-compiled) | Built-in (Gemini Nano) | GGUF |
| Browser-realistic max (Apr 2026) | ~8 B FP16/Q4 | 20 B Q4 (M4 Max) | 7–8 B class | Fixed (Gemini Nano) | Hardware-limited (70B+ on 32 GB) |
| NPU support | Yes (via WebNN EP) | Indirect (via ORT) | No (WebGPU only) | Yes (Core ML / DML) | Limited |
| API style | Low-level `InferenceSession` | High-level pipelines | OpenAI-compatible | Promise / streaming | OpenAI-compatible REST |
| Best for | Production custom ONNX | Rapid prototyping, HF models | Chat-style LLMs | Lightweight Chrome features | Max perf, server / desktop |
| Setup overhead | Medium | Low (`npm i`) | Low (`npm i`) | Trivial | Low (Ollama) / Medium (llama.cpp) |

Pick the column whose constraints best match the deployment, not the column with the highest published tok/s on someone else's hardware.

## Conclusion

The dominant constraint changed in early 2026. The runtime architecture problem — the 4 GB WASM heap — was not solved by raising the WASM ceiling. It was solved by routing around it: weights into `GPUBuffer`s, not into the heap. Memory64 turned out to be a sideshow for ORT, with the build flag deleted in June 2025; the C++ EP path won.

The practical numbers that anchor April 2026: **20 B is real on Apple's high-end** (the GPT-OSS 60 tok/s figure on M4 Pro Max); **~8 B is the realistic mass-market browser ceiling** with v4 + WebGPU + 8 GB VRAM; **≤ 2 B remains the rule for broad device compatibility**, especially on `shader-f16`-incapable Qualcomm Adreno mobile.

The new ceiling is the per-tab WebGPU VRAM budget. It is not specced. It varies by adapter and by browser policy. There is no portable query and no public commitment to add one. The next year's binding question is whether NPU paths via WebNN, DirectML, and Core ML — or further WebGPU work via bindless and a raised JS-API memory cap — opens the 30 B class to broad in-browser use. Both are on credible roadmaps. Neither is shipping at scale.

A walls metaphor implies a strict hierarchy of obstacles, each removed in turn. The reality has been more like a sequence of binding constraints under a moving frontier, where the runtime architecture caught up with the model architecture and the next bottleneck quietly moved from software to hardware allocation. The 4 GB wall is gone. The VRAM budget is the wall now.

## What this means for Daneel

Daneel ships exactly this stack — `@huggingface/transformers` for inference, the WebGPU EP under it, ONNX models with external-data sidecars — so the three walls map directly onto choices visible in the codebase.

The **model catalog tiering** in `modelSelector.ts` is exactly the broad / 8 B / 20 B split this paper describes: small-class models (LFM2, SmolLM, Granite micro) for broad-compatibility targets; mid-class (Qwen3-4B, Phi-3.5-mini) for 8 GB-VRAM machines; high-tier reserved for unified-memory Apple machines. Wall 3 is the reason the catalog is tiered at all, not flat.

The **dtype matrix in `WebGPUProvider.stream()`** — `q4f16` default with `fp32` fallback — is the practical answer to the `shader-f16` Qualcomm caveat. The provider does not gamble on `shader-f16` being present; it has a degradation path that does not break inference on adapters that lack 16-bit storage values.

The **empirical observation** in the project's memory store that "8 GB VRAM is borderline for 3 B LLMs (OOM on long contexts)" is Wall 3 in action. The number on the box (8 GB VRAM) is not the number the tab actually gets, and the discrepancy is not a bug — it is the per-tab VRAM budget surfacing as OOM under long-context KV-cache pressure.

The choice to use **OPFS via transformers.js for model weight caching**, rather than Cache API, is the same answer the broader practitioner community has converged on, for the same reasons: no opaque-response inflation, byte-range I/O, near-IndexedDB-for-cold-reads performance.

The shipping question for Daneel in the next year is the same as the field's: does the WebNN EP path become production-grade enough to add NPU acceleration, and does WebGPU `bindless` (or a raised storage-binding cap) reduce the sharding cost on >7 B models? Both are tracked in the registry; neither has shipped enough to justify a default switch yet.

> **Further reading inside Daneel:** the `shared/` model registry (`registry.json`, `evaluation/`); the WebGPU / Ollama / Gemini Nano / Claude provider modules under [`src/providers/llm/`](https://doc.daneel.injen.io/reference/) for the alternative-stack matrix in practice; the Settings → Models panel for the runtime tier observations users hit empirically.

[^v4-blog]: Hugging Face, "transformers.js v4." 9 Feb 2026. <https://huggingface.co/blog/transformersjs-v4>
[^v4-tag]: transformers.js v4.0.0 release. <https://github.com/huggingface/transformers.js/releases/tag/4.0.0>
[^v4-cos]: Cross-Origin Storage cache backend (PR #1549). <https://github.com/huggingface/transformers.js/pull/1549>
[^v4-issue-1599]: Qwen3.5 vs Qwen3 perf regression on M2 Pro Chrome 145 (issue #1599). <https://github.com/huggingface/transformers.js/issues/1599>
[^v4-dtypes]: transformers.js dtypes guide. <https://huggingface.co/docs/transformers.js/en/guides/dtypes>
[^ort-large]: ONNX Runtime Web — "Large models." <https://onnxruntime.ai/docs/tutorials/web/large-models.html>
[^ort-1-20]: ONNX Runtime 1.20 release notes. <https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0>
[^ort-env]: ONNX Runtime Web — env flags and session options. <https://onnxruntime.ai/docs/tutorials/web/env-flags-and-session-options.html>
[^onnx-build-web]: ONNX Runtime — Build for web. <https://onnxruntime.ai/docs/build/web.html>
[^ort-webnn]: ONNX Runtime Web — WebNN EP. <https://onnxruntime.ai/docs/tutorials/web/ep-webnn.html>
[^pr-14579]: ORT PR #14579 — JSEP origin. <https://github.com/microsoft/onnxruntime/pull/14579>
[^pr-21260]: ORT PR #21260 — initial wasm64; closed for #21836. <https://github.com/microsoft/onnxruntime/pull/21260>
[^pr-23697]: ORT PR #23697 — migrate WebGPU EP to WASM (replace JSEP). <https://github.com/microsoft/onnxruntime/pull/23697>
[^pr-23910]: ORT PR #23910 — UMA direct CPU→GPU buffer upload. <https://github.com/microsoft/onnxruntime/pull/23910>
[^pr-25181]: ORT PR #25181 — delete wasm64 build option. <https://github.com/microsoft/onnxruntime/pull/25181>
[^ort-25952]: ORT issue #25952 — `WebGpuExecutionProvider` enumerable in 1.23. <https://github.com/microsoft/onnxruntime/issues/25952>
[^ort-26732]: ORT issue #26732 — Gemma-3 fp16/q4f16 broken on WebGPU. <https://github.com/microsoft/onnxruntime/issues/26732>
[^ort-26827]: ORT issue #26827 — Safari/WebKit 26 JSEP CPU + memory bug. <https://github.com/microsoft/onnxruntime/issues/26827>
[^onnx-extdata]: ONNX External Data spec. <https://github.com/onnx/onnx/blob/main/docs/ExternalData.md>
[^onnx-3275]: ONNX issue #3275 — protobuf 2 GB cap. <https://github.com/onnx/onnx/issues/3275>
[^webgpu-spec]: W3C WebGPU spec. <https://www.w3.org/TR/webgpu/>
[^webgpu-limits]: WebGPU limits (gpuweb). <https://gpuweb.github.io/gpuweb/#limits>
[^gpuweb-5505]: gpuweb#5505 — Support for querying maximum/available GPU memory. <https://github.com/gpuweb/gpuweb/issues/5505>
[^gpuweb-5006]: gpuweb#5006 — `shader-f16` on Qualcomm Adreno. <https://github.com/gpuweb/gpuweb/issues/5006>
[^gpuweb-5071]: gpuweb#5071 — 64-bit atomics. <https://github.com/gpuweb/gpuweb/issues/5071>
[^gpuweb-bindless]: WebGPU bindless proposal. <https://github.com/gpuweb/gpuweb/blob/main/proposals/bindless.md>
[^gpuweb-2024-09]: gpuweb 2024-09 F2F notes. <https://github.com/gpuweb/gpuweb/wiki/GPU-Web-2024-09-F2F>
[^webgpu-impl-status]: WebGPU implementation status. <https://github.com/gpuweb/gpuweb/wiki/Implementation-Status>
[^chrome-113]: Chrome 113 — WebGPU release. <https://developer.chrome.com/blog/webgpu-release>
[^chrome-120]: Chrome 120 — `shader-f16`. <https://developer.chrome.com/blog/new-in-webgpu-120>
[^chrome-121]: Chrome 121 — Android WebGPU; D3D12 DXC. <https://developer.chrome.com/blog/new-in-webgpu-121>
[^chrome-133]: Chrome 133 — Memory64 era; limits messaging; `featureLevel: "compatibility"`. <https://developer.chrome.com/blog/new-in-webgpu-133>
[^chrome-134]: Chrome 134 — subgroups GA. <https://developer.chrome.com/blog/new-in-webgpu-134>
[^chrome-137]: Chrome 137 — `powerPreference`. <https://developer.chrome.com/blog/new-in-webgpu-137>
[^chrome-144]: Chrome 144 — Linux Intel; Kotlin bindings. <https://developer.chrome.com/blog/new-in-webgpu-144>
[^chrome-troubleshoot]: Chrome WebGPU troubleshooting tips. <https://developer.chrome.com/docs/web-platform/webgpu/troubleshooting-tips>
[^chrome-ai-builtin]: Chrome built-in AI APIs. <https://developer.chrome.com/docs/ai/built-in>
[^chrome-ai-io25]: Chrome AI API updates, Google I/O 2025. <https://developer.chrome.com/blog/ai-api-updates-io25>
[^chromestatus-mem64]: chromestatus — Memory64. <https://chromestatus.com/feature/5070065734516736>
[^chrome-blink]: Intent to Ship — Memory64 (blink-dev). <https://groups.google.com/a/chromium.org/g/blink-dev/c/5vTbd1dttwc>
[^chrome-workbox]: Chrome / Workbox — understanding storage quota. <https://developer.chrome.com/docs/workbox/understanding-storage-quota>
[^v8-4gb]: V8 — 4 GB Wasm memory. <https://v8.dev/blog/4gb-wasm-memory>
[^v8-bce-cl]: V8 — bounds-check elimination CL. <https://chromium-review.googlesource.com/c/v8/v8/+/6980318>
[^spider-mem64]: SpiderMonkey — "Is Memory64 actually worth using?" 15 Jan 2025. <https://spidermonkey.dev/blog/2025/01/15/is-memory64-actually-worth-using.html>
[^firefox-134]: Firefox 134 release notes. <https://github.com/mozilla/release-notes/blob/master/releases/firefox-134.0-release.json>
[^safari-26-2]: WebKit — Safari 26.2 release notes. <https://webkit.org/blog/17640>
[^wwdc25]: Apple WWDC25 — "Unlock GPU computing with WebGPU." <https://developer.apple.com/videos/play/wwdc2025/236/>
[^mtl-256]: Apple Developer Forums — `MTLBuffer` 256 MB. <https://developer.apple.com/forums/thread/61188>
[^intel-xmx]: Intel — "Boost AI Inference Performance with WebGPU on Intel Platforms." <https://www.intel.com/content/www/us/en/developer/articles/community/boost-ai-inference-performance-with-webgpu.html>
[^arxiv-webllm]: Ruan et al. — "WebLLM: A High-Performance In-Browser LLM Inference Engine." arXiv:2412.15803 (updated 2026). <https://arxiv.org/html/2412.15803v2>
[^arxiv-rtx5090]: WebGPU dispatch overhead on RTX 5090 / Ubuntu / Dawn (arXiv 2604.02344). <https://arxiv.org/html/2604.02344>
[^arxiv-mobile]: Mobile LLM bandwidth on Adreno 750 / Mali-G720 (arXiv 2410.03613). <https://arxiv.org/html/2410.03613v1>
[^acm-weinfer]: WeInfer — ACM Web Conf 2025. <https://dl.acm.org/doi/10.1145/3696410.3714553>
[^ms-feb24]: Microsoft Open Source Blog — "ONNX Runtime Web unleashes generative AI in the browser using WebGPU." 29 Feb 2024. <https://opensource.microsoft.com/blog/2024/02/29/onnx-runtime-web-unleashes-generative-ai-in-the-browser-using-webgpu/>
[^ms-phi3]: Hugging Face / Emma N. — "Enjoy the power of Phi-3 with ONNX Runtime." <https://huggingface.co/blog/Emma-N/enjoy-the-power-of-phi-3-with-onnx-runtime>
[^bgremoval]: img.ly — "Browser background removal using ONNX Runtime WebGPU." <https://img.ly/blog/browser-background-removal-using-onnx-runtime-webgpu/>
[^vulk-2024]: Brandon Jones — "Shipping WebGPU on Android" (Vulkanised 2024). <https://www.khronos.org/assets/uploads/developers/presentations/WebGPU_Meetup_-_WebGPU_on_Android.pdf>
[^webgpu-com-news]: WebGPU.com news — "WebGPU hits critical mass: all major browsers." 25 Nov 2025. <https://www.webgpu.com/news/webgpu-hits-critical-mass-all-major-browsers/>
[^web-dev-supported]: web.dev — "WebGPU now supported in major browsers." <https://web.dev/blog/webgpu-supported-major-browsers>
[^web-dev-storage]: web.dev — "Storage for the web." <https://web.dev/articles/storage-for-the-web>
[^web-dev-opfs]: web.dev — "Origin Private File System." <https://web.dev/articles/origin-private-file-system>
[^mdn-storage-quota]: MDN — Storage API quotas and eviction. <https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria>
[^mdn-opfs]: MDN — Origin private file system. <https://developer.mozilla.org/en-US/docs/Web/API/File_System_API/Origin_private_file_system>
[^pypi-ort-webgpu]: `onnxruntime-webgpu` on PyPI. <https://pypi.org/project/onnxruntime-webgpu/>
[^webllm-docs]: WebLLM docs — `@mlc-ai/web-llm`. <https://webllm.mlc.ai/docs>
[^webnn-cr]: W3C — WebNN Candidate Recommendation snapshot 22 Jan 2026. <https://www.w3.org/TR/2026/CR-webnn-20260122/>
[^webnn-news]: W3C news — updated WebNN CR. <https://www.w3.org/news/2026/updated-candidate-recommendation-web-neural-network-webnn-api/>
[^webnn-io]: WebNN.io reference. <https://webnn.io>
[^stable-diff-issue]: Snapdragon X Elite WebGPU `maxBufferSize` (`softwiredtech/stable-diffusion-webgpu#1`). <https://github.com/softwiredtech/stable-diffusion-webgpu/issues/1>
[^liquid-lfm2-card]: Liquid AI — LFM2-8B-A1B-ONNX model card. <https://huggingface.co/LiquidAI/LFM2-8B-A1B-ONNX>
[^liquid-space]: Liquid AI — LFM2-MoE-WebGPU Space. <https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU>
[^nv-phi35]: NVIDIA — Phi-3.5-mini-Instruct-ONNX-INT4 card. <https://huggingface.co/nvidia/Phi-3.5-mini-Instruct-ONNX-INT4>
[^webgpu-report]: webgpureport.org — adapter limits report tool. <https://webgpureport.org/>
[^sitepoint]: SitePoint — "WebGPU vs WebAssembly transformers.js." <https://www.sitepoint.com/webgpu-vs-webasm-transformers-js/>
[^dawn-graphics]: dawn-graphics Google Group — VRAM measurement thread. <https://groups.google.com/g/dawn-graphics/c/Rw_E21KjWAU>
[^pcworld]: PCWorld coverage — Snapdragon X2 Elite NPU, CES 2026 NPU positioning. (Cited via secondary aggregation; check vendor source before quoting.)
[^phoronix-121]: Phoronix — Chrome 121 release. <https://www.phoronix.com/news/Chrome-121-Released>

---

[Read on site](https://daneel.injen.io/research/three-walls-browser-llm-inference-2026.html?utm_source=extension_research_reader&utm_medium=extension_settings&utm_campaign=extension)
