In-depth research papers from Daneel AI on local inference, model architecture, quantization, and adjacent technologies.

Local inference - 2026-04-25 - 45 min
In late 2024 the in-browser open-source LLM catalog was effectively three community ports: Microsoft Phi-3, Meta Llama 3.2, and HuggingFace SmolLM2. By March 2026 it runs to roughly two dozen first-party releases from a multi-vendor cohort that did not exist eighteen months earlier, including IBM Granite 4 with explicit ONNX-web variants, OpenAI's first open-weight family in years, Liquid AI's hybrid LFM2.5 line, and a 1-bit entrant from Caltech that emerged from stealth on the last day of the window. This paper maps the trajectory through its inflection points, the labs driving releases, the catalog as it stands, and what 2026-2027 has telegraphed.

Local inference - 2026-04-25 - 51 min
In-browser LLM inference has lived under three constraint walls in sequence: protobuf's 2 GB cap on `.onnx` files, WebAssembly's 4 GB linear-memory limit, and the browser-allocated WebGPU VRAM budget. As of April 2026, the first two are effectively cleared — the first by ONNX's External Data format, the second by ONNX Runtime's new C++ WebGPU execution provider — while the third remains the binding constraint, with no portable spec query and substantial per-platform variance. This paper traces how each wall arose, how the February 2026 transformers.js v4 / ORT C++ EP inflection collapsed two of them, and what genuinely fits in a browser tab today across Apple Silicon, NVIDIA, AMD, Intel, and mobile.