The in-browser open LLM catalog grew up in 2025-2026: from three community ports to two dozen first-party releases in eighteen months.


In January 2025, when a public WebGPU LLM demo needed a small reasoning model, the practical answer was Microsoft Phi-3-mini, ported to ONNX by the one-engineer effort of Xenova nearly a year earlier.

The other two production-class options were Meta's Llama 3.2 1B (gated, but among the most-downloaded ONNX text-generation models on Hugging Face) and HuggingFace's own SmolLM2 family. The catalog was three models deep, two of them ported by hand1.

By March 2026 a similar demo can pick from roughly two dozen first-party releases.

IBM ships granite-4.0-micro-ONNX-web with a naming convention that explicitly targets transformers.js2. OpenAI's first open-weight family in years lands at onnx-community/gpt-oss-20b-ONNX with first-party MXFP4 quantization on the Mixture-of-Experts layers3.

Liquid AI ships LiquidAI/LFM2.5-1.2B-Thinking-ONNX directly from its own org4.

Mistral ships mistralai/Ministral-3-3B-Instruct-2512-ONNX5.

Hugging Face's own State of Open Source 2026 captures the structural change in one sentence: "most major model developers now release families of models spanning a range of sizes."6 The community-port era is not over, but it is no longer the catalog.

This paper is about that change.

TL;DR

  • The in-browser open-source LLM catalog (≤ 14B parameters, transformers.js v4 plus ONNX, text generation as a first-class output, including multimodal models that emit text) expanded from a handful of community-converted Phi and Llama ports in late 2024 to a multi-vendor catalog of two-dozen-plus first-party releases by March 202667.
  • Hugging Face's State of Open Source Spring 2026 reports that the median size of downloaded open models barely moved (326M parameters in 2023 to 406M in 2025), while the mean rose from 827M to 20.8B. Quantization and MoE pull the high end; small-model usage is flat-stable6.
  • Three structural shifts shape the catalog: first-party browser builds from labs that previously shipped only PyTorch or GGUF, hybrid Mamba-plus-attention architectures arriving at small scale, and sub-4-bit quantization moving from research to shipped product28910.
  • The 2026 cohort spans roughly fifteen organisations across four quadrants. Established labs include Liquid AI, IBM, Microsoft, Mistral, OpenAI, AllenAI, Cohere, Tencent, AI21 Labs, and Alibaba Qwen. New entrants targeting on-device specifically include PrismML and the ServiceNow-NVIDIA Apriel partnership. Academic and national-sovereign cohorts include HuggingFaceTB, the swiss-ai consortium (EPFL, ETH Zurich, CSCS), TII (UAE), and a five-lab Korean sovereign-AI initiative. The Hugging Face onnx-community/ org acts as the catalog spine61011.
  • Forward look: an NVIDIA Research position paper from June 2025 makes the explicit thesis that small language models are the future of agentic AI12. Industry analysis projects edge-AI device count from 1.2 billion in 2024 to 2.5 billion in 2027, paired with NPU silicon (40-80 TOPS Copilot+ baseline) that creates a real target surface13.

The pre-2025 baseline

The 2024 picture is worth one paragraph as anchor. The transformers.js v3 release in October 2024 added WebGPU support across roughly 120 architectures14. The browser-runnable LLM set in production demos was effectively three: Xenova/Phi-3-mini-4k-instruct (Microsoft, ported by Xenova), onnx-community/Llama-3.2-1B-Instruct-ONNX (Meta, gated, ported by the community), and the HuggingFaceTB SmolLM2 family. Apple had released OpenELM in April 2024 with a permissive sample-code license at sizes from 270M to 3B, and explicitly framed it as an on-device language model, but did not ship browser or ONNX builds. OpenELM stayed a research artifact rather than a production catalog member15. The 2024 catalog was small, the porting effort was concentrated in a handful of community contributors, and labs were not part of the loop.

The 2025-2026 trajectory

The eighteen-month window divides cleanly into four phases. The first is shaped by external pressure (DeepSeek-R1's release in January 2025), the second by the first wave of first-party browser builds (LFM2, gpt-oss, SmolLM3, Apertus), the third by a near-simultaneous cluster of releases from four major labs (IBM, Mistral, NVIDIA, AllenAI), and the fourth by transformers.js v4 itself plus a flurry of agentic-focused small models in early 2026.

timeline
    title Small/medium browser-LLM releases, 2025-Q1 2026
    Jan 2025 : DeepSeek-R1 (MIT) + 6 distilled checkpoints
    Apr 2025 : Microsoft BitNet b1.58 2B4T (native 1-bit)
             : Alibaba Qwen3 family (0.6B-32B)
    May 2025 : TII Falcon-H1 hybrid family (0.5B-34B)
    Jul 2025 : HuggingFaceTB SmolLM3-3B
             : Liquid AI LFM2 (350M, 700M, 1.2B)
    Aug 2025 : OpenAI gpt-oss (20B, 120B)
             : Tencent Hunyuan dense (0.5B-7B)
    Sep 2025 : Apertus 8B/70B (swiss-ai)
    Nov 2025 : AllenAI OLMo 3 (7B, 32B)
    Dec 2025 : Mistral 3 family (3B-14B)
             : NVIDIA Nemotron 3 Nano
             : IBM Granite 4 (Micro 3B, H-Tiny 7B/1B MoE, H-Small 32B/9B MoE)
    Jan 2026 : Tencent Youtu-LLM 2B (agentic)
    Feb 2026 : transformers.js v4 preview
             : Alibaba Qwen3.5
             : Korean sovereign cohort trends on HF Hub
    Mar 2026 : transformers.js v4.0.0 GA
             : PrismML emerges from stealth (1-bit Bonsai)

Q1-Q2 2025: DeepSeek and the Big Lab response

DeepSeek-R1's release on 20 January 2025 under MIT license, with six distilled checkpoints based on Qwen2.5 and Llama3 at 1.5B, 7B, 8B, 14B, 32B, and 70B, did two things at once16. It collapsed the perceived gap between Chinese open-weight reasoning models and Western frontier closed models.

And it triggered the geopolitical inflection that runs through the rest of the trajectory. Hugging Face's own State of Open Source captures the consequence in plain language: "Western organizations increasingly seek commercially deployable alternatives to Chinese models, creating urgency around efforts like OpenAI's GPT-OSS, AI2's OLMo, and Google's Gemma to offer competitive open options from US and European developers."6 The distilled 1.5B variant landed at onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX quickly enough to be the headline transformers.js demo for the spring17.

Three responses followed in close succession. Microsoft Research published BitNet b1.58 2B4T on 14 April 2025: the first LLM trained natively at 1.58-bit (1-trit) weights from scratch on a 4-trillion-token corpus, MIT-licensed, with a 400 MB on-disk footprint that beats Llama 3.2 1B and Qwen 2.5 1.5B on GSM8K and PIQA918. Alibaba's Qwen3 family arrived on 28 April 2025 under Apache 2.0, with dense variants at 0.6B, 1.7B, 4B, 8B, 14B, and 32B, plus MoE at 30B-A3B and 235B-A22B19. Most of the dense Qwen3 family fits the browser tier and most of it is now on onnx-community. TII followed on 20 May 2025 with Falcon-H1, a parallel hybrid Transformer plus State Space Model architecture in six dense variants from 0.5B to 34B with adjustable attention-to-SSM ratios; the 0.5B-Deep delivers, by their claim, "performance on par with typical 7B models from 2024."8 The first Falcon-H1 ONNX port (Falcon-H1-Tiny-90M-Instruct-ONNX) appeared shortly afterward20.

The shape of the response is informative. Microsoft chose the architectural-research path (native sub-2-bit training). Alibaba chose the breadth path (a full size pyramid, fully Apache, immediately available). TII chose the hybrid-architecture path (Transformer plus SSM, a structurally different model class). None of the three are gestures; all three put non-trivial engineering and compute behind a small-model release.

Q3 2025: first-party browser builds arrive

July and August 2025 are when the catalog stops being a community project. SmolLM3-3B, released by HuggingFace's own training and benchmarking org on 8 July 2025 under Apache 2.0, is a 3B decoder-only with grouped-query attention and a 3:1 NoPE pattern, trained on 11.2 trillion tokens with 128K context (64K trained plus YARN extrapolation), dual-mode reasoning via /think and /no_think flags, and six languages21. Two days later, Liquid AI shipped LFM2 with 350M, 700M, and 1.2B variants22. The architecture is "a hybrid Liquid model with multiplicative gates and short convolutions": 16 blocks composed of 10 double-gated short-range convolution blocks and 6 grouped-query attention blocks. The licensing is unusual and worth flagging: it is "based on Apache 2.0" but adds a $10M revenue threshold above which a commercial license is required. The initial release shipped PyTorch via ExecuTorch and llama.cpp; ONNX builds followed at onnx-community/LFM2-1.2B-ONNX and onnx-community/LFM2-350M-ONNX, and the LFM2.5 evolution is now distributed first-party from the LiquidAI org with explicit ONNX builds (LiquidAI/LFM2.5-1.2B-Thinking-ONNX, LiquidAI/LFM2.5-VL-450M-ONNX)4.

OpenAI's gpt-oss release on 5 August 2025 is the cohort headline of Q3. Two models: gpt-oss-20b (21B total parameters with 3.6B active in a Mixture-of-Experts) and gpt-oss-120b (117B total, 5.1B active). Both Apache 2.0. MXFP4 4-bit quantization on the MoE layers, 128K context with sliding-window attention, learned attention sink per head3. The 20B variant runs on a 16 GB consumer GPU. Microsoft ships GPU-optimized ONNX builds for Windows, and onnx-community/gpt-oss-20b-ONNX is the transformers.js v4 target. The framing in Hugging Face's welcome blog is candid about motivation: "This release is a meaningful step in their commitment to the open-source ecosystem, in line with their stated mission to make the benefits of AI broadly accessible. Many use cases rely on private and/or local deployments."3 Read in context with the geopolitical pressure noted earlier, the meaning is plain: gpt-oss exists because DeepSeek does.

Tencent's Hunyuan dense small-models release on 6 August 2025 covers 0.5B, 1.8B, 4B, and 7B Instruct variants, with 256K context, grouped-query attention, and FP8, GPTQ-Int4, and AWQ-Int4 quantized variants for edge deployment positioned at "smartphones, smart vehicles, smart home, smart cabin"23. The notable gap, candidly: as of March 2026 there is no onnx-community/ build of the Hunyuan dense models, and Tencent itself has not shipped one. Hunyuan is the largest such gap in the catalog.

The swiss-ai consortium (EPFL, ETH Zurich, and the Swiss National Supercomputing Centre) released Apertus on 2 September 2025 at 8B and 70B, both Base and Instruct, under Apache 2.02425. The 8B is browser-tier. The model is trained on 15 trillion tokens spanning more than a thousand languages (40% of training data is non-English), including underrepresented languages like Swiss German and Romansh. The mission is stated explicitly: "Democratizing Open and Compliant LLMs for Global Language Environments."24 The Apertus architecture is listed among the v4-supported families in the transformers.js v4 release notes7; a community ONNX port was not directly verified at the time of writing.

Q4 2025: a four-lab cluster

Between mid-November and mid-December 2025, four labs shipped releases that are now central to the small-LLM catalog. AllenAI's OLMo 3 arrived on 20 November 2025 (with a 3.1 update on 12 December), Apache 2.0, at 7B and 32B2627. The 7B is browser-relevant. AllenAI's framing of "fully open" is structural and goes beyond weights: the Dolma 3 corpus (~9.3T tokens), the Dolci post-training suite, and all intermediate training checkpoints are released, alongside the open tooling (Olmo-core, Open Instruct, datamap-rs, OLMES, decon). On the 32B Think variant, AllenAI's own positioning is a "glass-box reasoning model where the data, the training code, the post-training recipe, and even the provenance of individual answers are all on the table."26

Mistral 3 launched on 2 December 2025 with nine dense models across three sizes (3B, 8B, 14B) and three variants per size (Base, Instruct, Reasoning), all Apache 2.05. Each model is natively multimodal and multilingual (40+ languages, with image understanding). The Ministral 3 line, especially the 3B, is positioned for smartphones and IoT. The reasoning 14B variant scores 85% on AIME '25. The first-party ONNX build at mistralai/Ministral-3-3B-Instruct-2512-ONNX lands the catalog member; the 8B and 14B ONNX status is still rolling out.

NVIDIA's Nemotron 3 Nano on 15 December 2025 is the most architecturally ambitious of the cluster: a hybrid Mamba-Transformer Mixture-of-Experts with 3.2B active and 31.6B total parameters, with a 1M-token context window. NVIDIA reports it delivers "4× higher throughput than Nemotron 2 Nano" and beats GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507 on benchmarks28. The total parameter count exceeds the browser tier, but the active-parameter count is well within. NVIDIA also released 3 trillion tokens of pre-training data and 18 million post-training samples alongside the model, the first time the company has packaged training data with the model itself.

IBM's Granite 4.0 closed the quarter29. The variants run from Granite-4.0-H-Small (32B with 9B active in an MoE), through Granite-4.0-H-Tiny (7B with 1B active MoE), to Granite-4.0-H-Micro (a 3B dense hybrid). The architecture is Mamba-2 plus transformer in a 9:1 ratio, with the Mamba-2 layers handling global context efficiently and the transformer blocks doing local-context parsing. IBM's positioning is the most explicit of any lab in the cohort. As covered in the announcement, "the launch of Granite 4.0 initiates a new era for IBM's family of enterprise-ready large language models, leveraging novel architectural advancements to double down on small, efficient language models that provide competitive performance at reduced costs and latency."30 The IBM strategic framing here is described as "betting that enterprises will prioritize cost, governance and reliability over raw scale"30. What sets IBM apart in the catalog is the publishing convention: the Micro, 350M, and 1B variants ship as granite-4.0-micro-ONNX-web, granite-4.0-350m-ONNX-web, and granite-4.0-1b-ONNX-web on the onnx-community org, with explicit "ONNX-web" naming. This is a tier above the cohort norm: IBM did not just ship a model that could be ported; it shipped the port itself, named for the runtime2.

Q1 2026: agentic SLMs and the runtime catches up

The first quarter of 2026 has three threads. The first is agentic-focused small models. Tencent's Youtu-LLM 2B on 4 January 2026 is a 1.96B-parameter model with 128K context and "native agentic talents", released alongside Youtu-VL 4B (vision-language) and powering Youtu-Tip, an offline agent that runs on macOS via llama.cpp31. Both Tencent models lack onnx-community/ builds at the time of writing. The Youtu-LLM 2B catalog gap is the second-largest of the year after the Hunyuan dense gap.

The second thread is the runtime catching up. transformers.js v4 preview shipped on 9 February 2026, with v4.0.0 GA on 30 March 2026732. The v4 release adds Mamba (state-space), Multi-head Latent Attention, and Mixture of Experts as new architecture families, plus the ORT contrib operators (GroupQueryAttention, MatMulNBits, QMoE, MultiHeadAttention) that make these architectures fast in the browser. The headline benchmark is GPT-OSS 20B at q4f16 running at roughly 60 tokens per second on an M4 Pro Max. The list of v4-exclusive architectures is the catalog map for the year: "GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, Youtu-LLM."7 These are the labs who shipped in 2025 and the runtime now supports them all.

The third thread is multimodal expansion at small scale. Liquid AI's LFM2.5-VL series (450M and 1.6B image-text-to-text models) lands on onnx-community and LiquidAI simultaneously4. Alibaba's Qwen3.5 family ships on 16 February 2026 in browser-relevant sizes (onnx-community/Qwen3.5-0.8B-ONNX, Qwen3.5-2B-ONNX, Qwen3.5-4B-ONNX, Qwen3.5-9B-ONNX)33. South Korea's national-sovereign cohort, backed by a 240-billion-won government program with five consortia (LG AI Research, SK Telecom, Naver Cloud, NC AI, Upstage), produces three simultaneously-trending HF Hub models in February 202634356.

The window closes on 31 March 2026 with PrismML, a Caltech-rooted lab backed by Khosla Ventures, Cerberus Ventures, and Google compute, emerging from stealth with Bonsai 1-bit models at 1.7B, 4B, and 8B10. Bonsai is end-to-end 1-bit: embeddings, attention layers, MLP layers, and the LM head are all 1-bit, across 8.2 billion parameters in the 8B variant. The licensing is Apache 2.0. The vendor claims 14× smaller, 8× faster, and 4-5× more energy efficient than Llama 3 8B at FP16. PrismML's CEO, Babak Hassibi, frames the architectural bet plainly: "We see 1-bit not as an endpoint, but as a starting point."10 The catch is that Bonsai ships only as GGUF, MLX, and llama.cpp at launch; there is no ONNX or transformers.js build, and the browser inference path PrismML demos goes through Google Colab rather than the v4 runtime36. PrismML is in the cohort to watch. It is not in the catalog today.

Why big labs are shipping small open models

Three of the established labs in the cohort have explicit strategic framings for why they are doing this. They are different framings, and that is informative.

IBM's positioning is the cleanest. The Granite 4.0 announcement frames "a new era" of enterprise-ready language models that "leverages novel architectural advancements to double down on small, efficient language models that provide competitive performance at reduced costs and latency."30 The strategic bet, as covered in industry reporting, is that enterprises will "prioritize cost, governance and reliability over raw scale."30 The architecture (Mamba-2 plus transformer, hybrid with explicit MoE in the H-Tiny variant) is in service of that bet, not the other way around. The naming convention (granite-4.0-micro-ONNX-web) commits to the deployment surface, not just the model.

OpenAI's positioning is more subtle, partly because gpt-oss is the company's first open-weight family in years and the gestures are still being read. The official framing in the Hugging Face welcome blog is mission-aligned: "This release is a meaningful step in their commitment to the open-source ecosystem, in line with their stated mission to make the benefits of AI broadly accessible. Many use cases rely on private and/or local deployments."3 The unstated half is geopolitical: the same Hugging Face report (in a different document) names the pressure directly. "Western organizations increasingly seek commercially deployable alternatives to Chinese models, creating urgency around efforts like OpenAI's GPT-OSS, AI2's OLMo, and Google's Gemma to offer competitive open options from US and European developers. Whether these efforts can match the adoption momentum of Qwen and DeepSeek will be a defining question of 2026."6 gpt-oss exists because DeepSeek-R1 took the most-liked-models slot on the Hugging Face Hub in early 2025.

Microsoft's positioning is ecosystem-centric. The Phi family is positioned as the on-device anchor for Snapdragon Copilot+ PCs, optimized via Microsoft Olive plus the ONNX GenAI Runtime, deployed via Foundry Local and the AI Toolkit for VS Code37. Phi-4-mini at 3.8B with 128K context is described as designed for "compute-constrained inference environments." The strategic frame is NPU plus low latency plus reasoning at small size, complementing Microsoft's own Copilot+ PC silicon push. It is also the lab whose transformers.js v4 support is most explicitly in progress: tracking issue #1460 in the transformers.js repository indicates that Phi-4-mini support is partial as of the writing of this paper.

NVIDIA is the interesting case because the company is straddling research and product simultaneously. NVIDIA Research published "Small Language Models are the Future of Agentic AI" in June 2025 as an arXiv position paper1238. The thesis is unambiguous: "small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI." The supporting argument is that agentic applications differ fundamentally from general-purpose chat: agents "perform a small number of specialized tasks repetitively and with little variation", which favors smaller, specialized models over larger generalists. NVIDIA's Nemotron 3 Nano, plus the company's product partnership with ServiceNow on the Apriel family, is the direct enactment of that thesis at the silicon vendor1139.

Underneath the per-lab framings, three pressures converge. The cost economics are stark: industry estimates put serving a 7B SLM at 10-30× cheaper than running a 70-175B frontier model, cutting operational expenses up to 75%, and a frequently-cited statistic suggests that "nearly 80% of corporate LLM calls could have been handled more accurately and at 1/10th of the latency by a tuned SLM."13 The privacy and governance pressure is structural: GDPR, HIPAA, and equivalent regulations make on-premise or on-device deployment a non-negotiable for sensitive data, and SLMs are the only architecture that fits.

The hardware availability pressure is the third leg: the edge-AI device count is projected from 1.2 billion in 2024 to 2.5 billion in 2027, and NPU silicon (40 TOPS Copilot+ baseline, 80 TOPS on Snapdragon X2 Elite, 48-50 TOPS on Intel Lunar Lake and Panther Lake) creates a real target surface13.

Hugging Face captures the structural shift in one sentence: "As a result, most major model developers now release families of models spanning a range of sizes."6 The sentence is not editorial. It is empirical.

The 2026 lab cohort

The cohort separates cleanly into four quadrants. The first quadrant is established labs with first-party browser builds today. The second is new entrants targeting on-device specifically. The third is academic and national-sovereign cohorts. The fourth is the aggregator-community spine.

Established labs with first-party browser builds today

Lab Browser-runnable models First-party ONNX status License Notes
Liquid AI LFM2 350M, 700M, 1.2B; LFM2.5 350M, 1.2B (Instruct, Thinking, JP); LFM2.5-VL-450M (multimodal-text) Yes (LiquidAI/*-ONNX first-party plus onnx-community/) LFM Open License v1.0 (Apache-based, $10M revenue threshold) Hybrid conv plus GQA; explicit on-device positioning
IBM Granite 4.0 Micro (3B), 350M, 1B, h-350m Yes (onnx-community/granite-4.0-*-ONNX-web) Apache 2.0 Hybrid Mamba-2 plus transformer; explicit "ONNX-web" naming
Microsoft Phi-3.5-mini (3.8B), Phi-4-mini (3.8B, partial v4) Phi-3.5 yes (Xenova/Phi-3-mini-4k-instruct); Phi-4-mini transformers.js v4 in progress MIT Olive plus ONNX GenAI Runtime path; Snapdragon NPU partner
OpenAI gpt-oss-20b (21B / 3.6B active MoE) Yes (onnx-community/gpt-oss-20b-ONNX) Apache 2.0 First OpenAI open-weight family in years; MXFP4
Mistral Ministral 3 family (3B, 8B, 14B; Base, Instruct, Reasoning; image understanding) First-party for 3B Instruct (mistralai/Ministral-3-3B-Instruct-2512-ONNX); broader catalog rolling Apache 2.0 Multimodal plus multilingual native (40+ languages)
AllenAI OLMo 3 7B (Base, Think, Instruct) No first-party ONNX yet; v4-supported architecture Apache 2.0 Fully-open glass-box (Dolma 3 corpus, all checkpoints)
Alibaba Qwen Qwen3 0.6B, 1.7B, 4B, 8B, 14B; Qwen3.5 0.8B, 2B, 4B, 9B Extensive (onnx-community/Qwen3-*-ONNX and Qwen3.5-*-ONNX) Apache 2.0 HF reports 200K+ Qwen-derivative models on Hub
DeepSeek DeepSeek-R1-Distill-Qwen-1.5B (browser-relevant) Yes (onnx-community/...-ONNX); also onnxruntime/...-ONNX MIT Triggered the geopolitical inflection
Tencent Hunyuan dense 0.5B, 1.8B, 4B, 7B; Youtu-LLM 2B; Youtu-VL 4B Gap: no onnx-community/ build verified Tencent license (per-model) Native FP8/GPTQ/AWQ multi-quant for edge
AI21 Labs Jamba 2 3B; Jamba Reasoning 3B (250K context) Gap: no onnx-community/ build verified Apache 2.0 Hybrid Mamba-Transformer; 35 tok/s on standard MacBook Pro
Cohere Command R7B (open-weights research release) Audio model onnx-community/cohere-transcribe-03-2026-ONNX confirmed; Command R7B status unverified CC-BY-NC research-release 23-language multilingual; on-device positioning explicit

The pattern across the eleven labs is uneven but trending. IBM and Liquid AI have shipped ONNX-web variants as a first-party concern. OpenAI distributes through onnx-community. Microsoft is partial. Tencent, AI21, and Cohere are gaps. Alibaba is the leader by volume. AllenAI relies on transformers.js v4 architecture-family support without (yet) shipping its own ONNX.

New entrants targeting on-device specifically

PrismML (also operating as Mintplex Labs in the GitHub fork) emerged from stealth on 31 March 2026 with the Bonsai 1-bit family (1.7B, 4B, 8B), backed by Khosla Ventures, Cerberus Ventures, and compute grants from Google and Caltech1036. The CEO is Babak Hassibi, a Caltech professor; the technical research is described as "developed at Caltech." The Apache 2.0 license and the radical 1-bit architecture (1 GB memory footprint for the 8B variant) make Bonsai notable. As of March 2026 there is no ONNX or transformers.js build, and the browser inference path is via Google Colab rather than transformers.js.

PrismML is in the cohort but not yet in the catalog. The CEO's framing of the architectural direction ("We see 1-bit not as an endpoint, but as a starting point") suggests Bonsai is the first of a planned line10.

ServiceNow and NVIDIA's Apriel family is the other notable new entrant in the on-device quadrant1139. Apriel 5B is positioned to "run on both powerful data center GPUs and everyday consumer GPUs—and even on some edge devices, such as laptops and high-end smartphones, when optimized." Apriel 1.6 is multimodal and scores 57 on the Artificial Analysis Index, which the announcement positions as on par with Qwen3 235B-A22B and DeepSeek-V3.2 (15× larger models). Apriel 2.0 is expected in production by Q1 2026. ONNX availability has not been verified at the time of writing.

Academic and national-sovereign cohorts

HuggingFaceTB (Hugging Face Training and Benchmarking) is academic-style in its release discipline.

SmolLM3-3B was trained on 384 H100 GPUs over 24 days with a fully-published recipe, training data (SmolTalk2), and per-stage configurations, all licensed Apache 2.021. The cohort role is to set a fully-open baseline that other labs are measured against.

The swiss-ai consortium (EPFL, ETH Zurich, and the Swiss National Supercomputing Centre) released Apertus on 2 September 2025, with both 8B and 70B variants under Apache 2.02425. The mission statement ("Democratizing Open and Compliant LLMs for Global Language Environments") aligns with the structural choice to release intermediate training checkpoints alongside the final weights, plus the multilingual coverage spanning more than 1,000 languages. The 8B variant is browser-tier and the architecture is in the v4-supported family list.

TII (the Technology Innovation Institute, Abu Dhabi) shipped the Falcon-H1 family in May 2025 across 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B parameters, with native support for 18 languages and scalable to over 1008. The cohort posture is similar to swiss-ai's: a national-research-institute release pattern rather than a startup or vendor pattern. In January 2026 the Falcon-H1-Arabic variants were added at 3B, 7B, and 34B, deepening the Arabic-specific capability.

The South Korean national-sovereign cohort is a new structural feature343540.

A 240-billion-won (~$170M USD) Ministry of Science and ICT program selected five consortia in mid-2025 to develop sovereign LLMs operating on local infrastructure: Naver Cloud (HyperClova X Think, advanced June 2025), SK Telecom (AX 3.1 Lite, a 7B model trained on 1.65 trillion multilingual tokens with explicit on-device focus), Upstage (Solar Pro 2 at 31B parameters, the first Korean model recognized as a frontier model by Artificial Analysis), LG AI Research (Exaone 4.0 hybrid reasoning), and NC AI.

Every six months the government reviews progress and culls underperformers; the program will eventually narrow to two leaders. The cohort's open-source obligation is structural: "All five teams are required to make public more than half of their models as open-source technology." The result is visible on the platform: Hugging Face's own State of Open Source notes that "three models from South Korea trended simultaneously on Hugging Face Hub in February 2026."6 ONNX availability for the cohort has not been individually verified at the time of writing; this is a near-term gap to track.

The aggregator-community spine

Hugging Face's onnx-community/ org is the de facto staging ground for browser-targeted ONNX builds41. As of April 2026 the org has 1,110-plus models.

The recent uploads tell the cohort story:

Qwen3.5 series, gemma-4-E2B-it-ONNX and gemma-4-E4B-it-ONNX (Google, gated and out-of-scope for this paper but worth noting as the most-downloaded ONNX text-generation models on the Hub), LFM2.5-VL-450M, Falcon-H1-Tiny-90M-Instruct. The org also hosts a convert-to-onnx Space for community conversion, which is how many of the catalog members got there in the first place.

The webml-community/ org and the transformers.js-examples GitHub repo carry the demo applications: Phi-3.5 WebGPU, SmolLM WebGPU, GPT-OSS-WebGPU, deepseek-r1-webgpu, janus-pro-webgpu1. The aggregator is not a lab; it is the substrate the labs land on.

Discovery-but-not-cataloguable today

Two notable mentions sit outside the catalog as of March 2026 but are worth flagging.

The first is Pleias, a French and EU lab with strong focus on multilingual training data and recipes. Pleias does not actively ship ONNX builds and is primarily focused on training-data curation and methodology rather than inference deployment, so its models are not in the browser catalog. If Pleias pivots toward inference deployment, the cohort gains a strong European entry.

The second is Apple OpenELM (April 2024). The original release covered 270M to 3B parameters under a permissive sample-code license that allowed commercial use, but Apple did not ship browser or ONNX builds, and OpenELM has stayed a research artifact rather than a production catalog member15.

Apple's silicon is the dominant browser-LLM target on Mac (the GPT-OSS 20B 60-tok/s benchmark runs on M4 Pro Max), but the company's own model releases are not part of the cohort's deployment-focused work.

What actually runs in a browser today

The trimmed catalog below covers the representative working set as of March 2026. The full Hugging Face org listings have many additional variants (instruct versus base, plus quantized versions) but the table captures the cross-section a practitioner can rely on.

Model Repo (primary) Params Provider ONNX-via-onnx-community License Tier
SmolLM3-3B HuggingFaceTB/SmolLM3-3B 3B HuggingFace TB v4-supported Apache 2.0 mid
LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct 1.2B Liquid AI First-party (LiquidAI/*-ONNX) LFM Open v1.0 low-mid
LFM2-1.2B onnx-community/LFM2-1.2B-ONNX 1.2B Liquid AI Yes LFM Open v1.0 low-mid
Granite 4.0-Micro onnx-community/granite-4.0-micro-ONNX-web 3B IBM Yes (web variant) Apache 2.0 mid
Granite 4.0-H-Tiny (community ONNX in progress) 7B / 1B active MoE IBM Pending Apache 2.0 mid-high
Phi-3.5-mini Xenova/Phi-3-mini-4k-instruct 3.8B Microsoft Yes (legacy onnx-web) MIT mid
Phi-4-mini microsoft/Phi-4-mini-instruct 3.8B Microsoft Partial (transformers.js v4 in progress) MIT mid
GPT-OSS 20B onnx-community/gpt-oss-20b-ONNX 21B / 3.6B active MoE OpenAI Yes Apache 2.0 high (MoE)
Ministral 3-3B mistralai/Ministral-3-3B-Instruct-2512-ONNX 3B Mistral First-party Apache 2.0 mid
Ministral 3-8B (HF; ONNX status TBD) 8B Mistral Pending Apache 2.0 mid-high
Qwen3-1.7B onnx-community/Qwen3-1.7B-ONNX 1.7B Alibaba Yes Apache 2.0 low-mid
Qwen3-4B onnx-community/Qwen3-4B-ONNX 4B Alibaba Yes Apache 2.0 mid
Qwen3-8B onnx-community/Qwen3-8B-ONNX 8B Alibaba Yes Apache 2.0 mid-high
Qwen3-14B onnx-community/Qwen3-14B-ONNX 14B Alibaba Yes Apache 2.0 high
Qwen3.5-2B onnx-community/Qwen3.5-2B-ONNX 2B Alibaba Yes Apache 2.0 low-mid
Qwen3.5-4B onnx-community/Qwen3.5-4B-ONNX 4B Alibaba Yes Apache 2.0 mid
Qwen3-4B-VL onnx-community/Qwen3-4B-VL-ONNX 4B Alibaba Yes (multimodal-text) Apache 2.0 mid
DeepSeek-R1-Distill-Qwen-1.5B onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX 1.5B DeepSeek (Qwen base) Yes MIT low-mid
Falcon-H1-Tiny onnx-community/Falcon-H1-Tiny-90M-Instruct-ONNX 90M TII Yes (sub-billion) Falcon LLM (Apache-based) low
Apertus-8B-Instruct swiss-ai/Apertus-8B-Instruct-2509 8B swiss-ai v4-supported Apache 2.0 mid-high
OLMo 3 7B allenai/OLMo-3-7B 7B AllenAI v4-supported Apache 2.0 mid-high
BitNet b1.58 2B4T microsoft/bitnet-b1.58-2B-4T 2B Microsoft Research Export-able via optimum-cli; no pre-built community MIT low-mid (1.58-bit)
LFM2.5-VL-450M onnx-community/LFM2.5-VL-450M-ONNX 450M (multimodal-text) Liquid AI Yes LFM Open v1.0 low

A few honest caveats.

The table omits Hunyuan dense models and Youtu-LLM 2B from Tencent, Jamba 2 3B from AI21, Cohere Command R7B, and the South Korean cohort entries because none had a confirmed onnx-community/ ONNX build at the time of writing. These are gaps, not denials. Several are likely to fill in over Q2-Q3 2026 as the v4-architecture support stabilizes and community ports follow. The catalog also explicitly excludes gated families: Llama 3.x (Meta), Gemma 3 and Gemma 4 (Google).

These are the three most-downloaded ONNX text-generation models on the Hub by aggregate, and excluding them is a real omission for a "what runs in the browser" picture; the rationale is that the open-source / no-gate distinction matters for the trajectory and cohort questions this paper is answering.

Architecture in this window

The architecture story of the 2025-2026 catalog was covered in depth elsewhere; the version that matters for this paper is shorter. Three trends shape what is actually shipped and what is feasible.

Hybrid Mamba plus attention is the new default at small scale.

IBM's Granite 4.0-H series uses Mamba-2 plus transformer in a 9:1 ratio, with the Mamba-2 blocks handling global context efficiently and the transformer blocks doing local-context parsing2. TII's Falcon-H1 uses a parallel hybrid design where the attention-to-SSM ratio is adjustable per model variant, claiming up to "4× speedup in input throughput and 8× in output throughput" on long contexts versus comparable pure-Transformer models8. AI21's Jamba 2 is hybrid Mamba-Transformer, with 256K context windows on the small variant; the Jamba Reasoning 3B handles 250K-token contexts and runs at 35 tok/s on a standard MacBook Pro4243. Liquid AI's LFM2 is hybrid in a different sense: 10 short-convolution blocks with multiplicative gates plus 6 grouped-query-attention blocks per 16-block stack22. NVIDIA's Nemotron 3 Nano combines hybrid Mamba-Transformer with Mixture-of-Experts at 3.2B active and 31.6B total28. Pure Transformer plus attention is no longer the assumed architecture for new small-LLM releases.

Mixture-of-Experts at small scale moved from research to shipped.

transformers.js v4 added MoE as a new architecture family with the QMoE ORT contrib operator7. The browser headline is GPT-OSS 20B at q4f16 running at 60 tok/s on M4 Pro Max; the 3.6B active parameters keep it within reachable browser memory3. Liquid AI's own LFM2-8B-A1B-ONNX (8.3B total / 1.5B active MoE) marks the explicit upper edge: the LiquidAI model card itself states "This model is too large for WebGPU browser inference", restricted to Node, Deno, and Bun WebGPU instead4. Granite 4.0-H-Tiny at 7B / 1B active is the smallest hybrid-MoE that fits comfortably in browser memory budgets, though it has not landed as a community ONNX build at the time of writing. The browser-MoE ceiling is well-defined and the catalog respects it.

Sub-4-bit quantization shipped.

Microsoft BitNet b1.58 2B4T was trained natively at 1.58-bit weights from scratch on 4 trillion tokens, MIT-licensed, with a 400 MB on-disk footprint and benchmark wins over Llama 3.2 1B, Gemma 3 1B, and Qwen 2.5 1.5B on GSM8K and PIQA918. ONNX export works via optimum-cli; no pre-built community ONNX ships with the runtime today, and the kernel-level support in transformers.js v4 plus ORT WebGPU for BitNet's specific quantization format is unverified, so calling BitNet "browser-runnable" carries real caveats. OpenAI's gpt-oss uses MXFP4 for 4-bit weights only on the MoE layers, keeping the non-MoE precision higher and reducing footprint without a uniform precision drop3. Tencent's Hunyuan dense models normalize multi-quant releases (FP8, GPTQ-Int4, AWQ-Int4) per parameter size23. PrismML's Bonsai 1-bit is end-to-end across embeddings, attention, MLP, and LM head, but again ships only GGUF and MLX, not ONNX10. The shift is real but uneven: the formats and tooling are converging on sub-4-bit as a first-class option, not just a research target.

Forward look: 2026 to 2027

The clearest forward-looking thesis comes from the NVIDIA Research position paper of June 2025:

Small language models are sufficient, suitable, and economical for most agentic-AI invocations, and are therefore the future of agentic AI12. The argument is structural rather than rhetorical. Agents "perform a small number of specialized tasks repetitively and with little variation", which is a fundamentally different workload from open-ended chat. Heterogeneous edge-plus-cloud architectures, where SLMs handle 90-95% of agent invocations and frontier LLMs handle the remaining 5-10% requiring broad knowledge, are projected to become the production default13. The Plan-and-Execute pattern, where a capable model creates a strategy that cheaper models execute, is reported to "reduce costs by 90% compared to using frontier models for everything."13

The hardware envelope is widening.

Edge AI Vision projects edge-device count from 1.2 billion in 2024 to 2.5 billion in 202713. NPU silicon is gaining capacity: Copilot+ baseline 40 TOPS, Snapdragon X2 Elite at 80 TOPS in January 2026, Intel Panther Lake expected at 50 TOPS for the NPU and 180 platform TOPS in mid-2026. The structural binding constraint, however, is not compute but bandwidth. The same Edge AI Vision analysis frames it precisely: "Phones didn't become GPUs. The field learned to treat memory bandwidth, not compute, as the binding constraint." Mobile NPU bandwidth runs 50-90 GB/s versus data-center GPU bandwidth at 2-3 TB/s, a 30-50× gap that defines the practical SLM-on-edge ceiling13. The sub-4-bit native-training story (BitNet, Bonsai) is the cleanest way to compress around the bandwidth limit, and the cohort is reading that signal: more native sub-2-bit releases through 2026-2027 are likely.

Architecturally, the catalog suggests a convergence.

Hybrid Mamba plus attention is now the default at small scale; pure Transformer is the exception, not the assumption. MoE at sub-10B total is the new browser frontier. Multimodal models that emit text are arriving in browser-runnable sizes (Liquid AI's LFM2.5-VL-450M and 1.6B, Qwen3-4B-VL, Mistral's Ministral 3 family with image understanding native to every variant). Speculative decoding on edge offers reported 2-3× speedups via small-draft-plus-large-target patterns13. The combination of these trends suggests the 2027 catalog will look qualitatively different in a way the 2026 catalog already hints at: smaller, faster, more architecturally diverse, and routed against larger models rather than replacing them.

The geopolitical dimension is the wildcard:

Hugging Face's State of Open Source frames the question for 2026 directly: "Whether [GPT-OSS, OLMo, and Gemma's] efforts can match the adoption momentum of Qwen and DeepSeek will be a defining question of 2026."6 The numbers behind the question are already striking. Qwen has more derivative models on the Hugging Face Hub than Google and Meta combined, with the Qwen family alone supporting over 113,000 derivative models, balloning to over 200,000 when including everything tagging Qwen. ByteDance and Tencent each "increased releases by eight to nine times" between 2024 and 2025. Baidu "went from zero releases on the Hub in 2024 to over 100 in 2025." National-sovereign initiatives (Switzerland's Apertus, the UAE's Falcon, South Korea's five-consortium program) are a new structural feature on top of the established lab cohort. The cohort is not consolidating; it is expanding, and the geographic distribution is shifting.

Closing

The picture in March 2026 is cleaner than the picture in late 2024 was.

The browser is now a real LLM runtime (the prior question, addressed elsewhere). The catalog is now a real catalog, with first-party releases from a multi-vendor cohort that did not exist eighteen months earlier. The labs participate. The trajectory is outward in two senses: more labs, broader geography. And the architectural diversity (hybrid SSM, MoE at small scale, native sub-4-bit) is the catalog's response to the hardware envelope it is shipping into. Where the runtime paper closed with the observation that the binding constraint had moved from runtime architecture to per-tab VRAM budget, this paper closes with the parallel observation: the binding constraint on what runs has moved from "is the model ported" to "which model fits which workload." That is the better problem to have.

The catalog still has gaps.

Tencent's Hunyuan dense models, AI21's Jamba 2 series, Cohere's Command R7B, and the South Korean cohort all lack confirmed onnx-community/ builds at the time of writing. PrismML ships GGUF and MLX but not ONNX. The IBM strategic-positioning quotes were sourced through industry reporting because the direct IBM announcement page was inaccessible during research; the quotes are accurate but the chain of attribution should be checked at any later citation. The 1.58-bit and 1-bit kernel support in the browser runtime is real but uneven, and the 1-bit cohort (BitNet, Bonsai) is not yet uniformly catalogued via transformers.js. These are honest gaps, not failures, and most are likely to fill in over Q2-Q3 2026.

The shift this paper documents is structural rather than incremental.

A practitioner reading the cohort table can identify, lab by lab, who is shipping what, where the gaps are, and what to expect next. That was not a thing one could do in late 2024.

What this means for Daneel

Daneel ships exactly this stack: @huggingface/transformers for inference, the WebGPU EP under it, ONNX models with external-data sidecars, and a model catalog curated against the lab cohort this paper maps. The professionalization shift shows up in concrete codebase choices.

The model catalog tiering in MODEL_CATALOG and modelSelector.ts is structurally aligned with the cohort table: small-class entries (LFM2, SmolLM, Granite Micro) cover the broad-compatibility tier, mid-class (Qwen3-4B, Phi-3.5-mini, Granite Micro 3B) cover the 8 GB-VRAM range, and the unified-memory Apple high-end tier carries the larger MoE variants where they fit. The catalog updates track the cohort: when IBM ships granite-4.0-micro-ONNX-web, when Liquid AI ships LiquidAI/LFM2.5-1.2B-Thinking-ONNX, when Mistral ships mistralai/Ministral-3-3B-Instruct-2512-ONNX, the registry pulls them in directly rather than waiting on community ports.

The Settings > Models panel in the extension surfaces the lab cohort live: the browser shows model cards from onnx-community/, LiquidAI/, mistralai/, microsoft/, and the rest of the orgs noted in the cohort table, with the same per-model evaluation logic across providers. This is the cohort table made interactive: a user picks a model, sees the lab attribution, license, hardware fit, and capability profile, then runs it.

Two cohort gaps the broader catalog has are gaps Daneel inherits. The Tencent Hunyuan and AI21 Jamba 2 ONNX gaps mean the registry cannot include those families until the ports land. The Phi-4-mini transformers.js v4 partial-support gap means the registry currently anchors Microsoft's row on Phi-3.5-mini until v4 support is complete. These are tracked rather than worked around.

The architectural diversity question (which hybrid, which quantization tier) is the cleanest place where the runtime paper and this paper meet. The runtime paper said: VRAM is the new ceiling. This paper says: small models with hybrid architectures and aggressive quantization are how the cohort responds. Daneel's adoption of LFM2 prominently in the catalog, plus the q4f16 default with fp32 fallback in the LFM2 provider, reflects both observations.

Further reading inside Daneel: the model registry under shared/; the WebGPU and Ollama provider modules under src/providers/llm/ for the alternative-stack matrix; the Settings → Models panel for the per-tier observations users actually hit.

Footnotes

  1. transformers.js-examples GitHub repo. https://github.com/huggingface/transformers.js-examples 2

  2. Hugging Face onnx-community/granite-4.0-micro-ONNX-web. https://huggingface.co/onnx-community/granite-4.0-micro-ONNX-web 2 3 4

  3. Hugging Face, "Welcome GPT OSS, the new open-source model family from OpenAI!" 5 August 2025. https://huggingface.co/blog/welcome-openai-gpt-oss 2 3 4 5 6

  4. LiquidAI organization on Hugging Face. https://huggingface.co/LiquidAI 2 3 4

  5. Mistral AI, "Introducing Mistral 3." December 2025. https://mistral.ai/news/mistral-3 2

  6. Hugging Face, "State of Open Source on Hugging Face: Spring 2026." https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026 2 3 4 5 6 7 8 9 10

  7. Hugging Face, "Transformers.js v4." 9 February 2026. https://huggingface.co/blog/transformersjs-v4 2 3 4 5

  8. TII Falcon LM, "Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance." 20 May 2025. https://falcon-lm.github.io/blog/falcon-h1/ 2 3 4

  9. microsoft/bitnet-b1.58-2B-4T model card. https://huggingface.co/microsoft/bitnet-b1.58-2B-4T 2 3

  10. PrismML, "Launches World's First 1-Bit AI Model to Redefine Intelligence at the Edge." 31 March 2026. https://prismml.com/news/prismml-launches-worlds-first-1-bit-ai-model 2 3 4 5 6 7

  11. ServiceNow, "Apriel 5B: Small but mighty enterprise language model." https://www.servicenow.com/blogs/2025/apriel-5b-small-enterprise-language-model 2 3

  12. Belcak et al., "Small Language Models are the Future of Agentic AI." NVIDIA Research, June 2025. https://research.nvidia.com/labs/lpr/slm-agents/ 2 3

  13. Edge AI and Vision Alliance, "On-Device LLMs in 2026: What Changed, What Matters, What's Next." January 2026. https://www.edge-ai-vision.com/2026/01/on-device-llms-in-2026-what-changed-what-matters-whats-next/ 2 3 4 5 6 7 8

  14. Hugging Face, "Transformers.js v3." 22 October 2024. https://huggingface.co/blog/transformersjs-v3

  15. Apple Machine Learning Research, "OpenELM: An Efficient Language Model Family with Open Training and Inference Framework." April 2024. https://machinelearning.apple.com/research/openelm 2

  16. deepseek-ai/DeepSeek-R1 model card. https://huggingface.co/deepseek-ai/DeepSeek-R1

  17. onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX. https://huggingface.co/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX

  18. Microsoft BitNet inference framework GitHub. https://github.com/microsoft/BitNet 2

  19. onnx-community/Qwen3-1.7B-ONNX (representative of Qwen3 family). https://huggingface.co/onnx-community/Qwen3-1.7B-ONNX

  20. Hugging Face onnx-community/Falcon-H1-Tiny-90M-Instruct-ONNX. https://huggingface.co/onnx-community/Falcon-H1-Tiny-90M-Instruct-ONNX/blob/main/chat_template.jinja

  21. Hugging Face, "SmolLM3: smol, multilingual, long-context reasoner." 8 July 2025. https://huggingface.co/blog/smollm3 2

  22. Liquid AI, "Introducing LFM2: The Fastest On-Device Foundation Models on the Market." 10 July 2025. https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models 2

  23. Tencent Hunyuan Dense Model Hugging Face collection. https://huggingface.co/collections/tencent/hunyuan-dense-model 2

  24. Hugging Face swiss-ai/Apertus LLM collection. https://huggingface.co/collections/swiss-ai/apertus-llm 2 3

  25. ETH Zurich, "Apertus: a fully open, transparent, multilingual language model." 2 September 2025. https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html 2

  26. AllenAI (Ai2), "Olmo 3: Charting a path through the model flow to lead open-source AI." 20 November 2025. https://allenai.org/blog/olmo3 2

  27. "Olmo 3" preprint. https://arxiv.org/abs/2512.13961

  28. NVIDIA Newsroom, "NVIDIA Debuts Nemotron 3 Family of Open Models." December 2025. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models 2

  29. IBM, "Granite 4.0: Hyper-efficient, High Performance Hybrid Models for Enterprise." https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

  30. IBM, "Granite 4.0 bets big on small models." Cited via secondary industry reporting; verify against direct IBM source before further citation. https://www.ibm.com/think/news/granite-4-bets-big-on-small-models 2 3 4

  31. tencent/Youtu-LLM-2B. https://huggingface.co/tencent/Youtu-LLM-2B

  32. transformers.js v4.0.0 release. https://github.com/huggingface/transformers.js/releases/tag/4.0.0

  33. onnx-community/Qwen3.5-2B-ONNX (representative of Qwen3.5 family). https://huggingface.co/onnx-community/Qwen3.5-2B-ONNX

  34. MarkTechPost, "Meet South Korea's LLM Powerhouses: HyperClova, AX, Solar Pro, and More." 21 August 2025. https://www.marktechpost.com/2025/08/21/meet-south-koreas-llm-powerhouses-hyperclova-ax-solar-pro-and-more/ 2

  35. KED Global, "Naver, LG, SK, NC, Upstage named to build S.Korea's sovereign AI model." https://www.kedglobal.com/artificial-intelligence/newsView/ked202508040010 2

  36. Mintplex Labs prism-ml-llama.cpp GitHub. https://github.com/Mintplex-Labs/prism-ml-llama.cpp 2

  37. transformers.js issue #1460 (Phi-4 Mini support tracking). https://github.com/huggingface/transformers.js/issues/1460

  38. arXiv:2506.02153. https://arxiv.org/abs/2506.02153

  39. ServiceNow, "Apriel Model Family: Frontier Reasoning." https://www.servicenow.com/blogs/2025/apriel-model-family-frontier-reasoning 2

  40. TechCrunch, "How South Korea plans to best OpenAI, Google, others with homegrown AI." September 2025. https://techcrunch.com/2025/09/27/how-south-korea-plans-to-best-openai-google-others-with-homegrown-ai/

  41. Hugging Face onnx-community organization. https://huggingface.co/onnx-community

  42. AI21 Labs, "Introducing Jamba2." https://www.ai21.com/blog/introducing-jamba2/

  43. VentureBeat, "AI21's Jamba reasoning 3B redefines what 'small' means in LLMs." https://venturebeat.com/ai/ai21s-jamba-reasoning-3b-redefines-what-small-means-in-llms-250k-context-on