WebLLM Prompt Lab

Two language models side-by-side, running in the browser. Built so kids can poke at how AI works instead of just chatting with it.

WebGPU WebLLM RAG (bge-small) IndexedDB Vanilla JS Firebase Hosting

Live at webllm-promptlab.web.app. Pick a model, watch it download once into IndexedDB, and from then on it runs offline on the student's own laptop. No API key, no data leaving the device. Two panes side-by-side so you can run the same prompt against two models and watch the streaming.

WebLLM Prompt Lab desktop view, two panes ready to load models, header with Compare/Sync/New/Knowledge/Cache/Export buttons
Desktop view. Each pane has its own model, system prompt, generation settings, and Knowledge toggle.

Why build this

Working on a project that wanted something students could experiment with privately in their after-school program. The hosted assistants (ChatGPT, Claude, Gemini) are slick, but they're also opaque (you don't see the model, you don't see the prompt template, you don't see what knowledge the model "has").

Running the model in the browser via WebGPU makes the pipeline visible and keeps every prompt on the student's device. The student loads weights they can see in their browser's storage. They edit a system prompt and watch the answer change. They turn retrieval on and off and watch the same model get smarter or dumber on the same question.

How it works

The runtime is WebLLM from MLC, a TVM-compiled WebGPU runtime that ships quantized transformer weights to the browser and runs inference there. Each pane has its own Web Worker hosting a WebLLM engine, so generation in pane A doesn't block pane B's UI.

The model list is curated down to 10 featured models with friendly names and one-line taglines ("Best all-around for Chromebooks," "Loves structured prompts. Great for RAG."), but the full ~85-model WebLLM catalog is one tab click away in the picker. Each tile shows tier (Quick / Balanced / Smart), size in MB or GB, last-used timestamp, and whether the weights are already cached locally.

Model picker dialog showing 10 featured models as tiles
The picker replaces a raw <select> of cryptic filenames like Llama-3.2-3B-Instruct-q4f16_1-MLC with friendly names, taglines, and cached badges.

The Knowledge panel: client-side RAG!

Drop in a text file, Markdown, PDF, or Word doc and the page:

  1. Chunks the text hierarchically (parent paragraphs ~1500 chars, child sentences ~200 chars with one-sentence overlap)
  2. Embeds every chunk with Xenova/bge-small-en-v1.5 running in a Web Worker via transformers.js
  3. Stores text + embeddings in IndexedDB so they survive page reloads

Each pane has a per-pane "Use knowledge" toggle. When a student sends a message, the page embeds the query once, retrieves top-k unique parent passages by cosine similarity, and folds them into the system message wrapped in <reference><doc> tags. The user message stays clean (small models get confused by "Context:" headers that look like document continuations).

Knowledge dialog with the sample stick-insect doc loaded
Knowledge dialog with the bundled sample document. VIEW shows the raw text that was chunked. REMOVE deletes it from IndexedDB.

The side-by-side demo: leave pane A's Use knowledge off, turn pane B's on, load the sample doc (the true-but-obscure story of the Lord Howe Island stick insect's 2001 rediscovery on Ball's Pyramid), then ask both panes "where was the Lord Howe Island stick insect rediscovered?" Pane A confabulates, pane B answers correctly. Same model, same temperature, same prompt. The only variable is retrieval.

The match threshold is exposed as a per-pane slider (default 55% cosine similarity). When a question doesn't clear the bar, the response gets a small dashed footer: "Searched knowledge · best match 42% (below 55% threshold)". Kids can see retrieval ran, see why it didn't fire, and move the slider if they disagree.

Engine sharing for memory-constrained devices

An early version of the app ran two isolated workers per pane. That meant picking the same model in both panes loaded the weights into VRAM twice. Fine on a developer laptop, fatal on a Chromebook with 4 GB of RAM.

The fix was a small registry:

const engineRegistry = new Map(); // modelId -> { engine, worker, refCount }

Each pane holds a reference. Picking the same model in both panes shares one engine and one copy of the weights. Picking different models gives each pane its own worker. When a pane switches away, the refcount decrements, and the underlying engine is unloaded only when nothing's holding it.

Generation is already serialized between panes (one WebGPU device, one timeline), so sharing the engine costs nothing. A defensive engine.resetChat() runs between calls when the previous caller was a different pane, so KV-cache state doesn't leak between conversations.

Mobile (it works, barely)

The app runs on phones. Not the ideal experience, but it works, which matters when students don't all have laptops at home.

Mobile portrait view with two stacked panes and a fitted send button
On phones the picker collapses to a single column, the brand wordmark wraps, and the Send button stretches to match the textarea height (44 px minimum tap target).

A few mobile-specific tweaks:

  • Fresh phone visits default to the two smallest models (SmolLM2 360M for A, Qwen 2.5 0.5B for B) instead of a 3B Llama, because 3B is brutal on a phone.
  • The model picker tiles get a "Slow on phone" pill for anything >1 GB on phone-sized viewports.
  • A one-time advisory banner: "On a phone? These models run…barely (or not at all!) Phone browsers are more limited than laptops. Stick to Quick-tier models, or open this site on a Chromebook or other computer for the full experience."
  • Textareas use a 16 px font so iOS Safari doesn't auto-zoom on focus.
  • Header buttons are 36 × 44 (WCAG 2.5.8 allows the narrower width with adequate spacing, and vertical tap accuracy is what really matters on phones).

Accessibility (it's school software)

This needs to work in a public-school setting, so I did a proper accessibility pass:

  • :focus-visible rings on every interactive element (cyan default; violet inside pane B's context).
  • Skip link to the message input as the first focusable element on the page.
  • role="banner" / "main" / "contentinfo" landmarks; dialogs have aria-labelledby pointing at their titles.
  • All decorative Lucide SVG icons are auto-stamped with aria-hidden="true" and focusable="false"; the surrounding button has the accessible name.
  • RAG threshold sliders use aria-labelledby; the percentage badge next to them is aria-hidden so screen readers don't double-announce.
  • prefers-reduced-motion rule kills the pulse / blink / indeterminate-bar animations.
  • forced-colors: active rule keeps structural borders visible in Windows High Contrast mode.
  • Muted text color darkened from #6B7A92 to #5A6982 to clear WCAG AA contrast on the icy-blue background.

Verified at 375×667 via Playwright: zero overflowing elements, zero unlabeled visible inputs, all SVGs aria-hidden, all 30 buttons have accessible names.

Next up: classroom model mirror

If school wifi can't handle 30 students each pulling 2 GB the first time, the plan is a Pi 5 8 GB with a USB SSD plugged into the classroom switch. The Pi doesn't run inference (the Chromebooks do that), it just mirrors the model weights from HuggingFace once and serves them locally with Access-Control-Allow-Origin: *. About $130 in hardware, one afternoon of setup. The app already supports an appConfig.model_list override, so the change to point at a Pi instead of HF is roughly 30 lines.

Insights

Small models hate clever prompts

My first RAG injection wrote "Context:\n[1] doc\n…" into the user message. TinyLlama echoed back garbled "Preferring Context: [Context, followed by another question…" Switching the injection to <reference><doc> tags inside the system message (never touching the user turn) fixed it across every model.

"Grounded" should be honest

First version badged every response "grounded · N sources" even when the retrieved chunks were 45% similar (i.e., not relevant). The badge was lying. Now the threshold gates whether grounding fires, and below-threshold queries get a separate "searched · no strong match" footer instead.

WCAG ≠ iOS HIG

Slapping min-width: 44px on every header button made six icons plus the brand wordmark overflow a 375 px viewport, causing horizontal scroll. WCAG 2.5.8 allows narrower with adequate spacing, and vertical 44 is what really matters for tap accuracy on phones.

white-space: pre-wrap hates HTML

User bubbles need pre-wrap to preserve their newlines. Assistant bubbles (rendered HTML from marked) get a literal \n between every block element. Under pre-wrap, those rendered as visible blank lines, producing ridiculous gaps between paragraphs and lists. Scoping pre-wrap to .msg.user only fixed it.