Building an LLM from Scratch

I want to be able to read a frontier paper and know which choices are load-bearing and which are taste. The only way I've found that works is to rebuild the floor.

There's a second, slower motivation. There's no language model that actually works for Zazakî — the language I grew up around in Basel. The big foundation models hallucinate it, and no one is going to train a small one for free. If I want it to exist, I need to know how to build one. This subject is the warm-up.

Chapter 1 — tokens

Text doesn't reach the model as text. It reaches as integers. The piece in between is the tokenizer.

Modern LLMs almost universally use BPE — a subword scheme that starts with characters and merges common pairs into single tokens. Frequent words become single tokens, rare words decompose into pieces, and the vocabulary stays bounded.

The training loop for BPE itself is small enough to fit on a screen:

from collections import Counter

def train_bpe(corpus: list[str], num_merges: int) -> list[tuple[str, str]]:
    # Start from characters. Each "word" is a tuple of symbols so we can
    # track adjacency cheaply.
    vocab = {tuple(word) + ("</w>",): freq for word, freq in Counter(corpus).items()}
    merges: list[tuple[str, str]] = []

    for _ in range(num_merges):
        pairs = Counter()
        for symbols, freq in vocab.items():
            for a, b in zip(symbols, symbols[1:]):
                pairs[(a, b)] += freq

        if not pairs:
            break

        best = pairs.most_common(1)[0][0]
        merges.append(best)
        vocab = {_merge(symbols, best): freq for symbols, freq in vocab.items()}

    return merges


def _merge(symbols: tuple[str, ...], pair: tuple[str, str]) -> tuple[str, ...]:
    a, b = pair
    out, i = [], 0
    while i < len(symbols):
        if i < len(symbols) - 1 and symbols[i] == a and symbols[i + 1] == b:
            out.append(a + b)
            i += 2
        else:
            out.append(symbols[i])
            i += 1
    return tuple(out)

The bit that matters: pairs.most_common(1) is the entire algorithm. You repeatedly pick the most frequent adjacent pair and merge it. Everything else is bookkeeping.

There are alternatives (wordpiece, unigram language models) — I'll come back to these when we discuss multilingual tokens, because tokenization for low-resource languages is a different problem.

Chapter 2 — attention

The single mechanism that makes the whole thing work. Each token attends to every other token through a learned weighting.

The naive formulation is straightforward. Where it gets interesting is when you stack several heads in parallel — each one learning a different relationship — and then mask out the future positions during training so the model can't cheat.

import torch
import torch.nn.functional as F
from torch import nn

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, block_size: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        # Lower-triangular mask: position t can only attend to <= t.
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(block_size, block_size)).view(1, 1, block_size, block_size),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        # Split into heads: (B, T, C) → (B, n_heads, T, d_head)
        q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_head).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) / self.d_head**0.5
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)

        y = att @ v                                  # (B, n_heads, T, d_head)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(y)

A useful frame: the attention block is the only part of a transformer where tokens interact. Every other layer operates on each position independently. That's what makes attention the lever.

Chapter 3 — training loop

Training is unromantic. You assemble batches, compute loss, backprop, step the optimizer. It's almost all boilerplate; the magic is in the data and the hyperparameters.

def train_step(model, optimizer, batch):
    x, y = batch  # next-token targets shifted by one
    logits = model(x)
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        y.view(-1),
        ignore_index=-100,
    )
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    return loss.item()

I start with a small model (~12M parameters) on a small corpus to keep the loop tight. Once it makes coherent next-token predictions on the tiny domain, I know the plumbing is right. Then scale.

Chapter 4 — what's next

Once the base model trains end-to-end, the interesting choices begin: positional embeddings (sinusoidal vs RoPE), layer norm placement (pre vs post), activation choice (GELU vs SwiGLU), and the data work — which is where most of the real performance comes from.

The final ambition for this subject is the Zazakî model. Tokenizer that respects the dialect splits, dataset built from the platform translations, a small transformer trained on it, evaluation that goes through native speakers, not BLEU.

That's a long road. The chapters above are the road's first 100 metres.