How LLMs Actually Work

Original: How LLMs work

A practical walkthrough of transformer-based LLM internals, from tokens and embeddings to attention, FFNs, and next-token generation.

The article explains how modern LLMs convert text into token IDs, embeddings, and position-aware vectors before passing them through stacked transformer blocks. It covers attention, multi-head attention, KV cache, GQA, feed-forward networks, MoE, residual streams, normalization, and decoding. Its goal is educational: helping readers understand the common architecture behind many current model families and read model cards or papers more confidently.

"How LLMs Actually Work" is a tutorial on LLM architecture aimed at beginner-to-intermediate readers, with the central theme that most modern LLMs repeatedly stack transformer blocks on top of the Transformer family of architectures. Therefore, understanding this mechanism lets you grasp the common high-level skeleton shared by models such as GPT, Claude, Gemini, and LLaMA. The article begins with tokenization, explaining that the model does not read text directly but rather reads the integer token IDs produced by the tokenizer; this also explains why LLMs sometimes make mistakes on tasks such as counting letters, because they process subword tokens rather than the single characters a human sees. The article then introduces how the embedding matrix converts token IDs into high-dimensional vectors, noting that semantically similar tokens form nearby positions in the vector space, a structure that emerges from training rather than being manually hard-coded.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.