What are Large Language Models?

After this lesson, you will be able to: Understand LLMs: tokens, transformers, context windows, and why they seem intelligent.

Large Language Models predict the next token given previous tokens. That simple task, done at scale with the transformer architecture, produces ChatGPT. This lesson demystifies what's under the hood.

Prerequisites:Neural Networks and Deep Learning

Tokens, not words

LLMs see text as tokens, sub-word units. 'Hello world' might be 3 tokens. 'antidisestablishmentarianism' is many. Tokens are the atom of LLM I/O, and what you pay for in API pricing.

How an LLM generates

1
1. Your prompt is tokenized.
2
2. Tokens flow through transformer layers.
3
3. Last layer outputs probabilities over the vocabulary.
4
4. Sample one token (greedy, top-k, top-p).
5
5. Append to prompt, repeat for next token.
6
6. Stop at end-of-sequence or max tokens.

Context window

The total tokens the model sees at once (prompt + response). GPT-4: 128K. Claude: 200K. Gemini 1.5: 1M+. Larger windows = longer documents in one shot, but more compute per token.

Quick Check

Why are LLMs bad at counting characters?

Think about how text is represented inside the model.

They process text one character at a time and lose count over long inputs.They see tokens, not characters. Words can be one or more tokens, so individual letters aren't atomic to the model.Their training data didn't include enough spelling examples.They round all integer outputs to the nearest 10.

←Neural Networks and Deep Learning

Back to Artificial Intelligence

The LLM Landscape: Models and Companies→