After this lesson, you will be able to: Know the major models, the companies behind them, and how to evaluate which to use when.
The LLM landscape changes every quarter. This lesson maps the players (OpenAI, Anthropic, Google, Meta, Mistral), their models' strengths, and how to think about evaluation.
Anthropic. Claude family (Opus, Sonnet, Haiku). Strong reasoning, long context, safety focus. OpenAI. GPT-4, GPT-4o, o1 (reasoning). Broadest ecosystem. Google. Gemini family. Multi-modal, longest context. Meta. Llama family. Open weights, runnable locally. Mistral, open + commercial. Strong European alternative.
Modern frontier models read images, parse PDFs, transcribe audio, and (in some cases) watch video. Claude 3.5+ and Claude 4 handle images and PDFs natively. GPT-4o is multimodal in/out (text, image, audio). Gemini 2.x handles longest video. When picking a model, ask: 'what input shapes will my users send?' before benchmark scores.
Context window = how much input the model can read at once. Modern frontier models offer 128K-1M tokens (Claude Sonnet 4.x ~200K, GPT-4o 128K, Gemini 2 Pro 1M+). Large context lets you skip RAG for medium documents (one long prompt with the document inline). Long contexts cost more per request and degrade in attention to the middle (the 'lost in the middle' effect). Always test retrieval quality at the size you'll deploy.
1. Capability tier (frontier vs cheap fast).
2. Latency requirements (Haiku < Sonnet < Opus on Anthropic; gpt-4o-mini < gpt-4o on OpenAI).
3. Cost per token (compare via OpenRouter, llmprices.com).
4. Modality requirements (text only? image input? long PDFs?).
5. Context window (does your worst-case input fit?).
6. Privacy (data residency, training opt-out, BYOK).
7. Tools/features (function calling, vision, computer use, structured output).
8. Always benchmark on YOUR data, not generic leaderboards.
MMLU (broad knowledge), HumanEval (code), GPQA (graduate-level science), SWE-bench (real GitHub issues), LMSYS Arena (human pref). Useful as filters; final choice should be your own evals on your task.
Picking by leaderboard, not by task. Arena rankings shift weekly; the task that matters for your product probably isn't a benchmark. Defaulting to the most expensive model. Sonnet/GPT-4o-mini handle ~80% of production tasks at a fraction of the cost. Ignoring multimodal options when input includes screenshots, scanned forms, or audio. Locking in a single provider. Build a thin abstraction (or use OpenRouter, LiteLLM) so swapping providers takes a config change, not a rewrite.
Sign in and purchase access to unlock this lesson.