After this lesson, you will be able to: Take an AI idea to a deployed feature: scope, cost, latency, UX, and the honest question of when AI is the wrong tool.
Most AI product launches fail not on capability but on cost, latency, or UX framing. This lesson covers the questions every senior AI engineer asks before writing the first prompt.
1. What's the user-visible job? Describe it in one sentence without the word 'AI'. 2. What does failure cost? A wrong word in a draft is recoverable; a wrong word in a medication dose is not. 3. Is this a hidden classifier or a generative feature? Many 'AI features' are really classifiers (route, tag, score) and would work better with a simpler approach. 4. Where does the user catch errors? If the model is wrong, can the user see it? Can they fix it cheaply?
Anthropic Sonnet 4.x pricing is roughly $3/M input, $15/M output tokens (verify at anthropic.com/pricing — these change). Use these to ballpark.
# Per-request cost exampleinput_tokens = 2000 # system prompt + user content + contextoutput_tokens = 500 # response budgetrate_in = 3 / 1_000_000 # $/tokenrate_out = 15 / 1_000_000 # $/tokencost_per_request = input_tokens * rate_in + output_tokens * rate_outprint(f"${cost_per_request:.4f} per request")# $0.0135 per request# At scaledaily_requests = 10_000monthly_cost = cost_per_request * daily_requests * 30print(f"${monthly_cost:.2f}/month")# $4,050/month — surprises companies that 'just want to try AI'# Cost levers in order of impact:# 1. Use a smaller model (Haiku is ~10x cheaper than Sonnet)# 2. Use prompt caching for static system prompts (Anthropic: ~90% cheaper on cached input)# 3. Cap max_tokens hard# 4. Batch off-realtime requests (Anthropic Message Batches: ~50% off)
Chat UI: 200ms first token feels instant, 1s feels normal, 3s feels slow, 5s+ feels broken. Always stream tokens so the user sees output starting immediately. Background job: latency is irrelevant; throughput and cost are. Use batch APIs and queues. On-keystroke (autocomplete): need sub-200ms total. Use a small fast model, aggressive caching, and pre-warm.
Walk this list before shipping any AI feature to real users.
Eval set with at least 20 cases is in place (see ai-evals)
Cost-per-request projected at expected scale; finance has signed off
Latency target is documented and tested at p50 and p99
User knows they're talking to AI; the failure mode is visible (citations, regenerate button, edit button)
There's a non-AI fallback (or the feature degrades to plain text) when the API is down
Rate limiting per user/team is in place; runaway prompts can't bankrupt you
PII handling is documented; user data is not sent to a model that trains on it without consent
Logging captures prompts + responses with PII redaction; you can audit any complaint
Treating AI features as identical to non-AI features in design reviews. They have different failure modes, different cost shape, different UX requirements. Promising determinism. LLMs are non-deterministic by default; users WILL get different answers on identical inputs. Set expectation. Skipping the kill switch. Have a config flag that disables the AI path and falls back to a non-AI alternative; you'll need it. Forgetting model deprecation. Frontier models retire on 12-month cycles. Build a model abstraction now, not when you're scrambling.
Pick the highest-leverage option.
Sign in and purchase access to unlock this lesson.