BiTree
  • Search For Lessons
  • Curriculum
  • Pricing
  • For Educators
  • Become a Tutor
  • About
  • Contact
Log InGet Started

Questions, concerns, bug reports, or suggestions? We read every message, write to us at [email protected].

More ways to reach us →
BiTree

Live coding lessons for aspiring developers and security professionals.

[email protected]

(201) 785-7951

Mon–Fri, 9 AM–5 PM EST

Learn

  • Search For Lessons
  • Curriculum
  • Pricing

Company

  • About
  • For Educators & Schools
  • Become a Tutor
  • Contact Us

Legal

  • Terms of Service
  • Privacy Policy
© 2026 BiTree. All rights reserved.
Curriculum/Web Development/Security for Developers/Bot Protection and Anti-Scraping
40 minIntermediate

Bot Protection and Anti-Scraping

After this lesson, you will be able to: Defend forms and content against bad bots and scrapers using Cloudflare Bot Fight Mode, Turnstile, robots.txt, honeypot fields, and layered user-agent checks, while understanding how scrapers work.

Bots are most of the traffic on the internet. Some are good (search crawlers); many are bad (scrapers, credential stuffers, vulnerability scanners). This lesson covers Cloudflare Bot Fight Mode, Turnstile as a privacy-friendly CAPTCHA, robots.txt and what it does and does not enforce, honeypot fields, user-agent checks, and how understanding scrapers helps you defend against them.

Prerequisites:Rate Limiting and Abuse Prevention

Good bots vs bad bots

Good bots identify themselves and respect rules: Googlebot, Bingbot, uptime monitors. Bad bots do not: content scrapers that copy your data, credential stuffers replaying leaked logins, and vulnerability scanners probing for known holes. The goal is not to block all automation; it is to let the good ones through and make the bad ones expensive.

Cloudflare Bot Fight Mode and challenge types

Bot Fight Mode (free tier) uses behavioral signals and machine learning to score requests and challenge likely bots. A managed challenge presents a lightweight, often invisible, check; a JS challenge requires the client to run JavaScript (most simple scrapers cannot); block drops the request entirely. Start with managed challenge on sensitive paths so you do not break legitimate users, and reserve block for confirmed-bad patterns.

Cloudflare Turnstile: a privacy-friendly CAPTCHA

Turnstile replaces reCAPTCHA without sending users to label crosswalks or tracking them across the web. Render the widget, then verify the token server-side.

html
// Client: render the widget (gives you a token on success)
// <div class="cf-turnstile" data-sitekey={SITE_KEY}></div>
// Server: verify the token before trusting the form
async function verifyTurnstile(token: string, ip: string) {
const res = await fetch(
"https://challenges.cloudflare.com/turnstile/v0/siteverify",
{
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({
secret: process.env.TURNSTILE_SECRET_KEY,
response: token,
remoteip: ip,
}),
},
);
const data = await res.json();
return data.success === true; // reject the request if false
}

robots.txt: a request, not a wall

robots.txt tells well-behaved crawlers which paths to avoid. It is a polite request, not enforcement: malicious bots ignore it, and worse, it advertises the paths you want hidden. Use it to keep search engines out of admin or duplicate pages, never to protect anything sensitive. Sensitive paths need real authentication, not a Disallow line.

Honeypot fields and user-agent checks

A honeypot is a form field hidden from humans with CSS; real users never fill it, naive bots do. User-agent checks are a weak signal (trivially spoofed) but still worth using as one layer.

tsx
// Honeypot: hidden field, reject the submission if it is filled
// <input name="website" tabindex="-1" autocomplete="off"
// style="position:absolute;left:-9999px" aria-hidden="true" />
export async function POST(req: Request) {
const form = await req.formData();
if (form.get("website")) {
// A human never fills a hidden field. This is a bot.
return new Response("OK", { status: 200 }); // fail silently
}
// user-agent is spoofable, so treat it as ONE weak signal among many
const ua = req.headers.get("user-agent") ?? "";
if (!ua || /curl|python-requests|scrapy/i.test(ua)) {
// raise suspicion / add friction, do not rely on this alone
}
}

Knowing your enemy: how scrapers work

Modern scrapers use headless browsers (Puppeteer, Playwright) that run real JavaScript, so JS challenges alone do not stop them. They look for stable HTML structure and predictable API endpoints. You make scraping harder without harming users by: requiring auth for bulk data, paginating and rate-limiting API responses, avoiding exposing a clean JSON API that returns everything, and using Turnstile on the endpoints that matter. You will never make a determined scraper impossible; you make it expensive enough that they move on.

Quick Check

Can you rely on robots.txt to keep attackers away from your /admin path?

Pick the correct statement.

Common mistakes only experienced devs catch

Treating user-agent as proof of identity instead of a weak hint. Putting sensitive paths in robots.txt and thereby advertising them. A honeypot field with a name like 'honeypot' that bots learn to skip (name it something plausible like 'website' or 'phone2'). Blocking by IP so aggressively that you ban a whole corporate NAT or a mobile carrier. Forgetting that headless-browser scrapers pass JS challenges, so layering (auth + rate limit + Turnstile) matters more than any single trick.

Sign in and purchase access to unlock this lesson.

Sign in to purchase
←Rate Limiting and Abuse Prevention
Back to Security for Developers
HTTP Security Headers→