Bot Protection and Anti-Scraping

After this lesson, you will be able to: Defend forms and content against bad bots and scrapers using Cloudflare Bot Fight Mode, Turnstile, robots.txt, honeypot fields, and layered user-agent checks, while understanding how scrapers work.

Bots are most of the traffic on the internet. Some are good (search crawlers); many are bad (scrapers, credential stuffers, vulnerability scanners). This lesson covers Cloudflare Bot Fight Mode, Turnstile as a privacy-friendly CAPTCHA, robots.txt and what it does and does not enforce, honeypot fields, user-agent checks, and how understanding scrapers helps you defend against them.

Prerequisites:Rate Limiting and Abuse Prevention

Good bots vs bad bots

Good bots identify themselves and respect rules: Googlebot, Bingbot, uptime monitors. Bad bots do not: content scrapers that copy your data, credential stuffers replaying leaked logins, and vulnerability scanners probing for known holes. The goal is not to block all automation; it is to let the good ones through and make the bad ones expensive.

Cloudflare Bot Fight Mode and challenge types

Bot Fight Mode (free tier) uses behavioral signals and machine learning to score requests and challenge likely bots. A managed challenge presents a lightweight, often invisible, check; a JS challenge requires the client to run JavaScript (most simple scrapers cannot); block drops the request entirely. Start with managed challenge on sensitive paths so you do not break legitimate users, and reserve block for confirmed-bad patterns.

Cloudflare Turnstile: a privacy-friendly CAPTCHA

Turnstile replaces reCAPTCHA without sending users to label crosswalks or tracking them across the web. Render the widget, then verify the token server-side.

html

// Client: render the widget (gives you a token on success)
// <div class="cf-turnstile" data-sitekey={SITE_KEY}></div>

// Server: verify the token before trusting the form
async function verifyTurnstile(token: string, ip: string) {
  const res = await fetch(
    "https://challenges.cloudflare.com/turnstile/v0/siteverify",
    {
      method: "POST",
      headers: { "content-type": "application/json" },
      body: JSON.stringify({
        secret: process.env.TURNSTILE_SECRET_KEY,
        response: token,
        remoteip: ip,
      }),
    },
  );
  const data = await res.json();
  return data.success === true; // reject the request if false
}

robots.txt: a request, not a wall

robots.txt tells well-behaved crawlers which paths to avoid. It is a polite request, not enforcement: malicious bots ignore it, and worse, it advertises the paths you want hidden. Use it to keep search engines out of admin or duplicate pages, never to protect anything sensitive. Sensitive paths need real authentication, not a Disallow line.

Honeypot fields and user-agent checks

A honeypot is a form field hidden from humans with CSS; real users never fill it, naive bots do. User-agent checks are a weak signal (trivially spoofed) but still worth using as one layer.

tsx

// Honeypot: hidden field, reject the submission if it is filled
// <input name="website" tabindex="-1" autocomplete="off"
//   style="position:absolute;left:-9999px" aria-hidden="true" />

export async function POST(req: Request) {
  const form = await req.formData();
  if (form.get("website")) {
    // A human never fills a hidden field. This is a bot.
    return new Response("OK", { status: 200 }); // fail silently
  }
  // user-agent is spoofable, so treat it as ONE weak signal among many
  const ua = req.headers.get("user-agent") ?? "";
  if (!ua || /curl|python-requests|scrapy/i.test(ua)) {
    // raise suspicion / add friction, do not rely on this alone
  }
}

Knowing your enemy: how scrapers work

Modern scrapers use headless browsers (Puppeteer, Playwright) that run real JavaScript, so JS challenges alone do not stop them. They look for stable HTML structure and predictable API endpoints. You make scraping harder without harming users by: requiring auth for bulk data, paginating and rate-limiting API responses, avoiding exposing a clean JSON API that returns everything, and using Turnstile on the endpoints that matter. You will never make a determined scraper impossible; you make it expensive enough that they move on.

Quick Check

Can you rely on robots.txt to keep attackers away from your /admin path?

Pick the correct statement.

No. robots.txt is advisory; only authentication actually protects a path, and listing it can even advertise it.Yes, all crawlers and bots are required to obey robots.txt.Yes, as long as you also add a meta noindex tag.No, but adding it to .gitignore will hide it.

Common mistakes only experienced devs catch

Treating user-agent as proof of identity instead of a weak hint. Putting sensitive paths in robots.txt and thereby advertising them. A honeypot field with a name like 'honeypot' that bots learn to skip (name it something plausible like 'website' or 'phone2'). Blocking by IP so aggressively that you ban a whole corporate NAT or a mobile carrier. Forgetting that headless-browser scrapers pass JS challenges, so layering (auth + rate limit + Turnstile) matters more than any single trick.

←Rate Limiting and Abuse Prevention

Back to Security for Developers

HTTP Security Headers→