Pixelite: Bots, scrapers, and proxies: defending Drupal sites in an automated internet
Over half of all web traffic in 2024 was automated. That is the headline number from the Imperva 2025 Bad Bot Report, and it is the first time bots have outnumbered humans in more than a decade. Drupal sites sit squarely in that traffic mix, and the old defensive playbook — block an IP, ban a user agent, drop a robots.txt entry, lean on Fail2ban — does not hold up anymore.
This is the companion post to my DrupalSouth Wellington 2026 talk, Bots, scrapers, and proxies: defending Drupal sites in an automated internet. The talk walked through the defences I actually use at amazee.io and recommend on client sites. The post covers the same ground, with a bit more room to show config and link out to the projects.
What actually changedThe technical context underneath bot defence has shifted in three ways that matter:
- Residential proxy networks. Scrapers no longer come from a handful of cloud subnets you can block. They route through real consumer IP addresses, often unwittingly donated by free-VPN users or piggy-backed off shady SDKs in mobile apps.
- Headless browsers everywhere. Playwright and Puppeteer have made it trivial to render JavaScript-heavy pages at scale. A page that needed a real human five years ago can be scraped today by anyone with a laptop.
- AI-driven scraping. Volume is up sharply because every new LLM needs training data, and there is now a steady drip of new crawlers showing up. Meta's externalagent is one recent example. There will be more.
Mimicry is now the baseline, not the edge case. A modern scraper will rotate IPs, randomise user agents, replay realistic TLS fingerprints, and pace itself slowly enough to look like a real user. You cannot rely on signal that lives in one HTTP header.
The scale of itIf this still sounds like a niche problem, the numbers say otherwise.
- 51% - share of web traffic that was automated in 2024, per the Imperva 2025 Bad Bot Report.
- +96% - year-on-year growth in some popular bot services across Pantheon's hosting fleet in their July 2025 data.
- 1B+ - unique monthly visitors Pantheon sees across its platform, which is the size of dataset those numbers are coming from.
On the amazee.io platform globally, 13% of incoming requests can be flagged as non-human based on the user agent alone. That is the lazy bots. The actual share of automated traffic is higher once you account for the ones that try to blend in. In absolute terms it adds up to hundreds of millions of requests every month.
The goal is not to block all botsBefore going through the defences, one thing I am careful to say up front, both on stage and here: the goal is not to block all bots. That is unwinnable, and the closer you get to it the more real users you break.
Search crawlers, RSS readers, uptime monitors, link-preview generators in Slack and iMessage, accessibility tooling - all bots, all wanted. The goal is to reduce abuse where it hurts most, on the endpoints that cost you real money or real performance, while leaving everything else alone.
Drupal-native defencesThe defences closest to your application are the smartest. They can see the path, the user, the form, the cache state. They are also the most expensive per blocked request, because every block at this layer has already cost you a full PHP bootstrap.
PerimeterThe Perimeter module drops requests matching known-bad patterns: /wp-admin, /.env, xmlrpc.php, all the WordPress scanner noise that hits every Drupal site daily. It is the cheapest win on the list. It will not stop a serious scraper, but it will keep your logs clean and your error rate honest.
CrowdSec and AbuseIPDBCrowdSec is a local agent plus a community blocklist. Every site running CrowdSec contributes detected attacks back to a shared signal, and pulls down the latest list of bad actors. It is the closest thing the open-source world has to a distributed reputation system.
AbuseIPDB is a reputation lookup service. You query an IP, you get a confidence score. It is most useful on the forms and login flows where you can afford the latency of an external API call. Both are available as Drupal modules.
Facet Bot BlockerIf you run Search API with facets, this is the single cheapest huge win available to you. Faceted search URLs are catnip for scrapers: every combination of filters is a new URL, every URL is uncached, every uncached request hits the database. A bot that crawls a faceted listing can take a site down without trying.
The Facet Bot Blocker module acts as a rate limit on requests that include at least one facet in the URL. Configure it to use Redis or memcache for the counter so you are not making the problem worse by hitting the database to record the request. On one of our hosting customers, this one module cut Search API load by more than half.
Form-side defencesLogins, registrations, password resets and contact forms all need their own treatment, separate from page-level defence:
- Honeypot - invisible field plus a time-based check. Cheap, fast, surprisingly effective against the dumb half of form spam.
- Antibot - requires JavaScript to submit, blocks the bots that do not run JS.
- CAPTCHA, reCAPTCHA, or Cloudflare Turnstile - full challenge. Use the lightest option that works, and ideally only after Honeypot and Antibot have already rejected the easy cases.
- Hidden CAPTCHA - bridges the gap when you want a CAPTCHA-style check without the accessibility cost of a visible challenge.
Every block at the Drupal layer has already cost you a PHP bootstrap. That is fine when the absolute volume is small. It is not fine when you are eating hundreds of millions of bot requests and bootstrapping PHP for each one. This is why you cannot stop at the application layer.
Web server and infrastructureOne layer out, the web server can drop requests before PHP ever runs. The trade-off flips: you save the bootstrap cost, but you lose access to application context.
Rate limiting and geo blockingnginx ships with limit_req_zone, Apache has mod_ratelimit. Both are blunt but effective on volume. A starting point for nginx looks roughly like this:
limit_req_zone $binary_remote_addr zone=search:10m rate=10r/m; location /search { limit_req zone=search burst=5 nodelay; proxy_pass http://drupal; }Ten search requests per minute, per IP, with a burst of five. Tune to taste. The $binary_remote_addr key is cheap on memory; a 10MB zone holds around 160,000 IPs.
Geo blocking is the other infrastructure-level lever. It is pragmatic and occasionally controversial. If your audience is the New Zealand public sector, blocking inbound from regions you do not serve is a defensible call. If your audience is global, it is not. Know your traffic before reaching for it.
ModSecurity and the OWASP CRSModSecurity with the OWASP Core Rule Set is a proper WAF you can self-host. Once tuned, it is real protection. The tuning is the catch — out of the box it will flag Drupal admin actions, file uploads, anything that looks like SQL in a body or a query string. Expect to spend real time pruning rules and adding exceptions for legitimate site behaviour before you stop generating false positives.
Cache disciplineA request that hits the cache costs you nothing. Whatever else you do, get your cache headers right. Vary on the bits that need to vary, cache aggressively on the bits that do not, and lean on the page cache or the reverse proxy in front of Drupal. The cheapest bot is the one that asks for a page you have already served.
(shamless plug) you can also use my site Caching Score to review your current caching setup, to see if there is anything better you can be doing.
Caching ScoreAssess how strong the caching capabilities of any given site is.Caching Scoresean.hamlinamazee.ioWhat this layer is bad atRate limits, ModSecurity rules and geo blocks are great at volume and bad at quality. They cannot tell a scraper trickling one request per minute apart from a real user. For that you need either the edge or the application.
Edge and paid bot managementThe edge is where the big vendors live, and it is where you push the cheapest blocks. A scraper rejected by Cloudflare at the network edge never gets to your origin at all.
CloudflareThe free tier already includes Bot Fight Mode, basic challenges, and Turnstile. For most small-to-medium Drupal sites, this is a good baseline at zero extra cost. The paid Bot Management product adds custom rule logic, JA3 and JA4 TLS fingerprinting, and machine-learning-based bot scoring you can wire into firewall rules. The jump from free to paid is significant in price; the jump in capability is also significant.
Fastly, Akamai, and the restFastly offers the Next-Gen WAF (originally Signal Sciences) with a Bot Management add-on. Akamai sits at the enterprise tier with the most sophisticated fingerprinting available, and a price tag to match. Beyond those, there is AWS WAF with Bot Control, DataDome, HUMAN, and Imperva — all credible, all paid, all priced for sites where bot abuse is costing real money.
The trade-offs nobody puts on the sales deckBot Management at the edge solves real problems. It also comes with real costs that the vendor demos skip past:
- Cost. Bot Management is almost always an add-on to the core WAF subscription, and the pricing escalates fast with traffic.
- Vendor lock-in. Your rules, your dashboards, your observability all live in the vendor's UI. Migrating off is painful.
- Accessibility and SEO. Aggressive challenges break real users, and search bots that fail a challenge will hurt your rankings. Test both before turning anything up.
- Rules live outside your application codebase. They drift, they are not versioned alongside the code that depends on them, and a rule change can break a feature without any commit to point to.
- False positives are invisible to you. By default, blocked requests do not reach your logs. You will not know which real users were turned away unless you specifically ask for that signal.
The newest piece in the picture, and the one that has me genuinely interested.
What Anubis isAnubis is an open-source reverse proxy (MIT licensed) that sits in front of your site and issues a proof-of-work challenge to clients before letting them through. It was built specifically for the AI scraper era — for the case where the scraper is mimicking a real browser well enough that classifying it on signal alone has stopped working.
Why proof-of-work, not CAPTCHAThe interesting move with Anubis is who pays the cost. A real user pays a few hundred milliseconds of CPU once when they first arrive, and never sees it again for the lifetime of the cookie. A scraper hitting you a million times pays the cost a million times.
That asymmetry is the whole point. CAPTCHAs put the cost on humans (the people who lose patience trying to identify traffic lights). Anubis puts it on whoever is doing the hammering. That is closer to the right shape of the trade.
Where to put itYou do not want Anubis in front of your whole site. You want it in front of the endpoints that are expensive and uncacheable. From the talk, my shortlist:
- Search endpoints
- Facet and filter URLs
- Pagination tails - ?page=2348 is not a real user
- Login, register, password reset
- Spicy forms (contact, anything that triggers an email)
- Authenticated user flows
- Anything expensive and uncacheable
Static pages stay fast. The cache stays warm. The PoW cost only applies on the routes where it earns its keep.
But what about Googlebot?This is the first question every site owner asks, and the answer is good. Anubis ships with allowlists for known good crawlers, matching IP ranges against the published lists from Google, Bing, and the rest. The allowlist is maintained upstream, which means you need to keep Anubis deployed on a reasonable cadence to pull in the latest changes. New legitimate crawlers do show up.
Demo siteYou can see Anubis in action with a demo Drupal 11 site I put together, the login form has Anubis in front of it, the homepage does not.
Log in | Drush Site-InstallDrush Site-InstallPutting the layers togetherNone of these defences is a silver bullet on its own. Each layer is cheap at one thing and bad at another, and the trick is matching the layer to the threat.
Layered defence diagram showing requests flowing from clients through Edge/CDN, Anubis, web server, and Drupal, with cost-to-block increasing as you move closer to the application.Block the cheap traffic at the edge. Block the lazy bots with rate limits and ModSecurity at the web server. Put Anubis in front of the endpoints that are expensive and uncacheable. Let Drupal-native modules handle the application-aware decisions where you actually need to see the user, the form, or the facet state.
Five things to take away- No single layer is enough. Stack them. The edge handles raw volume, the web server handles patterns, and the application is the only thing that can see real user and form context.
- Match the protection to the threat. A login form needs different defence to a faceted search results page.
- Measure before you defend. Look at your actual traffic. Find your most-hit uncacheable endpoints. Defend those first.
- Watch accessibility and SEO. Every challenge you add is a tax on a real user or a real crawler. The cost of false positives is invisible unless you go looking.
- Plan for adversarial improvement. Whatever you deploy today, the scrapers get a turn next. Pick defences you can iterate on.
You do not need to win the bot war. You just need to make your site a worse target than the next one.
The slides from the talk are on the DrupalSouth schedule page. The recording will be posted here once the DrupalSouth team have edited and uploaded it — check back in a few weeks.