The Silent Siege: How AI Scrapers are Destabilizing the Internet’s Knowledge Hubs

Table of Contents
The Hidden Cost of LLM Training
For years, the ‘open web’ has operated on a quiet agreement: websites provide information, and in exchange, search engines index that data to drive traffic back to the source. But as the race for Large Language Model (LLM) supremacy intensifies, that agreement is collapsing. Website administrators, particularly those running community-driven wikis, are reporting a surge in aggressive, anonymous scraping that is pushing servers to their breaking point.
The scale of the problem is becoming evident at Weird Gloop, the entity hosting some of the most visited gaming wikis in the world, including those for Minecraft, Old School RuneScape (OSRS), and League of Legends. According to the site’s administration, bot traffic has become a disproportionate financial and technical burden. Without constant mitigation, these bots would consume roughly ten times more compute resources than the millions of actual human users and thousands of daily editors combined.
Beyond the ‘Official’ Bots
Much of the public discourse around AI scraping has focused on the ‘polite’ bots—those operated by industry giants like OpenAI (GPTBot), Anthropic (ClaudeBot), and Perplexity. These agents typically identify themselves in their User Agent strings and, for the most part, respect robots.txt files, allowing webmasters to block them via simple tools like Cloudflare or Nginx.
However, a more insidious trend has emerged. As site owners began blocking identified AI bots, scrapers evolved. A growing wave of ‘stealth’ bots now meticulously mimic human browser requests, spoofing headers to appear as legitimate versions of Google Chrome. This eliminates the obvious signals that sysadmins previously used to distinguish a researcher from a bot.
The Rise of Residential Proxies
The technical battle has shifted from identifying what the bot is to where it is coming from. Historically, scrapers operated from a handful of IP addresses or specific data centers, making them easy to blacklist. Today, bad actors are utilizing residential proxies—networks of millions of IP addresses belonging to home internet users (such as Comcast or AT&T customers) who are often unaware their connection is being used as an exit node for a scraper.
Furthermore, some scrapers are leveraging ‘cloaking’ techniques through third-party services. By routing requests through Google Translate or the facebookexternalhit link preview tool, scrapers can mask their identity behind the trusted infrastructure of Google and Meta. In some cases, wiki administrators have been forced to disable Google Translate functionality entirely because the vast majority of requests coming through the tool were purely abusive.
Inefficiency and Infrastructure Strain
What makes this surge particularly damaging is the sheer inefficiency of the scraping methods. Many of these bots ignore sitemaps and robots.txt entirely, crawling blindly through a site’s architecture. For a wiki like OSRS, which has roughly 40,000 primary articles but billions of navigable URLs—including every historical revision and edit screen—this is catastrophic.
These ‘junk’ requests are exponentially more expensive to serve than standard page views. While a cached page might load in 20 milliseconds, an old page diff or a special edit screen can take up to two seconds of processing time. This creates a scenario where the CPU bottleneck is far more critical than total bandwidth. When these scrapers operate in bursts of over 1,000 requests per second, the result is indistinguishable from a distributed denial-of-service (DDoS) attack, leading to site-wide slowness and frequent outages.
The identity of the entities behind this surge remains a mystery. Whether they are data brokers, frontier AI labs ‘double-dipping’ into data, or independent projects with access to cheap residential proxies, the impact is the same: the infrastructure of the open web is being cannibalized to fuel the next generation of AI.