Robots attack

Once you have a Web site that gets a moderate amount of traffic, it’s likely to become the object of the attention of "robots" that crawl your site for the purpose of archiving or indexing its content or otherwise making the site available offline. We have one site whose number of "not viewed" pages exceeds the number of (presumably) human-viewed pages by more than three times, according to our AWStats numbers. While our hosting provider’s bandwidth allocation is generous, given that we’re using a particularly resource-demanding application server for the site I’m much more concerned about the strain on other, physical resources, especially memory and CPU, brought about by this activity.

The real answer in our particular case is to tame the site with a caching strategy and other methods to be able to better tolerate high traffic, but that involves an effort that we haven’t had the time or resources to undertake. So Plan B is to take a closer look at exactly who these robots are, and whether they really need to be looking at your site, because you have a couple of different ways of blocking their access.

Some of them, like Googlebot, you obviously want to give unrestricted access to your site, because they’re the mechanism by which search engines index your site and allow their users to find your content in as a result of their keyword searches.

These legitimate bots, though, are not so well-behaved as you might expect. Yahoo!’s "Slurp" indexing bot, for example, single handedly accounted for well over a gigabyte of bandwidth (and well over 10% of the traffic) in a month (it’s an extensive site but not that extensive). Is this really necessary? I don’t know enough about how these bots work to know for sure, but it strikes me as excessive.

These bots typically identify themselves in your Web server access logs along with other information such as the IP address:

MISSING TEXT

Legitimate search engines can be instructed not to crawl your site, or to ignore sections of it using the robots exclusion standard, but most non-mainstream robots ignore this file anyway, so a more reliable method is using Apache’s Rewrite module.

So say you wanted to block this bot, you might use Apache’s Rewrite module to send the bot a "forbidden" response like so:

RewriteCond %{HTTP_USER_AGENT} ^(.*)Yahoo(.*) [OR] RewriteRule .* - [F,L]

(You can probably use regular expressions with a bit more precision than I do here.)

There’s the additional matter of whether these bots are really who they say they are. The information illustrated in the above access log entry is easy to spoof, including the IP and the user agent information at the end. Bad bots can even impersonate a browser like Microsoft Internet Explorer. If IPs aren’t being spoofed, you can also use Apache Rewrites to block by IP, but if they are, what can be done?

What I haven’t had much time to investigate is what purpose these rogue bots serve and to whom. Going down that path just a little, you find yourself in a sordid world of things like referrer log spamming–which I now realize that we’re also being victimized by–and in the company of lots of other folks who are trying to navigate the treacherous waters of a sea that used to be pretty calm as far as these things go.

For now, we’re sticking with the blunt instrument of blocking virtually all non-browser agents except Googlebot, and hoping that user agent spoofing is relatively rare.

Tags: ,

Leave a Reply