What can you do when web robots are hammering your site
Is your web site constantly getting hit by automated web robots, causing service denials and excess bandwidth consumption?
What are automated web robots?
Automated web robots are a necessary and usually a "good thing" on the web. A web robot can be explained as a program that can automate a specific function on the web. For example, almost all search engines use robots to automate the process of grabbing a copy of your web pages and indexing them so that other Internet users can find your pages. Such a robot could also visit your pages from time to time to check if your pages have been updated and re-index if necessary. Another example of a robot would be a program that would follow links on a web page and retrieve all related pages, images, etc., and make a copy of the web site on your computer that you can read off-line.
What's the problem here?
So what's the problem, you ask? Most web sites are designed to be read by real humans, not robots. The main difference is that a person would follow only the links that are interesting to him or her and would do so at a "human speed," while a robot could be setup to download all linked pages and objects on those pages at speeds limited only by the speed at which the web server and the Internet connection can serve pages. To add to the problem, some robots may download multiple pages and objects at the same time.
So what? Someone liked my pages enough to read them off-line!
Isn't that a sign that someone liked my web pages, to take the trouble of downloading my pages to view at a later time? Probably, if they knew the content of your web pages. Whether you have hundreds of small pages, fewer number of large pages or just a single time consuming script, a web robot could possibly have problematic side-effects on your site, such as:
- Temporary denial of services caused by heavy bursts of page downloading. This is where other visitors to your web site may not be able to access your pages at the normal speed, or in extreme cases not at all, because a robot is busy downloading your pages in a heavy burst and consuming majority of your web server resources and Internet connection bandwidth.
- Excessive consumption of bandwidth. As explained earlier, because the content that robots download may or may not be useful to a person at a later time, the chances of downloading content and throwing it away is higher. This could be a waste of your bandwidth. If your site has to pay on a pay-per-bandwidth basis, or if you have daily or monthly bandwidth limits, frequent visits from robots could consume a large portion of this bandwidth.
Your web site may not be affected by robots
| NOTE: |
It's very important to note that you should not be afraid of robots, because most robots, such as search engine robots, are friendly. They will go through your site at a phase that your web server can handle and usually will obey your suggestions on how a robot could index your pages. If you have a small personal site with pages that load quickly and/or you do not use server-side scripts that take a long time to execute, it's unlikely that even an unfriendly robot could cause any significant problems to your site. Also note that most web sites are able to handle the "burst effect" of unfriendly robots once in a while. It's when such a robot visits your site frequently that you may have to take caution. Unless your pages are frequently updated, it's unlikely that someone would setup a robot to download your pages as frequently.
|
Possible preventative steps you can take
- Check if you're actually being visited by robots often enough before taking any preventative measures :
You can check if robots are visiting your pages by examining your web server log files. Look in the "user agent" field (browser name/signature) for unusual names, the frequency of their visits and the type of files that are being retrieved. You may not have to take any further steps if robots aren't causing significant problems, except setting up a robots.txt file as described below.
- Use the robots.txt file, meta tags and other way to suggest robots how your site should be indexed (or not indexed): You should first try suggesting to robots how you would like your site to be indexed or downloaded. Unfortunately, this is just a suggestion and unfriendly robots may simply ignore your suggestions and continue to download all pages anyway.
- Convert time-consuming scripts to static pages if possible : It takes longer for a web server to execute scripts that generate web pages than to serve static HTML pages. Because of this, it is less likely that your server would get a denial-of-service when robots download static files in heavy bursts. If possible, save the results of time-consuming scripts to static web pages and serve those resulting files instead. There are many tools that can help you to convert scripts to multiple HTML files. If you can use ActiveX server-side components (within ASP scripts for example), you maybe able to use our free URL2File component.
- Check if a possible robot requested the script before performing time-consuming functions : If you have to use scripts on your web site, check if a possible unfriendly robot called the script before going in to the time-consuming parts of the scripts. For example, if your script is supposed to display a complex graph that would take a long time to generate, exit the script gracefully if a specific robot called it to reduce server resource consumption.
Following is a sample script written in ASP/PerlScript to check the presence of a robot. It can easily be modified as documented to be called from any regular Perl script.
#
# return values:
# 1 / true = is a robot
# 0 / false = probably not a robot
#
sub IsWebRobot
{
my( $temp );
# use:
# my( $ua ) = $ENV{ 'HTTP_USER_AGENT' };
# if using regular Perl
my( $ua ) = $Request->ServerVariables(
'HTTP_USER_AGENT' )->item;
# list of user agent IDs for known robots
# that you want to avoid. partial names
# are okay.
my( @UA ) = (
'robot-name-1',
'robot-name-2',
# ...
'robot-name-n'
);
foreach $temp (@UA)
{
if($ua =~ /Q$temp/i)
{
return 1;
}
}
# use:
# my( $ua ) = $ENV{ 'REMOTE_ADDR' };
# if using regular Perl
my( $ra ) = $Request->ServerVariables(
'REMOTE_ADDR' )->item;
# list of IPs that you want to restrict
# script access to
my(@RA)=
(
'0.0.0.', # sample IPs / IP classes
'0.0.',
# ...
'0.0.0.0'
);
foreach $temp (@RA)
{
if($ra eq $temp)
{
return 1;
}
}
# didn't find a suspicious user agent
# or an IP address/group
return 0;
}
Listing #1 : Perl code. Download
chk4robo (0.6 KB).
Sample script using the above "IsWebRobot()" function:
# include above function at the top of the script
# then check if we're being called by a robot
if(IsWebRobot())
{
$Response->write("Couldn't retrieve ...");
}
else
{
# perform the actual task of the script
}
Listing #2 : Perl code. Download
sample (0.29 KB).
- If a particular IP address is constantly causing problems even after taking above steps, consider limiting access to your server : If you have positive proof that a particular IP address or a particular class of addresses is causing problems to your web server, try contacting the owners of those Internet connections. If all fails, you may want to consider excluding such addresses from your web server (refer to your web server documentation for details).
- Use server-specific components if necessary : If you're still having severe problems with avoiding unfriendly robots, writing a server-specific component maybe another solution. This is where you write an extension to your web server that would dynamically verify the properties of calling programs to check for possible robots. For example, Internet Information Server users can write ISAPI filters while Apache web server users can add extensions to the source code itself.
Don't forget, web robots such as programs automating the downloading of complete web sites is not necessarily the problem here. It's the problematic use of such tools by a few number of users, trying to download the most amount of content in the least amount of time, that's creating heavy demands on web servers and Internet connections, that could create headaches to webmasters.
Applicable Keywords : Active Server Pages, HTML, Internet Information Server, Internet, Perl, Web Resource, World Wide Web