Block web bots using apache config files

From Wiki
Jump to: navigation, search

web crawlers and bots can be blocked server-wide by using mod_rewrite rules in the httpd.conf or a global include file.

Note: By default, mod_rewrite configuration settings from the main server context are not inherited by virtual hosts. To make the main server settings apply to virtual hosts, you must place the following directives in each <VirtualHost> section.

On cPanel servers this means that you will need to modify the global apache templates at /var/cpanel/templates/apache2. Copy the ssl_vhost.default and vhost.default files to ssl_vhost.local and vhost.local and then rebuild the apache config. A good spot for this would be under the vhost.hascgi section.

cd /var/cpanel/templates/apache2
cp vhost.default vhost.local
cp ssl_vhost.default ssl_vhost.local
vim *.local

Add the following lines:

% IF !vhost.hascgi -%]
    Options -ExecCGI -Includes
    RemoveHandler cgi-script .cgi .pl .plx .ppl .perl
[% END -%]
    RewriteOptions Inherit

Save changes and rebuild the httpd.conf file.

/scripts/rebuildhttpdconf

Add global rules to the /usr/local/apache/conf/includes/pre_virtualhost_global.conf file.

Example rules:

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sosospider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC]

# Allow access to robots.txt and forbidden message
# at least 403 or else it will loop
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/403\.shtml$
RewriteRule ^.* - [F,L]
</IfModule>

Another example.

# block unwanted referers
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_REFERER} buttons-for-website\.com [NC,OR]
RewriteCond %{HTTP_REFERER} lax1\.ib\.adnxs\.com [NC,OR]
RewriteCond %{HTTP_REFERER} forum\.topic2988964.darodar\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ilovevitaly\.co [NC,OR]
RewriteCond %{HTTP_REFERER} priceg\.com [NC,OR]
RewriteCond %{HTTP_REFERER} blackhatworth\.com [NC,OR]
RewriteCond %{HTTP_REFERER} hulfingtonpost\.com [NC,OR]
RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ilovevitaly\.com [NC,OR]
RewriteCond %{HTTP_REFERER} priceg\.com [NC,OR]
RewriteCond %{HTTP_REFERER} cenoval\.ru [NC,OR]
RewriteCond %{HTTP_REFERER} sprtkw\.com [NC,OR]
RewriteCond %{HTTP_REFERER} xvideos\.com [NC,OR]
RewriteCond %{HTTP_REFERER} dansmovies\.com [NC,OR]
RewriteCond %{HTTP_REFERER} i-part\.com\.tw [NC,OR]
RewriteCond %{HTTP_REFERER} t007\.uee\.ly200\.net [NC,OR]
RewriteCond %{HTTP_REFERER} 18avday\.com [NC,OR]
RewriteCond %{HTTP_REFERER} 141jav\.com [NC,OR]
RewriteCond %{HTTP_REFERER} cs600\.wpc\.edgecastdns\.net [NC,OR]
RewriteCond %{HTTP_REFERER} dongtaiwang\.com [NC,OR]
RewriteCond %{HTTP_REFERER} funny*\.tv [NC,OR]
RewriteCond %{HTTP_REFERER} gm99\.com [NC,OR]
RewriteCond %{HTTP_REFERER} lichaozheng\.info [NC,OR]
RewriteCond %{HTTP_REFERER} plming\.tistory\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ticklingtv\.com [NC,OR]
RewriteCond %{HTTP_REFERER} dualshine\.com [NC,OR]
RewriteCond %{HTTP_REFERER} hcomicbook\.com [NC,OR]
RewriteCond %{HTTP_REFERER} nicovideo\.jp [NC,OR]
RewriteCond %{HTTP_REFERER} jamo\.tv [NC,OR]
RewriteCond %{HTTP_REFERER} eyosup\.com [NC,OR]
RewriteCond %{HTTP_REFERER} fatstube\.com [NC,OR]
RewriteCond %{HTTP_REFERER} aboluowang\.com [NC,OR]
RewriteCond %{HTTP_REFERER} gayteentwink\.com [NC,OR]
RewriteCond %{HTTP_REFERER} offers.bycontext\.com [NC,OR]
RewriteCond %{HTTP_REFERER} 40maturetube\.com [NC,OR]
RewriteCond %{HTTP_REFERER} able-hd\.com [NC,OR]
RewriteCond %{HTTP_REFERER} blog.naver\.com [NC,OR]
RewriteCond %{HTTP_REFERER} clicrbs\.com\.br [NC,OR]
RewriteCond %{HTTP_REFERER} epochtimes\.com [NC,OR]
RewriteCond %{HTTP_REFERER} tap2-cdn\.rubiconproject\.com [NC,OR]
RewriteCond %{HTTP_REFERER} dualshine\.com [NC,OR]
RewriteCond %{HTTP_REFERER} paisleyanne\.com [NC,OR]
RewriteCond %{HTTP_REFERER} cartier\.cn [NC]
RewriteRule .* - [F]
</IfModule>

Save your changes and restart Apache.

Test using curl to verify that requests are blocked.

curl -A 'baidu' example.com

Blocking can also be done locally in .htaccess

# BLOCK USER AGENTS
RewriteEngine on
 RewriteCond %{HTTP_USER_AGENT} xenu [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} curl [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} heritrix [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} wget [NC]
RewriteRule !^robots\.txt$ - [F]

# BLOCK BLANK USER AGENTS
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^ - [F]

Blocking Spiders via robots.txt

Shamelessly stolen from http://searchenginewatch.com/sew/news/2067357/bye-bye-crawler-blocking-parasites.

For a general introduction to the robots.txt protocol, please see: http://www.robotstxt.org/

Search engines are called to disclose which code to deploy in a given robots.txt file to deny their spiders access to a site's pages. Moreover, the page outlining this process should be easy to find.

Regrettably, most spiders listed above feature their robots.txt specs only in Chinese, Japanese, Russian, or Korean -- not very helpful for your average English speaking webmaster.

The following list features info links for webmasters and the code you should actually deploy to block specific spiders.

Yandex (RU)
Info: http://yandex.com/bots gives us no information on Yandex-specific robots.txt usage.

Required robots.txt code:

User-agent: Yandex
Disallow: /

Goo (JP) 
Info (Japanese): http://help.goo.ne.jp/help/article/704/
Info (English): http://help.goo.ne.jp/help/article/853/

Required robots.txt code:

User-agent: moget
User-agent: ichiro
Disallow: /

Naver (KR) 
Info: http://help.naver.com/customer/etc/webDocument02.nhn

Required robots.txt code:

User-agent: NaverBot
User-agent: Yeti
Disallow: /

Baidu (CN) 
Info: http://www.baidu.com/search/spider.htm

Required robots.txt code:

User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

SoGou (CN) 
Info: http://www.sogou.com/docs/help/webmasters.htm#07

Required robots.txt code:

User-agent: sogou spider
Disallow: /

Youdao (CN) 
Info: http://www.youdao.com/help/webmaster/spider/

Required robots.txt code:

User-agent: YoudaoBot
Disallow: /

Because the robots.txt protocol doesn't allow for blocking IPs, you'll have to resort to either of the two following methods to block Copyscape spiders.