Skip to content

Why is MSNBot ignoring robots.txt?

Today, the root file system on our public svn server nearly ran out of disk space. The reason? The /tmp directory was quickly filling up with temporary files created by websvn, which I set up parallel to the FishEye repository browser for testing purposes. A quick investigation of the apache log files revealed the culprit - a crawler from Microsoft was running haywire and decided to ignore the rules in the robots.txt file, even though it did actually looked at the file before!

Here is how robots.txt looked like (I now changed it to disallow everything):

User-agent: *
Disallow: /fisheye/
Disallow: /websvn/

If I am not mistaken, no crawler should actually consider going into the SVN browser directories. Some snippets from the apache log:

$ grep robots.txt /var/log/apache2/access_log | grep msn
65.55.208.178 - - [03/Aug/2008:16:58:35 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [03/Aug/2008:19:05:55 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.235.139 - - [03/Aug/2008:22:14:47 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.25.136 - - [04/Aug/2008:00:31:32 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [04/Aug/2008:00:57:38 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.235.139 - - [04/Aug/2008:06:49:33 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [04/Aug/2008:07:16:21 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.25.136 - - [04/Aug/2008:09:29:17 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.104.156 - - [04/Aug/2008:11:08:24 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [04/Aug/2008:11:29:34 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [05/Aug/2008:13:30:20 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.208.178 - - [05/Aug/2008:16:17:59 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

Good boy, it checks the robots.txt file. But what is this?

$ grep msnbot /var/log/apache2/access_log | tail -20
65.55.208.164 - - [05/Aug/2008:22:48:15 +0200] "GET /websvn/filedetails.php?repname=MySQL+Documentation&path=%2Fworkbench%2Fall-entities.ent&rev=9981&sc=1 HTTP/1.1" 200 6408 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:15 +0200] "GET /websvn/dl.php?repname=MySQL+Connector%2FJ&path=%2Fbranches%2Fbranch_5_0%2Fconnector-j%2F&rev=6600&isdir=1 HTTP/1.1" 200 40960 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:19 +0200] "GET /websvn/rss.php?repname=MySQL+Documentation&path=%2Fproto-doc%2F&rev=9994&sc=1&isdir=1 HTTP/1.1" 200 36907 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=MySQL+Documentation&path=%2Ffalcon%2F&rev=8323&sc=0&isdir=1 HTTP/1.1" 200 15278 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=MySQL+Proxy&path=%2Ftrunk%2FDoxyfile&rev=365&sc=1&isdir=0 HTTP/1.1" 200 4162 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=Eventum&path=%2Feventum%2Freports%2F&rev=3542&sc=1&isdir=1 HTTP/1.1" 200 90591 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:23 +0200] "GET /websvn/log.php?repname=MySQL+Documentation&path=%2Fndbapi%2F&rev=9749&sc=0&isdir=1 HTTP/1.1" 200 21440 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:23 +0200] "GET /websvn/log.php?repname=MySQL+Documentation&path=%2Ffalcon%2F&rev=8511&sc=0&isdir=1 HTTP/1.1" 200 18541 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

As you can see, it is happily crawling everything below /websvn/, which also includes links named "Tarball" - guess what they are good for? Yes, they create tarballs of a given SVN directory, using /tmp to build up the archive file... Within a very short amount of time, it used up more than 6 GB of disk space, as it seems as if websvn leaves these temporary directories behind, if the connection gets aborted or times out. We do have a cron job that wipes /tmp from files older than a certain amount of days, but it currently fills up much faster than what the cron job usually discards. I need to investigate if it is actually is a bug in websvn to leave these temporary dirs behind.

Hello Microsoft? Can you please fix your bots so they not only read but honor robots.txt files and stop DOSing our site? Thanks :-)

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

Vladislav Vaintroub on :

strcmp on :

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
HTML-Tags will be converted to Entities.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Form options
tweetbackcheck