Why can’t web crawlers use some ethics…and some intelligence?
So I’ve been up since 4am this morning battling what is essentially a Distributed Denial of Service attack…basically a bunch of computers sending requests to our web servers over and over and over again. After two hours of battling, the culprit was found and disabled.
This company offers to crawl data on websites via some customizable code. However, their business practices are definitely questionable. A google search is most enlightening. This web crawler hit our site over 7,000 times in a 10 minute span. And based on that Google search, we are not the only ones.
Now there are a couple of things I simply don’t understand. First of all, who’s the genius at 80legs.com that thinks hitting any site on the web at this volume is a good idea? I understand that they have a business, that they are selling crawling technology, but how much do they expect to sell if the end result of implementing their crawler is the immediate block of the crawler by the unwitting victim? Certainly whoever is paying them to crawl our site is now going to be disappointed.
Second, why would anyone think that this sort of crawling is ethical in this day and age of botnets and hackers? If I was building a business based on this technology I would at a minimum make sure targets could remove themselves from the line of fire (80legs claims it does so but it doesn’t work…they don’t honor robots.txt like they say the do), and make sure my bot speed was within reason speed wise. Google, Bing, and Yahoo all crawl the web without causing mass chaos and overwhelmed servers. Certainly if you have the intelligence to write a crawler, you have the intelligence to throttle a crawler.
Or maybe my standards are too high.