This is a copy of a support E-Mail that others might find useful.
=================================
Customer
=================================
Today, I want to direct your attention to your SPIDER SPEED.
I bet most people writing you about spidering want to make
things faster. In contrast, I am suggesting to build into
Zoom 5.1 some mechanism to make your spider run SLOWER,
sometimes considerably slower.
Let me explain:
In a scenario where the Zoom owner runs the spider against
their own, dedicated web server the maximum speed is
naturally desirable.
However, in a shared-web scenario and even more, running
your program against someone else's server, the sheer amount
of pages accessed in rapid succession may:
a) alarm the webmaster that the site is under attack
b) on a shared server, have the ISP cancel the site entirely
c) on larger sophisticated sites, trigger their anti-attack
countermeasures, and have access blocked from your IP
and possibly other undesirable effects.
I am sure you know large spiders like Google, Yahoo and
others access web sites rather slowly, meaning with pauses
between pages ranging from 1 second to more than 20.
Discussing the subject with Amazon.com engineers (not that
I would like to index their site, but I know them well and they
have a LOT of insight) I was told their site would block my
access in a heartbeat; to be precise after the first 500 pages
accessed in rapid succession. The access blockage is triggered
automatically.
It is not the number of pages that would get me in trouble,
it is the speed of the access. Slowing down my access to say,
one page in 5 seconds will let me index their entire site,
if I so desire - but your program does not make it possible.
<snip>
I can imagine at this point you will probably say I can reduce
the number of threads all the way to 1. I think (don't hold me
to it) I saw somewhere in your FAQs that setting the number of
threads to 1 make your indexer behave like 1 regular site user.
<snip>
I would suggest giving the user a sliding scale from
zero (0) to 300, where zero would insert no pause at all
while the maximum 300 would insert 30-second pause.
In other words, the user could fine-tune your spider speed
in increments of 1/10 of a second, from zero to 30 secs.
=================================
Response
=================================
Yes you are correct that Zoom can place a reasonable amount of load on a server if you are using a lot of threads.
And yes, having a throttle control is already on our to do list. Probably for a 5.1 release, but it isn't sure as yet as V5.1 is not defined as yet.
However I would argue that the current situation is isn't nearly as bad as you think it is because,
1) Servers that are under load start to report errors. And on the vast majority of sites we don't see any errors, even with 10 threads running. In fact I think you would have a problem finding a site that did have a problem. Can you point to an example?
2) The load from 1 PC running Zoom is a drop in the bandwidth ocean for big sites like Amazon.
3) Google and other big engines have 100,000+ machines that could hit and disable a site. So they need to be careful. Zoom only runs on 1 PC and so there is a natural limit of how much load you can generate from a single PC.
3) There is an argument to say that you are better off hitting a site harder at a known time (e.g. 4am in the morning) rather than adding more background load to your busy period.
4) Amazon certainly doesn’t block in a 'heartbeat'. And doesn't block after 500 hits either. We did a test and got to over 600 hits (with 1 thread) without any blocking. To be fair a lot of these hits just resulted in Amazon re-directing the HTTP request to another page. But they aren't as draconian as their engineers make out.
5) Servers naturally throttle there own load. In the Amazon example above we only got about 1 hit per second. If your site allows 30 hits / second it can probably deal that that kind of load from 1 PC without a significant problem.
6) Google and some of the other big engines do hit a sites multiple times per second.(At least that's what our logs show). But I agree they are significantly slower than Zoom.
7) A resonable server can serve 20 to 100 HTTP requests per second depending on caching, page type, etc. We get a peak of 83 pages/sec from our server. But I am betting the figures you are seeing with 1 thread from Zoom are more like 2 to 3 pages per second. (~5% of total capacity).
In 6 years of selling Zoom. We have never had any reports of an IP address being blocked nor anyones account cancelled by any ISP.
Of course you can present it in such a way to make it sound much scarier than it is.
This is not to say you don't have a valid point. Because you do, and we will address the issue. But it doesn't cause nearly as many problems in the real world as you might imagine.
=================================
Customer
=================================
Today, I want to direct your attention to your SPIDER SPEED.
I bet most people writing you about spidering want to make
things faster. In contrast, I am suggesting to build into
Zoom 5.1 some mechanism to make your spider run SLOWER,
sometimes considerably slower.
Let me explain:
In a scenario where the Zoom owner runs the spider against
their own, dedicated web server the maximum speed is
naturally desirable.
However, in a shared-web scenario and even more, running
your program against someone else's server, the sheer amount
of pages accessed in rapid succession may:
a) alarm the webmaster that the site is under attack
b) on a shared server, have the ISP cancel the site entirely
c) on larger sophisticated sites, trigger their anti-attack
countermeasures, and have access blocked from your IP
and possibly other undesirable effects.
I am sure you know large spiders like Google, Yahoo and
others access web sites rather slowly, meaning with pauses
between pages ranging from 1 second to more than 20.
Discussing the subject with Amazon.com engineers (not that
I would like to index their site, but I know them well and they
have a LOT of insight) I was told their site would block my
access in a heartbeat; to be precise after the first 500 pages
accessed in rapid succession. The access blockage is triggered
automatically.
It is not the number of pages that would get me in trouble,
it is the speed of the access. Slowing down my access to say,
one page in 5 seconds will let me index their entire site,
if I so desire - but your program does not make it possible.
<snip>
I can imagine at this point you will probably say I can reduce
the number of threads all the way to 1. I think (don't hold me
to it) I saw somewhere in your FAQs that setting the number of
threads to 1 make your indexer behave like 1 regular site user.
<snip>
I would suggest giving the user a sliding scale from
zero (0) to 300, where zero would insert no pause at all
while the maximum 300 would insert 30-second pause.
In other words, the user could fine-tune your spider speed
in increments of 1/10 of a second, from zero to 30 secs.
=================================
Response
=================================
Yes you are correct that Zoom can place a reasonable amount of load on a server if you are using a lot of threads.
And yes, having a throttle control is already on our to do list. Probably for a 5.1 release, but it isn't sure as yet as V5.1 is not defined as yet.
However I would argue that the current situation is isn't nearly as bad as you think it is because,
1) Servers that are under load start to report errors. And on the vast majority of sites we don't see any errors, even with 10 threads running. In fact I think you would have a problem finding a site that did have a problem. Can you point to an example?
2) The load from 1 PC running Zoom is a drop in the bandwidth ocean for big sites like Amazon.
3) Google and other big engines have 100,000+ machines that could hit and disable a site. So they need to be careful. Zoom only runs on 1 PC and so there is a natural limit of how much load you can generate from a single PC.
3) There is an argument to say that you are better off hitting a site harder at a known time (e.g. 4am in the morning) rather than adding more background load to your busy period.
4) Amazon certainly doesn’t block in a 'heartbeat'. And doesn't block after 500 hits either. We did a test and got to over 600 hits (with 1 thread) without any blocking. To be fair a lot of these hits just resulted in Amazon re-directing the HTTP request to another page. But they aren't as draconian as their engineers make out.
5) Servers naturally throttle there own load. In the Amazon example above we only got about 1 hit per second. If your site allows 30 hits / second it can probably deal that that kind of load from 1 PC without a significant problem.
6) Google and some of the other big engines do hit a sites multiple times per second.(At least that's what our logs show). But I agree they are significantly slower than Zoom.
7) A resonable server can serve 20 to 100 HTTP requests per second depending on caching, page type, etc. We get a peak of 83 pages/sec from our server. But I am betting the figures you are seeing with 1 thread from Zoom are more like 2 to 3 pages per second. (~5% of total capacity).
In 6 years of selling Zoom. We have never had any reports of an IP address being blocked nor anyones account cancelled by any ISP.
Of course you can present it in such a way to make it sound much scarier than it is.
This is not to say you don't have a valid point. Because you do, and we will address the issue. But it doesn't cause nearly as many problems in the real world as you might imagine.
Comment