The following question came from a cutomer, but we have posted it here as it may help others.
========
In order to give you an idea about our testing so far, I describe everything
with plenty of detail. While it makes this email rather long, I see no
alternative.
Where I need your advice most is:
a) Speed of indexing (our project ran for almost 40 hours)
b) Speed or search execution (it takes about 6 seconds)
c) Result paging (changing pages using your template results is re-execution
of the entire search)
Details follow.
I have used ver 4.2 for a few days on small projects (+/-50,000 pages) just
to "learn the ropes". Then I switched to v5beta because v4.2 simply could
not handle the size of our sites. No surprise there, you warn about that
quite clearly.
- - - -
Hardware setup for testing:
The OS on all 3 machines is Win2000 Advanced Server.
The machine I used for creating the indexes is a dual-P3-Xeon 500Mhz-2MB
cache, with 2GB RAM and quite large hard drives with plenty of free space.
The target website sits on IIS5, 2xP-Pro 200MHz and 512MB RAM. (Note: Our
web server farm was set up in 1999 when Win2000 came out, and since IIS5
runs with respectable speed on just about anything we have not felt the urge
to upgrade the web hosting hardware.)
The final indexes are installed in two places: On the indexed web server
itself (see previous paragraph), and also on another one with faster CPU:
Dual-PII-333MHz, same 512MB RAM.
- - - - -
Indexing environment:
I have used (naturally) the CGI version, in spider mode since the indexing
machine and the web server are on different locations far apart from each
other.
The connection is a 7 Mbit DSL pipe, which is actually rather misleading
because it being in asymmetric mode and the Cisco DSL router being limited
to 896K in throughput, the actual connection speed was about 0.7 Mbit. Still
it is a respectable speed and I did not see the indexing machine being
starved for bandwidth.
Number of threads: 5 (I own both locations so I wasn't worried about causing
problems by usurping too much bandwidth).
"Verbose" was turned on since I wanted to get as much detail as possible.
ATTACHED: The configuration file from our latest test run,
"buysellmusic_superfile_F1M_W2-5M_verboseOn.zcfg"
- - - - -
I ran two separate indexing on the same set of approx. 408,000 files.
The files were plain .HTM design-wise, about 25K in size.
Those pages/website were selected simply because they already existed; their
actual page design is unimportant for the indexing test.
The only difference in page design between the 1st and 2nd run was the
insertion of your ZOOMSTOP/RESTART and ZOOMSTOPFOLLOW/RESTART tags. The
first index run indexed the entire HTM pages, the second one excluded page
navigation menus and page headers/footers. This had very positive result in
index filesizes, total number of words, unique words, discarded external
links and many other places.
- - - - -
The first project was set to 1 Mil pages and 1 Mil unique words.
The RAM requirement was about 1.5 GB which the indexing machine had.
It has self-stopped after about 27 hours because the 1Mil unique word limit
has been reached.
This project has indexed entire HTM pages (see above).
So I have changed the limits to 500K pages, and 2.5 Mil unique words.
The RAM requirement was 2GB which the machine had, but after Win2000
overhead it lacked about 300K.
The project ran with no problem regardless, and finished in approx. 40
hours. This project has indexed only partial HTM pages, using the ZOOMSTOP tags (see above).
- - - - -
The few errors listed in the log were caused intentionally by removing some
of the HTM pages from the target website after they were already in the
indexer's cache. Website management being a dynamic thing, I wanted to see if your indexer will get confused by HTM pages removed during the indexing
process. It didn't
- - - - -
The resulting index filesizes:
Although the indexer was warning the sizes may reach the 4GB limit, they
remain in fact way below. That's probably because the HTM pages indexed are relatively small (25K on the average).
Project 1 (242,900 whole HTM pages, has stopped after reaching 1 Mil unique
words):
zoom_dictionary.zdat 21 MB
zoom_pagedata.zdat 35 MB
zoom_pageinfo.zdat 7 MB
zoom_pagetext.zdat 219 MB
zoom_wordmap.zdat 302 MB
Project 2 (408,000 partial HTM pages with ZOOMSTOP tags):
zoom_dictionary.zdat 34 MB
zoom_pagedata.zdat 58 MB
zoom_pageinfo.zdat 12 MB
zoom_pagetext.zdat 238 MB
zoom_wordmap.zdat 366 MB
- - - - -
My questions (all pertain to ver.5) :
1) Indexing requirements:
It seems that even after the rewrite swapping the temp files to the hard
drive, the indexer still requires huge amounts of RAM for large jobs.
Is there perhaps a setting somewhere I have missed, or is the only solution
a really powerful (read: expensive) machine with 4/8GB RAM?
2) Indexing speed (408,000 pages):
408,000 pages in 5 threads in spider mode takes 40 hours. Meaning a daily
reindex is out of the question.
Any suggestions to cut the time down?
Number of threads: Would an increase (say to 10) speed up the indexing
appreciably?
What would you recommend as a maximum number of threads before the target webserver gets overwhelmed?
3) Speed of search execution (408,000 pages):
Testing has established the search takes about the same time regardless what I'm searching for, be it 1 result or 400,000. Also, executing 10 separate
searches in parallel from 10 separate browser windows did not seem to cause
any appreciable slowdown. That's all good.
I have seen somewhere on your site or in the forums however, that the search should take about 3 seconds and ours takes twice that much.
To be exact it takes about 5.7 seconds on a dual-PII-333MHz machine, and
about 8.7 seconds on a dual-PPro-200MHz machine. This seems to indicate a
direct correlation between CPU speed and search time.
Are those figures on target, or would you consider them somewhat slow?
Any suggestions for speeding up the search? Files are completely
defragmented and the HDs are SCSI type.
4) Index results pagination: (This is probably my MOST IMPORTANT question)
I have used your "search_template.html" page unaltered, default settings of
10 results per page.
It appears that by simply changing the page number, the search is re-running
from scratch causing a 6-second time lapse when simply changing from page to page. I can imagine the actual website users being annoyed by that.
Do you have a suggestion how I can just advance from page to page of results without actually re-running the search every time, or is your program
written the way it HAS TO re-run the search a-new for every page of the
results?
Please advise, this is really crucial.
======================
I think that's more than enough questions for today.
I will appreciate your insight to the issues raised in my questions when you
have some time left to devote to them. Thank you.
========
In order to give you an idea about our testing so far, I describe everything
with plenty of detail. While it makes this email rather long, I see no
alternative.
Where I need your advice most is:
a) Speed of indexing (our project ran for almost 40 hours)
b) Speed or search execution (it takes about 6 seconds)
c) Result paging (changing pages using your template results is re-execution
of the entire search)
Details follow.
I have used ver 4.2 for a few days on small projects (+/-50,000 pages) just
to "learn the ropes". Then I switched to v5beta because v4.2 simply could
not handle the size of our sites. No surprise there, you warn about that
quite clearly.
- - - -
Hardware setup for testing:
The OS on all 3 machines is Win2000 Advanced Server.
The machine I used for creating the indexes is a dual-P3-Xeon 500Mhz-2MB
cache, with 2GB RAM and quite large hard drives with plenty of free space.
The target website sits on IIS5, 2xP-Pro 200MHz and 512MB RAM. (Note: Our
web server farm was set up in 1999 when Win2000 came out, and since IIS5
runs with respectable speed on just about anything we have not felt the urge
to upgrade the web hosting hardware.)
The final indexes are installed in two places: On the indexed web server
itself (see previous paragraph), and also on another one with faster CPU:
Dual-PII-333MHz, same 512MB RAM.
- - - - -
Indexing environment:
I have used (naturally) the CGI version, in spider mode since the indexing
machine and the web server are on different locations far apart from each
other.
The connection is a 7 Mbit DSL pipe, which is actually rather misleading
because it being in asymmetric mode and the Cisco DSL router being limited
to 896K in throughput, the actual connection speed was about 0.7 Mbit. Still
it is a respectable speed and I did not see the indexing machine being
starved for bandwidth.
Number of threads: 5 (I own both locations so I wasn't worried about causing
problems by usurping too much bandwidth).
"Verbose" was turned on since I wanted to get as much detail as possible.
ATTACHED: The configuration file from our latest test run,
"buysellmusic_superfile_F1M_W2-5M_verboseOn.zcfg"
- - - - -
I ran two separate indexing on the same set of approx. 408,000 files.
The files were plain .HTM design-wise, about 25K in size.
Those pages/website were selected simply because they already existed; their
actual page design is unimportant for the indexing test.
The only difference in page design between the 1st and 2nd run was the
insertion of your ZOOMSTOP/RESTART and ZOOMSTOPFOLLOW/RESTART tags. The
first index run indexed the entire HTM pages, the second one excluded page
navigation menus and page headers/footers. This had very positive result in
index filesizes, total number of words, unique words, discarded external
links and many other places.
- - - - -
The first project was set to 1 Mil pages and 1 Mil unique words.
The RAM requirement was about 1.5 GB which the indexing machine had.
It has self-stopped after about 27 hours because the 1Mil unique word limit
has been reached.
This project has indexed entire HTM pages (see above).
So I have changed the limits to 500K pages, and 2.5 Mil unique words.
The RAM requirement was 2GB which the machine had, but after Win2000
overhead it lacked about 300K.
The project ran with no problem regardless, and finished in approx. 40
hours. This project has indexed only partial HTM pages, using the ZOOMSTOP tags (see above).
- - - - -
The few errors listed in the log were caused intentionally by removing some
of the HTM pages from the target website after they were already in the
indexer's cache. Website management being a dynamic thing, I wanted to see if your indexer will get confused by HTM pages removed during the indexing
process. It didn't
- - - - -
The resulting index filesizes:
Although the indexer was warning the sizes may reach the 4GB limit, they
remain in fact way below. That's probably because the HTM pages indexed are relatively small (25K on the average).
Project 1 (242,900 whole HTM pages, has stopped after reaching 1 Mil unique
words):
zoom_dictionary.zdat 21 MB
zoom_pagedata.zdat 35 MB
zoom_pageinfo.zdat 7 MB
zoom_pagetext.zdat 219 MB
zoom_wordmap.zdat 302 MB
Project 2 (408,000 partial HTM pages with ZOOMSTOP tags):
zoom_dictionary.zdat 34 MB
zoom_pagedata.zdat 58 MB
zoom_pageinfo.zdat 12 MB
zoom_pagetext.zdat 238 MB
zoom_wordmap.zdat 366 MB
- - - - -
My questions (all pertain to ver.5) :
1) Indexing requirements:
It seems that even after the rewrite swapping the temp files to the hard
drive, the indexer still requires huge amounts of RAM for large jobs.
Is there perhaps a setting somewhere I have missed, or is the only solution
a really powerful (read: expensive) machine with 4/8GB RAM?
2) Indexing speed (408,000 pages):
408,000 pages in 5 threads in spider mode takes 40 hours. Meaning a daily
reindex is out of the question.
Any suggestions to cut the time down?
Number of threads: Would an increase (say to 10) speed up the indexing
appreciably?
What would you recommend as a maximum number of threads before the target webserver gets overwhelmed?
3) Speed of search execution (408,000 pages):
Testing has established the search takes about the same time regardless what I'm searching for, be it 1 result or 400,000. Also, executing 10 separate
searches in parallel from 10 separate browser windows did not seem to cause
any appreciable slowdown. That's all good.
I have seen somewhere on your site or in the forums however, that the search should take about 3 seconds and ours takes twice that much.
To be exact it takes about 5.7 seconds on a dual-PII-333MHz machine, and
about 8.7 seconds on a dual-PPro-200MHz machine. This seems to indicate a
direct correlation between CPU speed and search time.
Are those figures on target, or would you consider them somewhat slow?
Any suggestions for speeding up the search? Files are completely
defragmented and the HDs are SCSI type.
4) Index results pagination: (This is probably my MOST IMPORTANT question)
I have used your "search_template.html" page unaltered, default settings of
10 results per page.
It appears that by simply changing the page number, the search is re-running
from scratch causing a 6-second time lapse when simply changing from page to page. I can imagine the actual website users being annoyed by that.
Do you have a suggestion how I can just advance from page to page of results without actually re-running the search every time, or is your program
written the way it HAS TO re-run the search a-new for every page of the
results?
Please advise, this is really crucial.
======================
I think that's more than enough questions for today.
I will appreciate your insight to the issues raised in my questions when you
have some time left to devote to them. Thank you.
Comment