PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Old hardware, large site, making indexing & search faster

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Old hardware, large site, making indexing & search faster

    The following question came from a cutomer, but we have posted it here as it may help others.

    ========
    In order to give you an idea about our testing so far, I describe everything
    with plenty of detail. While it makes this email rather long, I see no
    alternative.
    Where I need your advice most is:
    a) Speed of indexing (our project ran for almost 40 hours)
    b) Speed or search execution (it takes about 6 seconds)
    c) Result paging (changing pages using your template results is re-execution
    of the entire search)
    Details follow.

    I have used ver 4.2 for a few days on small projects (+/-50,000 pages) just
    to "learn the ropes". Then I switched to v5beta because v4.2 simply could
    not handle the size of our sites. No surprise there, you warn about that
    quite clearly.

    - - - -
    Hardware setup for testing:
    The OS on all 3 machines is Win2000 Advanced Server.
    The machine I used for creating the indexes is a dual-P3-Xeon 500Mhz-2MB
    cache, with 2GB RAM and quite large hard drives with plenty of free space.
    The target website sits on IIS5, 2xP-Pro 200MHz and 512MB RAM. (Note: Our
    web server farm was set up in 1999 when Win2000 came out, and since IIS5
    runs with respectable speed on just about anything we have not felt the urge
    to upgrade the web hosting hardware.)
    The final indexes are installed in two places: On the indexed web server
    itself (see previous paragraph), and also on another one with faster CPU:
    Dual-PII-333MHz, same 512MB RAM.

    - - - - -
    Indexing environment:
    I have used (naturally) the CGI version, in spider mode since the indexing
    machine and the web server are on different locations far apart from each
    other.
    The connection is a 7 Mbit DSL pipe, which is actually rather misleading
    because it being in asymmetric mode and the Cisco DSL router being limited
    to 896K in throughput, the actual connection speed was about 0.7 Mbit. Still
    it is a respectable speed and I did not see the indexing machine being
    starved for bandwidth.
    Number of threads: 5 (I own both locations so I wasn't worried about causing
    problems by usurping too much bandwidth).
    "Verbose" was turned on since I wanted to get as much detail as possible.
    ATTACHED: The configuration file from our latest test run,
    "buysellmusic_superfile_F1M_W2-5M_verboseOn.zcfg"

    - - - - -
    I ran two separate indexing on the same set of approx. 408,000 files.
    The files were plain .HTM design-wise, about 25K in size.
    Those pages/website were selected simply because they already existed; their
    actual page design is unimportant for the indexing test.
    The only difference in page design between the 1st and 2nd run was the
    insertion of your ZOOMSTOP/RESTART and ZOOMSTOPFOLLOW/RESTART tags. The
    first index run indexed the entire HTM pages, the second one excluded page
    navigation menus and page headers/footers. This had very positive result in
    index filesizes, total number of words, unique words, discarded external
    links and many other places.

    - - - - -
    The first project was set to 1 Mil pages and 1 Mil unique words.
    The RAM requirement was about 1.5 GB which the indexing machine had.
    It has self-stopped after about 27 hours because the 1Mil unique word limit
    has been reached.
    This project has indexed entire HTM pages (see above).
    So I have changed the limits to 500K pages, and 2.5 Mil unique words.
    The RAM requirement was 2GB which the machine had, but after Win2000
    overhead it lacked about 300K.
    The project ran with no problem regardless, and finished in approx. 40
    hours. This project has indexed only partial HTM pages, using the ZOOMSTOP tags (see above).

    - - - - -
    The few errors listed in the log were caused intentionally by removing some
    of the HTM pages from the target website after they were already in the
    indexer's cache. Website management being a dynamic thing, I wanted to see if your indexer will get confused by HTM pages removed during the indexing
    process. It didn't

    - - - - -
    The resulting index filesizes:
    Although the indexer was warning the sizes may reach the 4GB limit, they
    remain in fact way below. That's probably because the HTM pages indexed are relatively small (25K on the average).

    Project 1 (242,900 whole HTM pages, has stopped after reaching 1 Mil unique
    words):
    zoom_dictionary.zdat 21 MB
    zoom_pagedata.zdat 35 MB
    zoom_pageinfo.zdat 7 MB
    zoom_pagetext.zdat 219 MB
    zoom_wordmap.zdat 302 MB

    Project 2 (408,000 partial HTM pages with ZOOMSTOP tags):
    zoom_dictionary.zdat 34 MB
    zoom_pagedata.zdat 58 MB
    zoom_pageinfo.zdat 12 MB
    zoom_pagetext.zdat 238 MB
    zoom_wordmap.zdat 366 MB
    - - - - -


    My questions (all pertain to ver.5) :

    1) Indexing requirements:
    It seems that even after the rewrite swapping the temp files to the hard
    drive, the indexer still requires huge amounts of RAM for large jobs.
    Is there perhaps a setting somewhere I have missed, or is the only solution
    a really powerful (read: expensive) machine with 4/8GB RAM?

    2) Indexing speed (408,000 pages):
    408,000 pages in 5 threads in spider mode takes 40 hours. Meaning a daily
    reindex is out of the question.
    Any suggestions to cut the time down?
    Number of threads: Would an increase (say to 10) speed up the indexing
    appreciably?
    What would you recommend as a maximum number of threads before the target webserver gets overwhelmed?

    3) Speed of search execution (408,000 pages):
    Testing has established the search takes about the same time regardless what I'm searching for, be it 1 result or 400,000. Also, executing 10 separate
    searches in parallel from 10 separate browser windows did not seem to cause
    any appreciable slowdown. That's all good.
    I have seen somewhere on your site or in the forums however, that the search should take about 3 seconds and ours takes twice that much.
    To be exact it takes about 5.7 seconds on a dual-PII-333MHz machine, and
    about 8.7 seconds on a dual-PPro-200MHz machine. This seems to indicate a
    direct correlation between CPU speed and search time.
    Are those figures on target, or would you consider them somewhat slow?
    Any suggestions for speeding up the search? Files are completely
    defragmented and the HDs are SCSI type.

    4) Index results pagination: (This is probably my MOST IMPORTANT question)
    I have used your "search_template.html" page unaltered, default settings of
    10 results per page.
    It appears that by simply changing the page number, the search is re-running
    from scratch causing a 6-second time lapse when simply changing from page to page. I can imagine the actual website users being annoyed by that.
    Do you have a suggestion how I can just advance from page to page of results without actually re-running the search every time, or is your program
    written the way it HAS TO re-run the search a-new for every page of the
    results?
    Please advise, this is really crucial.

    ======================
    I think that's more than enough questions for today.
    I will appreciate your insight to the issues raised in my questions when you
    have some time left to devote to them. Thank you.

  • #2
    The Pentium Pro 200Mhz is a 10 year old CPU (released in 1995). You have achieved a truly fantastic result on such old hardware. To search nearly half a million pages on a $90 machine like this is nothing sort of miraculous!

    Anyway on to your questions.

    To make indexing faster.
    ================
    - Use offline mode if possible. It is often around 5 times quicker.
    - Don't index remote sites if you can avoid it. It will be much quicker if you can index a machine on the same local network. e.g. connected via CAT6 & a Gigabit switch.
    - The optimal number of threads depends on your hardware and the speed of the network and the size of the pages. It is hard to imagine going above 5 will help however.
    - Don't log the indexing process to file unless you need to
    - Minimise the use of features like Content filtering, Synonyms, Skip lists & Turn off duplicate page checking if you can.
    - Check you server configuration, there are some features like the KeepAlive feature in Apache that can help indexing speed in Spider mode.
    - Your indexing speed was 2.9 pages / sec. Which I think is good for such old hardware over ADSL. Expected V4 indexing speed benchmarks can be found here, and as you can see with a newer PC and local indexing you could get a 28 times speed increase. This probably equates to around 35 times faster with a brand new PC and V5 of Zoom.
    - Try using incremental indexing in V5. It should make the 2nd & subsequent passes quicker.


    Search speed
    =========
    - You have a huge number of words. Have a look in the zoom_dictionary.zdat file with a good text editor and confirm that all 1.6 million of your unique words are real words that need to be indexed. For example, using the wrong character set or indexing binary files can result in a bloated garbage filled dictionary.
    - I think 5 - 9 sec for 408,000 pages is OK, on your decade old hardware. With a new server this would drop to around 1 sec I think.
    - Search speed depends mainly on hard disk speed, CPU speed and hard disk caching. Multiple CPUs don't help much unless you have several searches happening at the same time.
    - Yes, a search is re-executed upon every search request, new page or not. Unless you are using MasterNode which does search caching.
    Normally this is OK as a typical search is < 1 sec.


    Indexing requirements
    ===============
    Yes, Zoom can use a lot of RAM. V5 is much much better than V4. And the amount required does not grow in a linear fashion with your pages. For example going from 0.5M pages to 1M pages does not double the RAM required. I think you should get to 1M with 2GB of RAM. But it will be slow on your 500Mhz CPU. Upgrading to 3GB would be good. But there is no point going much above 3GB with a 32bit O/S.

    Going beyond 1 million pages
    ====================
    Yes, we will release a 64bit version. We don't have a date for this release as yet. But around Jan 07 is likely. We expect this release to cover the 1M to 2M page range. Maybe even as high as 3 for some sites.

    We have also just released a first beta release of the MasterNode search aggregation software which should cover the 1M to 10M page range.

    Using Old Hardware
    =============
    While it is great to be putting such old PC hardware to good use, you need to take into account the extra time you are going to loose getting it set up and running smoothly. Indexing and searching 1M pages is a computational intensive task. You would save yourself a lot of time and grief by using a newer PC.

    Comment

    Working...
    X