PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

V5 development progress - Indexing enormous sites

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Ray View Post
    You can upgrade to V5 now and receive a key which will work for the beta as well as the final release.

    See details for upgrading to V5 here:
    http://www.wrensoft.com/forum/showthread.php?t=1124

    Contact us by email if you have any questions.

    Okay, purchased, installed, registered.

    If I put these limits in I observe the following.

    Max files to Index: 100000
    Max unique words: 2500000
    Max file size: 50000
    Max description: 150
    Those setting say I need 935MB of Ram

    If I change to this...

    Max files to Index: 100001
    Max unique words: 2500000
    Max file size: 50000
    Max description: 150
    This says I need 3.2GB RAM to index and it will not do the index.

    Why does changing the max files to scan by "1" make such a huge jump in RAM usage? This behaviour is not present in 4.2 version.

    Thanks

    Comment


    • #17
      First of all, 2.5 million unique words is an enormous limit for this setting and it is very likely to be much more than you needed. Please note that this refers to "unique words", and not "total number of words" on your website. For example, there are only 50,000 unique words in the english dictionary. You can check what would be a more reasonable value here for your website with some trial indexing attempts. There is a "Unique words found" count on the "Indexing Status" tab and it is also shown at the end of indexing. Adjusting this setting to a more reasonable number will lower your memory requirements significantly.

      You are right that the big jump in the memory estimation is inaccurate at the moment. We will need to correct this. The reason for it is the new "data flushing" mechanism which we have in place, which writes out the indexed data to disk at a certain threshold (in this case, 100,000 files). As such, a different estimation method is employed at 100,001 pages. However, it seems that our estimation for the first 100,000 is not very accurate when given an exceedingly large unique words limit as you have done so. The first estimation you referred to is likely to be under-estimated. Indexing 2.5 million unique words at 100,000 pages should use up more memory than 1 GB. It should be worth noting that Version 4.2 estimates 2.6 GB of memory required for your first set of limits.

      I should point out that the memory estimation is just that, an estimation at best. It is not possible to calculate the exact memory requirements in advance without knowing how many unique words will be found on each page, etc. We have to make an estimation based on average realistic scenarios. So there will always be rare exceptions which can undermine the estimation (eg. having a single page which contains over 2 million unique words, and all your other pages having a low number of words). In such cases though, you should verify that your limits are reasonable and necessary. For example, if you do find that your unique words count exceed 2 million, there may be something else that is wrong, eg. you may be indexing binary files, and alot of meaningless data.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #18
        The settings I have in 4.2 for this particular search engine is as follows.

        Max files to scan: 100000
        Max unique words: 2500000
        Mafi file size: 12000
        Max desc 150
        This estimates 1.2 GB.

        This indexes fine even on my backup server with only 1GB of RAM.

        There are 100000 html files this is indexing and the max unique words were found by increasing it everytime the limit was reached. I think the final number was 2.2 mil and I buffered it with 2.5 mil as the files update.

        These html files are output of dump analysis from GDB back traces. So the unique words would be high as this will be showing millions of possible memory locations with millions of possilbe lines of code called out.

        The way I do my website now is I have three different search engines.

        Engine 1 - Documents. Mix of DOC, PDF, HTML, PPT, TXT, etc...
        This is ~4000 files and 300000 unique words.

        Engine 2 - Emails saved as PDF files
        This is ~ 40000 files and 300000 unique words

        Engine 3 - GDB output saved as HTML files
        This is ~100000 files with 2.5 mil unique words

        My hope was v5 would allow me to have one engine where I coould set the files to 200000, unique words 3000000, file size 50MB within the confines of my hardware. I really think it's possible...maybe not the fastest indexing, but possible.

        Comment


        • #19
          We certainly weren't expecting to be indexing debugger dumps of memory locations And yes, we can see how that number of unique words would cause our current memory estimation method to be inaccurate. As mentioned above, assumptions have to be made for our estimation, and we thought it would be rare that the number of unique words should exceed the number of pages so greatly. But it is true that such a case would give unreasonable estimations, and we will need to address that. We'll look into some possible changes in our calculations in the next few days and take this into consideration.

          On a side note, indexing that many unique words within 1 GB of memory would be really pushing it to the edge (it may well be swapping out to disk alot, and running it very slowly, but to a certain point, it becomes too slow to be practical). Memory is cheap these days, and I would consider adding some if you were hoping to index the full content mentioned above.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #20
            The new version is working much better now.

            I think Zoom is now going to help me finally reach a goal of having one search engine for my gigantic site.

            The status from my latest index.
            Total files indexed: 84,478
            Unique Words: 2,060,950
            Total Words: 105,525,921

            It took 5.5 hours to index but that does not bother me as I run it off shift.

            It is performing beautifuly. I even did a very difficult search string for an exact phrase in "All" category and the speed was very impressive.

            Next step is adding about 50,000 more HTML files

            Great work.

            Comment

            Working...
            X