PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

indexing large volume across a number of days?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • indexing large volume across a number of days?

    I have a really large volume that I'm attempting to index (around 6-7 TB). Much of the information is graphical in nature, but it still takes a long time to open the files and pull out the text (5+ hrs for a 1TB test even on a fast machine: Win 7, 64 bit, 12GB RAM, Xeon W3530).

    I haven't been able to index the entire volume yet (for political reasons I can't explain), but I was curious to know whether there was a way to automatically start the index at night (e.g. 7pm) and then pause it the next morning (e.g. 5am) and then resume indexing (i.e. not starting over) the next night...

    I'm happy to write a script that uses command-line options, but I can't figure out whether there is a way to pause and resume.

    Alternatively, would stopping and then starting again with the "-update" option for incremental indexing work for me in this case?

    Thanks,

    Ken

  • #2
    To start with have a read of this FAQ,
    Indexing Enormous Sites - Hints and Tips

    I would suggest not trying to pause the indexing (although there is a pause function in the toolbar).

    The reason is that
    1) Even when paused the indexer is going to be holding a bunch of system resources (mainly RAM).
    2) The longer you have it running to more chance there is of a disruptive event of some sort. e.g. a power failure, Windows patch with a reboot, software crash, etc..

    So better to break up the job. Are your documents arranged into several folders? Maybe index one folder at a time and use the incremental function to keep adding to the index. This also has the advantage that in the even of a failure you can fall back to the last valid set of index files (which you should backup as go).

    The command line option you want is not -update. It is -addstartpt,

    You can also use the Windows scheduler to schedule the start time.

    Comment


    • #3
      Thanks David, that's very helpful.

      We actually have a volume that has 100s of folders beneath the root, so I think I'll create a text file of start points and use -addstartpts.

      I also want to be able to provide an option to search within each folder individually. Is the best way to do that to create a categories CSV file with each of the folder names and paths? If so, is there a way to automatically add these to the config so I can automate the whole process? (the top level folders don't change often, but they do change).

      Comment


      • #4
        You could either create a search function per folder (or for a group of folders), which would keep the index small and make of easy updates, OR you can have one large index and then use the categories feature to provide a drop down selection for search category.

        Yes you can import a list of categories from CSV. But this is a manual task from the user interface. To automate this you would need to write code parse and rewrite the Zoom configuration file (xxxxx.zcfg), which is a Unicode text file.

        You also need to think about how the updates are done. It isn't trivial at all to index 7TB of data and then keep the index up to date. Having the index split into several smaller indexes can help with this. Ideally you could avoid re-indexing all the files when an update was required. There are command line options to add and remove files from the index, but if you are using offline mode you need to have someway to track what is updated and what is deleted.

        Comment


        • #5
          I like the idea of having several smaller indexes to help with maintenance. I was thinking I could do rolling weekly updates where I index about 1TB each night of the week.

          The thing I wasn't sure of is how to combine the indexes in a search query so that the page rank algorithm still works. I know you support federated search using MasterNode, but does it handle combining search results based on their score in an intelligent fashion so that page 1 of the search results has the top 10 ranked pages across all indexes?

          I really appreciate your prompt responses to all my questions,

          Thanks,
          Ken

          Comment


          • #6
            You could present the user with multiple search boxes. This is the easy solution for multiple sets of index files.

            You could write up a small piece of Javascript (or server side script) then pick a set of index files based on a user selection from a drop down list. Both of these suggestions assume that some form of sensible categories can be made out of your data, and that a user is happy to search a category at a time.

            But you can also user MasterNode (which can be a bit tricky to install), and this will search all the sets of index files at the same time and sort the results by ranking.

            Comment


            • #7
              There are command line options to add and remove files from the index, but if you are using offline mode you need to have someway to track what is updated and what is deleted.
              I had assumed that if I selected "Incremental Indexing > Update Existing Index" it would look skip all files whose last modified time was before the last index operation.

              But if I understand the above quote correctly, I would need to crawl the volume using another application to pull out the file names of all files updated since the last index and then add these to a text file that I would then use either with the command line or through the UI option to "Add new or updated pages to existing index".

              Is that correct?

              Comment


              • #8
                But you can also user MasterNode (which can be a bit tricky to install), and this will search all the sets of index files at the same time and sort the results by ranking.
                Can MasterNode be installed and configured to search multiple index files on the same server? Or does it require additional servers to use as slaves, one per index?

                Comment


                • #9
                  You can have all the index files on the same server in different sub-folders. But at some point once you have lots of large sets of index files, and lots of searching being done by the users, you are better off using multiple servers.

                  Comment

                  Working...
                  X