Announcement

**David** · Jul-30-2016, 07:24 AM

As pointed on the FAQ page for indexing large sites.
Indexing huge amounts of data can be a big job. There isn't any getting around it.

...you might be inclined to tweak your settings in which case ZOOM requires you to start all over again.

Correct. There are many instances where changing settings results in changes to the structure and content of the index files. For example, if you added a new file type to be indexed, maybe .PDF files, then of course the indexer needs to rescan the sites previously indexed to index all the PDF files.

The solution to this is to, where possible, finalize your settings with a small data set (e.g. 500 files) before moving on to indexing the full data set.

System disruption - if something goes pear shaped

For very large indexes it can take a couple of minutes to save the index. So saving the index every 10min would be a significant overhead.

You can script the incremental indexing of each site. It doesn't provide any mechanism for automatic recovery and it will be slower than indexing everything in one go, but you will have a partial indexing in the case of problems.

loss of connectivity

It doesn't really make sense to be running a major search engine on a unreliable internet connection. Can you run the indexer on the same machine that hosts the site or run the index on a machine from the same hosting company (completely removing the problem of an intermittent internet connection).

app crash

I don't know if you are referring to Zoom or some app you have running on your server. But if Zoom is crashing you should let us know. Zoom should never crash (assuming your hardware is stable).

Maximum file size

Can you give us an example URL to one of these large files.
Some servers correctly report the file size before any download starts. Others don't report a file size, so the client doesn't know the file size until after the file is downloaded.

If you are mostly indexing PDF files, can you use offline mode? It would solve many of your problems.

I did have a tinker with MasterNode

We stopped development on MasterNode. There wasn't enough interest in federated search or indexing truly huge datasets.

**rkg82** · Aug-03-2016, 01:09 PM

Originally posted by kpa View Post

...

Changes to your .cfg file - I know you can add new starting points and keep going which is great but invariably if you add new starting points you might be inclined to tweak your settings in which case ZOOM requires you to start all over again. For me I usually tweak the settings to improve search results or reduce indexing loads and times. On large operations restarting or redoing it entirely is a pain and a great consumer of resources.

I'm not sure this would help in your case, and I don't know if there is any risk to the integrity of your configuration files, but I have a good number of Zoom configuration files (.zcfg) that start with a common base set of rules and settings, then diverge greatly in terms of search paths and output paths as the base set is tweaked or extended for various purposes.

I handle this by building my "base" or common configuration directly with the Zoom GUI, save it as a .zcfg "base template". I then build a "benchmark template", again in Zoom, which incorporates the base set and is extended with as many features that I may need, paths, and so on, as possible.

At that point I can maintain the .zcfg files directly with a text editor, which allows me to easily compare files against the base or the benchmark templates in order to see where the tweaks are, or need to added, for example. And I can use standard or regex search and replace in order to modify all my zcfg files at one go.

**Ray** · Aug-04-2016, 03:09 AM

Good points in the two posts above, but just wanted to add:

Originally posted by kpa View Post

Maximum file size - is a real pain for me. In some cases I'm indexing (or trying to index) files (usually PDFs) way beyond the default limit for ZOOM - I have played with various settings but that's not the real issue. It looks like ZOOM downloads the file to the maximum configured file size and once reached ditches it - is that right - if so then this is great waist of resources and ultimately the document is not in the index in any way. What would work better is that if ZOOM indexed the file to the maximum size if that's possible - or alternatively if the maximum size is reached (or preferably detected before download) then it index the first X pages of the document. This is actually a good strategy because if you're indexing a lot of vintage magazines as I am then the Contents of the magazine are usually listed within the first 10 pages or so so you pick up a "snapshot" of the magazine in your index rather than it being ditched and not indexed at all.

As mentioned above, but worth repeating -- if the files you are indexing are your own, and they are a large collection of PDFs, then it would really make sense to index with Offline Mode instead of Spider Mode. It would be far more efficient, much faster, and avoid almost all the abovementioned grief.

Second point -- assuming you really have to index with Spider Mode -- can you clarify how this is "a real pain"? It sounds like the issue you have is with the time it takes to download the full PDF file (which is apparently very large).

We can't process the PDF file until we have the full file downloaded. The format is a proprietary Adobe binary format, and the external module used to handle this is an industry standard "xpdf" solution that is used by OpenOffice and other tools. As far as I know, there is no way to reliably process the content of a PDF file when it is downloaded partially.

Again, if you switch to using Offline Mode, then there would be zero download time.

Announcement

Indexing large systems and strategy

Indexing large systems and strategy

Comment

Comment

Comment