Announcement

**David** · Jun-25-2007, 11:10 PM

We have a 64bit version of the software almost ready to go. Be we figured that there was no point releasing it until there was some demand.

64bit hardware and and a 64bit operating systems will be required. The native 64bit version should immediately double the capacity of the 32bit software. And there is the potential to increase capacity 10 fold in the medium term (with the right hardware).

But at the moment the million page capacity of the 32bit version is large enough for almost all of our customers. And the customers who want more than a million tend to rather irrationally ask for infinite capacity.

What is your project?

If it is just capacity you are after, then there is also the MasterNode distributed search solution.

**will** · Jun-26-2007, 09:55 AM

Hi,

Are you sure that the Enterprise 5 edition has a million page capacity in all cases?

My index of 100,000 pages overloads it and I am trying to figure out why. Before indexing starts it says it needs 2.8GB RAM (which we have).

It overloads before it finishes indexing rather than when it tries to write the files at the end of indexing.

Does that mean it is a RAM issue or that 32-bit edition cannot cope?

Please reply saying either:
- 32-bit can handle it with enough computer power
- I can have a free upgrade to 64-bit version when it is released

infinite capacity

Download the internet here:
http://www.w3schools.com/downloadwww.htm

**David** · Jun-26-2007, 12:03 PM

We have tested it to just over a million pages / documents. We did most of our tests with moderately sized HTML files. But there are certainly conditions under which you are not going to reach a million pages / documents. If you are indexing large PDF files for example, you are not going to get to 1M large PDF files. If you don't have enough RAM, then you aren't going to make it either.

Generally the blocking factor will be either, 1) You run of of RAM or 2) The files grow so large that the internal 32bit pointers that cross reference records within the index aren't large enough anymore. But there are other subtle effects in play, sometimes, like older versions of Linux can't seek within files greater than 2GB, etc..

But 100,000 shouldn't be so hard to reach.

What limits did you set on the limits tab?

What exactly do you mean by "overload"? We don't have any message to that effect in the software as far as I know.

Are you using spider mode or offline mode?

What type of content are you indexing?

What hardware (CPU, RAM, Free disk space, Internet connection) do you have?

**will** · Jun-28-2007, 06:01 PM

Hi,

Think it's a unique word issue (several million of them) which stops Enterprise edition exceeding ~50,000 files. Once 2GB limit is reached I guess your software cuts off indexing. Is there any way to "disable" that cut-off and let it carry on? Afterall, Zoom lets indexing start when it predicts 2.8GB needed.

**David** · Jun-29-2007, 12:23 AM

The 2GB limit I was refering to was a file size limit for some of the index files. Old Linux operating systems are not able to deal with individual files greater than 2GB (technically speaking, they used a signed 32bit integer for seeking to a file position). That limit has nothing to do with available RAM.

It would help if you answered my other questions.

How many million is "several million"? Having so many unique words is not typical, why do you have / need so many?

**Ray** · Jun-29-2007, 02:17 AM

Would like to add that, in the past, we have often found that when users seemingly index a huge number of unique words, that there might be something else at play. For example, they could be indexing a file type that is not supported, and Zoom ends up indexing a lot of binary garbage, thus pushing the unique words count up for data that would not be meaningfully searchable anyway.

On very rare occasions, it may be more valid and the user may actually have 100,000 PDF files, each of which are over 100 MB in size or similar. These are certainly exceptions to the rule, and in such cases, there are still alternative methods which may be suitable (for example, limiting the number of words to index per file). So if you can better clarify your scenario, the more likely we can be of help.

**nibb** · Jul-02-2007, 03:11 AM

Thats great. So the 64 Bit will have the double of power.
Can the limits go farther with the right harware? For example if you have 5 servers with the multinode, i see it uses the 5 servers for searching but could it be done for indexing as well?
For example if you put 5 servers with 2 GB of Ram each one, could they all work together to index the data? Also what happens if i have a 8 GB server will i be able then to index more then a millon files? html files.

**David** · Jul-02-2007, 04:19 AM

Thats great. So the 64 Bit will have the double of power.

At least double. We hope. It depends on how much we change the index file format and if we are prepared to drop support for older operating systems.

With MasterNode you can have multiple machine doing the indexing and multiple machines serving up search results.

4GB of RAM is almost always a waste on a 32bit Windows system.
Windows applications are limited to using 2GB of RAM due to a lack of virtual address space.

**nibb** · Jul-02-2007, 09:05 PM

Well I did not knew that. So if you buy 64 bit servers, what would be more recommend for config, 2 severs with 8 GB of Ram of 4 servers with 4 GB each one. For example you said 4 GB on a 32 bit system is a waster. I dont have 32 bit systems anymore but we had a server some years ago with Windows NT with a couple of Gb Of Ram so i guess it was a waste. I dont remember but i think it had 6 Gb in the time.

**will** · Jul-03-2007, 06:07 PM

Windows applications are limited to using 2GB of RAM

Another source: 32 bit apps are limited to 3GB

Can we not have 3GB from Zoom please?

**will** · Jul-03-2007, 06:17 PM

Something about setting IMAGE_FILE_LARGE_ADDRESS_AWARE tag?

w.r.t.. what Ray says: I'm pretty sure they are PDFs but how do I know whether they are indexing binary garbage? Also, how would I stop this from happening?

**David** · Jul-03-2007, 11:07 PM

I think you are confusing the 2GB file system limit and the 2GB Virtual memory limit.

Once 2GB limit is reached I guess your software cuts off indexing.

From this description I am guessing you are hitting the file system limit and not the virtual memory limit. If you are hitting the virtual memory limit then either indexing will not start at all with a message about lack of RAM, or indexing will fail rather catastrophically with a memory allocation failure of some sort.

There is no work around for the file system limit until we drop support for old Linux and then we can use files of 4GB files.

It would have helped if you answered my other questions, from my 2nd post.

**Ray** · Jul-04-2007, 01:36 AM

Originally posted by will View Post

w.r.t.. what Ray says: I'm pretty sure they are PDFs but how do I know whether they are indexing binary garbage? Also, how would I stop this from happening?

If they are just PDF files, then there should be no problem unless your PDF files were created unusually (eg. with a third party application besides Acrobat), and they contain an invalid searchable text layer. You can confirm this by opening "zoom_dictionary.zdat" in a text editor and seeing if it contains a list of normal looking words, or not (do NOT modify this file however, or you'll break functionality).

What I was suggesting before was more related to cases where you might be indexing unrecognized file types (eg. "myfile.abc") which Zoom would treat as a text/HTML file. Another possibility is if your server is not serving the PDF files correctly and serve them as text/HTML content-type. I can't guess all the possibilities without seeing anything or having any more details though.

How big are some of these PDF files in filesize? As we have mentioned several times in this thread, it would help if you can describe your project: the file types indexed, the number of files, the sizes of the files, if you are indexing a multi-lingual site, if you are indexing multiple sites, ...etc.

**will** · Jul-04-2007, 12:27 PM

Yes it is the file system limit not RAM I was referring to.

Can you not allow 3GB for CGI? Seeing as you have separate Windows and Linux options for the CGI version.

Also:
- I'm not trying to index 100MB PDFs

- I'm not trying to index the whole internet (just 100,000 PDFs)

Av words per file is ~5000
Av unique words per file ~ 40

High unique words mostly due to people's names in the documents, chemical names and numbers.

There is some rubbish in some of the PDFs but not enough to alter the situation.

Announcement

64bit Edition of Zoom Search Engine

64bit Edition of Zoom Search Engine

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment