Announcement

**David** · Aug-05-2005, 05:37 AM

Zoom will use the Windows paging file (like most applications do when short on RAM).

Zoom also writes out some of the indexed data as it goes, you can see this in the .tmp files that Zoom creates while indexing is in progress. For example, zoom_pagetext.tmp.

But even so it uses a lot of RAM becuase it holds the core if the index in RAM. It does this for fast access. Zoom can read from RAM at speeds greater than 1GByte/sec but disk speeds are more like 5MB/Sec (when seeking for data at random in a file). So we can get a 200 times speed increase by not using the disk for the core part of the index.

We did try making much heavy use of the disk, but it was painfully slow once about 20% of the index was swapped out to disk. The technical term is thrashing.

The reason you can process a DVD video, even with a small amount of RAM, is because the data is processed in a sequential fashion. A small block of data is read, processed and written out. Only a small block of data is required at a time. This is not analogous to creating an search index where the output data is not sequential with the input data (if this makes sense?).

We have reduced the memory footprint a lot over the last few releases, but we are now at the point of diminishing returns.

Our logic has also been that if you have a web site with more than 200,000 pages (which is when the RAM usage really becomes a problem), then you probably
1) Have the budget to purchase some extra RAM, or already have a machine with the RAM maxed out.
2) Would prefer better performance at the cost of extra RAM usage

------
David

**jough** · Aug-05-2005, 02:33 PM

The output data may not be sequential, but the input data is (or rather its sequence doesn't matter). In other words, you could marshall data from the web and store it on the hard drive (which I believe the spider is already doing in its cache, no?) and then grab chunks at a time to process and index. I was under the impression that it was already doing that, which is why I was surprised that the number of pages spidered (and copied to the hard drive) was contingent on the size of the system memory.

Or you could do what pretty much EVERY other spidering program does, and that's download, process, write to HD, repeat.

Then you could spider an infinite number of pages.

One use of Zoom could be to spider pages from MANY sources, a la a web search engine. That isn't really possible with the current version, as the impractical utilisation of memory puts artificial limits on the size of an index.

**David** · Aug-05-2005, 11:31 PM

For most pages, a copy of the spidered pages are not keep on the hard drive. (The exception is a certain % of the pages that end up in the IE cache)

It is not the pages that are keep in RAM (or on disk). It is the index that is keep in RAM. The process you suggest can not work for this problem.

The Zoom index is very similar to an index that you would find at the back of a large book. So for each word you get a list of pages that the word appears on.

For example,
APPLE,4,7,34,786
ZEBRA,987
Would mean that the word APPLE appears on the pages 4,7,34,786 and ZEBRA only on page 987.

The index in a typical book only lists key words. Zoom indexes every word and stores additional information (like the page title and ranking information), but otherwise works in a manner similar to this.

If you were given a truely large book with 200,000 pages, imagine the process that would be required to build up a complete word index. As you process each page the index gets larger. Eventually the index starts to get really big as more words are processed.

At this point you might think you can write out parts of the index to disk. But which parts? If you write out, for example, the part of the index that covers words starting with the letter A - C, then you can be fairly sure that the next page you index will have a word that starts with A,B or C. So you need to immediately read the A - C part of the index back from disk again, in order for you to add the new page to the index. So this type of algorithm would result in a huge amount of disk activity but very little actual indexing.

It indexing problem is more complex that it may appear at first glance. There are trade offs to be made between RAM usage and processing speed.

Or you could do what pretty much EVERY other spidering program does...

I would never claim to know how EVERY other spidering program works, but the ones we have looked at work in a similar manner to ours. But most are less efficient, slower and use ever more RAM than ours. In particular the ones written in scripted languages (PHP, ASP, etc..) are either appallingly inefficient in their RAM use or appallingly slow.

Then you could spider an infinite number of pages.

Iinfinite is a big number

and I don't think we'll every index that many pages.

One use of Zoom could be to spider pages from MANY sources, a la a web search engine.

It already can. We have one customer that indexed a portion of every web site in Poland (the country). We have another customer building an Australian search engine (~50,000 sites)

impractical utilisation of memory puts artificial limits on the size of an index

There is no impractical utilisation and no artificial limits. It you want to claim otherwise you'll need to back up your claim with detailed benchmarks showing RAM usage and indexing speed. We have spent a long time (years) looking at the RAM vs speed issue and think we have made a good tradeoff.

--------
David

**jough** · Aug-05-2005, 11:51 PM

I realise there are certain limitations of Zoom because it uses flat files for its index rather than a relational database. If the indexer were processing the word list of one page after another it could store the words/pages from each page as it went and again, like a river, an infinite amount of pages could "flow" through the engine. Since CPUs are fast and hard drive access is FAR FAR faster than most internet connections, the bottleneck would be as it is currently - with how fast you can spider the pages.

Zoom isn't really competing with services like Google, but if you used a database for storing the search words and index you could extend Zoom's scalability immeasurably.

Right now Zoom is limited to about a million pages and keywords per 5Gb of system RAM (based on a max 1mb page - and while it may not need to index that many words, halving or quartering the keyword limit doesn't lower the RAM utilisation by a large degree).

Granted, that's an estimate, because the engine will pop up an ERROR message if you set the limits higher than the indexer's estimate of your RAM and how large a page will be.

Since most consumer PCs are limited to 4Gb, since that's the upper memory limit of 32-bit Windows XP, so that's fewer than a million keywords from fewer than a million pages, or in other words, an upper limit to how many pages Zoom can spider.

Google, for instance, brags that they have over 8 BILLION pages in their index.

Not a fair comparison, I know, because Google uses proper databases and more powerful servers than consumer PCs for which Zoom is intended to run, but spidering every site in even a small country is unlikely with only being able to index about 800,000 total pages on a Windows XP system.

**David** · Aug-06-2005, 01:08 AM

I realise there are certain limitations of Zoom because it uses flat files for its index

Incorrect. Zoom does not use flat files. The index files are indexed. You don't get sub 1 second search times from flat files.

If the indexer were processing the word list of one page after another it could store the words/pages from each page as it went and again, like a river

Although you might think the image of a river is appealing it holds no water for the indexing problem.

...an infinite amount of pages could "flow" through the engine

Our aim is not to index an infinite amount of data. It is not a realistic aim.

and hard drive access is FAR FAR faster than most internet connections

We have many customers on intranets. Their network speeds are 1Gbit/sec. A disk doing random access does not keep up. No even close in fact. We also have another big group of user indexing local files (without network access). So your suggestion to slow indexing down would punish these customers.

Since most consumer PCs are limited to 4Gb, since that's the upper memory limit of 32-bit Windows XP

The limit is 2GB in fact (3GB on servers). A PC's full virtual address space is not all usable by any application. The upper 2GB is always reserved by the O/S.

But the solution is obvious. 64bit machines are now available and we have already moved some of our software over to 64bit. Anyone with a million page web site isn't going to be worried about the cost of a couple of GB of extra RAM. So we aren't worried.

Not a fair comparison, I know, because Google uses proper databases

Like Zoom, Google uses a custom designed database. I am not sure where you get your information from?

...and more powerful servers than consumer PCs for which Zoom is intended to run

Also incorrect. They use fairly standard PCs, but lots of them, I believe more than 30,000 PCs running simultaneously.

In fact we believe, based on published spec's, Zoom out performs Google's low end (single PC) solutions in a few areas.

-----
David

**jough** · Aug-06-2005, 03:59 AM

Originally posted by Wrensoft

Incorrect. Zoom does not use flat files. The index files are indexed. You don't get sub 1 second search times from flat files.

I noticed that the indexes were compressed, but I wasn't sure of their format.

BTW, the Zoom manual says that it produces a "pre-indexed flat file database" (section 1.1, page 5).

So is it a flat file or a database?

Although you might think the image of a river is appealing it holds no water for the indexing problem.

Excuse the pun? A river flows water, the incoming pages flow data - the analogy is apt. It's just that most rivers can flow a lot more water, like, say, more than 800,000 units of water.

Our aim is not to index an infinite amount of data. It is not a realistic aim.

Well, the upper limit right now is less than 800,000.

We have many customers on intranets. Their network speeds are 1Gbit/sec...So your suggestion to slow indexing down would punish these customers

So make it an option. Right now *everyone* is punished with the 800K limit.

The limit is 2GB in fact (3GB on servers). A PC's full virtual address space is not all usable by any application. The upper 2GB is always reserved by the O/S.

So you're saying that Zoom can really only index 400,000 pages (assuming 1024kb pages and 500,000 indexed terms)?

But the solution is obvious. 64bit machines are now available and we have already moved some of our software over to 64bit. Anyone with a million page web site isn't going to be worried about the cost of a couple of GB of extra RAM. So we aren't worried.

There are very few million page web sites, of course, but once you have a spidering search engine you aren't limited to spidering only your own site(s). The internet has, at best estimate, hundreds of billions of pages. Zoom can't spider even a hundredth of a percent of them.

Like Zoom, Google uses a custom designed database. I am not sure where you get your information from?

I've been to Google's campus and have seen some of their server racks.

In fact we believe, based on published spec's, Zoom out performs Google's low end (single PC) solutions in a few areas.

That would be hard to prove (either way) but even if that's the case, my point is that Google wouldn't be in business long if they could only spider 400,000 web sites.

However, barring spidering that many sites in one swell foop, is there a way to merge multiple Zoom indicies? If they were run consecutively, or on different machines?

**David** · Aug-06-2005, 11:53 PM

You think we should be building a S.E. with infinite capabilities (or at least a big as Google). We don't think this is realistic, and in addition don't believe there is enough market demand to justify the development cost.

I think this summarizes the entire discussion and don't think there is much to add.

----
David

**jough** · Aug-07-2005, 05:04 PM

I think you *could* make Zoom handle more than 400,000 pages, and that it generates an error for an estimate is wrong (most pages won't be close to 1mb in size, but for the one or two that are larger you have to set that as your limit for EVERY page and thus limit the number of pages you can spider).

It is impossible to index, say, every web site in Australia, with the current limits, without breaking the index up into multiple parts.

And you don't offer a way to merge indexes. So Zoom has built-in limits that you don't mention in your documentation.

I think that summarizes the discussion a little more clearly.

**David** · Aug-07-2005, 10:27 PM

The warning and error messages about low RAM are by design.

The limits due to RAM requirements have always been documented. See,
http://www.wrensoft.com/zoom/editions.html
and
http://www.wrensoft.com/zoom/support...tml#largesites

--
David

Announcement

Question about Memory Limits - RAM usage

Question about Memory Limits - RAM usage

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment