Zoom V5 is looking to be a great enhancement over the existing software. This is short update on one aspect the development process of V5 of Zoom.
But before I get into that, I would like to remind everyone that we offer free upgrades for 6 months after a purchase, so if you purchase V4 now, it will be a free upgrade to V5 when it becomes available.
Over the last couple of weeks we have been looking deeply into the problem of indexing enormous web sites. By enormous, we mean one or more web sites having more than 250,000 pages in total.
At the moment in V4 the indexer requires a fair amount of RAM to index this many pages (around 1.5GB for 250,000 pages). It uses a lot of RAM because it holds part of the index in RAM, while it is being built. This gives better indexing speed, provided you have enough RAM. But not having enough RAM meant indexing enormous sites impossible. So the challenge was to move some of this data from RAM onto the hard disk without significantly reducing the indexing speed. (Accesses to the hard disk is at least 10 times slower than RAM access).
So our plan was to write additional partial index files to disk during indexing and merge the partial files at the end into a larger index. The merge process hopefully not taking to long and not using too much RAM.
Today was the first test of this new V5 code. For the first time we successfully indexed 500,000 small HTML documents on an old machine with only 512MB of RAM! A huge improvement on the ~2.5GB that would have been required to do the same thing with V4.
The downside was that the writing out and merging of the partial indexes on the disk added nine minutes to the overall indexing time which was 56 minutes in total for the 500,000 files.
So we have reduced RAM usage 5 fold, for this enormous site, at the expense of 16% longer indexing times.
This new code only kicks in when you index more than 65,000 pages. For small sites under this limit there is no impact from this change.
But this is just the first run. With further code optimization and profiling, we hope to get down to only maybe a 5% performance drop while still saving just as much RAM. Even this 5% will probably be offset by optimisations in other areas of the code. So V5 should still be faster overall. We also plan during the next week to push our test scenarios out to 1,000,000 HTML documents on the same old 1.8Ghz CPU, 512MB machine.
As I get time I'll write about some of the other aspects of V5.
-----
David
But before I get into that, I would like to remind everyone that we offer free upgrades for 6 months after a purchase, so if you purchase V4 now, it will be a free upgrade to V5 when it becomes available.
Over the last couple of weeks we have been looking deeply into the problem of indexing enormous web sites. By enormous, we mean one or more web sites having more than 250,000 pages in total.
At the moment in V4 the indexer requires a fair amount of RAM to index this many pages (around 1.5GB for 250,000 pages). It uses a lot of RAM because it holds part of the index in RAM, while it is being built. This gives better indexing speed, provided you have enough RAM. But not having enough RAM meant indexing enormous sites impossible. So the challenge was to move some of this data from RAM onto the hard disk without significantly reducing the indexing speed. (Accesses to the hard disk is at least 10 times slower than RAM access).
So our plan was to write additional partial index files to disk during indexing and merge the partial files at the end into a larger index. The merge process hopefully not taking to long and not using too much RAM.
Today was the first test of this new V5 code. For the first time we successfully indexed 500,000 small HTML documents on an old machine with only 512MB of RAM! A huge improvement on the ~2.5GB that would have been required to do the same thing with V4.
The downside was that the writing out and merging of the partial indexes on the disk added nine minutes to the overall indexing time which was 56 minutes in total for the 500,000 files.
So we have reduced RAM usage 5 fold, for this enormous site, at the expense of 16% longer indexing times.
This new code only kicks in when you index more than 65,000 pages. For small sites under this limit there is no impact from this change.
But this is just the first run. With further code optimization and profiling, we hope to get down to only maybe a 5% performance drop while still saving just as much RAM. Even this 5% will probably be offset by optimisations in other areas of the code. So V5 should still be faster overall. We also plan during the next week to push our test scenarios out to 1,000,000 HTML documents on the same old 1.8Ghz CPU, 512MB machine.
As I get time I'll write about some of the other aspects of V5.
-----
David
Comment