hi,
we use the Enterprise Zoom search v6 and I have to crawl a large E-Commerce site, with nearly 60.000 items/pages.
I use some Datafields for article numbers, ean, price and producer.
I stripped out as much as possible via ZOOMSTOP/RESTART and use a lot of follow, noindex meta tags and adjusted the robots.txt, to get just the needed article data. Further I use the CGI option.
So far so good.
Indexing this whole site takes around two hours on first crawl.
Send a searchquery to this dataset takes max 2 seconds. This seems very much, since your comparison of crawling/searching large websites offers a lower querytime.
On the other site, I think it's because all of the technical data that is indexed and blow up the file?. How can I optimize this?
So this is all ok when making the index the first time. But I have one problem.
I tried incremental index and it took too long, to add new and changed sites (and changes happen every 15 minutes).
The pages give back a proper last modified header - so it doesn't reindex the whole site, but it's slow.
What can I do?
I saw the possibility to add a text file with new/changed pages and use console mode.
Would this goes faster, to provide all changed/new pages in this textfile?
I suppose I can't provide a url with all new pages?
Because I could setup a simple sql query to printout all pages that are newer than the zoom_index filetime.
What is the best strategy to handle this case.
re-index every 15-30 minutes all new/updated files, provided via http as textfile.
Thanks and sorry for the blabla ^^ I just want to describe my scenario as good as possible.
we use the Enterprise Zoom search v6 and I have to crawl a large E-Commerce site, with nearly 60.000 items/pages.
I use some Datafields for article numbers, ean, price and producer.
I stripped out as much as possible via ZOOMSTOP/RESTART and use a lot of follow, noindex meta tags and adjusted the robots.txt, to get just the needed article data. Further I use the CGI option.
So far so good.
Indexing this whole site takes around two hours on first crawl.
Send a searchquery to this dataset takes max 2 seconds. This seems very much, since your comparison of crawling/searching large websites offers a lower querytime.
On the other site, I think it's because all of the technical data that is indexed and blow up the file?. How can I optimize this?
So this is all ok when making the index the first time. But I have one problem.
I tried incremental index and it took too long, to add new and changed sites (and changes happen every 15 minutes).
The pages give back a proper last modified header - so it doesn't reindex the whole site, but it's slow.
What can I do?
I saw the possibility to add a text file with new/changed pages and use console mode.
Would this goes faster, to provide all changed/new pages in this textfile?
I suppose I can't provide a url with all new pages?
Because I could setup a simple sql query to printout all pages that are newer than the zoom_index filetime.
What is the best strategy to handle this case.
re-index every 15-30 minutes all new/updated files, provided via http as textfile.
Thanks and sorry for the blabla ^^ I just want to describe my scenario as good as possible.
Comment