Using ZoomSearch Pro in Spider mode, I'm indexing a site comprising 50+ .asp pages and 200+ .doc/.pdf files (each typically a few MB long). Because the URLs of the latter are generated dynamically when executing the former, I have produced a list of these files within an additional page - "indexed.asp"
(in the form:
<a href=file1.pdf>x</a>
<a href=file2.pdf>x</a>
<a href=file3.pdf>x</a>
<a href=file4.doc>x</a> etc)
I then added this indexed.asp file to the "List of Start Points" (follow links only) after the start point URLs for .asp pages.
(Note, all the .pdf and .doc files have .desc files in the same directory - this may/may not be relevant.)
Indexing the .asp pages proceeds perfectly, and on reaching the indexed.asp file, the 200+ .doc/.pdf files are queued and then progressively processed. This works about 98% to 99% of the time: On each run, 2 to 8 files fail to be processed, giving an error of the form:
10:52:06 - [ERROR] Can not write file C:\output directory\zoom_plugin.in (Error code: 5)
10:52:06 - [ERROR] Failed to write plugin file to disk: http:// site name/filexx.pdf
The files which fail (identified as filexx.pdf in above ERROR message) appear to be completely random.
Re-running the indexing with identical parameters again produces between 2 and 8 failures but of different files (and those which failed last time are successfully indexed this time). Out of a dozen or so attempts, I've not had less than 2 failures and I've not had more than 8.
Examining the failed files (in Acrobat Pro) reveals no errors, and producing a "List of Start Points" comprising just the URLs of the "failed" files results in a perfect index for this limited list.
So instead of producing the "indexed.asp" file, I produced a .txt file which just lists the URLs of the .doc/.pdf files, and imported this into the "List of Start Points" after the start point URLs for .asp pages. This works perfectly.
I'm curious as to why the second method works but the first method does not? The only difference appears to be the large queue. (It's not that large - I see from other messages that some people have a queue >4000 long).
(in the form:
<a href=file1.pdf>x</a>
<a href=file2.pdf>x</a>
<a href=file3.pdf>x</a>
<a href=file4.doc>x</a> etc)
I then added this indexed.asp file to the "List of Start Points" (follow links only) after the start point URLs for .asp pages.
(Note, all the .pdf and .doc files have .desc files in the same directory - this may/may not be relevant.)
Indexing the .asp pages proceeds perfectly, and on reaching the indexed.asp file, the 200+ .doc/.pdf files are queued and then progressively processed. This works about 98% to 99% of the time: On each run, 2 to 8 files fail to be processed, giving an error of the form:
10:52:06 - [ERROR] Can not write file C:\output directory\zoom_plugin.in (Error code: 5)
10:52:06 - [ERROR] Failed to write plugin file to disk: http:// site name/filexx.pdf
The files which fail (identified as filexx.pdf in above ERROR message) appear to be completely random.
Re-running the indexing with identical parameters again produces between 2 and 8 failures but of different files (and those which failed last time are successfully indexed this time). Out of a dozen or so attempts, I've not had less than 2 failures and I've not had more than 8.
Examining the failed files (in Acrobat Pro) reveals no errors, and producing a "List of Start Points" comprising just the URLs of the "failed" files results in a perfect index for this limited list.
So instead of producing the "indexed.asp" file, I produced a .txt file which just lists the URLs of the .doc/.pdf files, and imported this into the "List of Start Points" after the start point URLs for .asp pages. This works perfectly.
I'm curious as to why the second method works but the first method does not? The only difference appears to be the large queue. (It's not that large - I see from other messages that some people have a queue >4000 long).
Comment