This post continues the series of posts about what new features will be in V5 of Zoom (when it is finally released).
If you are building internet search engine it is often the case that you be indexing files PDF and DOC files that you don't control, becuase they are on someone eles web site.
A lot of the time the authors of these documents don't know how to, or forget to, correctly set the document properties. Having invalid meta data means searches are not as accurate as they should be a results, which display the meta data appear to be wrong.
This new feature will largely solve this problem by allowing the owner of the search engine to override the incorrect meta data in the document. New meta data is placed in .desc files.
To enable this feature, click on "Configure"->"Scan Options" and check the "Use the offline folder for all plugin .desc files". Specify or select the folder path where your .desc files are to be found.
With this setup, you can now index external sites using Spider Mode, and and the Indexer will look for the .desc files for any plugin supported file formats (such as .pdf, .doc, etc.) in the local directory. This allows you to specify custom .desc files without having to host them up on the remote web site.
The offline .desc files need to include the full domain name and URL path in its filename. This is usually everything after the "http://" or "https://" prefix. It must also end in ".desc" (see examples below).
However, since a number of characters possible in a URL are not valid as filenames, you must encode these characters in their hexadecimal form and precede them with a "%" sign. This is similar to the HTTP encoding required for URLs. The following is a list of the characters in URL which must be encoded.
Character Encoded
\ %5C
/ %2F
: %3A
* %2A
? %3F
" %22
< %3C
> %3E
| %7C
For each of the above characters in a URL, substitute them with the Encoded form of the character when naming a .desc file for that URL.
Here are some examples of URLs and their corresponding .desc filenames
URL: http://www.mysite.com/files/mydocument.pdf
.desc filename: www.mysite.com%2Ffiles%2Fmydocument.pdf.desc
URL: http://www.mysite.com/download.php?fileid=123
.desc filename: www.mysite.com%2Fdownload.php%3Ffileid=123.desc
Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.
------
David
If you are building internet search engine it is often the case that you be indexing files PDF and DOC files that you don't control, becuase they are on someone eles web site.
A lot of the time the authors of these documents don't know how to, or forget to, correctly set the document properties. Having invalid meta data means searches are not as accurate as they should be a results, which display the meta data appear to be wrong.
This new feature will largely solve this problem by allowing the owner of the search engine to override the incorrect meta data in the document. New meta data is placed in .desc files.
To enable this feature, click on "Configure"->"Scan Options" and check the "Use the offline folder for all plugin .desc files". Specify or select the folder path where your .desc files are to be found.
With this setup, you can now index external sites using Spider Mode, and and the Indexer will look for the .desc files for any plugin supported file formats (such as .pdf, .doc, etc.) in the local directory. This allows you to specify custom .desc files without having to host them up on the remote web site.
The offline .desc files need to include the full domain name and URL path in its filename. This is usually everything after the "http://" or "https://" prefix. It must also end in ".desc" (see examples below).
However, since a number of characters possible in a URL are not valid as filenames, you must encode these characters in their hexadecimal form and precede them with a "%" sign. This is similar to the HTTP encoding required for URLs. The following is a list of the characters in URL which must be encoded.
Character Encoded
\ %5C
/ %2F
: %3A
* %2A
? %3F
" %22
< %3C
> %3E
| %7C
For each of the above characters in a URL, substitute them with the Encoded form of the character when naming a .desc file for that URL.
Here are some examples of URLs and their corresponding .desc filenames
URL: http://www.mysite.com/files/mydocument.pdf
.desc filename: www.mysite.com%2Ffiles%2Fmydocument.pdf.desc
URL: http://www.mysite.com/download.php?fileid=123
.desc filename: www.mysite.com%2Fdownload.php%3Ffileid=123.desc
Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.
------
David
Comment