HI, I am using Spider mode to crawl a site which predominantly contains links to a lot of files of various types (PDFs, MS Office Zip). However, after I have uploaded the DB and do a search, most of the titles in the serach results contain ASCII control characters, such as NAK, ACK, SO, VT, HT etc. This are displayed as question marks and/or the spaces between the words of the titles are removed. Any clues as to what might be causing this? Second question. At the moment if I choose any of the options to index the content, I end up with a very large DB, which takes @30 seconds to return the results. If I deselect the option to Index content, the DB is obviously significantly smaller resulting in a search which takes a fraction of a second to complete. As I see it, the Indexing options are global. It would be useful if the Indexing options are extending to cater for different file types. Is this something which is planned in a future release? Third question. When can we expect the next release. Many many thanks, Kind regards, Russ
Announcement
Collapse
No announcement yet.
ASCII control characters appearing in the titles of the search results
Collapse
X
-
1) My first guess regarding the strange characters you are seeing, is that your web server is serving the file types in question with the incorrect MIME / Content-Type header. This is something you can check with your browser's Developer Tools (in both Chrome and Firefox), under the "Network" tab, there are options to view/copy the "Response Headers".
If your web server is returning a Content-Type header for a .PDF file to be "text/plain", then Zoom will obey your web server and treat the file as a text file, instead of a PDF file. In doing so, it will index alot of garbage (thus your large database and slow search) instead of actual content.
So first confirm if the above is happening, then you should look into configuring your web server to return the proper content-type headers for the file extensions necessary. Ask your web hosting company if this is new to you.
2) I think the proper with the large files is due to the above. So fixing that would also fix the issue with your index size. Having said that, if you want to configure any file extension to not index the content, you can do so under "Configure"->"Scan options"->"Scan extensions". Here you can add/remove extensions and how they are treated (note however that the abovementioned Content-Type header will take priority over the file type specified here). There is a file type here for "Binary file" which by default, will only index the filename (and not the content). So for example, you can specify to index all ".zip" files only by filename by giving it this file type.
3) V7.1 is the current release. We don't have any set schedule for the next release. All minor version increments (e.g. V7.x) will be a free upgrade. Major version increments (e.g. V8.0) will be a free upgrade for any one who purchased the software within 6 months of its release. So you can rest assured you won't be caught with an old version right after purchasing.
-
Hi Ray,
Thanks for the prompt reply.
1) To try and isolate the problem, I copied over the site to a local version of IIS running on Windows 7. I made sure that the MIME type for PDFs was correct, i.e. "application\pdf" and checked the HTTP response headers using Fiddler and all seemed to be in order. However I am still getting the problem. To elaborate a little bit, I am configuring the Spidering option for the Start URL as "Follow links only". Because of the performance hit that I am currently experiencing I have selected the Indexing options: "Title of page", "Meta description", "Meta keywords", "Filename" and "Link text". I am not indexing the "Page content". If I do a search after the DB has been built, I notice that random spaces in the filename have been replaced by any of the bottom 32 ASCII characters and sometimes by a question mark. The spaces in the full URL, and the hyperlink behind the title are intact, albeit replaced by Hex 20 or %20%. I also looked in "zoom_pagedata.zdat" using NotePad++ I could see that the indexing has replaced some of the spaces in the title. For example, the URL could be:
http://localhost/international/Conte...dy%20(USA).pdf
But the title following the bar symbol is:
Twenty-TenFFandVTBrand?Case Study (USA).pdf
where "FF" and "VT" and ? represent spaces that have been replaced by Form Feed, Vertical Tab and question mark characters. Is there anything else you can think of that would help me to resolve this issue.
2) Accepted, but until I have fixed point 1), I won't know for sure.
3) Noted, and thanks.
Kind regards,
Russ
Comment
-
Hi Ray,
Thanks for the prompt reply. Sorry for the delay in responding.. I am havng problems with your forum rejecting my posts (getting "Missing Human Verification Information")... so here is a rerun on what I tried to send yesterday...
1) I copied over the site to a local copy of IIS so that I could check MIME types and HTTP response headers. I can confirm that all MIME types are coorect e.g. 'application/pdf' is being used for PDF. I used Fiddler to monitor the session and confirm that the Content-Type in the HTTP Content Response header is also set to 'application/pdf'. I then reindexed the local version of the site and copied over the DB to the local instance. Unfortunately, I still have the problem. Just to reiterate what I was saying in my earlier e-mail. Some of the spaces in some of titles of some (not all) of the search results are being replaced either by one of the bottom 32 ASCII characters or by a question mark. Because the ASCII control characters are non-printable, this means that some words appear together, or are separated by a question mark. Checking the "zoom_pagedata.zdat" file in NotePad++ I can see that the hyperlink behind the title is unaffected, albeit with spaces replaced by Hex 20 or %20. It is just the title representing the filename that is affected. So for example, the following is an exmaple URL behind a title:
http://localhost/international/Conte...1%20Lookup.pdf
But the totle is:
Product?USA2006FFtoVT2011 Lookup.pdf
Where ?, FF and VT represent question mark, Form Feed and Vertical Tab characters taht have replaced some of the spaces in the title. This means that when rendered in HTML, the title is displayed as:
Product?USA2006to2011 Lookup.pdf
I am running Zoom Indexer in Spidering mode with an ASP script output. The Start URL has been set to "Follow links only" and for teh Indexing options I have turned off "Page content" (purely because of the performance problems I am facing).
2) Noted. This might be the case, but until point 1 has been fixed I am unable to confirm.
3) Noted and thanks.
Kind regards and many thanks,
Russ
Comment
-
Those characters appearing in the Title is unusual, I can't say I've seen that before. What you should check is opening one of these PDF files up in Adobe Acrobat, and looking at the Properties (File->Document properties) and seeing what the Title is stored within the PDF file. It is possible the PDF file was created with this title, perhaps by whatever program you used to generate the PDF file in the first place. For example if you were printing to a "PDF Printer", this could explain the Form Feed and other unusual characters in question. But like I said, I've never seen this.
If you can't figure this out, e-mail us the PDF file in question, along with your .zcfg file with your indexer configuration. And we will try to reproduce it here and have a closer look.
Another note -- if you are not indexing dynamic web pages (e.g. PHP, ASP, etc.) and only static documents like PDF and DOC files, then you might want to consider using Offline Mode. Your search result links will still point to "http://localhost/" if you set your Base URL correctly.
Comment
-
If the problem was avoided by switching to Offline Mode, this means it has something to do with how the files are served via your web site. For example, if your web site serves the PDF files via a download script (e.g. "download.php?mydoc=1234") then the script would determine what the filename is, and it would declare this in the HTTP header. So any funny characters in the filename are likely generated at that point.
Comment
Comment