Announcement

**jerry2** · May-18-2011, 02:02 PM

And yes, I am very very worried about this issue. I admit I didn't test this on the free version but I've read DOC can not be indexed in free version.

I I will not be able to solve this (the words with our charactrs aren't found also so it is not just a visual hendicap), I will not be able to use this great software I bought The site that I am using it for has about 2000 word documents.

Using CGI directly without ASP also produces wrong characters so this is problem in Zoom not in ASP combination with CGI I guess...

http://www.delavska-participacija.com/search/default.asp?zoom_sort=0&zoom_xml=0&zoom_query=mini ster&zoom_per_page=10&zoom_and=1

**Ray** · May-19-2011, 08:05 AM

We had a look at the file. Was this created in Microsoft Word and with the Slovenian Language Pack installed and selected?
http://office.microsoft.com/en-us/language/

The language in Word can be selected in file properties or file options, and shown for the selected paragraph in the bottom bar.

This is important because without this setting, the Word plugin would not be able to convert this to UTF-8 as needed.

However, if the above isn't possible, it would index fine if you change your Indexer to use windows-1252 encoding (under "Configure"->"Languages"). It may also affect the necessary codepage settings used for your wrapper page, etc. though.

**jerry2** · May-19-2011, 08:19 AM

I am not sure how the file was written, I guess with SLo language pack and MS Word. I am sure it was in MS word.

I have tried running your plugin "by hand": word2txt.exe with our doc. The zoom_plugin.out seems to be encoded in ANSI not UTF-8 but the characters are OK in outputted text file. So the zoom_plugin.out is OK in my editor (but it shows ANSI not UTF

. I tried importing the zoom_plugin.out in Dreamweaver and I can confirm it is not UTF8 codepage because I get western European accents instead of Eastern European accents.

PS - In my word (I have Office 97, yes I know...), there is no option to see language. Could you download one of the documents on this page and see if the language is set? The documents are not mine, so I am not sure in which office whey were made, but yes, with SLO Windows, using MS1250 codepage I guess.

**jerry2** · May-19-2011, 08:32 AM

Unfortunately using ms-1252 in Zoom indexer settings and in response codepage doesn't help.

((( There seems to be no solution for this at the moment.

**Ray** · May-24-2011, 08:13 AM

We've investigated the issue and found a problem in the Word DOC file plugin. This has been fixed in a new plugin release found here,
http://www.wrensoft.com/zoom/plugins.html

**jerry2** · May-24-2011, 09:01 AM

There is also PDF problem with č and ž but not š (this is ok). See:

http://www.delavska-participacija.com/search/default.asp?zoom_sort=0&zoom_xml=0&zoom_query=%22a mpak+se+v+tem+pogledu%22&zoom_per_page=10&zoom_and =1

Se link 1 after yellow highlight: "ka ejo" instead of "kažejo"

Any hope for this? I am 100% this is not only problem for Slovenian letters but at least all other Slavic and Eastern Europe langauages, they have the same characters as we do and more.

I tried to mail you a PM but your box is full.

**Ray** · May-25-2011, 04:16 AM

I had a look at the PDF file in question. The problem is in the PDF file itself.

Do you know how the PDF file was created? The properties indicate it was created with "Jaws PDF Creator v3.4.1834". Was it originally scanned from a paper document? Was it a batch of pre-created PDF files that you ran through an OCR process to make searchable?

A PDF file contains an invisible "text layer" under the visible page. Because a PDF file is much like an image, what you see, is not actually the text data stored. When you use a search engine, it extracts the text layer. Similarly, when you use the cursor in Acrobat Reader to select and copy the text to your clipboard.

So if you open that PDF file in Acrobat Reader, find the passage of text in question and select and copy it with your mouse, then paste the selection into NotePad, or Word or something... you'll see that the text is actually what Zoom displayed.

This can happen with bad OCR. If the paper document, or the original PDF file was created without a text layer, an OCR process is needed to try and recognize the characters and create the text layer based on the visual data. It's never perfect, and some characters can be mis-recognized, as is the case here.

To fix this problem, you would have to fix the OCR process. If you have the original document that these PDF files were created from (e.g. a Word .doc file), then you should use that and try to create the PDF again.

**jerry2** · May-25-2011, 06:13 AM

Thank you for your explanation. I'll try to generate PDF myself from word with our characters and I'll test if Zoom is extracting properly

Announcement

Problem with čšž in DOC and PDF

Problem with čšž in DOC and PDF

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment