PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Problem with čšž in DOC and PDF

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with čšž in DOC and PDF

    Hi there again

    I have a site that has a lot of PDF and DOC files. The html files are encoded ok so the results are ok, but the DOC and PDF files have our characters mungled, č is missing, there is a � instead of š etc...

    How does the Zoom indexer takes the PDF and word files? Our word files are in MS1250 codepage allways so I have used:

    response.Charset="utf-8"
    Response.CodePage = 1250
    Dim WshShell, env, oExec
    Set WshShell = CreateObject("WScript.Shell")
    Set env = WshShell.Environment("Process")
    env.Item("REQUEST_METHOD") = "GET"
    env.Item("QUERY_STRING") = Request.QueryString
    set oExec = WshShell.Exec(Server.MapPath("/search/search.cgi"))
    oExec.StdOut.ReadLine() ' skip the HTTP header line
    Response.Write(oExec.StdOut.ReadAll())
    Response.CodePage = 65001

    in my ASP page. But the DOC and PDF are all mungled. Any ideas, is there and preference to set the correct codepage for the PDF and DOC?

  • #2
    And yes, I am very very worried about this issue. I admit I didn't test this on the free version but I've read DOC can not be indexed in free version.

    I I will not be able to solve this (the words with our charactrs aren't found also so it is not just a visual hendicap), I will not be able to use this great software I bought The site that I am using it for has about 2000 word documents.

    Using CGI directly without ASP also produces wrong characters so this is problem in Zoom not in ASP combination with CGI I guess...

    http://www.delavska-participacija.com/search/default.asp?zoom_sort=0&zoom_xml=0&zoom_query=mini ster&zoom_per_page=10&zoom_and=1

    Comment


    • #3
      We had a look at the file. Was this created in Microsoft Word and with the Slovenian Language Pack installed and selected?
      http://office.microsoft.com/en-us/language/

      The language in Word can be selected in file properties or file options, and shown for the selected paragraph in the bottom bar.

      This is important because without this setting, the Word plugin would not be able to convert this to UTF-8 as needed.

      However, if the above isn't possible, it would index fine if you change your Indexer to use windows-1252 encoding (under "Configure"->"Languages"). It may also affect the necessary codepage settings used for your wrapper page, etc. though.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        I am not sure how the file was written, I guess with SLo language pack and MS Word. I am sure it was in MS word.

        I have tried running your plugin "by hand": word2txt.exe with our doc. The zoom_plugin.out seems to be encoded in ANSI not UTF-8 but the characters are OK in outputted text file. So the zoom_plugin.out is OK in my editor (but it shows ANSI not UTF. I tried importing the zoom_plugin.out in Dreamweaver and I can confirm it is not UTF8 codepage because I get western European accents instead of Eastern European accents.

        PS - In my word (I have Office 97, yes I know...), there is no option to see language. Could you download one of the documents on this page and see if the language is set? The documents are not mine, so I am not sure in which office whey were made, but yes, with SLO Windows, using MS1250 codepage I guess.

        Comment


        • #5
          Unfortunately using ms-1252 in Zoom indexer settings and in response codepage doesn't help. ((( There seems to be no solution for this at the moment.
          Last edited by jerry2; May-19-2011, 12:19 PM.

          Comment


          • #6
            We've investigated the issue and found a problem in the Word DOC file plugin. This has been fixed in a new plugin release found here,
            http://www.wrensoft.com/zoom/plugins.html
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              There is also PDF problem with č and ž but not š (this is ok). See:

              http://www.delavska-participacija.com/search/default.asp?zoom_sort=0&zoom_xml=0&zoom_query=%22a mpak+se+v+tem+pogledu%22&zoom_per_page=10&zoom_and =1

              Se link 1 after yellow highlight: "ka ejo" instead of "kažejo"

              Any hope for this? I am 100% this is not only problem for Slovenian letters but at least all other Slavic and Eastern Europe langauages, they have the same characters as we do and more.

              I tried to mail you a PM but your box is full.

              Comment


              • #8
                I had a look at the PDF file in question. The problem is in the PDF file itself.

                Do you know how the PDF file was created? The properties indicate it was created with "Jaws PDF Creator v3.4.1834". Was it originally scanned from a paper document? Was it a batch of pre-created PDF files that you ran through an OCR process to make searchable?

                A PDF file contains an invisible "text layer" under the visible page. Because a PDF file is much like an image, what you see, is not actually the text data stored. When you use a search engine, it extracts the text layer. Similarly, when you use the cursor in Acrobat Reader to select and copy the text to your clipboard.

                So if you open that PDF file in Acrobat Reader, find the passage of text in question and select and copy it with your mouse, then paste the selection into NotePad, or Word or something... you'll see that the text is actually what Zoom displayed.

                This can happen with bad OCR. If the paper document, or the original PDF file was created without a text layer, an OCR process is needed to try and recognize the characters and create the text layer based on the visual data. It's never perfect, and some characters can be mis-recognized, as is the case here.

                To fix this problem, you would have to fix the OCR process. If you have the original document that these PDF files were created from (e.g. a Word .doc file), then you should use that and try to create the PDF again.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Thank you for your explanation. I'll try to generate PDF myself from word with our characters and I'll test if Zoom is extracting properly

                  Comment

                  Working...
                  X