PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Incorrect characters in Context description

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incorrect characters in Context description

    Despite downloading the latest version the Context Description or Meta Description shows a ? instead of '.

    For example :

    'Record Purge' menu is displayed incorrectly as:

    ?Record Purge? menu

    Is there any way of solving this?

    Thanks

  • #2
    It might just be a character set issue.
    What script are you using, PHP, ASP, CGI or JS?
    What character set are your source HTML pages in?
    What character set are you using in Zoom?
    Is the character really a single quote character, or are they 'smart quote' characters from a Word document?

    Comment


    • #3
      The script is php.
      Not to sure what you mean by character set. Can't find any options for this in Zoom? The pages are constructed using Robohelp v5.

      Cheers

      Comment


      • #4
        It may be a problem I've just 'solved': if the original text in the page is pasted from Word then the apostrophe characters inserted into the page are not plain ' but the opening and closing single quotation marks used by Word. Even if you set Zoom to use the same characterset as the source HTML page (in my case charset=iso-8859-1) I still find that Zoom does not recognise them. The original HTML page looks fine, but in the context the apostrophes are replaced with ?.

        My only solution so far is to replace the Word quotation marks with the standard ASCII '. But why if the characterset is set up OK does Zoom not recognise the characters?

        Replacing the ' as I have done then brings on another problem. If the search phrase are looking for is within single quotes in the HTML, then you don't appear to be able to search for it. Suppose your HTML says: Zoom is, according to their site, the 'best search engine'. Now searching for "best search engine" (with double quotes to treat as a phrase), doesn't find it; you have to look for "'best search engine'" and include the apostrophes, which of course no one is going to do!

        I guess this is because the ' is regarded as a character that can join other words. (The same search used to work with the Word quotation marks in place as these were not regarded as joining characters.)

        If my logic is correct, then it would be nice if the next Zoom fix could enable the apostrophe character to not be regarded as a joining character if it appears at what otherwise be a word boundary so couldn't would be one word but 'could' is really just could when it comes to searches. I'm not sure how this would work with plural possesives....

        Vernon

        Comment


        • #5
          Originally posted by caff View Post
          The script is php.
          Not to sure what you mean by character set. Can't find any options for this in Zoom? The pages are constructed using Robohelp v5.
          You will find character set options in Zoom on the "Languages" tab of the Configuration window, as "Encoding". Character set is also commonly known "charset", "page encoding", or "code page". This is required knowledge for all web developers (or anyone dealing with HTML files). You can find out more online:
          http://en.wikipedia.org/wiki/Charset
          http://www.w3.org/International/O-charset

          I suspect your search template page is in a different encoding than the one specified in Zoom, and this is causing the curly quotes to appear incorrectly. For example, if you are indexing your files in "iso-8859-1" encoding, but your template file is set to UTF-8 encoding.

          Originally posted by AmbitNewMedia View Post
          It may be a problem I've just 'solved': if the original text in the page is pasted from Word then the apostrophe characters inserted into the page are not plain ' but the opening and closing single quotation marks used by Word.
          These "curly" quotation characters are commonly known as smart quotes or, of course, curly quotes (as opposed to larry and moe quotes ). More information here:
          http://en.wikipedia.org/wiki/Quotation_mark,_glyphs

          Scroll down on the above Wikipedia page to "Quotation marks in English" and you will find some more details about the problems with these characters in certain character sets and their use in HTML files.

          To summarize briefly, you should not use the curly quote characters in a HTML file as it is. You should use HTML entities to specify the quote character, that is, "‘" (left single quote) and "’" (right single quote).

          Originally posted by AmbitNewMedia View Post
          Even if you set Zoom to use the same characterset as the source HTML page (in my case charset=iso-8859-1) I still find that Zoom does not recognise them. The original HTML page looks fine, but in the context the apostrophes are replaced with ?.
          I tried this, but could not reproduce the behaviour reported. First thing I would recommend is checking that you are using the latest version and build of Zoom available (V5.1.1003) by clicking on "Help"->"About". If not, you can download the latest build from our "What's new" page.

          Second, you should make sure the charset of your search_template.html page is also set to "iso-8859-1" to correspond to your source HTML, as well as your Zoom configuration (Zoom should have warned you on output if otherwise, but worth checking). If you are embedding the Zoom search script in another script or page, you will also have to check this too. The only way I could reproduce the behaviour you reported is if the template page is set to UTF-8 instead.

          We have already looked into this area a number of times and have addressed the majority of the issues, but the use of native (i.e. in non-entity form) smart quote characters are typically troublesome when dealing with encodings (servers and browsers have all sorts of quirks with charsets). So the best solution really, is to replace them with their corresponding HTML entities (which most reasonable HTML authoring programs should do automatically).

          Originally posted by AmbitNewMedia View Post
          My only solution so far is to replace the Word quotation marks with the standard ASCII '.
          ...
          I guess this is because the ' is regarded as a character that can join other words. (The same search used to work with the Word quotation marks in place as these were not regarded as joining characters.)
          Just in case you do not already realize, you can turn off the ' character (we refer to it as the apostrophe character) from joining words on the "Indexing Options" tab of the Configuration window.

          Originally posted by AmbitNewMedia View Post
          If my logic is correct, then it would be nice if the next Zoom fix could enable the apostrophe character to not be regarded as a joining character if it appears at what otherwise be a word boundary so couldn't would be one word but 'could' is really just could when it comes to searches. I'm not sure how this would work with plural possesives....
          As described above, none of this should be necessary, if you change to the proposed solution with HTML entities. The quotes will then be separated from the word.

          But just as a FYI: there is already some behaviour like this, which determines the join based on word boundary. However, we made the preceding character situation an exception, due to the fact that it is commonly expected (admittedly more so for the other word join characters than apostrophe) such as ".NET" or ".PDF", "$30", "#ID", etc. There are also some minor cases where single quotes precede the word in abbreviations, and in foreign languages such as Dutch. So we do not plan to change this behaviour at this point.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            I changed the encoding in Zoom to "iso-8859-1" and this has pretty much resolved the problem, athough you get:

            ' Record Purge ' instead or 'Record Purge'. I'm sure i can sort this spacing issue out, unless you know any anything to get rid of these?

            Thanks for the responses.

            Comment

            Working...
            X