PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

French Site - Query Problems

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • David
    replied
    As far as we are aware is there no issue to be resolved. We did some testing and posted the results of our tests (see above). But didn't see the problem you are talking about. Once you get the config correct, it works fine for French as far as we know and no one as provided an example to the contrary.

    So unless you are prepared to provide exact details of your configuration and copies of your input files we don't plan on investigating this issue. Otherwise there is nothing for us to investigate.

    Leave a comment:


  • JCF1976
    Guest replied
    Further Work

    David, I did further work on this problem. I made another copy of the whole site and did a find and replace function to replace all of the ASCII characters with normal characters. I did download the most recent version of your software to use in crawling the site again.

    I am restricted by doing offline searches. I don't know if that makes a difference. I am sure you would say that it does not. ...If Zoom would obey my online robots.txt file, I could try to crawl the site online to see if there would be a difference. This is another reason why you would not be able to crawl the site and do testing.

    I see a slight improvement, after the work I did and possibly the work you have done on your program. I see that if I do a one word search with the accents in the word, it does come up in the results. It appears that Zoom did indeed index the accented words. If I do a multi-word search, it also gives results, but it's hard to tell exactly what kind of results I am getting. What I have tried to do is doing the multi-word search in quotes, where there are words with accented characters. This does not work. So, this appears to be where your indexing breaks down.

    Also, the words are not being highlighted, if I do searches with accented words. This is disappointing, of course. I hope you will be able to fix this.

    I did download your files and took a look at them. I also reviewed the searches you performed. Again, I saw that you did not try to perform any searches with two or more words in quotes, where the words are accented. This kind of search is critical on my site.

    I look forward to your further responses. Based on m00di's posts, it's evident that others are also interested in this being resolved.

    Leave a comment:


  • JCF1976
    Guest replied
    Originally posted by m00di View Post
    Hi

    Did you find a solution for this. I am having the same issue.

    Thanks
    m00di, I have been meaning to take the time to test this and get back to David. I have been swamped with other project work. This is still very important to me. Please contribute to this thread with your own findings and maybe David will be able to offer a fix for this.

    Leave a comment:


  • m00di
    replied
    Hi

    Did you find a solution for this. I am having the same issue.

    Thanks

    Leave a comment:


  • JCF1976
    Guest replied
    Thanks! I'll download it and test it this evening!

    Leave a comment:


  • David
    replied
    I have uploaded the set of working example files I made to our server. So instead of us trying to reproduce the problem with your files (which we don't have), you can try and provoke the problem by editing our files or work out what is different by comparing your files to our files.

    You can download the set of files here,
    http://www.wrensoft.com/test/french/accenttest.zip

    and see it working here,
    http://www.wrensoft.com/test/french/...uery=m%C3%AAme
    http://www.wrensoft.com/test/french/...oom_query=meme

    This set of index files were generated with the UTF-8 selected in Zoom 5 on a Windows XP machine. I tested the search behaviour on Windows/PHP and Unix/PHP and it was the same.

    Leave a comment:


  • JCF1976
    Guest replied
    I'd rather not

    Originally posted by wrensoft View Post
    Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.
    I know it is limiting, but I'd rather not (and there's a lot of people that feel the same way I do). So, let's continue to communicate through the forum. What other questions can you think of?

    Leave a comment:


  • David
    replied
    Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.

    Leave a comment:


  • JCF1976
    Guest replied
    Utf-8

    Okay, I switched everything over to UTF-8 again and recrawled the files that had been converted from ASCII text to French accents. Reposted everything with the changes. Now we're back to the old/original problem. The queries are not pulling up results when I do a search with accented characters.

    Leave a comment:


  • JCF1976
    Guest replied
    Vouliez-vous dire: tenebres autant au la surface?

    I noticed that the suggested search isn't correct either:

    Vouliez-vous dire: tenebres autant au la surface?

    instead of:

    Vouliez-vous dire: tenebres etaient a la surface?

    Leave a comment:


  • JCF1976
    Guest replied
    ISO-8859-15 setting

    By the way, it should go without saying, that when I crawled the files locally, with Zoom, that I used the ISO-8859-15 setting.

    Leave a comment:


  • JCF1976
    Guest replied
    ténèbres étaient à la surface

    Okay, I just changed all of the ASCII characters to actual accented French vowels. I also changed all of the encoding to ISO-8859-15 (I did that prior to changing all of the vowels). I reran the zoom crawler (locally). I uploaded the new files and ran a query with the following words:

    ténèbres étaient à la surface

    The search result page displayed:

    Résultats de la recherche pour : ta©našbres a©taient a la surface dans toutes les categories

    and infact, the actual search field displays:

    ténÚbres étaient à la surface

    instead of:

    ténèbres étaient à la surface

    I did/had change/changed the encoding on the search template to ISO-8859-15 too. So, I am not sure what to make of this.

    Leave a comment:


  • JCF1976
    Guest replied
    More testing

    I am going to do some more testing this weekend, including converting the ASCII characters to real French characters IN the code. (Don't worry! I'll do testing on a copy of the site.

    Leave a comment:


  • JCF1976
    Guest replied
    Iso-8859-15

    I found this information helpful (found at http://en.wikipedia.org/wiki/ISO_8859-1 ):

    ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. Each character is encoded as a single eight-bit code value. These code values can be used in almost any data interchange system to communicate in the following European languages (with a few exceptions due to missing characters, as noted):

    ...# French (missing Œ, œ and rare Ÿ)

    * Note that Windows-1252 and ISO-8859-15 do contain these

    ...Relationship to ISO/IEC 8859-15

    Although ISO/IEC 8859-1 has enough characters for most French text, it is missing a few less-common letters. It is also missing a single-glyph representation for the letter IJ, two Finnish letters used for transcription of some foreign names and in a few loanwords (Š and Ž), typographic quotation marks and dashes, and common symbols such as the euro sign (€) and dagger (†).

    In order to provide some of these characters, ISO/IEC 8859-15 was developed as an update of ISO/IEC 8859-1. This required, however, the removal of some infrequently-used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics: ¤, |, ¨, ´, ¸, ¼, ½, and ¾.

    Leave a comment:


  • David
    replied
    The test files were made using a text editor.

    Please explain more, exactly what you mean.
    Multi-byte is when more than 1 byte is required to represent a character in the alphabet. ASCII is always single byte. UTF-8 is a mix of single byte and multi-byte. There are some accented characters that require 1 byte and some that require 2 or 3 or 4.

    HTML character entities are special strings, defined in the WWW standards, that are used to represent special characters. Including accented characters in some character sets.

    It should not matter if you cut and paste or type in the accented characters. Provided of course that the you aren't forcing a Unicode to single byte conversion on multibyte character. Which should not be the case here as the accented characters in question are represented by a single byte.

    So we need more details & maybe copies of your HTML pages if we are going to reproduce the problem.

    Leave a comment:

Working...
X