PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Could not download file with accented character in file name

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Could not download file with accented character in file name

    When the program spiders my site, it delivers an error message "could not download file" for every file which has an accented character in the filename.

    Is there any way of getting the program to index such files??? What rules does the program follow in regard to filenames?

  • #2
    Accented characters are not valid characters in a URL. They need to be encoded on the server.

    Valid characters in a URL are,
    a to z
    A to Z
    $ - _ @ . & + -
    0 to 9

    See also the official spec for what a URL must look like,
    http://www.w3.org/Addressing/URL/url-spec.txt

    If you think the URL is being encoded correctly, then can you post the URL of the page that links to the document in question.

    Comment


    • #3
      See for example http://drupal.org/node/1162252

      Comment


      • #4
        From: http://en.wikipedia.org/wiki/Percent-encoding

        The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.
        The key here being that there is the need for percent-encoding.

        Comment


        • #5
          This is a very vexing and complicated issue. To me, the key point is that the users should never have to deal with percentage encoding.

          At the present time, browsers differ in whether they send queries as UTF-8 or not. But it seems to me that data sent from a form on a utf-8 encoded page, in a site that is entirely utf-8 encoded, should be received as utf-8 by the search engine.

          I notice that if I enter a url containing a file name encoded in utf-8 in Google or Bing, it returns the proper page.

          The 1994 standard is pretty old and modified by I think at least two generations of RFCs (the Internationalization stuff).

          Perhaps it would be possible to add an option in configuring the Zoom Indexer that specified the query was guaranteed to be in utf-8?

          Comment


          • #6
            You are confusing the matter. You started off talking about URL links, but have now switched to talking about forms, which is a different subject.

            My understanding is that the current standard is RFC 3986 from 2005. (Not something from 1994).

            Some (maybe all) of the browser are hiding what is going on.

            So for example if you look at this page,
            http://en.wiktionary.org/wiki/éditeur

            Then you find links on the page are % encoded, like this,
            <a href="/wiki/%C3%A9rudite#French" title="érudite">érudite</a>

            Note how % encoding is used in the link, but the user never sees it. This is how it should be done.

            The URL display bar in the most new browsers decodes the % encoding to display the accent. Effectively hiding what is being sent to the server.

            So even if you type in the accent in the URL bar, what actually gets sent to the server is,
            HTTP GET /wiki/%C3%A9diteur

            In short you can't have UTF-8 in URLs. (Although some browsers allow you to enter UTF-8 URLs, they do a conversion to % encoding in the background).

            Comment


            • #7
              So, if I understand correctly, you are saying that in order to use the Zoom Search Engine, the urls in all the <a href=''"> links in the html need to be encoded (assuming the link contains non-ASCII characters). Even though, unencoded, these links work perfectly well in current browsers.

              The problem I have with this, aside from the work involved in encoding all these links, is that I can't read the encoded version myself.

              I will do a few experiments and see what happens.

              Thanks for your patience.

              Comment


              • #8
                assuming the link contains non-ASCII characters
                The range of allowed characters is a sub-set of ASCII. Much of the ASCII table also needs encoding.

                ...is that I can't read the encoded version myself.
                Good HTML editors will encode for you. As will good CMS systems.

                Even though, unencoded, these links work perfectly well in current browsers.
                I haven't tested it, but I am sure you are probably correct some of the time. However I am sure it wouldn't work in all cases. If it works or not would depends on the characters in the file name (e.g. Chinese, Japanese, French, etc..) the code page in use, the browser in use, and the web server which serves the pages.

                In the case of Zoom, it performs several steps to fix up URLs it finds. Technically this is known as canonicalization. And at least in some cases it fixes up accents with % encoding.

                But our general comment is that if it doesn't work, then make it comply with the standards. Then if it still doesn't work get back to us with an example showing the problem. Which you haven't done as yet.

                Comment


                • #9
                  Haven't provided an example yet because I haven't been able to get the php memory_limit on the server set high enough to enable Zoom to run. As soon as (or rather, if) that is done I will provide an example.

                  The use of the name attribute in a <a> tag is, as you know, deprecated in html 4.01. I'm afraid the approach wikipedia is taking is liable to break in HTML5, i.e., at the moment, on any mobile device.

                  Being a php newbie, I had not realized the effect of its one byte=one character nature. Apparently only certain parts of it can deal with "international" strings. Coding for this must be very challenging.

                  Comment


                  • #10
                    When the program spiders my site, it delivers an error message "could not download file" for every file which has an accented character in the filename.
                    The problem you described was an indexing problem. Shouldn't have anything to do with a PHP memory limit. Just the URL of the page that links to the accented files would be enough to verify the behavior of the indexer.

                    Coding for this must be very challenging
                    As we support ASP, PHP, JS, ASP.NET and CGI getting all the character set issues sorted out across all platforms has been a bit of a nightmare. Still doesn't fully work in some cases, like case conversion in Cyrillic.

                    The use of the name attribute in a <a> tag is, as you know, deprecated in html 4.01.
                    Not sure how the name attribute as used by wikipedia related to the issue of accents in links. Not that I think they are using it in any case.

                    Comment

                    Working...
                    X