PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Not indexing PDF or Doc files ind Spidermode

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Not indexing PDF or Doc files ind Spidermode

    Hi

    Have this problem:

    If I index my site in Offline mode I get all the PDF, doc files etc. indexed all right.

    But if I use Spider-mode it does not index any files but asp-files. THere is no error-messages in the log. It just doesnīt index those filetypes.

    I have installed all plugins and it says that plugins are installed and ready.

    I have the Pro-edition of Zoom-search 4.2

    "Why donīt you stick to offline mode then", you may say - but then I can not use my breadcrumb ASP-script because it uses <% pagetitle %> as <Title> and this will not show up as a title in the search result page. The ASP-pages has to be "spidered" to get this info.

    What must/can I do? Any clues?

    /Morten
    Denmark

  • #2
    To start with check these FAQ pages.

    Q. Why are some of my pages being skipped by the indexer?

    Q. I am indexing with spider mode but it is not finding all the pages on my web site

    ------
    David

    Comment


    • #3
      Spider mode appends extra URL to pdf URLs

      We use absolue references to link to our pdfs. When I index the files, Zoom, appends the base URL to the links to pdfs, so the links don't work. Is there a way to stop it from appending the base uRL, or is there a way I can edit the file to remove the extra URL?

      Thanks.

      elizabeth oshea

      Comment


      • #4
        Elizabeth, are you using spider mode or offline mode? Spider mode I guess, from the post title, but in spider mode, there should be no appending of base URLs.

        What are the settings you have for the start point and base URL?
        Can you give us a URL to a page that links to your PDFs so that we can have a look at it and try it out (or a sample of your HTML code).

        -----
        David

        Comment


        • #5
          Originally posted by Wrensoft
          Elizabeth, are you using spider mode or offline mode? Spider mode I guess, from the post title, but in spider mode, there should be no appending of base URLs.
          Oh! But there is a base URL box on the Spider mode tabbed page. I can't change the information in it.

          Originally posted by Wrensoft
          What are the settings you have for the start point and base URL?
          As a start point I have used both www.virtualaccess.com and www.virtualaccess.com/index.htm. The base URL is www.virtualaccess.com.

          When I spider the website, it doesn't pick up all the pdfs, even though all the pdfs have at least one link from an html page. When I use offline mode, it finds all the pdfs.

          Here is a link to a page that links to pdfs that spider mode doesn't spider:

          http://www.virtualaccess.com/MsAppliances.htm

          Here is a link to a page that links to pdfs that spider mode does spider:

          http://www.virtualaccess.com/SupportDocsHowToGuides.htm

          Thanks for your help.

          elizabeth

          Comment


          • #6
            Originally posted by Elizabeth OShea
            Oh! But there is a base URL box on the Spider mode tabbed page. I can't change the information in it.
            The base URL in spider mode is not appended to the link. It is used to determine when a link points to an "external site". So this is unlikely to be the cause of your problem. As a FYI, the base URL can be changed (but should not need to be, in your case) by clicking on the "More" button and selecting "Edit".

            Here is a link to a page that links to pdfs that spider mode doesn't spider:

            http://www.virtualaccess.com/MsAppliances.htm
            The PDF links on that page are broken. Try opening that page in a browser yourself and clicking on the "GW4000 datasheet" or "GW5000 datasheet".

            These links are broken because you have left out the "http://" part of the absolute URL. This causes the browser (and Zoom) to treat them as relative URLs, so you have actually mistakingly linked to "http://www.virtualaccess.com/www.virtualaccess.com/pdf/DSHT_GW4000.pdf" etc. Of course, these URLs don't exist, so that's why Zoom is unable to find them. If you fix the links, and re-index, Zoom should be able to find them just fine.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Ah. And I thought I had checked the html. Well, thank you for your help and patience!

              Comment

              Working...
              X