PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Double Pages in Index File

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Double Pages in Index File

    Hi,

    I've checked 'Use CRC to skip files with identical content', but I'm getting the pages twice. Any idea why?

    It's pretty messy right now, but you can see the search at www.minusforty.ca (search on the top bar); try searching minus forty.

    P.S. - Sorry about that last double submission. Sticky fingers; it's humid in Canada today.

    Thanks again,
    Michele J. Jones, PMP

  • #2
    I had a look at the site but didn't see any duplication in the results. Maybe you already fixed the problem?

    Comment


    • #3
      You don't see the duplicates because I manually take the pages out every time I reindex. It's just a pain in the neck doing this every time.

      I'm thinking it might be related to the fact that there are two index.htm pages, which all pages stem from. The one in the root only has one link right now and it's to the one in the fs directory, which all pages stem from. Shortly, there will be a total of three index.htm files in separate directories under the root like this.

      If you need to see them, I can put the duplicates back for a quick minute, but the site is live.

      Thanks again,
      Michele J. Jones, PMP

      Comment


      • #4
        Yes, you can upload the set of files with the problem to a temporary sub-folder, so that we can see the problem.

        Comment


        • #5
          I've got it figured out for now. It was caused by the iframes.

          I'm able to put the iframe page names in the skip options for now, but I'll have to figure out how to deal with them later. A few would be nice to have in the search, but not a biggie right now. Most iframes are just images which can't be indexed anyway.

          If you want to look at it, all the directories under fs/home/ have pages called producthome.htm and the iframes are product.htm. Instead of indexing product.htm, it's indexing producthome.htm twice.

          One little thing. Do you know how to prevent the TM symbol after Minus Forty from turning into an ! ? If you search for 'about', you'll see a document called C3-2360-ExternalJobOpportunitiesWithMinusForty.doc in the home directory (not fs/home).

          Thanks again for your excellent support as always,

          Michele
          Michele J. Jones, PMP

          Comment


          • #6
            Just to clarify the problem with duplicate pages... when Zoom (in Spider Mode only) indexes a page within an IFRAME, it will index the content within the IFRAME, but it will point the link to the page containing the IFRAME.

            So in your case, you have a page such as this one:
            http://www.minusforty.ca/fs/home/bottle/producthome.htm

            which contains an IFRAME that loads this page:
            http://www.minusforty.ca/fs/home/bottle/product.htm

            When Zoom indexes the latter page by following the IFRAME link on the first page, it will actually index the content of the second page, but the search result for it will link to the first one (if that makes sense). Think of the fact that most of the time, when a site has an IFRAME, the site owner does not want the end user to be given a link to the page within the IFRAME in the search results. They would usually want the end-user to be pointed to the page with the IFRAME that loads that page.

            Because your second page had very little content, and the same meta description as the IFRAME page, this was not obvious as to what was happening, and it did indeed look like the same page was indexed twice. I thought so too at first, and thought it was a bug, but it turns out this is not the case.

            Since the content within your IFRAMEs do not really contain any indexable content (just images), you could do what you said, and simply add the pages to the skip list.

            The trademark symbol issue may need further investigation. The problem is likely due to the fact that the trademark symbol is not being processed as UTF-8 while Zoom is set to expect UTF-8.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment

            Working...
            X