PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexer finding tens of thousands of files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexer finding tens of thousands of files

    The indexer is finding files from my database multiple times
    See here
    http://www.e-yaji.com/search/search.asp?zoom_query=jh206&zoom_cat[]=-1
    for an example (it will take a while to load due to the thousands of files it has to search)
    Each has a slightly different url but points to the same file when clicked. (332 results all of the same page in this case)
    It makes the index unusable for me.
    Is there anything I can do to make it find each of these just once?

    I tried checking the box to skip identical files by CRC checking and it made no difference.

    I didn't write the php script that my site is running by the way

  • #2
    If you look at the URL's they are all different

    http://www.e-yaji.com/auction/photo.php?photo=165&u=9,3
    http://www.e-yaji.com/auction/photo.php?photo=165&u=*.*

    And so on.

    How is your "Base URL SET".
    Have you tried Off line indexing?
    It looks like a variable is creating the *.* in the URL.

    Try off line
    and index the directory containing the data. And not the template PHP PAGE.

    and set the base url to
    http://www.e-yaji.com/auction/photo.php?photo=


    AS FOR THE PHP SCRIPT..... Can you edit it?

    Also on another note. The site seems much slower the it shouldbe.

    Are you using Server side includes with php.

    Code the looks like so.

    include("http://www.e-yaji.com/myfile.php");

    if so DO NOT WRITE YOUR INCLUDE THAT WAY,
    if they are on the same server. Do this .

    include("/myfolder/mypath/myfile.php");

    Or you Can tie up the server and slow it down.


    Aslo
    http://www.e-yaji.com/menu/JS/stmenu.js

    Should be like

    /menu/JS/stmenu.js



    / = root directory of server
    ./ = cuerrent folder.
    Last edited by z00m user; Apr-29-2010, 10:13 AM.

    Comment


    • #3
      These URLs all appear to be real URLs generated by your site. So it is normal that Zoom locates all the pages on your site.

      The CRC function won't work as not only are the URLs different, but the page content is also significantly different.

      As an example this URL displays the product with a large picture.
      http://www.e-yaji.com/snuff2/photo.php?photo=820&u=12,135
      and this URL displays the product with a small picture.
      http://www.e-yaji.com/snuff2/photo.php?photo=820&u=1,46
      so both the page content and page URLs are unqiue in each case.

      The final "&u=" paramter in the URL seems to vary depending on the listing position in your photo gallery. As there there are many ways to browse the gallery on your site, there will be hundreds of URLs per product. I think the u=1,46 parameter in the URL allows the Next and Previous buttons to function on your pages.

      Your site is script generated. So I don't think offline mode will help you are all.

      I also don't think setting the base URL as 'zOOm User' suggested will resolve all the issue.

      I would normally suggest using the page an folder skip list function in Zoom to remove the unwanted URLs. But in your case it is hard to define which URLs are wanted and which are unwanted.

      So I think the best idea might be to see if you can create a list of good URLs on your site (probalby less than 200 URLs in total). Your scripting package might already do this for SEO reasons.

      Then once you have this list of URLs in a HTML file you can start Zoom on this 'sitemap' page and tell Zoom to only follow 1 level of links from this sitemap page. This will limit indexing to just the pages you want.

      Comment


      • #4
        Thanks for the replies I didn't get email notification of replies or I'd have been back sooner
        The pages are dynamically generated so I can use offline mode.
        I realise that the urls are different but whilst some of them have different content many of them have exactly the same content (the first 4 on that search for instance, I didn't go any further but those 4 are exactly the same page despite having different urls)

        I'm not sure I fully understand the last part of your post but I will give it a go and see what i can come up with. The rate of new pages on the site might make it a PITA though

        Comment


        • #5
          The pages are dynamically generated so I can use offline mode
          I think you mean "cannot".

          If you site is dynamic then it shouldn't be too hard to have a auto-generated sitemap page. In your case this is probalby a single page that has a simple links. One link per product, with all product links on a single page.

          There might be other solutions, but they are probalby more complex.

          Comment


          • #6
            Originally posted by malum View Post
            I realise that the urls are different but whilst some of them have different content many of them have exactly the same content (the first 4 on that search for instance, I didn't go any further but those 4 are exactly the same page despite having different urls)
            I had a closer look at the pages in question, and this is not true.

            When I request this URL:
            http://www.e-yaji.com/snuff2/photo.php?photo=820&u=12,135

            The "PREVIOUS", "NEXT" and various other links on the page that is returned look like this:

            Code:
            <div class="ee_css_previous_photo_link ee_previous_photo_link">
               <a href="photo.php?photo=1486&amp;u=12,[COLOR=red][B]134[/B][/COLOR]" title="Previous">PREVIOUS</a>  </div>
             
              <div class="ee_filler">&nbsp;&nbsp;</div>
             
              <div class="ee_css_next_photo_link ee_next_photo_link">
               <a href="photo.php?photo=819&amp;u=12,[B][COLOR=red]136[/COLOR][/B]" title="Next">NEXT</a>  </div>
             
              <div class="ee_filler">&nbsp;&nbsp;</div>
             
              <div class="ee_css_photo_browser_link ee_photo_browser_link">
               <a href="list.php?exhibition=3&amp;u=12,[COLOR=red][B]135[/B][/COLOR]" title="Back to Thumbnails">SEARCH/INDEX</a>
             
               </div>
            When I then request the page with this URL:
            http://www.e-yaji.com/snuff2/photo.php?photo=820&u=12,142

            The same links look like this:
            Code:
            <div class="ee_css_previous_photo_link ee_previous_photo_link">
               <a href="photo.php?photo=819&amp;u=12,[COLOR=red][B]141[/B][/COLOR]" title="Previous">PREVIOUS</a>  </div>
             
              <div class="ee_filler">&nbsp;&nbsp;</div>
             
              <div class="ee_css_next_photo_link ee_next_photo_link">
               <a href="photo.php?photo=826&amp;u=12,[COLOR=red][B]143[/B][/COLOR]" title="Next">NEXT</a>  </div>
             
              <div class="ee_filler">&nbsp;&nbsp;</div>
             
              <div class="ee_css_photo_browser_link ee_photo_browser_link">
               <a href="list.php?exhibition=3&amp;u=12,[B][COLOR=red]142[/COLOR][/B]" title="Back to Thumbnails">SEARCH/INDEX</a>
             
               </div>
            It is evident that your "photo.php" script changes the links it generates depending on the "u=xx,xxx" parameter that is passed to it in the URL.

            Because of this, the different URLs you are seeing here, each create a distinct different page. There is no reason for Zoom to assume that this page is actually identical when they contain different links.

            What you can do to address this, is to wrap the offending links (or parts of the page which contain such changing links) with <!--ZOOMSTOP--> and <!--ZOOMRESTART-->. This will exclude the section from being indexed, as well as excluding it from the CRC duplicate page detection method.

            More information on using the ZOOMSTOP tags can be found in section 7.5 of the Users Guide

            Note that you need to exclude ALL sections which contain links that can change dynamically. This is more than just the bits I highlighted above. I noted that there's also an image tag at the bottom of those pages which also changes.

            It benefits in other ways as well since you really don't want to index words like "NEXT", "PREVIOUS" and "SEARCH", which appear on most pages. Ideally, you would skip everything but the textual content that is unique (and important) to the page.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Your search page is also unusually slow. If you have modified "search.asp" or "settings.asp" or any of the index files, then this may be a cause of problem. Or it may be because you have turned on "Substring match for all searches" and this is causing a huge number of unnecessary matches. I noticed searching for "test" matches "greatest" and "attested" which only happens when substring match is enabled and is generally not recommended for English (this option can be found under "Configure"->"Languages")
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Thanks guys
                Yes I meant that I cannot run in offline mode as the site is dynamic

                The site is bilingual English/Chinese so I left the "substring match for all searches" on
                Presumably the search will be considerable faster when it is searching 1,500 pages rather than 65,000 even with this on.

                The dynamically created pages are only part of the site BTW. I can create a sitemap page for the dynamic part but then I need to incorporate the non dynamic part into the scan as well.

                What you can do to address this, is to wrap the offending links (or parts of the page which contain such changing links) with <!--ZOOMSTOP--> and <!--ZOOMRESTART-->. This will exclude the section from being indexed, as well as excluding it from the CRC duplicate page detection method.
                Can I do this for dynamic pages?

                Edit

                All right. I'm getting there, using the <!--ZOOMSTOP--><!--ZOOMSTOPFOLLOW--> tags is working
                Finding where to put them in the php files that control the database was a freaking nightmare.
                Having trouble with the Chinese now but the number of indexed pages has dropped from hitting the 65,000 limit to less than 2000.

                I may start a new thread about the Chinese if I can't get it working. Any further queries on the limiting of the index I'll put in here.

                Thanks for the help so far!
                Last edited by malum; May-05-2010, 02:18 PM. Reason: Update

                Comment

                Working...
                X