PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Is it possible to run a script before uploading?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is it possible to run a script before uploading?

    I'm running the pro edition of v5 and I love it. One problem I am running into is duplicate pages showing in results. I read through the forums and realize that since I have Ads displaying, the CRC check doesn't help. I've tried the other suggestions and still no joy.

    What I do now is manually remove the dupes and re-upload the files. Since I want to rely on the scheduled tasks, I was wondering if there was a way to filter the results after a scan and have the filtered results uploaded?

    Any help would be greatly appreciated.

  • #2
    It is better to filter results during the indexing process, rather than trying to do it after.

    Use the skip list in the Zoom configuration window to do this.

    We are also looking at more flexible CRC options for V6 of the software, that will enable similar pages to be filtered. (The current software only filters identical pages)

    Comment


    • #3
      I am currently filtering during the proccess as well but I run into a problem. I display a list of links on index.php that are brought in by an include file. This list contains 100 links and then "Next" is displayed to get to the rest of the links (paging). The spider follows the "Next" link and indexes as required.

      The problem is that it dupes the link since all of the pages have this list of links. I filter out the "Back" link to keep the spider from indexing the links three times, but I have not be been able to avoid the dupes.

      I'm using PHP and the url to my site is: http://www.siliconjournal.com/

      This link is the one causing the problem:

      /index.php?pageNum_rsFeedsList=1&totalRows_rsFeedsL ist=140

      If I filter it during the scan, the spider won't index the links after the initial 100. I've kept the the spider from creating three copies in the results by placing "pageNum_rsFeedList=0 in the config so it doesn't scan backwards as well.

      I hope this all makes sense

      Comment


      • #4
        You can insert the <!--ZOOMSTOPFOLLOW--> and <!--ZOOMRESTARTFOLLOW--> tags around your "Back" links, or any other links that you don't want the spider to follow.

        I think that should solve your problem. See the Users Guide for more details.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Wouldn't that keep the spider from following the link? I need it to follow the link to find the other 40 links.

          For instance, a search for kosovo on my site would return the following duplicate results:

          index.php?pageNum_rsFeedsList=1&totalRows_rsFeedsL ist=140&feed=http://rss.cnn.com/rss/cnn_topstories.rss

          and

          index.php?feed=http://rss.cnn.com/rss/cnn_topstories.rss

          both of which are valid, but there is no point in both showing. It also doubles the volume of the index unecessarily.

          I'll try your suggestion and report back.

          Comment


          • #6
            As I suspected, the spider doesn't index the remaining 40 links since it doesn't page the results.

            I guess there's no other way around the duplicate entries?

            Comment


            • #7
              I found away around the problem. I created a page on the site that lists all of the links and set the spider to start at that page. I then wrapped the following code around the links that the spider was duplicating in the results.

              <!--ZOOMSTOP-->
              <!--ZOOMSTOPFOLLOW-->
              <!--ZOOMRESTARTFOLLOW-->
              <!--ZOOMRESTART-->

              This keeps the spider from looping and the index is now much smaller and doesn't require any maintenance.

              This is truly a great product and I thank you for the fast support. I look forward to the next version with the beefed up CRC check.

              Comment


              • #8
                I was talking about only stopping the "Previous" or "Back" links from being followed, but not stopping the "Next" link from being crawled. The idea then was that the spider would go forward through the pagination but not backwards, which I thought might solve your problem from your description.

                But I don't really understand what's happening on your site. I didn't see a situation where the feed= links had the pageNum_rsFeedsList= parameters, nor do I really see why that would be necessary. If at all possible, you should not have multiple URLs that go to the same page content. It's generally search engine unfriendly and it would likely get your page ranked down by Google.

                I tried to index your site just then, but it seems that you've put ZOOMSTOPFOLLOW tags around the whole links block. That's not what I was talking about at all... I said just to put it around the "Back" (or rather, "<< Previous") text link.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  OK, I must have caught your site in mid-change. The above post was written while your previous post wasn't posted yet, so you can disregard the above. Either way, so long as you've achieved what you're after, that's good news.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X