PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Automated indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Automated indexing

    Looks like a great product (which we've already purchased)...now it's been given to me to 'make it happen'.

    I have a database of URLs that I'd like to index. As it is now, I see two options: 1) create a dynamic page that pulls the list of URLs and output to the screen (so Zoom Search can index that page) or 2) somehow get Zoom to pull the URLs from the database directly.

    1) If Zoom indexes the page that has the list of 2000 URLs, will any reference point to the list? I mean, I don't want someone searching for a word that happens to be someone's URL too, and have it display the list of URLs.
    2) Can Zoom console build interface with a database?

    Thanks!

  • #2
    Does the list of URLs vary much over time? If the list doesn't vary much, export the list to a text file, then import it into Zoom as a list of start points. (From the "More" button in the main Zoom window).

    You could then maintain the URL list from Zoom.

    Are the URLs all from the same web site? Or do the 2000 URLs point to 2000 different home pages of different web sites. Do you want to follow outoing links that appear on the 2000 pages or just index the single page pointed to by the link?

    Indexing 2000 full web sites could be a big job. The total number of pages might be in excess of 200,000 pages!

    ----
    David

    Comment


    • #3
      The list could change on a daily basis...I'm setting up a cron job to reindex nightly.

      The URLs are totally different sites (2000 URLS point to 2000 different home pages of different web sites). I want to follow the links on the list of links, but not externally from those links.

      And yes, it's possible that the total number ofpages might be > 200,000. Is Zoom Search able to support such a large request?

      Comment


      • #4
        I'm setting up a cron job to reindex nightly
        Strictly speaking this doesn't make sense, but I think I know what you mean. The Zoom indexer is a Windows application but Cron is a Unix application, so using them together doesn't make sense.

        Zoom has a built in scheduling function that uses the Windows scheduler (the Windows equivalent of Cron)

        If the list of web sites to index changes every day there are a couple of options.
        1) You maintain the URL list in Zoom by hand. You just add & remove URLs as required in Zoom.
        2) You write a script or some small tool to create a Zoom configuration file for you (this is a Unicode text file). A Zoom configuration file contain a URL list. Zoom will read the configuration file each time it is run as a scheduled job.

        the total number of pages might be > 200,000. Is Zoom Search able to support such a large request?
        That is a definite maybe. Have a look at these benchmark pages
        Indexing speed
        http://www.wrensoft.com/forum/viewtopic.php?t=525
        Search speed
        http://www.wrensoft.com/zoom/benchmarks.html

        The potential problems are

        Indexing time
        If you assume you are indexing these remote sites at the rate of 5 pages per second you are looking at about 11 hours to index 200,000 pages. This assumes a load time of 200 milliseconds per page. If they were all slow servers however it might be 400ms per page (22 hours). On a fast servers it might be <100ms however (<4 hours). (Technically speaking the load times would be longer but they would be running asynchronously across multiple download threads to get the rates stated above)

        RAM requirements
        Zoom writes out a lot of index data to disk as it indexes. But even so it needs to hold part of index in RAM, for efficiency reasons, during indexing. Having 200,000 pages of 40KB means 8GB of data. But as Zoom doesn't store the entire page you don't need 8GB of RAM (or disk space). About 1.3GB of RAM should be enough during indexing.

        Bad sites
        There are many badly coded sites on the web. Some of which have a infinite number of pages (like a loop). So you need to restrict the number of pages indexed per site. Zoom can do this.

        Search Times
        If you use the CGI option and a reasonable server, then the search times should be OK on 200,000 pages. Maybe <2 sec.

        If you are thinking about indexing a lot more than 200,000 pages then you might also want to have a read of this past post,
        http://www.wrensoft.com/forum/viewtopic.php?t=712
        Explaining why the entire internet isn't going to fit your hard disk

        Let me know how it goes.

        ------
        David

        Comment


        • #5
          I should've been clearer...yes, we will use Windows scheduler (using cron job as a generic term of scheduling).

          I want a limit of 50 per site, but only on some of the sites. On other URLs, I want it to fully index the site.

          Okay, on the Zoom configuration file that I'm going to have to create manually, how do I specify different base hrefs?

          For example, I have www.site1.com, www.site2.com, www.site3.com, ... www.site20.com. Each of these need to be indexed, and followed to 50 pages. (As such, the base href needs to be changed each time). At the bottom of my URL list I have another site (www.mastersite.com) needs to be fully indexed (no limit).

          Suggestions on how the configuration file needs to look for this?

          Comment


          • #6
            I want a limit of 50 per site, but only on some of the sites. On other URLs, I want it to fully index the site.
            You can't. You can allow indexing of all pages on a site, or you can set a global limit on the total number of pages indexed or you can set a single limit that applies to all start points. You can't set a different page limit for each start point.

            This might be something we look at in V5 of the software.

            For each start point that you create you can (and should) have a different base URL. You can enter this when you enter in the start point URL. But you shouldn't need multiple base URLs for a single start point, at least not from what you have described.

            ------
            David

            Comment


            • #7
              are there any command line options for zoom that might help?
              ____________________________
              Terry Remsik

              Comment


              • #8
                No. There are no command line options that effect the number of pages indexed per start point.

                ----
                David

                Comment


                • #9
                  Actually I do need different base urls since the first page zoom visits is just a list of external URLs and is not indexed (follow only).

                  For what it's worth, I think having a parameter to specify the max number of pages/links per URL would be very useful.

                  Thanks.

                  Comment

                  Working...
                  X