PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

File Based Indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • File Based Indexing

    Very sorry if this question has been asked, I did search the board but can’t find an answer. Can I import a list of file based html pages and provide a matching remote url to appear within the SERP? Your software is very good overall, however, I would prefer to use another crawler.

  • #2
    I am not sure exactly what you are asking for. Can you give an example. It sounds bit like offline mode in Zoom (where you index local files) without spidering a remote site.

    ------
    David

    Comment


    • #3
      Right, currently I import a list like http://www.somesite.xxx/somepage.html/,INDEX_ONLY

      But what I’d like is file://C:/health/somepage.html, URL_http://www.somesite.xxx/somepage.html,INDEX_ONLY

      Your indexer is great for local files, I'd just like to use a remote crawler that is a little more feature rich.

      Comment


      • #4
        I think I see what you want to do. You want to download a complete web site to you local drive and then index the web site from your local drive but still have the search results pointing to the remote site where the files originally came from.

        You might be able to use offline mode if there aren't many different sites.

        In offline mode you can enter,

        Start folder.
        C:/health/

        Base URL
        http://www.somesite.xxx/

        Assuming the file(s) somepage.html, etc.. are in the /health folder.

        But this doesn't work so well if you have lots of sites (too much maintenance) or if you mix up the pages from various different sites into a single directory on your local drive.

        What is the reason you don't want to use Zoom to directly download the files in Spider mode? As this would seem to be the easier solution.

        ------
        David

        Comment


        • #5
          What I like about Zoom is the way it delivers the final result. For niche web sites that have low volume traffic, say less than a thousand unique visitors per day, per site, … with Zoom you could possably run 10 or more niche sites on a single box with fairly large databases (over a hundred thousand pages each) of remote web pages dedicated to that niche. The search is offered as a resource to the niche, rather than just a site search.

          Try doing that with Nutch and you’ll soon realize that it can’t be done on one box. The reason is, that type of system is designed to be capable of delivering many, many searches per second. By demanding more resources and memory dedicated up front, Nutch can deliver more searches per second with less CPU overhead. All well and good if you have a very busy site, but complete overkill for small web sites that index only a few thousand pages.

          Zoom doesn’t demand resources “upfront”. You can offer as many databases as you like with the only limit being CPU time, and a modern machine with low level traffic has lots of CPU time to spare.

          I’m new to Zoom, and it does a great job for web sites that wish to offer search as part of the site navigation system. It’s easy to use, a great price and just plain works. It does exactly what it was designed to do.

          The problem with Zoom, is that it is now “to good”. In other words, it fills its market place in an excellent fashion, but what it is now capable of doing is starting to out grow what it was originally designed for. I’m in a different market, I’m not looking for an add on to my site navigation, I want to create mini-Googles. Zoom has all the pieces of the puzzle I need, except for its remote page crawler. It does a good job for single remote domains and very good for local file indexing, but very slow for what I intend to use it for.

          Reading your threads about the up coming v5 and it is clear that Zoom will be even more suited to what I want to do, and eventually you’ll attract more and more webmasters like myself, and when that happens, I’m sure you’ll improve the remote crawler. But in the meantime, if you would consider adding the capability to import a list of local files with a matching remote address, I’ve got disk space to spare, and the tools to easily create the list.

          This feature would also solve another problem, adding a few new pages to the database requires rebuilding the complete index. I understand the reason for that, and I agree that by rebuilding the database you end up with a smaller and more efficient index. That said, much of what I’ll index is “evergreen” content that never changes, and downloading 200,000 remote pages just to add one page isn’t very practical. However, rebuilding the index from the local files system is very fast, and it would then be practical to add a few pages and quickly rebuild the index.

          So … regardless of what features you add to the future Zoom, I’m always going to prefer using up cheap disk space, in exchange for bandwidth and time.

          Comment


          • #6
            Your case is unusual as you want to just index a couple of pages from a large number of domains. But never want check if the pages have been updated, but at the same time add additional pages to the existing index.

            And as you point out, while Zoom can do this, this is not the most efficient scenario. Especially with V4.x of Zoom

            Incremental indexing (planned in V5) does address part of this problem. You will be able to add just a few new files without downloading all the unchanged files. Thus using less bandwidth and time. It still won't match the speed of a local hard driver however.

            And maybe what you want can be mostly done by using multiple start points in offline mode. But there is no import function for this list. Meaning you will need to manually enter the data or write a script to manually create a Zoom configuration file (which is a Unicode text file).

            ------
            David

            Comment

            Working...
            X