PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Scoring HTML over PDFs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scoring HTML over PDFs

    Hi. I'm running a Professional Edition implementation of Zoom on a reasonably large website. Generally, the results are very good due to the many options and setting available. However, we have encountered an issue which is now causing major problems.

    There are a large number of large pdf files on the website – these are swamping the html results simply due to the large amount of words (and repetition of words) within the pdfs.

    We have given the all html pages a +5 Boost, and the pdf’s a -5 Deboost (using Desc files). We’ve set the Content Density adjustment to Strong, as well as the Word Positioning, and we’ve even given Body Content a -5 Deboost.
    However, we still have issues with documents taking precedence.
    Note that Recommended links are not an option.

    We have currently disabled body content indexing, so to make use of the reasonably comprehensive metadata used across the site, but unsurprisingly the results are now very much dependant on specific keywords.

    My client has asked if it is possible to still score body content, but only on the first hit for each unique word in the html/document. This would allow for reasonably accurate results and prevent the pdf's always taking precedence.

    Is this a possibility? Is there anyway of indexing in this way without asking for some custom development? Is there another way we can approach the problem?

    Thanks in advance for any help.

  • #2
    Are you using the V6 CGI scripting option?
    We have found a bug in the CGI script that is causing page boosting to be ignored. This has been fixed in the last few days and will be in the 6.0.1019 release out later this week.

    You might also trying indexing just the first 500 (or so) words from each page / document.

    Comment


    • #3
      Hi, and thanks for the quick response.
      I should have specified details - we're using the ASP.Net ver, and to be honest, I think everything is functioning as it should.

      I think indexing the first 500 words might be worth a go, but it's possibly we'll still miss required hits. I'll give it a go and see how the results looks.

      Thanks.

      Comment


      • #4
        The bug also effects .NET option (which is based on the CGI source code). So make sure you get the new release later this week as well.

        Comment


        • #5
          V6.0 build 1019 is now available with the abovementioned fix:
          http://www.wrensoft.com/zoom/whatsnew.html

          Make sure to get the ASP.NET Server Control as well (you will need to uninstall and reinstall the control on the server):
          http://www.wrensoft.com/zoom/aspdotnet.html
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X