PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

.docx & .doc files being indexed but not showing in search results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • .docx & .doc files being indexed but not showing in search results

    I am indexing a website with a mix of .htm, docx and pdf files but the doc & docx file contents are not showing up in the search results.

    e.g. - The log for file "DescendantsOfJamesGalbraith,Antrim.docx" says -
    • 14|01/03/15 11:50:04|DL Thread #1, got URL (http://www.clangalbraith.org/MembersOnly/Library/Canada-Ontario/DescendantsOfJamesGalbraith,Antrim.docx) off queue
    • 04|01/03/15 11:50:04|Downloading file http://www.clangalbraith.org/MembersOnly/Library/Canada-Ontario/DescendantsOfJamesGalbraith,Antrim.docx
    • http://www.clangalbraith.org/MembersOnly/Library/Canada-Ontario/DescendantsOfJamesGalbraith,Antrim.docx (Content-type: Plain text)
    • 11|01/03/15 11:50:05|Spidering for links on http://www.clangalbraith.org/MembersOnly/Library/Canada-Ontario/DescendantsOfJamesGalbraith,Antrim.docx
    • 00|01/03/15 11:50:05|Indexing http://www.clangalbraith.org/MembersOnly/Library/Canada-Ontario
    • DescendantsOfJamesGalbraith,Antrim.docx


    There are no error messages so this all seems to be OK.
    • However the contents do not appear to being indexed.
    • When searching using only the document title it shows up in search results with the extension .docx, so the docs are being seen.
    • NB. PDF files are being indexed OK, including one linked via the same page as the docx file. The PDF files are bigger than the doc files so size is not a problem.
    • I do have doc & docx added in the Scan options.
    • After seeing in the log (Content-type: Plain text) I changed the scan options for these docs to plain text from the standard setting but they are still not showing in the index.
    • I have amended the settings to one single thread, 5 sec, reload files (do not use cache), increased all the limits etc. to no effect.


    I have been checking help & the forum but have hit a brick wall. Help, what am I doing wrong!

    Tassie Simon

  • #2
    In the Zoom configuration window, .DOC files MUST be set to be Word documents and .DOCX must be set top be Office 2007 documents. If you force them to be treated as text it won't work.

    Your files are hidden behind a Login page, so I can't check them. But my guess is that your server is returning the wrong mime type for the files. See,
    http://en.wikipedia.org/wiki/Internet_media_type
    So check this first by looking at the HTTP headers.

    Edit: The second thing that is strange is the use of a comma in the URL, "Galbraith,Antrim". A comma is a reserved character in a URL (See RFC 3986). So in this context the comma should probalby be percent encoded. Does it work for Word documents that have file names that don't have a comma in them?

    Comment


    • #3
      The doc files were initially set as you said. I only tried text in trying different things to get it to work. I will look at the HTTP headers as suggested.

      Comment


      • #4
        Solution :-

        In this case the settings in the website's host server needed to be accessed and further MIME content types added.

        In my case my server is running Apache web server software.

        The settings for Apache are in a file called '.htaccess'. Logging into the host server concerned with the website in question (an Association website) brings up a control panel where the '.htaccess editor' is located under 'additional tools'.
        With another host I use for my own web sites it uses the common CPanel, and there it is located under the group 'Advanced' and is named 'MIME Types'

        In the .htaccess editor use the 'Add new MIME Type' or similar. In CPanel it is named 'Create a MIME Type'.

        You needs to add lines like this, e.g. for docx
        In 'AddType' :
        application/vnd.openxmlformats-officedocument.wordprocessingml.document docx
        In 'Extension (s)' :
        .docx

        The type and extension will differ depending on the application, i.e. .docx or .xlsx etc
        For office 2007 see the list at :
        http://www.wrensoft.com/forum/showth...X-PPTX-XLSX%29
        For a complete list of all MIME types see :
        http://www.sitepoint.com/web-foundat...complete-list/

        But depending on the version of Apache and the setup there might also be other files,
        /etc/mime.types
        /usr/share/mime/application
        /etc/apache2/mods-enabled/mime.conf

        In my case these other files were not needed.

        However all this should be setup correctly by your hosting company. Unfortunately in this case the website hosting company did not. On the other host I mention using CPanel all the content types in the list of MIME types in the 2nd link above were already included.

        Comment

        Working...
        X