PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Zoom not indexing post attachments in vBulletin v4.0

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zoom not indexing post attachments in vBulletin v4.0

    Hi,

    Zoom keeps running into a no follow meta tag when trying to index attachments (pdfs) in a thread or post in vBulletin 4.0.

    Here is the Zoom log error (attachment.php?attachmentid=4&d=1275696187 meta robots "noindex" tag found)

    For the life of me I can't figure out where it is picking up this tag from. What am I missing here?

  • #2
    I assume the meta tag would be generated by the attachment.php script. Possibly in response to a setting in the vB config settings.

    Did you do a wget on the URL and then examine the headers?

    What is the full URL, we can check it from here.

    Comment


    • #3
      Thanks mate...unfortunately it's behind a firewall.

      If I look at the page source of the thread there is no rel tag there.

      Code:
      <li>
          <img class="inlineimg" src="images/attach/pdf.gif" alt="File Type: pdf" />
          <a href="attachment.php?attachmentid=6&amp;d=1275717490">this is a file name.pdf</a> 
      (76.4 KB, 0 views)
      </li>
      Also I have gone through the attachment.php and there is no "nofollow" code there. It's really frustrating.

      Are you able to index the attachments on this board?

      wget actually dies on the URL

      wget -v -S http://localhost/forums/attachment.php?

      attachmentid=6&d=1275717490
      SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
      syswgetrc = C:\Program Files\GnuWin32/etc/wgetrc
      --2010-06-05 17:16:00-- http://localhost/forums/attachment.php?attachmentid=6
      Resolving localhost... 127.0.0.1, ::1
      Connecting to localhost|127.0.0.1|:80... connected.
      HTTP request sent, awaiting response...
      HTTP/1.1 200 OK
      Date: Sat, 05 Jun 2010 07:16:00 GMT
      Server: Apache/2.2.11 (Win32) PHP/5.3.0
      X-Powered-By: PHP/5.3.0
      Cache-Control: private
      Vary: User-Agent
      Connection: close
      Content-Type: text/html; charset=ISO-8859-1
      Length: unspecified [text/html]
      Saving to: `attachment.php@attachmentid=6.2'

      [ <=> ] 14,882 --.-K/s in 0.001s

      2010-06-05 17:16:01 (26.4 MB/s) - `attachment.php@attachmentid=6.2' saved [14882
      ]

      'd' is not recognized as an internal or external command,
      operable program or batch file.

      I also posted a question over at vB to find out where it could possibly be picking up the noindex tag.
      Last edited by MikeR; Jun-05-2010, 07:29 AM.

      Comment


      • #4
        'd' is not recognized as an internal or external command
        You are not using wget correctly. I assume you are using this from DOS? You need to put the URL in quotes to avoid DOS splitting up the URL and treating bits of the URL like command line parameters.

        You might also want to look to see if there is a robots.txt file blocking this link.

        This board is vB 3, and we don't allow attachments. So we can't test it here.

        Comment


        • #5
          Okay here it is with the quotes.

          There is no robots.txt anywhere in the directory structure either. Now for some added weirdness I went into the vB templates and removed all instances of rel="nofollow" and zoom is still getting that noindex error.


          wget -v -S -d "http://localhost/forums/attachment.
          php?attachmentid=6&d=1275717490"
          SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
          syswgetrc = C:\Program Files\GnuWin32/etc/wgetrc
          DEBUG output created by Wget 1.11.4 on Windows-MinGW.

          --2010-06-06 07:09:23-- http://localhost/forums/attachment.php?attachmentid=6&d
          =1275717490
          Resolving localhost... seconds 0.00, 127.0.0.1, ::1
          Caching localhost => 127.0.0.1 ::1
          Connecting to localhost|127.0.0.1|:80... seconds 0.00, connected.
          Created socket 336.
          Releasing 0x0033ae98 (new refcount 1).

          ---request begin---
          GET /forums/attachment.php?attachmentid=6&d=1275717490 HTTP/1.0
          User-Agent: Wget/1.11.4
          Accept: */*
          Host: localhost
          Connection: Keep-Alive

          ---request end---
          HTTP request sent, awaiting response...
          ---response begin---
          HTTP/1.1 200 OK
          Date: Sat, 05 Jun 2010 21:09:23 GMT
          Server: Apache/2.2.11 (Win32) PHP/5.3.0
          X-Powered-By: PHP/5.3.0
          Cache-Control: private
          Vary: User-Agent
          Connection: close
          Content-Type: text/html; charset=ISO-8859-1

          ---response end---

          HTTP/1.1 200 OK
          Date: Sat, 05 Jun 2010 21:09:23 GMT
          Server: Apache/2.2.11 (Win32) PHP/5.3.0
          X-Powered-By: PHP/5.3.0
          Cache-Control: private
          Vary: User-Agent
          Connection: close
          Content-Type: text/html; charset=ISO-8859-1
          Length: unspecified [text/html]
          Saving to: `attachment.php@attachmentid=6&d=1275717490.4'

          [ <=> ] 14,860 --.-K/s in 0.001s

          Closed fd 336
          2010-06-06 07:09:23 (24.9 MB/s) - `attachment.php@attachmentid=6&d=1275717490.4'
          saved [14860]

          Comment


          • #6
            Okay...just for fun I did a completely new install of my board and left it totally vanilla. I made 1 post and attached 1 PDF. I have a completely different outcome now.

            07:45:27 - [QUEUED] Queued URL: http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290

            07:45:30 - DL Thread #1, got URL (http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290) off queue

            07:45:30 - [DOWNLOAD] Downloading file http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290

            7:45:30 - Index Thread got ready buffer for http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290 (Content-type: HTML text)

            07:45:30 - [QUEUED] Spidering for links on http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290

            07:45:30 - [INDEXED] Indexing http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290

            Problem this time is Zoom isn't finding the PDF at that link and indexing the contents of the PDF.

            Comment


            • #7
              Okay made some progress.

              There was something in my Zoom block list that was preventing scanning the PDF. It's strange that it would find it (like in my post above) but something was preventing it from being scanned further.

              Now my next challenge is Zoom is not even finding the attachments in the vB CMS articles even though there is a URL when looking at the page source.

              Also the next thing is trying to limit the spider when scanning threads and posts. Zoom will scan threads and posts multiple times due to the way vBulletin makes so many ways to view the same thread. So for example on my test board, just 1 thread gives me 4 pages of results becuase it is scanned in so many different ways. Any ideas on that one?

              Comment


              • #8
                Length: unspecified [text/html]
                Saving to: `attachment.php@attachmentid=6&d=1275717490.4'
                Strange that Wget reports this as HTML, when you say it is a PDF file?
                Did you have a look inside this wget saved file to see what was actually in the file?

                07:45:30 - [INDEXED] Indexing http://localhost/forums/forums/attachment.php?attachmentid=1&d=1275773290
                Problem this time is Zoom isn't finding the PDF at that link and indexing the contents of the PDF.
                The log would seem to indicate that it is in fact being indexed.

                If you have an example of this on a live site we can see, I am sure we can get to the bottom of the issue in 5min.

                For limiting the scope of indexing, see this FAQ
                http://www.wrensoft.com/zoom/support/msgboards.html

                Comment


                • #9
                  Well it is indexing the PDFs in the threads now, it was something in my Zoom skip list that was causing it not to be for some reason. Now the problem is it can't even see the attachments in articles in the CMS.

                  Let me try and get a test site open on the web for you.

                  Thanks

                  Comment


                  • #10
                    It's definitely something in the zoom skip list that is preventing the PDF from being scanned, I just don't know which one. If I don't have anything in the skip list the PDF gets indexed. If I start adding things to the skip list the PDF gets indexed from a file name point of view but it's not scanning inside the PDF. Weird!

                    I will try to get that test site set up for you today.

                    Comment


                    • #11
                      I sent you a PM with the info.

                      Thanks!

                      Comment


                      • #12
                        David will get back to you soon when he gets to your PM.

                        But I just wanted to add that if you disable robots support (click "Configure"->"Spider options" and uncheck "Enable robots.txt support"), Zoom will also ignore meta robots tags such as the nofollow tag in question.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          Another thought: do you need to be logged in to your forum for the attachments to be downloadable?

                          Because this indicates that it is not Zoom that is "not identifying the file as PDF", the attachment here is actually reported by the web server (or the forum software) as text/html:

                          Originally posted by MikeR View Post
                          HTTP/1.1 200 OK
                          Date: Sat, 05 Jun 2010 21:09:23 GMT
                          Server: Apache/2.2.11 (Win32) PHP/5.3.0
                          X-Powered-By: PHP/5.3.0
                          Cache-Control: private
                          Vary: User-Agent
                          Connection: close
                          Content-Type: text/html; charset=ISO-8859-1
                          Length: unspecified [text/html]
                          Saving to: `attachment.php@attachmentid=6&d=1275717490.4'
                          Note the Content-Type field. I assume this was supposed to be the PDF attachment URL.

                          If your forum does not permit attachments to be downloaded if you haven't logged in, it is possible that it will serve up a HTML page asking the user to login, despite the URL. So this might be what's actually being indexed? You should take a look at the content of what is actually saved to disk by wget.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            Hi Ray,

                            Thanks for the tips. I actually did make sure that I had "use robots.txt" un-ticked and I also made sure to log into my forum first and then used cookies for authentication. Also on the board I made sure that all users (including guests) can download attachments.

                            So my latest update on this is that if I allow Zoom to search the whole site except for calendar.php it finds 2 of my 3 attachments. It finds the one in the CMS and it finds the one in the thread, but not the one in the blog post.

                            Comment


                            • #15
                              Your details were passed to me, and I've taken a look at your site and made some notes.

                              Originally posted by MikeR View Post
                              Thanks for the tips. I actually did make sure that I had "use robots.txt" un-ticked and I also made sure to log into my forum first and then used cookies for authentication.
                              If you have "Enable robots support" unchecked, then you should no longer be seeing the "nofollow" skipped messages you mentioned before. Can you confirm this.

                              Note that enabling "Use cookies from Windows and IE", and logging in via IE does not guarantee that the Indexer would be allowed to login to the site. It depends on what the forum expects and how their authentication is implemented. It may be stricter than that and not just allow cookies from different sessions to be used. Indeed, when I tried to do this on your site, it became evident that the Indexer was not recognized as being logged in, as it was being redirected to login/register pages which you do not see once logged in. Also, the links that were visible on the logged in page in my browser were not available to the spider.

                              Originally posted by MikeR View Post
                              Also on the board I made sure that all users (including guests) can download attachments.
                              This does not appear to be true. Even from a browser, when I was logged out, I tried to access the forum thread attachment, and it prompted me to login with a message: "You are not logged in or you do not have permission to access this page". This is most likely what Zoom is indexing instead of the PDF, and this is why it does not recognize it as a PDF file (because it isn't).

                              Originally posted by MikeR View Post
                              So my latest update on this is that if I allow Zoom to search the whole site except for calendar.php it finds 2 of my 3 attachments. It finds the one in the CMS and it finds the one in the thread, but not the one in the blog post.
                              Send us your ZCFG configuration file with your indexer settings saved, because that's very different to what we're seeing.

                              Can you also tell us which version of IE you are using, as it might make a difference with whether the cookie sharing is working the same way. There might also be a difference with you indexing via localhost as opposed to over the Internet as we are doing.

                              When we tested it from here, we were able to index the attachment to the blog post fine. But it didn't index the thread attachment (because it wasn't logged in) and it didn't see the CMS article at all (again since it wasn't logged in).

                              It might make more sense to correspond further via e-mail so we can mention more specific URLs and quote log entries directly (unless you don't mind me quoting the URLs here). PM me for my e-mail address (or provide me with yours).
                              --Ray
                              Wrensoft Web Software
                              Sydney, Australia
                              Zoom Search Engine

                              Comment

                              Working...
                              X