Hello, I need to index the site http://cms.hhs.gov for all of it's pdfs. However all the directory listings are denied and simply scanning what pages are known does not index enough or the right documents. Google has succeeded at indexing everything, is there something else I can do?
Announcement
Collapse
No announcement yet.
Indexing w/o Directory Listing
Collapse
X
-
It might be becuase the Google index is several months out of date and the documents that were available several months ago are not longer available?
Are you saying that they have documents that they are making public, but they are not providing any links to the documents?
Maybe the real issue is elsewhere. Like for example, maybe you are not indexing .ASP files and so are missing a lot of the links to PDF files. Or maybe you set the page limit too low and indexing stops before all documents have been indexed?
-
You might want to take a look at these FAQs too:
Q. Why are some of my pages being skipped by the indexer?
Q. I am indexing with spider mode but it is not finding all the pages on my web site
Comment
-
Originally posted by DAaaMan64 View PostHello, I need to index the site http://cms.hhs.gov for all of it's pdfs. However all the directory listings are denied and simply scanning what pages are known does not index enough or the right documents. Google has succeeded at indexing everything, is there something else I can do?
Comment
-
Is this actually a site that you are maintaining? Or a third-party site that you wish to index?
Because if it is not your own site, there may be other things setup to allow Google to index the files, that you are not aware of. For example, they may have a Google XML Sitemap file generated, and they've submitted this to Google and it gives them a comprehensive list of URLs to all the files on their site.
And as mentioned above, the problem could simply be a spider configuration issue: if you have not setup Zoom correctly to allow it to find the required files and pages. Please check the information David and I posted above.
Comment
-
I think I figured it out after messing with the settings, I looked around some more but I can't figure out how to make it scan all types of pages but only index pdfs. How do I do that? Pardon my ignorance, thank you.
I would just categorize it, but with these limitations on how much I can index then I need to be able to only index pdfs while still scanning everything.Last edited by DAaaMan64; Mar-23-2007, 05:51 PM.
Comment
-
If you use Offline Mode to index your files (ie: all the PDF files locally on your hard disk or networked drive), then you can simply allow it to only index .PDF files, since Offline Mode do not need to crawl pages for links.
If you are relying on Spider Mode to follow the links from your HTML pages to your PDF files, then you will either have to:
a) Add your HTML page containing the links to your PDF files as a start point with the "Follow links only" spider option. You can do this by clicking on the "More" button and "Edit" the start point. If you are depending on a number of HTML pages for links, then you will need to add each of them as a separate start point.
b) Modify the HTML pages so that they contain ZOOMSTOP and ZOOMRESTART tags, which will exclude their page content from being indexed.
However, both of these options only exclude the HTML page from appearing in the search results. They will still count towards your limits specified on the Limits tab.
Comment
Comment