PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

page has been skipped problem

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ray
    replied
    Sorry but it is honestly very difficult to make sense of what you're talking about, when you won't give actual information, and continue to describe made-up scenarios in broken english. If you at least quote actual log messages from the Indexer, it might help alot more.

    Originally posted by deandean View Post
    OK...but i wanna to confirm that .....
    input this "http://mysite.com/test/" ,then it can goto the page "http://mysite.com/test/index.html",
    This can mean several things. Are you saying that the URL gets redirected? In which case, there is an actual message on the "Log" tab of the Indexer, which says "URL redirected to ..." (if you have "Downloading" messages enabled).

    Or are you just saying that when you go to the URL http://mysite.com/test/ the actual file that is indexed is "index.html" (which is very common and normal)? But do you understand that a web server can serve the same file, via different URL's? I have explained this earlier as well, but the impression is you don't understand this. And we're going around in circles.

    Originally posted by deandean View Post
    and the CRC is on, so the page "http://mysite.com/test/index.html" will also be skipped, right?
    and the linkage(content) inside the page "http://mysite.com/test/index.html" will be indexed or not
    If it was a HTTP redirection, then the page wouldn't have been skipped by CRC because they would be different pages.

    If the page was skipped by CRC, then the same page was already indexed which means the same content, links, would all have been indexed.

    So in any of the given scenarios, what you are describing doesn't make sense.

    Honestly, please provide us with some more substantial information. You can send us the index log (Make sure you have "Show all" messages clicked on the "Log" tab, then click on "File"->"Save index log to file") and refer to the actual URLs you are indexing so we can see what you are really talking about. Otherwise you are wasting our time and your own.

    Leave a comment:


  • deandean
    replied
    [QUOTE=Ray;12659]As explained above, this shouldn't be possible. I think you might be confused. Are you referring to URLs like the following as the same:
    http://mysite.com/test/
    http://mysite.com/test/index.html
    http://www.mysite.com/test/default.html

    These are NOT the same URLs. They may go to the same page but that is dependent on the way your server is configured.

    The CRC detection prevents this from happening by actually looking at the contents of each page and determining if they are duplicates. Although ideally, you should not be linking to the same page with different names and URLs throughout your site, so that this would not be necessary.
    [QUOTE]
    OK...but i wanna to confirm that .....
    input this "http://mysite.com/test/" ,then it can goto the page "http://mysite.com/test/index.html",
    and the CRC is on, so the page "http://mysite.com/test/index.html" will also be skipped, right?
    and the linkage(content) inside the page "http://mysite.com/test/index.html" will be indexed or not
    thanks so much

    Leave a comment:


  • Ray
    replied
    I agree. This should not be possible unless:

    (a) You have a corrupted set of index files, by mixing files from different indexing sessions or uploading them incorrectly.

    (b) You have modified our search script source code and broken it.

    (c) You are missing details in your fictitious example, and it is not an accurate representation of what you are seeing.

    If you are sure it is not any of the above reasons, then ZIP up your search files (all files generated by the Indexer) and the ZCFG configuration file, and e-mail them to us.

    Leave a comment:


  • David
    replied
    Unless you are using a corrupted set of index files, the fictitious example you describe can't happen in real life. As already pointed out, each URL will appear at most once in the search results. So your scenario is impossible regardless of the CRC setting.

    Leave a comment:


  • deandean
    replied
    yes~im so sorry English is not my first language
    but i think , i tell u the sample is very clear
    ok i tell u in this way.

    a.htm
    <a href="z.htm" ...........>

    b.htm
    <a href="z.htm" ...........>

    c.htm
    <a href="z.htm" ...........>

    The "<a href="z.htm" ...........> " also in these 3 pages , right?
    so the z.htm contains a word like this "Hello zoom search", and the CRC turn off and start indexing, after that, i search a key word "Hello"
    the result will be
    ================================================== =======
    z page
    "Hello zoom search"
    Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm

    z page
    "Hello zoom search"
    Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm

    z page
    "Hello zoom search"
    Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm
    ================================================== =======
    right? so that is what i meant "3 results"
    and sorry again...the real URL cannot give u becouse it is not private website...it is the company i ma working for ........so i need to give the example in my way.
    Pls help , many many thanks

    Leave a comment:


  • David
    replied
    I'm sorry but I have read you post a couple of times, and I really can't make much sense of it. Statements like "the result will be 3......so that's why i need to turn on the CRC", are illogical.

    You also contradict you initial post, and your use fictitious examples rather than the real URLs and real search words further confuses matters.

    I am guessing English isn't your first language, but this is no excuse for only supplying half the information.

    You can't insert files into the forum. You need to E-Mail them.

    Leave a comment:


  • deandean
    replied
    i found that the input url "http://content.abc.com/ContactList/" can goto "http://content.abc.com/ContactList/index.htm" page....
    so that's why "http://content.abc.com/ContactList/index.htm" will be indexed ??
    so the linkage inside "http://content.abc.com/ContactList/index.htm" cannot be found ??

    u know i cannot turn off CRC ...
    becourse i have 3 pages e.g. a.htm , b.htm, c.htm
    and there are a linkage e.g. z.htm in these 3 pages........
    after search the key word inside the z.htm , the result will be 3......so that's why i need to turn on the CRC......

    and i found that the linkage is used <a href="ContactList/index.htm" ...........>

    now i turn off the CRC can index , but many linkage the same...
    turn on CRC........the page is skipping , so the linkage inside the page will also no indexing...

    how to insert the ZCFG file here??

    Leave a comment:


  • Ray
    replied
    Originally posted by deandean View Post
    if i turn off CRC function...it will be many same url appear after search
    As explained above, this shouldn't be possible. I think you might be confused. Are you referring to URLs like the following as the same:

    http://mysite.com/test/
    http://mysite.com/test/index.html
    http://www.mysite.com/test/default.html

    These are NOT the same URLs. They may go to the same page but that is dependent on the way your server is configured.

    The CRC detection prevents this from happening by actually looking at the contents of each page and determining if they are duplicates. Although ideally, you should not be linking to the same page with different names and URLs throughout your site, so that this would not be necessary.

    Originally posted by deandean View Post
    if i turn on CRC function...the page cannot be indexed, and the linkage in that page also not be indexed
    if the another page's content is the same, i should have the linkage indexed inside the page.
    You've made an assumption here. The problem might actually be that the page (or an identical copy of the page somewhere else) was indexed, but the links were not crawled for other reasons (links are inside javascripts, links are external to base URL, etc.). You should verify this first.

    It would likely be much quicker if you can give us a URL to the site in question, and maybe even provide us with the ZCFG file, and tell us what page or link you are trying to index.

    Leave a comment:


  • David
    replied
    it will be many same url appear after search
    I think this is impossible. Each URL will appear at most once in the search results.

    Leave a comment:


  • deandean
    replied
    Originally posted by wrensoft View Post
    The concept isn't complex. The content on one page is the same as the content on another page. It is a duplicate.

    But as suggested just turn off the CRC function if it is all too complex to understand.

    If you are using Spider mode then links within your site are followed to find other pages.
    yes~im using spider mode (follow links only)
    if i turn off CRC function...it will be many same url appear after search
    if i turn on CRC function...the page cannot be indexed, and the linkage in that page also not be indexed
    if the another page's content is the same, i should have the linkage indexed inside the page.
    now it can't pls help , thanks

    Leave a comment:


  • David
    replied
    how does it different URL but the same page
    The concept isn't complex. The content on one page is the same as the content on another page. It is a duplicate.

    But as suggested just turn off the CRC function if it is all too complex to understand.

    if the page already indexed, the linkage in this page also indexed
    If you are using Spider mode then links within your site are followed to find other pages.

    Leave a comment:


  • deandean
    replied
    Originally posted by wrensoft View Post
    The duplicate page could have any URL. But the content will be the same.

    Just turn off CRC in the Scan Options configuration window
    I dont understand of "The duplicate page could have any URL. But the content will be the same."
    how does it different URL but the same page
    if the page already indexed, the linkage in this page also indexed...right?
    nothing linkage in this page to be indexed

    Leave a comment:


  • David
    replied
    The duplicate page could have any URL. But the content will be the same.

    Just turn off CRC in the Scan Options configuration window

    Leave a comment:


  • deandean
    replied
    Originally posted by wrensoft View Post
    CRC = Cyclic redundancy check.
    It detects duplicate pages on your site and prevents the duplicate being indexed.

    You might have this page appearing at two different URLs. e.g.
    http://content.abc.com/ContactList/index.htm
    and
    http://content.abc.com/ContactList/
    or
    http://content.abc.com/ContactList/index.html
    but i cannot find any other indexed successfully page like "/ContactList/index.htm" , "/ContactList/" , "/ContactList/index.html" in the index log
    now i cannot index the link inside this page....

    Leave a comment:


  • David
    replied
    CRC = Cyclic redundancy check.
    It detects duplicate pages on your site and prevents the duplicate being indexed.

    You might have this page appearing at two different URLs. e.g.
    http://content.abc.com/ContactList/index.htm
    and
    http://content.abc.com/ContactList/
    or
    http://content.abc.com/ContactList/index.html

    Leave a comment:

Working...
X