page has been skipped problem

Ray replied

Mar-23-2009, 05:00 AM
Sorry but it is honestly very difficult to make sense of what you're talking about, when you won't give actual information, and continue to describe made-up scenarios in broken english. If you at least quote actual log messages from the Indexer, it might help alot more.

Originally posted by deandean View Post

OK...but i wanna to confirm that .....
input this "http://mysite.com/test/" ,then it can goto the page "http://mysite.com/test/index.html",

This can mean several things. Are you saying that the URL gets redirected? In which case, there is an actual message on the "Log" tab of the Indexer, which says "URL redirected to ..." (if you have "Downloading" messages enabled).

Or are you just saying that when you go to the URL http://mysite.com/test/ the actual file that is indexed is "index.html" (which is very common and normal)? But do you understand that a web server can serve the same file, via different URL's? I have explained this earlier as well, but the impression is you don't understand this. And we're going around in circles.

Originally posted by deandean View Post

and the CRC is on, so the page "http://mysite.com/test/index.html" will also be skipped, right?
and the linkage(content) inside the page "http://mysite.com/test/index.html" will be indexed or not

If it was a HTTP redirection, then the page wouldn't have been skipped by CRC because they would be different pages.

If the page was skipped by CRC, then the same page was already indexed which means the same content, links, would all have been indexed.

So in any of the given scenarios, what you are describing doesn't make sense.

Honestly, please provide us with some more substantial information. You can send us the index log (Make sure you have "Show all" messages clicked on the "Log" tab, then click on "File"->"Save index log to file") and refer to the actual URLs you are indexing so we can see what you are really talking about. Otherwise you are wasting our time and your own.
Leave a comment:
deandean replied

Mar-23-2009, 04:11 AM
[QUOTE=Ray;12659]As explained above, this shouldn't be possible. I think you might be confused. Are you referring to URLs like the following as the same:
http://mysite.com/test/
http://mysite.com/test/index.html
http://www.mysite.com/test/default.html

These are NOT the same URLs. They may go to the same page but that is dependent on the way your server is configured.

The CRC detection prevents this from happening by actually looking at the contents of each page and determining if they are duplicates. Although ideally, you should not be linking to the same page with different names and URLs throughout your site, so that this would not be necessary.
[QUOTE]
OK...but i wanna to confirm that .....
input this "http://mysite.com/test/" ,then it can goto the page "http://mysite.com/test/index.html",
and the CRC is on, so the page "http://mysite.com/test/index.html" will also be skipped, right?
and the linkage(content) inside the page "http://mysite.com/test/index.html" will be indexed or not
thanks so much
Leave a comment:
Ray replied

Mar-22-2009, 11:51 PM
I agree. This should not be possible unless:

(a) You have a corrupted set of index files, by mixing files from different indexing sessions or uploading them incorrectly.

(b) You have modified our search script source code and broken it.

(c) You are missing details in your fictitious example, and it is not an accurate representation of what you are seeing.

If you are sure it is not any of the above reasons, then ZIP up your search files (all files generated by the Indexer) and the ZCFG configuration file, and e-mail them to us.
Leave a comment:
David replied

Mar-20-2009, 07:52 PM
Unless you are using a corrupted set of index files, the fictitious example you describe can't happen in real life. As already pointed out, each URL will appear at most once in the search results. So your scenario is impossible regardless of the CRC setting.
Leave a comment:
deandean replied

Mar-20-2009, 10:09 AM
yes~im so sorry English is not my first language
but i think , i tell u the sample is very clear
ok i tell u in this way.

a.htm
<a href="z.htm" ...........>

b.htm
<a href="z.htm" ...........>

c.htm
<a href="z.htm" ...........>

The "<a href="z.htm" ...........> " also in these 3 pages , right?
so the z.htm contains a word like this "Hello zoom search", and the CRC turn off and start indexing, after that, i search a key word "Hello"
the result will be
================================================== =======
z page
"Hello zoom search"
Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm

z page
"Hello zoom search"
Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm

z page
"Hello zoom search"
Terms matched: 1 - Score: 14 - 20 Mar 2009 - URL: http://content.abc.com/z.htm
================================================== =======
right? so that is what i meant "3 results"
and sorry again...the real URL cannot give u becouse it is not private website...it is the company i ma working for ........so i need to give the example in my way.
Pls help , many many thanks
Leave a comment:
David replied

Mar-20-2009, 09:29 AM
I'm sorry but I have read you post a couple of times, and I really can't make much sense of it. Statements like "the result will be 3......so that's why i need to turn on the CRC", are illogical.

You also contradict you initial post, and your use fictitious examples rather than the real URLs and real search words further confuses matters.

I am guessing English isn't your first language, but this is no excuse for only supplying half the information.

You can't insert files into the forum. You need to E-Mail them.
Leave a comment:
deandean replied

Mar-20-2009, 07:11 AM
i found that the input url "http://content.abc.com/ContactList/" can goto "http://content.abc.com/ContactList/index.htm" page....
so that's why "http://content.abc.com/ContactList/index.htm" will be indexed ??
so the linkage inside "http://content.abc.com/ContactList/index.htm" cannot be found ??

u know i cannot turn off CRC ...
becourse i have 3 pages e.g. a.htm , b.htm, c.htm
and there are a linkage e.g. z.htm in these 3 pages........
after search the key word inside the z.htm , the result will be 3......so that's why i need to turn on the CRC......

and i found that the linkage is used <a href="ContactList/index.htm" ...........>

now i turn off the CRC can index , but many linkage the same...
turn on CRC........the page is skipping , so the linkage inside the page will also no indexing...

how to insert the ZCFG file here??
Leave a comment:
Ray replied

Mar-19-2009, 11:44 PM
Originally posted by deandean View Post

if i turn off CRC function...it will be many same url appear after search

As explained above, this shouldn't be possible. I think you might be confused. Are you referring to URLs like the following as the same:

http://mysite.com/test/
http://mysite.com/test/index.html
http://www.mysite.com/test/default.html

These are NOT the same URLs. They may go to the same page but that is dependent on the way your server is configured.

The CRC detection prevents this from happening by actually looking at the contents of each page and determining if they are duplicates. Although ideally, you should not be linking to the same page with different names and URLs throughout your site, so that this would not be necessary.

Originally posted by deandean View Post

if i turn on CRC function...the page cannot be indexed, and the linkage in that page also not be indexed
if the another page's content is the same, i should have the linkage indexed inside the page.

You've made an assumption here. The problem might actually be that the page (or an identical copy of the page somewhere else) was indexed, but the links were not crawled for other reasons (links are inside javascripts, links are external to base URL, etc.). You should verify this first.

It would likely be much quicker if you can give us a URL to the site in question, and maybe even provide us with the ZCFG file, and tell us what page or link you are trying to index.
Leave a comment:
David replied

Mar-19-2009, 09:48 AM
it will be many same url appear after search

I think this is impossible. Each URL will appear at most once in the search results.
Leave a comment:
deandean replied

Mar-19-2009, 09:13 AM
Originally posted by wrensoft View Post

The concept isn't complex. The content on one page is the same as the content on another page. It is a duplicate.

But as suggested just turn off the CRC function if it is all too complex to understand.

If you are using Spider mode then links within your site are followed to find other pages.

yes~im using spider mode (follow links only)
if i turn off CRC function...it will be many same url appear after search
if i turn on CRC function...the page cannot be indexed, and the linkage in that page also not be indexed
if the another page's content is the same, i should have the linkage indexed inside the page.
now it can't pls help , thanks
Leave a comment:
David replied

Mar-19-2009, 08:58 AM
how does it different URL but the same page

The concept isn't complex. The content on one page is the same as the content on another page. It is a duplicate.

But as suggested just turn off the CRC function if it is all too complex to understand.

if the page already indexed, the linkage in this page also indexed

If you are using Spider mode then links within your site are followed to find other pages.
Leave a comment:
deandean replied

Mar-19-2009, 08:31 AM
Originally posted by wrensoft View Post

The duplicate page could have any URL. But the content will be the same.

Just turn off CRC in the Scan Options configuration window

I dont understand of "The duplicate page could have any URL. But the content will be the same."
how does it different URL but the same page
if the page already indexed, the linkage in this page also indexed...right?
nothing linkage in this page to be indexed
Leave a comment:
David replied

Mar-19-2009, 06:42 AM
The duplicate page could have any URL. But the content will be the same.

Just turn off CRC in the Scan Options configuration window
Leave a comment:
deandean replied

Mar-19-2009, 06:20 AM
Originally posted by wrensoft View Post

CRC = Cyclic redundancy check.
It detects duplicate pages on your site and prevents the duplicate being indexed.

You might have this page appearing at two different URLs. e.g.
http://content.abc.com/ContactList/index.htm
and
http://content.abc.com/ContactList/
or
http://content.abc.com/ContactList/index.html

but i cannot find any other indexed successfully page like "/ContactList/index.htm" , "/ContactList/" , "/ContactList/index.html" in the index log
now i cannot index the link inside this page....
Leave a comment:
David replied

Mar-19-2009, 05:20 AM
CRC = Cyclic redundancy check.
It detects duplicate pages on your site and prevents the duplicate being indexed.

You might have this page appearing at two different URLs. e.g.
http://content.abc.com/ContactList/index.htm
and
http://content.abc.com/ContactList/
or
http://content.abc.com/ContactList/index.html
Leave a comment:

Announcement

page has been skipped problem

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: