Over the last few months we have been asked a couple of times about how Zoom compares with expensive search solutions such as SLI and the Google search appliance in terms of 'learning'.
These expensive solutions come up with impressive sounding marketing blurb like, "Learning Search perpetually learns and improves the search experience based on visitor behavior - continually re-ranking search results based on aggregate click-through behavior."
Zoom doesn't attempt automatic machine learning. But does give control to the web master to optimise results.
Our argument is that machine learning only makes sense for a small number of web sites.
Learning isn't a good as it sounds. For a whole bunch of reasons. Some of which are,
These expensive solutions come up with impressive sounding marketing blurb like, "Learning Search perpetually learns and improves the search experience based on visitor behavior - continually re-ranking search results based on aggregate click-through behavior."
Zoom doesn't attempt automatic machine learning. But does give control to the web master to optimise results.
Our argument is that machine learning only makes sense for a small number of web sites.
Learning isn't a good as it sounds. For a whole bunch of reasons. Some of which are,
- You need to regularly backup the state of the search index, as the state might change after each search.
- It is open to manipulation by visitors to the site (e.g. false clicking on particular results to raise the ranking). Spammers and people with a self interest have incentive to manipulate the results.
- You assume people will always click on the "right" result, even if that right result might be on the 3rd page of results. There are many studies showing that this is not what happens in real life. People click on the top 3 results in real like. People re-enforce the wrong results (or if not wrong at least sub-optimal result). [Joachims, Granka et al, SIGIR ‘05]
- There is the implicit assumption that a large number of people are searching for the same search words. While this is the case on sites like Amazon and Google, this is often not the case on smaller sites. Search volumes on many sites do not reach statistically significantly numbers. You need 100's of clicks per search term before the numbers are significant. And then if you update your site (e.g. add, modify, delete pages), all this data becomes potentially useless. And you need to start the process again.
- The field of machine learning is in it infancy. Humans do a far better job of learning and optimising. Meaning a web master can tweak a site, if required, to return results in an optimal order, or to recommend particular pages for particular searches. Having a machine attempting to do this (sub-optimally) only makes sense for enormous collections of sites, where human optimisation is not feasible because of the scale.
- The search function needs return results that are not directly links to the documents, in order to track clicks. Meaning extra web traffic and slower browsing.
- Write access to a database of some sort is required to store clicks. Meaning the entire function fails to work when either the web host you are using doesn't support SQL (which is common on the cheaper hosting packages) or if your search function is on read only media, such as a CD or DVD. Even if the host does support SQL, it can be complex to set up with the right permissions.
- Newer web spiders enter search terms into web forms. Meaning that if your search function is on the web, the search words and search logs will get polluted with results from automated spiders. Not real users.
- You need to track and store user data (IP address) for long periods and the searches these people performed. Raising privacy issues, like Google has had.
- Machine learning can be resource intensive (high CPU, RAM and disk space requirements ). Many of our smaller customers are on shared hosting packages where they are resource constrained.