BANGALORE, INDIA: A common feature of websites is to have an inbuilt search facility for retrieving data of user's interest. Developers generally incorporate in their website the customized search APIs of popular search engines like Google, Yahoo!, MSLive, Amazon, etc. These companies crawl the related websites and provide search facility among the documents of those websites and of worldwide web also. It may also act as an advertisement for them through the websites. As a matter of pride, many organizations would prefer to have their own search engine embedded in the website.
A decade ago, many search engines like Altavista , Lycos, Yahoo, Askjeeves (now ask.com) were popular. Later, Google with its sophisticated ranking strategy ensured acceptable results for different types of user queries. But getting the customized service of Google is shareware and many sites may not be able to pay for availing the facility. Then different search engines like cuil, guruji, khoj, terrior came up with their own ranking strategy in the web supporting multiple languages. Along with these developments, Open Source search engines also emerged aside.
Nutch Nutch is an Open Source search engine developed in JAVA on top of Lucene, which itself is a free Open Source information retrieval system. Nutch can be deployed in Internet or Intranet environments and can be customized for building small or large scale information retrieval systems supporting multiple languages.
Search Result by Nutch
Installing and configuring Nutch The latest version of Nutch (ver 0.9) can be downloaded from http://www.apache.org/dist/lucene/nutc. Assume that the login is pcquest and the home folder is /home/pcquest. Create a folder, named say and download the file nutch-0.9.tar.gz (Size 68MB) in it, extract the contents therein and then go to folder /home/pcquest/mySearch/nutch-0.9/ which is the root folder of Nutch. Now Nutch has to be configured, which includes two tasks:
1. Configuring Crawl Filter: Edit the file conf/crawl-urlfilter.txt file and change ? to + only at one place after the line ?# skip everthing else? so that it appears as:
# skip everthing else +
2. Modification to Nutch configuration: This includes the folder containing the crawled data and enables Nutch Searcher to search crawled web data. Initially the file conf/nutch-site.xml does not contain any configuration details. We have to modify it by including the target folder which contains the crawled data. Add the following lines between tags:
searcher.dir /home/pcquest/mySearch/nutch-0.9/myCrawled Path to the crawled data of your web site
The file conf/nutch-default.xml should be modified for including agent name between the tags . We use 'pcquest' as the agent name and the final entry looks like:
http.agent.name pcquest
Get most out of your technology infrastructure investments with Dell
About CIOL | Media Kit | Site Map | Contact Us | Help | Write to us | Jobs@CyberMedia | Privacy Policy
Copyright © CyberMedia India Online Ltd. All rights reserved. Usage of content from web site is subject to Terms and Conditions.