Embedding open source search engine

03 Dec 2008 00:00 IST

Updated On 03 Dec 2008 21:31 IST

New Update

BANGALORE, INDIA: A common feature of websites is to have an inbuilt search facility for retrieving data of user's interest. Developers generally incorporate in their website the customized search APIs of popular search engines like Google, Yahoo!, MSLive, Amazon, etc. These companies crawl the related websites and provide search facility among the documents of those websites and of worldwide web also. It may also act as an advertisement for them through the websites. As a matter of pride, many organizations would prefer to have their own search engine embedded in the website.

Advertisment

A decade ago, many search engines like Altavista , Lycos, Yahoo, Askjeeves (now ask.com) were popular. Later, Google with its sophisticated ranking strategy ensured acceptable results for different types of user queries. But getting the customized service of Google is shareware and many sites may not be able to pay for availing the facility. Then different search engines like cuil, guruji, khoj, terrior came up with their own ranking strategy in the web supporting multiple languages. Along with these developments, Open Source search engines also emerged aside.

Nutch
Nutch is an Open Source search engine developed in JAVA on top of Lucene, which itself is a free Open Source information retrieval system. Nutch can be deployed in Internet or Intranet environments and can be customized for building small or large scale information retrieval systems supporting multiple languages.

Prerequisites
1. JAVA and JRE should be installed and path variables for JAVA_HOME and JRE_HOME should be set.
2. Set Path to current ANT build, if not done already. Apache Ant is a JAVA-based build tool which builds the project using configuration files that are XML based. Its current version (1.7.1) can be downloaded from www.apache.org/dist/ant/binaries/apache-ant-1.7.1-bin.tar.gz

Search Result by Nutch

Installing and configuring Nutch
The latest version of Nutch (ver 0.9) can be downloaded from http://www.apache.org/dist/lucene/nutc. Assume that the login is pcquest and the home folder is /home/pcquest. Create a folder, named say and download the file nutch-0.9.tar.gz (Size 68MB) in it, extract the contents therein and then go to folder /home/pcquest/mySearch/nutch-0.9/ which is the root folder of Nutch. Now Nutch has to be configured, which includes two tasks:

1. Configuring Crawl Filter: Edit the file conf/crawl-urlfilter.txt file and change ? to + only at one place after the line ?# skip everthing else? so that it appears as:

# skip everthing else
+

2. Modification to Nutch configuration: This includes the folder containing the crawled data and enables Nutch Searcher to search crawled web data. Initially the file conf/nutch-site.xml does not contain any configuration details. We have to modify it by including the target folder which contains the crawled data. Add the following lines between tags:

Advertisment

searcher.dir
/home/pcquest/mySearch/nutch-0.9/myCrawled
Path to the crawled data of your web site

The file conf/nutch-default.xml should be modified for including agent name between the tags . We use 'pcquest' as the agent name and the final entry looks like:

http.agent.name
pcquest

Now Nutch is ready for crawling and indexing your website.

Crawling, indexing and searching Website
Nutch initially crawls and indexes websites and is then ready for serving user's query through searching the indexed data.

1. Crawling and Indexing websites: In the Nutch folder, /home/pcquest/mySearch/ nutch-0.9/; make a directory named rls in which, create a text file named seed_urls having list of urls one per line (we used, http://www.iitkgp.ac.in/) and then build the system using command ? ?ant && ant war?. Now remove ROOT* from webapps folder of Apache-tomcat folder and copy nutch-0.9.war into the webapps folder of tomcat in the name of ROOT.war and restart tomcat server. Now the system should perform crawling using the following command:

$ ./bin/nutch crawl urls/seed_urls -depth 2 -threads 10 -dir myCrawled

It should be ensured that the folder myCrawled does not exist already. The above command creates the folder named myCrawled and stores the crawled & indexed data in it. If this folder already exists, then the crawler terminates. The values of the parameters depth and threads are user defined where depth shows the level of the websites to be crawled and threads shows the number of concurrent crawling processes. Once crawling is over, then searching starts with Nutch user interface.

2. Searching for User Query among Indexed documents

The deployment of the search engine can be tested using address http://localhost:8080/. This loads the default Nutch user interface (below)which can be modified to fit in your website.
Instead of using the above default interface, the following code can be used to include search box with submit button in your website:

Advertisment

value="Search">

Here 10.5.16.234 is the IP address of our computer running tomcat. The resulting search box is shown at the bottom.

Conclusion
The major bottleneck of search engines is the relevance of retrieved results. Nutch's performance can be tuned by either implementing a ranking algorithm by web developers or by tuning the boosting parameters available in the current build of the Nutch system. Nutch also provides facilities for including various user defined plugins and hence a cross lingual information access is possible with multiple languages support.

Nutch internally calls Hadoop which is basically built on map-reduce technology that is capable of operating in a distributed fashion. So Nutch system raises itself as a powerful open source library which could be used for solving search related issues in many classical real world problems of machine learning and information retrieval. Apart from Nutch users may also try to attempt using Terrior, which is another open source search engine with good ranking strategy.

Advertisment

R. Rajendra Prasath and Sumit Goswami; I I T Kharagpur

tech-news