Now Nutch is ready for crawling and indexing your website.
Crawling, indexing and searching Website Nutch initially crawls and indexes websites and is then ready for serving user's query through searching the indexed data.
1. Crawling and Indexing websites: In the Nutch folder, /home/pcquest/mySearch/ nutch-0.9/; make a directory named rls in which, create a text file named seed_urls having list of urls one per line (we used, http://www.iitkgp.ac.in/) and then build the system using command ? ?ant && ant war?. Now remove ROOT* from webapps folder of Apache-tomcat folder and copy nutch-0.9.war into the webapps folder of tomcat in the name of ROOT.war and restart tomcat server. Now the system should perform crawling using the following command:
$ ./bin/nutch crawl urls/seed_urls -depth 2 -threads 10 -dir myCrawled
It should be ensured that the folder myCrawled does not exist already. The above command creates the folder named myCrawled and stores the crawled & indexed data in it. If this folder already exists, then the crawler terminates. The values of the parameters depth and threads are user defined where depth shows the level of the websites to be crawled and threads shows the number of concurrent crawling processes. Once crawling is over, then searching starts with Nutch user interface.
2. Searching for User Query among Indexed documents
The deployment of the search engine can be tested using address http://localhost:8080/. This loads the default Nutch user interface (below)which can be modified to fit in your website. Instead of using the above default interface, the following code can be used to include search box with submit button in your website:
value="Search">
Here 10.5.16.234 is the IP address of our computer running tomcat. The resulting search box is shown at the bottom.
Conclusion The major bottleneck of search engines is the relevance of retrieved results. Nutch's performance can be tuned by either implementing a ranking algorithm by web developers or by tuning the boosting parameters available in the current build of the Nutch system. Nutch also provides facilities for including various user defined plugins and hence a cross lingual information access is possible with multiple languages support.
Nutch internally calls Hadoop which is basically built on map-reduce technology that is capable of operating in a distributed fashion. So Nutch system raises itself as a powerful open source library which could be used for solving search related issues in many classical real world problems of machine learning and information retrieval. Apart from Nutch users may also try to attempt using Terrior, which is another open source search engine with good ranking strategy.
R. Rajendra Prasath and Sumit Goswami; I I T Kharagpur
Get most out of your technology infrastructure investments with Dell
About CIOL | Media Kit | Site Map | Contact Us | Help | Write to us | Jobs@CyberMedia | Privacy Policy
Copyright © CyberMedia India Online Ltd. All rights reserved. Usage of content from web site is subject to Terms and Conditions.