Monday, September 24, 2012

Design of web search engine

USING C :The design of the search engine can be basically done in C using the data structure PATRICIA TREE for indexing the terms.The md5 hash function and porter stemming algorithms can be found on net.The complexity lies in the type of documents indexed.If the search engine is indexing real world documents like web pages containing html data,the design would become  very complex in C.The design can be done in the following steps:
  • Remove the stop words and apply porter stemming for the documents need to be indexed.
  • The clean documents are indexed using the data structure patricia tree which internally uses the md5 hash function for calculating the 128 bit hash.(2^128 unique entries can be indexed)
  • Using the hash values generated by MD5 algorithm index the terms in each document maintaining the document ID,document frequency,term frequency which will be helpful in identifying the search terms while indexing.[For decreasing the index size without compromising with the performance of search engine these details are necessary]
  • The question arises at this juncture is...What is the index size?Is the index fit to  main memory?
  • The answer  lies in the number of documents we are indexing.The main memory is of course not sufficient if we are indexing large collection of data.
  • The solution is...writing the patricia tree ie., index to hard disk(secondary memory) and reading it from disk for reconstructing the tree while performing search or re-indexing the modifying documents based on different measures[time,page history].This similar to writing a tree to secondary memory ie., file and reading the file for reconstruction.
  • The search function includes traversing the tree and checking for posting lists of the search term.
        The efficiency of the design is depicted by the time taken to show the search results.We can include several features including snippets and pagination for SERP.  
Useful links:
MD5 hash function:gcc.gnu.org/svn/gcc/branches/cilkplus/libiberty/md5.c
Porter stemming function:www.cs.cmu.edu/~callan/Teaching/porter.c


USING PYTHON:Alternatively, you can use the Whoosh library in python for indexing the documents parsed by beautiful soup also a python library.You need to just take care of the documents to be indexed.

Firstly parse the html documents through BEAUTIFUL SOUP and obtain the schema ie title,body and relavant text need to be indexed.Then use the whoosh documentation for indexing the text parsed by beautiful soup.You can perform stemming,and all other preprocessing using the library functions available in WHOOSH.

If several indexes are available ie.,LUCENE(search engine index constructed using lucene) can be integrated with the WHOOSH index using HAYSTACK.Many real-world websites are using HAYSTACK to search the indices for a given query.

Pros an Cons:
As the index size grows the efficiency of the search engine falls drastically.An increase of 50% index size may slows down by factor of 0.8 of performance.This is very advantageous for research purposes and for comparing performances of search terms used in indexing of search engine.

http://packages.python.org/Whoosh/indexing.html/
http://pypi.python.org/pypi/Whoosh/
http://pypi.python.org/pypi/BeautifulSoup/

Useful links for integrating several search indexes for having a front end search engine:
http://haystacksearch.org/
http://groups.csail.mit.edu/haystack/

1 comment: