Friday 28 October 2011

How to Code your own Search Engine Like Google, Bing or Yahoo

This subject of this post has been upgraded with new algorithm for set intersection. Go February 2012 posts for a better solution.


Problem: 
How to code your own search engine like Google or Bing? How to design the search engine? The search engine should give results in run time Order of (length of search key words)
Input: 
100 html files which have a lot of english words in it. Here we focus on the words only. Parsing out the html tags can be done too.
Solution: 
Here I use Trie and Binary Tree data structures to design the core search engine. Data can be extracted from pages and the data structures can be built by another program called crawler. This can be combined with the spell suggestion program in the previous post to make it close to real life. 
Design considerations here:
The design for the search engine is shown below. The tries as explained in the previous post gives search results in the Order of length of key words. Here we have two words "search " and "engine". The trie has paths for each of these search terms and end in a list of HTML files that have these search terms in them. This list is maintained as a binary search tree. This makes it to find the set intersection of the files which have both the search terms! Any data structure that efficiently finds set intersection can be used.


Working:
1) Crawl the files. Take the words from the files and add them to the trie. At the trie leaf node set the list of files in a binary tree. 
2) Input the search terms.
3) For each term retrieve the set of files. Here it is available as binary search trees at the leaf nodes. Since we use a trie, for any number of files, I get the set of files for any given word proportional to the length of the search key word.
4) Find the set intersection of these and display to the user. Here a binary search tree is used to represent sets. Any set representation can be as long as the process of finding set intersection is done efficiently.


The following screen shot shows the search engine in action for the key words, "search", "engine" and "search engine"




The run-time:
To find a single word is around 39 microseconds on my computer using the netbeans profiler (as the trie in the previous blog post). The run time for my search engine is shown below. The search took only 62 microseconds. Now that is how Google is fast. As long as your search data structure and set data structure is chosen correct, you can have it as fast as Google or Bing. Plus you can add additional meta data such for paid pages, relevance to terms, page rank etc and make it better. As for memory you can use a better trie as mentioned in the previous post. 

No comments: