Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.
Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.
While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google's engine. In order to do this we are going to start by carefully dissecting Google's original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.
The plan is to study how it worked originally, and follow all the published research papers and patents in order to put together the missing pieces. It is going to be very interesting.
Google has added and improved many things over the years. The original paper only describes the workings of the web search engine. Missing features are the ability to search news, images, documents (PDF, word, etc.), video, products, addresses, books, patents, maps, blogs, etc.
Also missing are substantial improvements such as local search, mobile search, personalized search, universal search, supplemental index, freshness, spam detection and PageRank improvements. Some things that will be hard to know is how Google uses the data it collects through other services, like Google Toolbar, Google Analytics, Google Adsense, Doubleclick, Gmail, Gtalk, Feedburner, etc. There is a lot of information that can be used both for better ads and for better search results.
No matter the type of search you are conducting, conceptually, search engines have three key components: the crawler, the indexer and the searcher.
The crawler's (also known as a search engine robot) job is to collect all the information that will be later searched. Whether it's images, video, text or RSS feeds. These documents are stored for later processing by the indexer module. Webmasters and site owners can control how crawlers access their websites via a robots.txt file and the robots exclusion protocol. In this file you basically tell the crawler what pages or sections it is not allowed to crawl. I posted an entry about this several days ago.
The indexer module is the one doing the heavy lifting. It has the daunting task of carefully organizing the information collected by the crawler. The power of the search engine is on this specific task. Depending on how well classified the information is – the faster and the better the search. Search engines conceptually classify documents similar to the way you file documents on a cabinet. Without some sort of labeling you will probably waste a lot time finding your bank statements, notes, etc. Search engines label documents in a way that makes it easy for them to find them later by words or phrases (also known as keywords). In the case of text and similar documents the indexer breaks down the document in words and collects some additional information about those words, such as the frequency of the word in the document.
The searcher module is the one that takes the user search, cleans it to remove ambiguities, misspellings, etc., finds the documents in the index that more closely match the search, and rank them according to the current ranking formula. The ranking formula is the most closely guarded secret of all major commercial search engines.
These basic components remain the same nowadays, but they say that the devil is on the details. Today, Google's inner workings are far more complex than what I am going to explain, but the basic principles are the same. I will quote the original paper as necessary.
There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of information for making relevance-related assessements and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).
One notable improvement Google brought to the commercial search markeplace was the use of link structure and anchor/link text to improve the quality of results. This proved to be a significant factor that helped fuel their growth. Today, these elements remain significant, but Google makes use of very sophisticated filters to detect most attempts at manipulation. Proof that they remain signifficant are the successful Google bombs of late.
To support novel research uses, Google stores all of the actual documents it crawls in compressed form
Here is the reference to their caching feature we are acostumed to use.
Now, let's see how they define PageRank — how important or high quality are pages for Google's search engine.
…a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.
Here is a clear description of why they use anchor text for searching.
The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.
Now, let's read about on-page elements that Google considered that were not in regular use back then. Pro
xi
mity, capitalization and font weight, and page caching.
Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font weigh heavier than other words. Third, full raw HTML of pages is available in a repository.
Now let's see have a big picture view as to how everything fits together. This is very technical, but I will try to explain it the best I can.
In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
Google uses distributed crawlers/downloaders. If you have ever looked at your server log files you will notice that when Googlebot is visiting your site you will see hits coming from different IPs. That is because the crawling is distributed among several computers. They need a URL server to feed the URLs to download, to the crawlers, because the URL server is the one coordinating the crawling efforts.
All the urls are sent to the storeserver for compression and storage and are assigned an ID (doctID). For computers, it is easier and more efficient to use numbers to refer to things.
The indexers does some heavy work:
- Reads, uncompresses and parses documents. Converts documents into hits (word ocurrences)
- Creates partially sorted forwarded indices.
- Create anchors file (link text, and to and from links). URLResolver fixes relative URLs and assigns docIDs.
- Include anchor text in forward index but using the link it points to as the docID. Associates the text in the link to the document it points to.
- Maintains a link databases used to compute PageRanks
- Generates a lexicon–the list of all different words in the index
Basically, the forward index allows you to find the words of a document given the docID. In order to be useful for searching, this needs to be inverted. ie.: find documents by the words. The sorter does this addtional step, by assisting the indexer in creating an inverted index that uses wordIDs as keys to the docIDs. The inverted index includes the offsets and list of words. Dumplexicon is used to update the lexicon used by the searcher.
Finally, the searcher combines the lexicon, the inverted index and the PageRanks to respond the queries.
Next, I'll describe each of the processes in more detail. Can't wait? Read the document yourself and draw your own conclusions 🙂
Jez
June 15, 2007 at 7:09 am
How important do you think it is to understand this stuff, or rather how much of an advantage does it give you in terms of optimising a site? Its interesting stuff, but very conceptual. I am not sure how much practical benefit can be derived from this information. Clearly you think this is worthwhile... how much did you gain from researching this?
Hamlet Batista
June 15, 2007 at 9:04 am
Jez, I am glad you are asking this. It all depends on how successful you want to be. If you are not targeting very competitive phrases, you can probably rank naturally by just creating useful content and by getting other blogs and sites to link to it. If you want to compete with the big boys and make a lot of money, every step you take ahead counts. These types of documents are scary, but they provide all the valuable insight that tells you what is the right direction to take. For example, if you had read this post before, you wouldn't have thought that the robots.txt was the problem in John's case. When learning SEO, you will notice that a lot of people have different opinions on the same topic. That is why I like to read from the sources and then make my own judgments. What practical use do I get from this post? I can more easily solve SEO related problems for my sites. I can tell if the problem is on my site or the search engine, what is causing it and how to solve it. For example, if you find out that Google is not going to look deeper than X amount clicks into your site, then you make those pages accessible via a sitemap, etc. Please note that you need to learn a lot of different things to succeed online. SEO and understanding search engines is just one of them. Learning about social media is an excellent complement. I can't post about that yet, as I am learning about it myself. I try to post on several related topics (PPC, conversions, affiliate marketing, etc). My plan is to work on illustrations to make this content easier to digest. I hope this encourages you to keep learning.
Jez
June 17, 2007 at 1:43 pm
Hi Hamlet, Thanks for the reply, personally I like the level this blog is pitched at, which is why I read it... illustrations would be nice, but I like the fact you are posting scripts and more in depth info... but not too abstract... On that note, why do you choose Python? I only ever used it at University... I liked the fact it is so clean (using indentation), but thought I would benefit more by learning more commercially used languages... Finally, I am in the process of creating multiple sites around a similar theme. I have unique content for all sites, and will host on different servers in Europe and the US, however the whois for each domain will show my name (The company I used does not allow me to hide this info). Is the common whois likely to make much difference when I begin cross linking the sites? I know that whois info is hidden by search engine spammers as it is used to discover / uncover SE spam networks... do you think this applies to WH sites too? Jez
Hamlet Batista
June 18, 2007 at 5:40 am
Jez, Thanks for your continued support. I will try to keep my developers and non-developer readers happy, and adequately informed. I chose Python for educational purposes, and because I am lazy. With Python you only need to think about the problem at hand, not about creating special data structures for your needs, etc. Your question about the domains and cross linking requires a longer explanation. I will write post about that shortly.
Reg nbs-seo
March 6, 2012 at 3:06 pm
Hi Hamlet. I came across your page when I was doing research for a discussion on LinkedIn about the value of links in regards to SERPs. I hold that with the removal of PageRank's influence, links now do not affect SERPs. My opponents hold that Google has a separate process to determine the worth of links for SERPs. Your description of the process is how I imagined it would be, with anchor text being added to the factors that influence the value of a link and as a pointer to page content. PageRank is applied as a separate modifier after the page's position in the lexicon is established. <a href="http://goo.gl/LYgPD" rel="nofollow">http://goo.gl/LYgPD</a> is our discussion if you would like to comment. best, Reg <a href="http://nbs-seo.com" rel="nofollow">http://nbs-seo.com</a>