Talk:Search engine indexing

From Wikipedia, the free encyclopedia
Jump to: navigation, search

About this article[edit]

The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web. Please refrain from adding commercial references to support a NPOV.

TODO[edit]

  • fill out the list of references using correct formats
  • add back in the some of the content removed on the 9th, but in a correct fashion
  • remove annotational garbage, opinionated content
  • integrate with templates
  • get rid of weasel words
  • harmonize with information extraction, controlled vocab


Possible future content[edit]

Search engine sizes[edit]

This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)

Search Engine Reported Size Page Depth
Google 8.1 billion 101K
MSN 5.0 billion 150K
Yahoo 4.2 billion
(estimate)
500K
Ask Jeeves 2.5 billion 101K+

Controlled vocab[edit]

  • maybe provide link to full text search topic, harmonize with its contents
  • explain controlled vocab style indexing, start lists, weighted lists, other techniques, which are indexed differently in the sense that a specifialized inverted index is created that is not data driven. the keywords (in keyword based controlled vocab searching) are like classes in a classification model or an associate array or map to keywords and specific full text terms/articles.

Notes on WikiPedia as an Example[edit]

WikiPedia's Search option is powered by a search engine which involves an indexing process. The search engine technology is a program called Lucene (wiki), which is built into MediaWiki, the supporting software that WikiPedia uses.

Some of these same concepts are employed by WikiPedia, and you can learn more about these by researching MediaWiki and the architecture of Lucene.

WikiPedia articles are stored in a database, which serves as the corpus. For indexing, primarily only the wikitext part of the article is considered, the other meta data is mostly ignored. The wikitext is periodically indexed by the Lucene indexer, and this index is used when you click the search button on WikiPedia. From wikiehelp - Article does not appear when searching. The database for searching is updated approximately every 4-6 weeks. Changes do not appear there right away. Josh Froelich 02:34, 19 December 2006 (UTC)

I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:

Josh Froelich 20:21, 7 December 2006 (UTC)

Further reading[edit]

Thinking about adding a future reading section or incorporating these sources into the article. Josh Froelich 22:38, 20 December 2006 (UTC)

  • Inverted Index Algorithm and Compression
    • Position may be expressed as section, paragraph, sentence, location within sentence
    • Stop words eliminate about half the size of an inverted index. 'the' occurs in 7 percent of English text.
    • Problem is some terms have very long posting lists -- in Excite’s search engine '1997' occurs 7 million times.
    • Top Docs - Other structures may be built at index, creation to optimize performance. Instead of retrieving the whole posting list, we might want to only retrieve the top x documents where the documents are ranked by weight. A separate structure with sorted, truncated posting lists may be produced. Avoids need to retrieve the entire posting list. Dramatic savings on efficiency for large posting lists. Not feasible for Boolean queries. Can miss some relevant documents due to truncation
  • Harvest http://harvest.transarc.com/
    • this is caching software if i remember, distributed db like. that or it was a crawler, gotta read up
  • Search Engine Watch http://www.searchenginewatch.com/
  • [Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition. Publisher: Beacon, ISBN: 0807061557
  • [Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSS for the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD International Conference On Management Of Data, 1994.
  • [Marchiori 97] Massimo Marchiori. The Quest for Correct Information on the Web: Hyper Search Engines. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
  • [McBryan 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. First International Conference on the World Wide Web. CERN, Geneva (Switzerland), May 25-26-27 1994. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps
  • [1] The Free Online Dictionary of Computing (http://foldoc.doc.ic.ac.uk/) is edited by Denis Howe <dbh@doc.ic.ac.uk>. inverted index - <database, information science> A sequence of (key, pointer) pairs where each pointer points to a record in a database which contains the key value in some particular field. The index is sorted on the key values to allow rapid searching for a particular key value, using e.g. binary search. The index is "inverted" in the sense that the key value is used to find the record rather than the other way round. For databases in which the records may be searched based on more than one field, multiple indices may be created that are sorted on those keys. An index may contain gaps to allow for new entries to be added in the correct sort order without always requiring the following entries to be shifted out of the way. (1995-02-08)
  • Using a Relational Database for an Inverted Text Index Inverted indices for free-text search are commonly stored and updated using B-Trees. This paper shows how to efficiently maintain a dynamic inverted index for rapidly evolving text document sets using a suitable SQL-based relational database management system. Using a relational system pro-vides performance and reliability features such as efficient index access and maintenance, caching, multi-user transactions, access control, back-up and error recovery. Steve Putz, Xerox Palo Alto Research Center. Copyright January 1991 Xerox Corporation. All rights reserved. System Sciences Laboratory Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, California 94304
    • Pg 1: An inverted index is a data structure that maps a search key (e. g. a word) to a postings list enumerating documents containing the key. It is often useful for the postings list to also include the locations of each word occurrence within each document. An index of this type allows efficient implementation of boolean, extended boolean, proximity and relevance search algorithms [5].
    • Pg1: Because it allows efficient insertion, lookup and deletion, the file-based B-Tree data structure is useful for implement-ing a large dynamically updated inverted index [1]. Cutting and Pedersen have developed several space and time op-timizations for efficiently maintaining an inverted index using a B-Tree and a heap file [2]. Unfortunately good B-Tree packages are not generally available for most computer languages and operating systems, and implementation of efficient and reliable B-Tree software is difficult. The algorithms for B-Tree insertion and deletion can be complex and difficult to debug, especially for B-Trees with variable length records. Adding the capability for multi-user trans-actions and error recovery is even more difficult.
    • Pg2: When a postings list becomes too large for the B-Tree (e. g. more than 16 postings), portions of it are pulsed to a sep-arate heap file. The heap is a binary memory file with contiguous chunks allocated as necessary for the overflow post-ings lists. For very long postings lists (several hundred postings), heap chunks are linked together with pointers. Heap management software handles allocation and deallocation and keeps track of deallocated chunks for reuse.
      • This is good example of trend towards distributed hashing, distributed storage, and the use of the data structures like a heap. This is a factual statement on the heal binary file, which is a very useful source for this article I think
    • Sec3: Document identifiers (after the first) within a document list are delta encoded as a difference from the previous iden-tifier. For example, the sequence of document identifiers (515, 676, 786, 881) becomes (515, 161, 110, 95) when delta encoded. Word positions within a document are also delta encoded. The integer values are each byte encoded using a variable-length format such that small values require less storage space than large values as shown in Table 1. The high bit of each encoded byte indicates whether additional bytes fol-low. The low seven bits of each byte encode a portion of the integer with least significant bytes first.
      • This is exactly the type of content I am looking to cite. Delta encoding. Similar the reference in the Google98 paper too.
    • Sec3: The amount of storage space required to store an inverted text index in a relational system using these methods can vary widely depending on the relational system used and other factors such as the choice of non indexed "drop words." In experiments with the Sybase SQL Server and a list of 72 drop words, the index size tends to be between 30% and 50% of the size of the original documents for document sets between 10 and 64 million bytes.
      • Perfect citation for using stopwords to reduce index size on disk, with a factual example!
    • Rather than allowing postings lists of arbitrary length or linking postings lists together with pointers, long postings lists are split into pieces no longer than 255 bytes and stored in adjacent database table rows. Since most words in a full text inverted index occur relatively infrequently, most postings lists are short and a single table row per word is sufficient to contain the postings for most words. Depending on the document set size, about 5% of the words will re-quire two or more table rows to store all the postings. In a large document set, some words may require a hundred or more table rows for their postings.
      • can cite for horizontal partitioning statement, also shows the thinking that goes into it.
    • Sec8: When separate word and postings tables are used as described in section 4.3, the two tables are joined as in the fol-lowing SQL query, which produces the same results as the previous example: SELECT firstdoc, flags, block FROM wordlist, numpostings WHERE word = "box" AND wordlist. wordnum = numpostings. wordnum AND flags < 128 11 11 Page 12 13 10 5.3 Retrieval with a Limited Document Identifier Range By specifying a constraint on the firstdoc column, it is possible to construct an SQL query to retrieve word postings that occur within a specified range of document identifiers. This is useful if document identifiers are assigned in a meaningful order, such as by creation or entry date.
      • this helps explain how a distributed hash table supports a query for a word across partitions. they give date as example, but it could also be the hash function. good for citation
    • 6.1 Update Optimizations - Cutting and Pedersen have developed efficient algorithms for creating and updating an inverted index by buffering a large number of postings lists in main memory and periodically merging them into the externally stored index [2]. This merge update optimization dramatically reduces the number of secondary storage accesses required to index new documents. This optimization is equally important when storing the inverted index in a relational system. At any given time, the final row( s) containing the document list and positions list for a large number of words should be kept in main memory. Whenever a row's block becomes filled (to its maximum of 255 bytes), it is stored into the database and a new empty row begun. If the relational system's application programming interface provides efficient bulk data copy primitives, they should be used for inserting new rows into the database tables.
      • This can be cited for the merge definition, also helps understand rationale behind forward index/barrels
    • 7.2 Indexing Speed - Indexing speed depends almost entirely on the speed with which rows can be retrieved and stored into the relational system. Our indexing program was able to create an inverted index for the Random House Encyclopedia in 37 minutes 1. This is the equivalent of 380 words per second or about 14 megabytes of text per hour. Due to overhead associated with writing out and later retrieving buffered postings when main memory becomes full, the larger Grolier's Encyclopedia was indexed at a rate of 140 words per second (6.4 megabytes of text per hour). Better buffering and the use of bulk copy routines would improve the indexing speed.


  • Dynamic Inverted Index Maintenance. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE VOLUME 1 NUMBER 2 2006 ISSN 1306-4428. Leo Galambos.
    • Pg1 - The number of rows (number of indexed documents) is also termed 'the size of the index'. The index may contain other values, for instance, positions of words (tokens) in documents. For purposes of this paper an item in an inverted list is termed a 'tuple'.
    • Pg1 - We will assume that inverted lists are stored in one file(inverted file) in an order that reflects the order of their terms. This format ensures that two inverted files T and U can be merged by reading them sequentially. Assuming that the inverted file X is built up for the document collection CX, and its length is LX, the merge operation then produces a new inverted file V
    • Pg1 - Rebuild. This method replaces an obsolete index with a new one which is built from scratch. Obviously, this way is not effective because it always re-scans all documents in a collection. The method could be improved by distributed processing, but it does not change the fact that this method is wasteful for the document collection assumed herein.
    • Pg1/2 - Delta change-append only. Some improvement is achieved when just the modified documents are processed. For this, the index is merged with an index built for new documents. Modification of a document is then realized via Delete+Insert operations. This implementation is very popular and can be found, for instance, in the Lucene engine [1]. Unfortunately, one serious drawback exists - if we execute many Delete operations (also part of the Modification operation), the index structure stops working effectively, and what is more important, it is not possible to easily detect this situation. Therefore, from time to time one must optimize the index. Such an optimization phase may take a lot of time, which again makes this method less effective.
    • Pg2 - Forward index. Another method was proposed by Brin and Page for the Google search engine [3]. This method constructs an auxiliary data structure (forward index) which speeds up the modifications in the main (inverted) index. The term “modifications” means the real changes of the existing values stored in the index. Other modifications (insertion or removal of values to/from the index) are realized using other approaches, e.g. Btree [6].
    • Pg2 - Landmark. Research in this area continues and new approaches are still developed, for instance, a landmark method [10] which introduces special marks in the index. The marks are used as initial points from which some values in the index are derived. Therefore, if one modifies the mark, all values dependent on the mark are also shifted. It was shown in the cited paper that such a case often happens in an index built for WWW and, as a result, the landmark method was faster than the forward index.
  • GIN - Generalized Inverted iNdex Gin fuzzy limit - There are often situations, when full text search returns a very big set of results, which is very difficult to manage, since reading tuples from disk and their ordering could takes a lot of time, which is unacceptable for the production (notice, that search itself is very fast). Such queries are usually contain very frequent lexemes, so results are not very helpful. To facilitate execution of such queries we introduced a configurable soft upper limit of the size of the returned set - GUC variable gin_fuzzy_search_limit, which is 0 on default (no limitation). This subset randomly chosen from the whole result set. Soft means that the actual number of returned results could slightly differs from this limit, depending on the query and the quality of system random generator. From our experience, we found that value about several thousands (5000-20000) is ok, i.e., gin fuzzy limit will have no effects for queries returning result set lesser than this number.
    • this might have more to do on the searching side, but its an important note about how search engines set a false ceiling on recall. Josh Froelich 22:55, 20 December 2006 (UTC)

Sentence and Paragraph Boundaries[edit]

Some search engines identify sentence and paragraph boundaries in natural language and denote the containing sentence for each token, and the position of the token within its containing sentence, referred to as the sentence offset. This is a natural language processing task. Some search engines attempt to automatically identify the part of speech of each token within its containing sentence. Stemming may be performed at this point, which essentially involves ignoring the affixes or suffices of the token. Many search engines store the position of the token within the overall document. One accepted definition of token position is the count of preceding characters prior to the occurrence of the word (citation needed). Position is sometimes referred to as the offset. As the first token in a document contains the first character in the document, so the token's position is 0. Where as the second word might occur 5 characters later, so its recorded position is 5. An alternative definition of token position is the count of preceding tokens (citation needed). The first token is at position 0, the second token at position 1, and so on. Suppose a document consists solely of the sentence "It's a dog eat dog world". The token array (a data structure similar to a list, or subtype of a list) might look like the following:

Token (Word) Position
It's 1
a 6
dog 8
eat 11
dog 15
world 20

Token Hashing[edit]

Storing the token 'dog' repeatedly is a redundancy. Storing a list such as this for all the documents in a large corpus is not technically feasible. Instead, many search engine designs incorporate a token hash, which assigns a unique, identifying value for each word. This word hash is a separately managed data structure commonly referred to as a lexicon. The following table displays what the lexicon looks like:

Word Id Word
1 a
2 dog
3 eat
4 it's
5 world

After hashing the words, the algorithm stores a list of each word hash key (the key is the id or value identifying the word) and its position (and potentially other characteristics such as the part of speech). This listing consumes less computer memory (takes up less space) given the smaller key size.

Word Id Position
4 1
1 6
2 8
3 11
2 15
5 20

Token Conflation[edit]

In the above figure, word id 2 appears twice. This is a redundancy, which means there is room for optimization. As such, the list is conflated or aggregated by the id. The position for each token occurrence, commonly referred to as a 'hit' (citation needed), is stored as a list for each key. The list might look something like the following:

Token Id Positions
1 1,20,500,etc
2 10,45,3445,etc

In addition to grouping together words based on the exact characters in each word (dog is the same as dog...), some search engines also group together 'similar words', or words which share something in common, such as grammatical form or meaning. Using linguistic analysis, stemming is performed which identifies the root lexeme of each word. If two words share the same lexeme, the words are grouped together during conflation.

Stop and Start List Data Structures[edit]

Main article: Stop words

A stop list is a list of words (or numbers or other symbols), typically for a specific language, that should not be indexed. Alternatively, the stop list is the list of words that the search engine user cannot use to search with (and as such, should not always be indexed). At this point in the process, or earlier during initial tokenization or hashing (or just by looking at a property of the word in the lexicon), words may be ignored and then removed to prevent the entries from being stored in one of the later data structures of the indexing process such as the forward index.

Stop lists are also referred to as black lists, or ignore lists. Entries in the stop list are typically referred to as stop words or stopwords.

Stop lists typically contain functional words, sometimes referred to as noise (see information theory, Claude Shannon). Common English stop words include:

  • the
  • it
  • she
  • he
  • and
  • or

These words carry little semantic value for the search effort. Removing the words due to limited value ratio (the disk space or memory consumed in the index data structures and the time spent processing them versus how frequently users search for them or need to search at all for them), is a common practice as it reduces the disk size of later data structures, which also in turn improves processing speed.

A start list is a list of words or symbols that should be indexed. Some indexing processes only index words from a start list. Others provide a weighting method, where start words are treated differently, so that search results containing these words appear higher in the search results.

Lexicographic Conflation[edit]

Lexicography is commonly used to refer to the practice of making dictionaries. Search engine designers may compile dictionaries, or one or more thesauri and incorporate these into the indexing process, to introduce human bias. For example, words which are found in a thesaurus as synonyms may only be indexed as a single entry.

In application, document authors may use lexically distinct symbols to annotate a single concept. Given that search engines pursue quality search results, and that a search engine user enters words which he/she considers to represent the concept, a goal of incorporating the dictionary is to increase the accuracy of the search engine in matching various documents, despite that the symbols (the physical words) differ.

Certain search engines do not incorporate dictionaries, or involve them at a later point as part of query expansion.

The Term Document Matrix[edit]

The class of search engines and natural language processing software which employ Latent Semantic Analysis use the Term-document matrix data structure. This is similar in nature to the forward and inverted indices described above. The term document matrix contains a table where every row is a document, every column is a word, and each intersection (or data cell) contains a frequency count (or sometimes other information). The data structure looks like the following:

the dog
Document 1 5 34243
Document 2 2 0 or null

Queries are performed by locating the column containing the word and obtaining the set of non-null rows in the table.

Note that the term document matrix is a sparse matrix as not every document in the corpus contains every token, and not every token matches all documents. The forward and inverted indices are also forms of this same sparse matrix.

Order of Sections[edit]

This article would flow better if the sections were placed in the general order in which they are executed to create the inverted index (the final step). So something like:

  1. Document Parsing
  2. Forward keyword index
  3. Inverted Keyword index

With the smaller sections in between some of these. This would also involved changing the lead to include a statement of the goal of indexing: "Indexing aims to produce an map from unique keywords or terms to a list of documents containing that keyword. This data structure is referred to as an inverted keyword index."

Thoughts from anyone?

Drmadskills (talk) 18:16, 21 February 2008 (UTC)

Biased?[edit]

This article seems too biased to Internet web search engines. I came here to find information on how full-text indexes work in general (hash tables?) and found lots of stuff about... identifying languages? meta tags? Web 3.0?? —Preceding unsigned comment added by 200.68.94.105 (talk) 19:06, 22 August 2008 (UTC)

I could not agree more: there should be a separate article that describes a unversal concept of search index providing examples of search index types and their applications, such as DBMS, text, and image search engines. Not necessarily an Internet search engine! The article should contain a link to an article that describes (Internet) fulltext search engines. Itman (talk) 19:53, 24 February 2009 (UTC)

Suggested change[edit]

One of the previous subject authors wrote: "Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science." The word "informatics" is linked to another page titled Information technology. Note: Another page titled Informatics (academic field) exists. The Informatics (academic field) page specifies "Not to be confused with Information technology". This discussion requires a SME to identify 1. whether Index design incorporates informatics or information technology and 2. link the appropriate page to the designated word. araffals 17:11, 22 January 2013 (UTC) — Preceding unsigned comment added by Araffals (talkcontribs)