# Boolean model of information retrieval

(Redirected from Standard Boolean model)
Jump to navigation Jump to search

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day.[citation needed] The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms.

## Definitions

An index term is a word or expression, which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article. Let

$T=\{t_{1},t_{2},\ \ldots ,\ t_{m}\}$ be the set of all such index terms.

A document is any subset of $T$ . Let

$D=\{D_{1},\ \ldots \ ,D_{n}\}$ be the set of all documents.

A query is a Boolean expression ${\textstyle Q}$ in normal form:

$Q=(W_{1}\ \lor \ W_{2}\ \lor \ \cdots )\land \ \cdots \ \land \ (W_{i}\ \lor \ W_{i+1}\ \lor \ \cdots )$ where ${\textstyle W_{i}}$ is true for $D_{j}$ when $t_{i}\in D_{j}$ . (Equivalently, ${\textstyle Q}$ could be expressed in disjunctive normal form.)

We seek to find the set of documents that satisfy ${\textstyle Q}$ . This operation is called retrieval and consists of the following two steps:

1. For each ${\textstyle W_{j}}$ in ${\textstyle Q}$ , find the set ${\textstyle S_{j}}$ of documents that satisfy ${\textstyle W_{j}}$ :
$S_{j}=\{D_{i}\mid W_{j}\}$ 2. Then the set of documents that satisfy Q is given by:
$(S_{1}\cup S_{2}\cup \cdots )\cap \cdots \cap (S_{i}\cup S_{i+1}\cup \cdots )$ ## Example

Let the set of original (real) documents be, for example

$O=\{O_{1},\ O_{2},\ O_{3}\}$ where

${\textstyle O_{1}}$ = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."

${\textstyle O_{2}}$ = "Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."

${\textstyle O_{3}}$ = "Bayesian epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."

Let the set ${\textstyle T}$ of terms be:

$T=\{t_{1}={\text{Bayes' principle}},t_{2}={\text{probability}},t_{3}={\text{decision-making}},t_{4}={\text{Bayesian epistemology}}\}$ Then, the set ${\textstyle D}$ of documents is as follows:

$D=\{D_{1},\ D_{2},\ D_{3}\}$ where

{\begin{aligned}D_{1}&=\{{\text{probability}},\ {\text{Bayes' principle}}\}\\D_{2}&=\{{\text{probability}},\ {\text{decision-making}}\}\\D_{3}&=\{{\text{probability}},\ {\text{Bayesian epistemology}}\}\end{aligned}} Let the query ${\textstyle Q}$ be:

$Q={\text{probability}}\land {\text{decision-making}}$ Then to retrieve the relevant documents:

1. Firstly, the following sets ${\textstyle S_{1}}$ and ${\textstyle S_{2}}$ of documents ${\textstyle D_{i}}$ are obtained (retrieved):
{\begin{aligned}S_{1}&=\{D_{1},\ D_{2},\ D_{3}\}\\S_{2}&=\{D_{2}\}\end{aligned}} 2. Finally, the following documents ${\textstyle D_{i}}$ are retrieved in response to ${\textstyle Q}$ $Q:\{D_{1},\ D_{2},\ D_{3}\}\ \cap \ \{D_{2}\}\ =\ \{D_{2}\}$ This means that the original document ${\textstyle O_{2}}$ (corresponding to ${\textstyle D_{2}}$ ) is the answer to ${\textstyle Q}$ .

Obviously, if there is more than one document with the same representation, every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).

## Advantages

• Clean formalism
• Easy to implement
• Intuitive concept

## Disadvantages

• Exact matching may retrieve too few or too many documents
• Hard to translate a query into a Boolean expression
• All terms are equally weighted
• More like data retrieval than information retrieval

## Data structures and algorithms

From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming, hash tables, inverted file structure, and so on.

### Hash sets

Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(n) to O(n2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.