# BK-tree

A BK-tree is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller[1] specifically adapted to discrete metric spaces. For simplicity, consider integer discrete metric ${\displaystyle d(x,y)}$. Then, BK-tree is defined in the following way. An arbitrary element a is selected as root node. The root node may have zero or more subtrees. The k-th subtree is recursively built of all elements b such that ${\displaystyle d(a,b)=k}$. BK-trees can be used for approximate string matching in a dictionary.[2][example needed]

## Example

An example of BK-tree

This picture depicts the BK-tree for the set ${\displaystyle W}$ of words {"book", "books", "cake", "boo", "boon", "cook", "cake", "cape", "cart"} obtained by using the Levenshtein distance

• each node ${\displaystyle u}$ is labeled by a string of ${\displaystyle w_{u}\in W}$;
• each arc ${\displaystyle (u,v)}$ is labeled by ${\displaystyle d_{uv}=d(w_{u},w_{v})}$ where ${\displaystyle w_{u}}$ denotes the word assigned to ${\displaystyle u}$.

The BK-tree is built so that:

• for all node ${\displaystyle u}$ of the BK-tree, the weight assigned to its egress arcs are distinct;
• for all arc ${\displaystyle e=(u,v)}$ labeled by ${\displaystyle k}$, each descendant ${\displaystyle v'}$ of ${\displaystyle v}$ satisfies the following equation: ${\displaystyle d(w_{u},w_{v'})=k}$:
• Example 1: Consider the arc from "book" to "books". The distance between "book" and any word in {"books", "boo", "boon", "cook"} is equal to 1;
• Example 2: Consider the arc from "books" to "boo". The distance between "books" and any word in {"boo", "boon", "cook"} is equal to 2.

## Insertion

The insertion primitive is used to populate a BK-tree ${\displaystyle t}$ according to a discrete metric ${\displaystyle d}$.

Input:

• ${\displaystyle t}$: the BK-tree;
• ${\displaystyle d_{uv}}$ denotes the weight assigned to an arc ${\displaystyle (u,v)}$;
• ${\displaystyle w_{u}}$ denotes word assigned to a node ${\displaystyle u}$);
• ${\displaystyle d}$: the discrete metric used by ${\displaystyle t}$ (e.g. the Levenshtein distance);
• ${\displaystyle w}$: the element to be inserted into ${\displaystyle t}$;

Output:

• The node of ${\displaystyle t}$ corresponding to ${\displaystyle w}$

Algorithm:

• If the ${\displaystyle t}$ is empty:
• Create a root node ${\displaystyle r}$ in ${\displaystyle t}$
• ${\displaystyle w_{r}\leftarrow w}$
• Return ${\displaystyle r}$
• Set ${\displaystyle u}$ to the root of ${\displaystyle t}$
• While ${\displaystyle u}$ exists:
• ${\displaystyle k\leftarrow d(w_{u},w)}$
• If ${\displaystyle k=0}$:
• Return ${\displaystyle u}$
• Find ${\displaystyle v}$ the child of ${\displaystyle u}$ such that ${\displaystyle d_{uv}=k}$
• If ${\displaystyle v}$ is not found:
• Create the node ${\displaystyle v}$
• ${\displaystyle w_{v}\leftarrow w}$
• Create the arc ${\displaystyle (u,v)}$
• ${\displaystyle d_{uv}\leftarrow k}$
• Return ${\displaystyle v}$
• ${\displaystyle u\leftarrow v}$

## Lookup

Given a searched element ${\displaystyle w}$, the lookup primitive traverses the BK-tree to find the closest element of ${\displaystyle w}$. The key idea is to restrict the exploration of ${\displaystyle t}$ to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).

Input:

• ${\displaystyle t}$: the BK-tree;
• ${\displaystyle d}$: the corresponding discrete metric (e.g. the Levenshtein distance);
• ${\displaystyle w}$: the searched element;
• ${\displaystyle d_{max}}$: the maximum distance allowed between the best match and ${\displaystyle w}$, defaults to ${\displaystyle +\infty }$;

Output:

• ${\displaystyle w_{best}}$: the closest element to ${\displaystyle w}$ stored in ${\displaystyle t}$ and according to ${\displaystyle d}$ or ${\displaystyle \perp }$ if not found;

Algorithm:

• If ${\displaystyle t}$ is empty:
• Return ${\displaystyle \perp }$
• Create ${\displaystyle S}$ a set of nodes to process, and insert the root of ${\displaystyle t}$ into ${\displaystyle S}$.
• ${\displaystyle (w_{best},d_{best})\leftarrow (\perp ,d_{max})}$
• While ${\displaystyle S\neq \emptyset }$:
• Pop an arbitrary node ${\displaystyle u}$ from ${\displaystyle S}$
• ${\displaystyle d_{u}\leftarrow d(w,w_{u})}$
• If ${\displaystyle d_{u}:
• ${\displaystyle (w_{best},d_{best})\leftarrow (w_{u},d_{u})}$
• For each egress-arc ${\displaystyle (u,v)}$:
• If ${\displaystyle |d_{uv}-d_{u}|: (cut-off criterion)
• Insert ${\displaystyle v}$ into ${\displaystyle S}$.
• Return ${\displaystyle w_{best}}$