# BK-tree

A BK-tree is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller specifically adapted to discrete metric spaces. For simplicity, consider integer discrete metric $d(x,y)$ . Then, BK-tree is defined in the following way. An arbitrary element a is selected as root node. The root node may have zero or more subtrees. The k-th subtree is recursively built of all elements b such that $d(a,b)=k$ . BK-trees can be used for approximate string matching in a dictionary.[example needed]

## Example

This picture depicts the BK-tree for the set $W$ of words {"book", "books", "cake", "boo", "boon", "cook", "cake", "cape", "cart"} obtained by using the Levenshtein distance

• each node $u$ is labeled by a string of $w_{u}\in W$ ;
• each arc $(u,v)$ is labeled by $d_{uv}=d(w_{u},w_{v})$ where $w_{u}$ denotes the word assigned to $u$ .

The BK-tree is built so that:

• for all node $u$ of the BK-tree, the weight assigned to its egress arcs are distinct;
• for all arc $e=(u,v)$ labeled by $k$ , each descendant $v'$ of $v$ satisfies the following equation: $d(w_{u},w_{v'})=k$ :
• Example 1: Consider the arc from "book" to "books". The distance between "book" and any word in {"books", "boo", "boon", "cook"} is equal to 1;
• Example 2: Consider the arc from "books" to "boo". The distance between "books" and any word in {"boo", "boon", "cook"} is equal to 2.

## Insertion

The insertion primitive is used to populate a BK-tree $t$ according to a discrete metric $d$ .

Input:

• $t$ : the BK-tree;
• $d_{uv}$ denotes the weight assigned to an arc $(u,v)$ ;
• $w_{u}$ denotes word assigned to a node $u$ );
• $d$ : the discrete metric used by $t$ (e.g. the Levenshtein distance);
• $w$ : the element to be inserted into $t$ ;

Output:

• The node of $t$ corresponding to $w$ Algorithm:

• If the $t$ is empty:
• Create a root node $r$ in $t$ • $w_{r}\leftarrow w$ • Return $r$ • Set $u$ to the root of $t$ • While $u$ exists:
• $k\leftarrow d(w_{u},w)$ • If $k=0$ :
• Return $u$ • Find $v$ the child of $u$ such that $d_{uv}=k$ • If $v$ is not found:
• Create the node $v$ • $w_{v}\leftarrow w$ • Create the arc $(u,v)$ • $d_{uv}\leftarrow k$ • Return $v$ • $u\leftarrow v$ ## Lookup

Given a searched element $w$ , the lookup primitive traverses the BK-tree to find the closest element of $w$ . The key idea is to restrict the exploration of $t$ to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).

Input:

• $t$ : the BK-tree;
• $d$ : the corresponding discrete metric (e.g. the Levenshtein distance);
• $w$ : the searched element;
• $d_{max}$ : the maximum distance allowed between the best match and $w$ , defaults to $+\infty$ ;

Output:

• $w_{best}$ : the closest element to $w$ stored in $t$ and according to $d$ or $\perp$ if not found;

Algorithm:

• If $t$ is empty:
• Return $\perp$ • Create $S$ a set of nodes to process, and insert the root of $t$ into $S$ .
• $(w_{best},d_{best})\leftarrow (\perp ,d_{max})$ • While $S\neq \emptyset$ :
• Pop an arbitrary node $u$ from $S$ • $d_{u}\leftarrow d(w,w_{u})$ • If $d_{u} :
• $(w_{best},d_{best})\leftarrow (w_{u},d_{u})$ • For each egress-arc $(u,v)$ :
• If $|d_{uv}-d_{u}| : (cut-off criterion)
• Insert $v$ into $S$ .
• Return $w_{best}$ 