Information gain ratio

In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan,^[1] to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.^[2]

Information Gain is also known as Mutual Information.^[3]

Information gain calculation

Let $Attr$ be the set of all attributes and $Ex$ the set of all training examples, $value(x,a)$ with $x\in Ex$ defines the value of a specific example $x$ for attribute $a\in Attr$ , $H$ specifies the entropy. The ${\textstyle values(a)}$ function denotes the set of all possible values of attribute ${\textstyle a\in Attr}$ . The information gain for an attribute $a\in Attr$ is defined as follows:

$IG(Ex,a)=H(Ex)-\sum _{v\in values(a)}\left({\frac {|\{x\in Ex|value(x,a)=v\}|}{|Ex|}}\cdot H(\{x\in Ex|value(x,a)=v\})\right)$

The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0.

Intrinsic value calculation

The intrinsic value for a test is defined as follows:

$IV(Ex,a)=-\sum _{v\in values(a)}{\frac {|\{x\in Ex|value(x,a)=v\}|}{|Ex|}}\cdot \log _{2}\left({\frac {|\{x\in Ex|value(x,a)=v\}|}{|Ex|}}\right)$

Information gain ratio calculation

The information gain ratio is just the ratio between the information gain and the intrinsic value: $IGR(Ex,a)=IG/IV$

Advantages

Information gain ratio biases the decision tree against considering attributes with a large number of distinct values. So it solves the drawback of information gain—namely, information gain applied to attributes that can take on a large number of distinct values might learn the training set too well. For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before.

References

[1] Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986): 81-106.

[2] ttp://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf

[3] "Information gain, mutual information and related measures".

[1]

[2]

[3]

Information gain calculation

Intrinsic value calculation

Information gain ratio calculation

Advantages

See also

References