# Association rule learning

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule $\{\mathrm {onions,potatoes} \}\Rightarrow \{\mathrm {burger} \}$ found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

## Definition

Following the original definition by Agrawal, Imieliński, Swami the problem of association rule mining is defined as:

Let $I=\{i_{1},i_{2},\ldots ,i_{n}\}$ be a set of $n$ binary attributes called items.

Let $D=\{t_{1},t_{2},\ldots ,t_{m}\}$ be a set of transactions called the database.

Each transaction in $D$ has a unique transaction ID and contains a subset of the items in $I$ .

A rule is defined as an implication of the form:

$X\Rightarrow Y$ , where $X,Y\subseteq I$ .

In Agrawal, Imieliński, Swami a rule is defined only between a set and a single item, $X\Rightarrow i_{j}$ for $i_{j}\in I$ .

Every rule is composed by two different sets of items, also known as itemsets, $X$ and $Y$ , where $X$ is called antecedent or left-hand-side (LHS) and $Y$ consequent or right-hand-side (RHS). The antecedent is that item that can be found in the data while the consequent is the item found when combined with the antecedent. The statement $X\Rightarrow Y$ is often read as if $X$ then $Y$ , where the antecedent ($X$ ) is the if and the consequent ($Y$ ) is the then. This simply implies that, in theory, whenever $X$ occurs in a dataset, then $Y$ will as well.

## Useful Concepts

Table 1. Example database with 5 transactions and 5 items
transaction ID milk bread butter beer diapers eggs fruit
1 1 1 0 0 0 0 1
2 0 0 1 0 0 1 1
3 0 0 0 1 1 0 0
4 1 1 1 0 0 1 1
5 0 1 0 0 0 0 0

To illustrate the concepts, we use a small example from the supermarket domain. Table 1 is shows a small database containing the items where, in each entry, the value 1 means the presence of the item in the corresponding transaction, and the value 0 represents the absence of an item in that transaction. The set of items is $I=\{\mathrm {milk,bread,butter,beer,diapers,eggs,fruit} \}$ .

An example rule for the supermarket could be $\{\mathrm {butter,bread} \}\Rightarrow \{\mathrm {milk} \}$ meaning that if butter and bread are bought, customers also buy milk.

In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best-known constraints are minimum thresholds on support and confidence.

Let $X,Y$ be itemsets, $X\Rightarrow Y$ an association rule and T a set of transactions of a given database.

Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant,[citation needed] and datasets often contain thousands or millions of transactions.

### Support

Support is an indication of how frequently the itemset appears in the dataset.

The support of X with respect to T is defined as the proportion of transactions in the dataset which contains the itemset X. Denoting a transaction by $(i,t)$ where i is the unique identifier of the transaction and t is its itemset, the support may be written as:

$\mathrm {supp} (X)={\frac {|\{(i,t)\in T:X\subseteq t\}|}{|T|}}$ In our example, it can be easier to explain support by writing $support=P(A\cap B)={\frac {({\text{number of transactions containing }}A{\text{ and }}B)}{\text{ (total number of transactions)}}}$ where A and B are separate transactions that were made within the total set of transactions recorded.

Using Table 1 as an example, the itemset $X=\{\mathrm {beer,diapers} \}$ has a support of $1/5=0.2$ since it occurs in 20% of all transactions (1 out of 5 transactions). The argument of $\mathrm {supp} ()$ is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).

Furthermore, the itemset $Y=\{\mathrm {milk,bread,butter} \}$ has a support of $1/5=0.2$ as it appears in 20% of all transactions as well.

If we set the support threshold to ≥0.4 then $\{\mathrm {milk,bread} \}\Rightarrow \{\mathrm {fruit} \}$ would be the only transaction to yield a result. This is because in 2/5 of all transactions, if milk and bread are bought, then customers also buy fruit; this is the only circumstance when ≥0.4 is reached.

Minimum support thresholds are useful for determining which itemsets are preferred.

### Confidence

Confidence is the percentage of all transactions satisfying X that also satisfy Y.

With respect to T, the confidence value of an association rule, often denoted as $X\Rightarrow Y$ , is the ratio of transactions containing both X and Y to the total amount of Y values present, where X is the antecedent and Y is the consequent.

Confidence can also be interpreted as an estimate of the conditional probability $P(E_{Y}|E_{X})$ , the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.

It is commonly written as:

$\mathrm {conf} (X\Rightarrow Y)=P(Y|X)={\frac {\mathrm {supp} (X\cup Y)}{\mathrm {supp} (X)}}={\frac {{\text{number of transactions containing }}X{\text{ and }}Y}{{\text{number of transactions containing }}X}}$ The equation illustrates that confidence can be computed by calculating the co-occurrence of transactions X and Y within the dataset in ratio to transactions containing only Y.

Using Table 1, the rule $\{\mathrm {butter,bread} \}\Rightarrow \{\mathrm {milk} \}$ has a confidence of ${\frac {1/5}{1/5}}={\frac {0.2}{0.2}}=1.0$ in the dataset, which means that a every time a customer buys butter and bread, they also buy milk. This particular example demonstrates the rule being correct 100% of the time for transactions containing both butter and bread.

For larger datasets, a minimum threshold, or a percentage cutoff, for the confidence can be useful for determining item relationships. More information about thresholds can be found in the Process section.

### Lift

The lift of a rule is defined as:

$\mathrm {lift} (X\Rightarrow Y)={\frac {\mathrm {supp} (X\cup Y)}{\mathrm {supp} (X)\times \mathrm {supp} (Y)}}$ or the ratio of the observed support to that expected if X and Y were independent.

For example, the rule $\{\mathrm {milk,bread} \}\Rightarrow \{\mathrm {butter} \}$ has a lift of ${\frac {0.2}{0.4\times 0.4}}=1.25$ .

If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

If the lift is < 1, that lets us know the items are substitute to each other. This means that presence of one item has negative effect on presence of other item and vice versa.

The value of lift is that it considers both the support of the rule and the overall data set.

### Conviction

The conviction of a rule is defined as $\mathrm {conv} (X\Rightarrow Y)={\frac {1-\mathrm {supp} (Y)}{1-\mathrm {conf} (X\Rightarrow Y)}}$ .

For example, the rule $\{\mathrm {milk,bread} \}\Rightarrow \{\mathrm {butter} \}$ has a conviction of ${\frac {1-0.4}{1-0.5}}=1.2$ , and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule $\{\mathrm {milk,bread} \}\Rightarrow \{\mathrm {butter} \}$ would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.

### Alternative measures of interestingness

In addition to confidence, other measures of interestingness for rules have been proposed. Some popular measures are:

• All-confidence
• Collective strength
• Leverage

Several more measures are presented and compared by Tan et al. and by Hahsler. Looking for techniques that can model what the user has known (and using these models as interestingness measures) is currently an active research trend under the name of "Subjective Interestingness."

## Process Frequent itemset lattice, where the color of the box indicates how many transactions contain the combination of items. Note that lower levels of the lattice can contain at most the minimum number of their parents' items; e.g. {ac} can have only at most $\min(a,c)$ items. This is called the downward-closure property.

Association rules are made by searching data for frequent if-then patterns and by using a certain criterion under Support and Confidence to define what the most important relationships are. Support is the evidence of how frequent an item appears in the data given, as Confidence is defined by how many times the if-then statements are found true. However, there is a third criteria that can be used, it is called Lift and it can be used to compare the expected Confidence and the actual Confidence. Lift will show how many times the if-then statement is expected to be found to be true.

Association rules are calculated from itemsets, which are created by two or more items. If the rules were built from the analyzing from all the possible itemsets from the data then there would be so many rules that they wouldn’t have any meaning. That is why Association rules are typically made from rules that are well represented by the data.

When using Association rules, you are most likely to only use Support and Confidence. However, this means you have to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Usually, the Association rule generation is split into two different steps that needs to be applied:

1. A minimum Support threshold to find all the frequent itemsets that are in the database.
2. A minimum Confidence threshold to the frequent itemsets found to create rules.

To find all the frequent itemsets in a database is not an easy task since it involves going through all the data to find all possible item combinations from all possible itemsets. The set of possible itemsets is the power set over I and has size $2^{n}-1$ , of course this means to exclude the empty set which is not considered to be a valid itemset. However, the size of the power set will grow exponentially in the number of item n that is within the power set I. An efficient search is possible by using the downward-closure property of support (also called anti-monotonicity). This would guarantee that a frequent itemset and all its subsets are also frequent and thus will have no infrequent itemsets as a subset of a frequent itemset. Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all frequent itemsets.

## History

The concept of association rules was popularized particularly due to the 1993 article of Agrawal et al., which has acquired more than 23,790 citations according to Google Scholar, as of April 2021, and is thus one of the most cited papers in the Data Mining field. However, what is now called "association rules" is introduced already in the 1966 paper on GUHA, a general data mining method developed by Petr Hájek et al. The creators developed an algorithm-based way to find relationships between items using point-of-sale systems. Applying the algorithms to stores, they discovered links between items, this information was used to find the likelihood of different products being purchased together.

An early (circa 1989) use of minimum support and confidence to find all association rules is the Feature Based Modeling framework, which found all rules with $\mathrm {supp} (X)$ and $\mathrm {conf} (X\Rightarrow Y)$ greater than user defined constraints.

## Statistically sound associations

One limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated, there is a large risk of finding many spurious associations. These are collections of items that co-occur with unexpected frequency in the data, but only do so by chance. For example, suppose we are considering a collection of 10,000 items and looking for rules containing two items in the left-hand-side and 1 item in the right-hand-side. There are approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association. If we assume there are no associations, we should nonetheless expect to find 50,000,000,000 rules. Statistically sound association discovery controls this risk, in most cases reducing the risk of finding any spurious associations to a user-specified significance level.

## Algorithms

Many algorithms for generating association rules have been proposed.

Some well-known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

### Apriori algorithm

Apriori is given by R. Agrawal and R. Srikant in 1994 for frequent item set mining and association rule learning. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often. The name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

### Eclat algorithm

Eclat (alt. ECLAT, stands for Equivalence Class Transformation) is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties.

### FP-growth algorithm

FP stands for frequent pattern.

In the first pass, the algorithm counts the occurrences of items (attribute-value pairs) in the dataset of transactions, and stores these counts in a 'header table'. In the second pass, it builds the FP-tree structure by inserting transactions into a trie.

Items in each transaction have to be sorted by descending order of their frequency in the dataset before being inserted so that the tree can be processed quickly. Items in each transaction that do not meet the minimum support requirement are discarded. If many transactions share most frequent items, the FP-tree provides high compression close to tree root.

Recursive processing of this compressed version of the main dataset grows frequent item sets directly, instead of generating candidate items and testing them against the entire database (as in the apriori algorithm).

Growth begins from the bottom of the header table i.e. the item with the smallest support by finding all sorted transactions that end in that item. Call this item $I$ .

A new conditional tree is created which is the original FP-tree projected onto $I$ . The supports of all nodes in the projected tree are re-counted with each node getting the sum of its children counts. Nodes (and hence subtrees) that do not meet the minimum support are pruned. Recursive growth ends when no individual items conditional on $I$ meet the minimum support threshold. The resulting paths from root to $I$ will be frequent itemsets. After this step, processing continues with the next least-supported header item of the original FP-tree.

Once the recursive process has completed, all frequent item sets will have been found, and association rule creation begins.

### Others

#### ASSOC

The ASSOC procedure is a GUHA method which mines for generalized association rules using fast bitstrings operations. The association rules mined by this method are more general than those output by apriori, for example "items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of supported interest measures can be used.

#### OPUS search

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either monotone or anti-monotone constraints such as minimum support. Initially used to find rules for a fixed consequent it has subsequently been extended to find rules with any item as a consequent. OPUS search is the core technology in the popular Magnum Opus association discovery system.

## Lore

A famous story about association rule mining is the "beer and diaper" story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. This anecdote became popular as an example of how unexpected association rules might be found from everyday data. There are varying opinions as to how much of the story is true. Daniel Powers says:

In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves.

## Other types of association rule mining

Multi-Relation Association Rules: Multi-Relation Association Rules (MRAR) are association rules where each item may have several relations. These relations indicate indirect relationship between the entities. Consider the following MRAR where the first item consists of three relations live in, nearby and humid: “Those who live in a place which is nearby a city with humid climate type and also are younger than 20 -> their health condition is good”. Such association rules are extractable from RDBMS data or semantic web data.

Contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in their distribution across subsets.

Weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results.

High-order pattern discovery facilitate the capture of high-order (polythetic) patterns or event associations that are intrinsic to complex real-world data. 

K-optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data.

Approximate Frequent Itemset mining is a relaxed version of Frequent Itemset mining that allows some of the items in some of the rows to be 0.

Generalized Association Rules hierarchical taxonomy (concept hierarchy)

Quantitative Association Rules categorical and quantitative data

Interval Data Association Rules e.g. partition the age into 5-year-increment ranged

Sequential pattern mining discovers subsequences that are common to more than minsup[clarification needed] sequences in a sequence database, where minsup is set by the user. A sequence is an ordered list of transactions.

Subspace Clustering, a specific type of Clustering high-dimensional data, is in many variants also based on the downward-closure property for specific clustering models.

Warmr is shipped as part of the ACE data mining suite. It allows association rule learning for first order relational rules.